# **Explain Like I'm Not a Scientist**
### *An exploration of (not so) scientific communication*
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|Emily K. Sanders| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Project 3: NLP|
|DSB-318| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |May 3, 2024|
---
###### *A report for the 2024 Greater Lafayette Association for Data Science Conference on Activism for a Thriving Society*

## Problem Statement

Scientific communication - the relaying of complex scientific information from experts to laypeople - is hugely important for an educated populace, but infamously difficult to do well.  In this report is to establish the utility of analyzing submissions from r/explainlikeimfive and r/AskPhysics as exemplars of scientific communication, with r/explainlikeimfive modeling successful strategies for communicating with laypeople, and AskPhysics serving as a control, where members are expected to have a greater level of baseline knowledge.  I will do this by building a model to differentiate between documents from each subreddit, thus providing strong support for the notion that they are meaningfully different in nature, and that their points of divergence can offer useful insights.  Based on the results of this attempt, I will recommend future work to learn more about the do's and don't's of successful communication, and provide a cleaned-up corpus of the documents I gathered for other scholars to use.

### Deliverables

In this report, I will deliver:
- A written report of my procedure, findings, and recommendations for future work.
- Slides from the presentation of this report to the *2024 Greater Lafayette Association for Data Science Conference on Activism for a Thriving Society*.
- A cleaned dataset containing my corpus of documents, labeled based on whether the model correctly classified them or not, available for others to use.

## Introduction

Scientific communication is notoriously hard to do well, and can cause a lot of damage when done badly.  All sorts of modern myths - from [vaccine conspiracy theories](http://vaccinescauseautism.com/) to the idea that ibuprofen makes covid worse ([it](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7289730/) [doesn't!](https://www.mayoclinic.org/diseases-conditions/coronavirus/in-depth/treating-covid-19-at-home/art-20483273)), can trace their roots, at least in some part, to failures of scientific communication.  Any insight into how it works and how to do it well could yield exponential benefits in reduced future misinformation.  Therefore, I set out to study it.

For the present report, I am defining scientific communication as any information that purports to educate a layperson about a scientific conept, and/or to advise a course of action based on scientific findings, and that is presented by someone whom the listener perceives to have domain knowledge authority - that is, "knows more than I do."  Scientific communication need not come from actual scientists, but it can.  A researcher publishing her findings, a newspaper summarizing those findings, and, in the present case, a redditor repeating those findings when asked by someone less knowledgable all count as scientific communication. 

One form that the problem with scientific communication can take is an imbalance between what the listener knows and what the speaker knows - more specifically, the tendency for speakers to assume that listeners know as much as they do.  Psychologists call this the "[curse of knowledge](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9794110/)," and it blights all kinds of communication.  However, it is especially pronounced in communications about science, because the gap between an expert and a layperson is so unusually large.  In this scenario, the curse of knowledge can create failures in communication because the expert is not explaining things in terms a layperson can understand.

The ubiquity of this problem has given rise to attempted solutions by laypeople seeking to understand complex ideas (scientific and otherwise).  Reddit hosts an abundance of [educational subreddits](https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_educational) for people looking to talk or learn about a particular topic with people who are knowledgable about it.  These forums cover a wide variety of topics, and allow users to learn everything from [Japanese](https://www.reddit.com/r/LearnJapanese/) to [useless talents](https://www.reddit.com/r/LearnUselessTalents/), but for the purpose of investigating scientific communication, I chose [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/) and [r/AskPhysics](https://www.reddit.com/r/AskPhysics/new/).

Explainlikeim5 (also known as eli5) is a popular subreddit, one that is often known even to people who do not use reddit habitually.  [Its description](https://www.reddit.com/r/explainlikeimfive/) boasts that it "is the best forum and archive on the internet for layperson-friendly explanations," and it has approximately 23 million members, the 24th largest community on all of reddit.  The idea, as the name implies, is that people can come to this subreddit with questions for which they have not been able to find sufficient answers elsewhere, and/or that they would be embarrassed to admit elsewhere that they do not already know.  For example, the poster below had a question about "relatives" versus "descendants" that seems very simple on the surface.  I cannot honestly say that I would be willing to ask it anywhere that I was not sure would be nonjudgmental.

![](../images/content-examples/eli5-post-example.png)

However, I will admit to having learned something from the answer!  The people of eli5 took the poster at face value, and explained in clear terms how these two relationships can apply to a long-extict species.  An example is below. 

![](../images/content-examples/eli5-comment-example.png)

The culture of the subreddit caters specifically to the curse of knowledge problem; [answers are required to be designed with laypeople in mind](https://www.reddit.com/r/explainlikeimfive/wiki/detailed_rules/#wiki_rule_4.3A_explain_for_laypeople).  Although eli5 covers most any topic for which an objective explanation is possible (i.e., not just science), its enormous size and active userbase nonetheless make it a worthy target of exploration for scientific communication.  Furthermore, because the subreddit is explicitly designed for non-experts, and because of its popularity, I deemed it safe to assume that eli5 is an example of scientific communication done well.  It offers explanations of complex ideas that are accessible to laypeople, and those laypeople find those explanations helpful enough to build a positive reputation for the subreddit.

For a comparison with eli5, I chose a subreddit dedicated to scientific discussions, but not necessarily targeted at laypeople.  Reddit offers many such communities, but I discovered r/AskPhysics to be one of the most active.  AskPhysics does not have a dedicated description like eli5 does, but its [number 1](https://www.reddit.com/r/AskPhysics/new/) offers similar insight: "Questions should be relevant to physics, and answers should be on-topic and correct."  It has about 606,000 members, which is much smaller than eli5, but still in the top 1% of subreddits.  

AskPhysics is by no means exclusive for expert physicists, but neither is it targeted at laypeople.  Curious redditors with no scientific background are welcome to post their questions, and students looking for homework help make up a large enough proportion of the membership to warrant an entry in the rules about cheating and plagiarism.  For example, consider this post: 

![](../images/content-examples/askphysics-post-example.png)

The poster felt comfortable enough to pose the question in this subreddit despite knowing "nothing about physics," and as far as I saw, the members were indeed welcoming, and many offered detailed answers to the question.  However, at least in my opinion, these answers were nowhere near as clear as the kinds of answers I found in eli5, and some of them relied on the reader knowing terminology and concepts that many laypeople would not.  For example, consider the first paragraph of [this response](../images/content-examples/askphysics-comment-example-whole.png), the full text of which is available in the `images/content-examples` folder.

![](../images/content-examples/askphysics-comment-example-paragraph-1.png)

Although the commenter went to great effort to provide an explanation for the poster, they did not simplify the information to the point that a layperson could easily understand it.  To be clear, I do not mean this as a criticism of the commenter; AskPhysics is a different space with different expectations than eli5, and this comment was well-received by the community.  I do, however, think it is a perfect example of the difficulty of scientific communication, and of why I chose to compare these two subreddits in my effort to better understand the phenomenon.

Through the course of this report, I will scrape posts and comments from both subreddits, and attempt to build a natural language processing model able to tell them apart.  The purpose of the present report is simply to determine whether such a model *can* be designed, as a proof of concept that these two subreddits do represent meaningfully different paradigms of scientific communication and therefore have utility as exemplars for study.  If that goal is achieved, I will interpret my results so as to lay the groundwork for future projects.  If the model is able to perfectly differentiate the two subreddits (100% accuracy in both training and testing sets), I will recommend immediate action from my fellow data scientists to explore the corpus more deeply for insights on the differences between the communication norms of each subreddit, as well as inviting collaboration from qualitative researchers, in hopes of identifying actionable do's and don't's of successful scientific communication based on what eli5 is doing differently than AskPhysics.  If, as is more likely, the model can differentiate some posts but not all, I will recommend attempt to provide some insight into what features it is using.  Because the primary purpose of the current report is simply to demonstrate that such differentiation is possible, it may or may not be possible to draw inferences based on the type of model that ends up being most successful.  In any case, however, I will provide a corpus of documents labeled with the model's prediction for it.

In short, the purpose of this report is to establish the utility of analyzing submissions from r/explainlikeimfive and r/AskPhysics in future research on scientific communication.  I will do this by building a model to differentiate between documents from each subreddit, thus providing strong support for the notion that they are meaningful different in nature.  My goal is to build a robust model, likely to generalize well to documents beyond those I have gathered; therefore, I will employ several types of models.  I will judge the model's quality on its accuracy score, because neither category is meaningfully a negative or a positive.  However, I am not particularly interested in the accuracy in and of itself.  I suspect that future research will benefit greatly from a model that does misclassify some documents, so that those documents can be evaluated and learned from.  Therefore, my main metric for success is to reduce variance between training and testing data. If the most successful model is one with interpretable coefficients, I will do that.  If not, I will clearly state it, and recommend future work prioritizing that goal.  I would particularly welcome partnerships with qualitative researchers who could bring unique expertise to the analysis of why some posts were misclassified.  Here's hoping that those experts can explain it to me in a way I understand.

## Method

For this project, I used `python` to scrape content from reddit, specifically the [eli5](https://www.reddit.com/r/explainlikeimfive/) and [AskPhysics](https://www.reddit.com/r/AskPhysics/) subreddits.  I first scraped *posts*, which are the individual submissions made by users to the subreddits; these appear in the "main feed" when viewing the subreddit.  In my chosen subreddits, posts must be questions posed by the user to the community, as seen in the examples above.  Post can have *titles*, which are usually a succinct form of the posters question (in eli5, this is a rule), as well as *bodies* (also known as `selftext`), where the poster has more room to elaborate.  Some posters in these subreddits leave the body blank when their question is self-explanatory from the title. I then scraped the *comments* that members of the subreddits left in reply to the posts.  Comments do not appear in the main feed of the subreddit, but rather, as sub-items underneath their parent posts.  They also do not have titles of their own, only bodies.  Users can also leave comments on comments, resulting in a chain of dicusssion represented by increasingly indented margins ([example](../images/content-examples/askphysics-comment-example-paragraph-1.png)).  In both subreddits, it is a rule that the "top level" comments - those left in direct reply to the post, rather than to another comment - be relevant answers to the question.  I scraped both types of bodies and the titles of the posts from several days worth of recent activity on each subreddit.  

Both posts and comments are included in the model so as to train it on a more complete representation of the type of communication in each subreddit, although I suspect that the comments will ultimately provide more useful insights in future research.  It is after all the commenters, not the posters, who are truly engaged in scientific communication.  Serendipitiously for the current project of establishing these two subreddits as useful future models of study, each subreddit produced many comments per post, and thus the comments came to far outnumber the posts in the corpus.

### Apparatus

For this project, I used the `python` programming language and various relevant modules, including `pandas`, `requests`, `numpy`, `sklearn`, `nltk`, `datetime`, `getpass`, `string`, `os`, `matplotlib`, and `seaborn`.  The developers of `python` and of these modules have graciously made their work open source.

Additionally, I accessed the [reddit API](https://www.reddit.com/dev/api/) to scrape the content from the subreddits.  This tool is also freely available to the public. 

### Scraping Procedure

Please continue to notebook 2.