![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=dog-rating-inflation/weratedogs-bestfit.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

### Callysto's Weekly Data Visualization

## WeRateDogs Inflationary Scoring

### Recommended Grade Level: 9-12

### Instructions

#### Step 1 (your only step): “Run” the cells to see the graphs
Click “Cell” and select “Run All.” This will import the data and run all the code to create the data visualizations (scroll back to the top after you’ve run the cells). **You don’t need to do any coding**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information.   

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

After a code cell runs, a number appears in the top left corner. If the code cell experiences a technical error some red text will appear below the cell. Email contact@callysto.ca if you experience issues.

### About This Notebook

Callysto's Weekly Data Visualization is a learning resource that helps Grades 5-12 teachers and students grow and develop data literacy skills. We do this by providing a data visualization, like a graph, and asking teachers and students to interpret it. This companion resource walks learners through how the data visualization is created and interpreted using the data science process. The steps of this process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need.
3. Organize - Arrange the data so that you can easily explore it.
4. Explore - Examine the data to look for evidence to answer our question. This includes creating visualizations.
5. Interpret - Explain how the evidence answers our question.
6. Communicate - Reflect on the interpretation.

### Acknowledgment
The dataset we used in this week’s visualization is from [Greg Baker](https://www.sfu.ca/computing/people/faculty/gregbaker.html), a Computing Science instructor at Simon Fraser University in British Columbia. This visualization is inspired by [this](http://dhmontgomery.com/2017/03/dogrates/) data science exploration.

## 1. Question
**Do the WeRateDogs scores inflate (or increase) over time? **

<div>
    <br><br>
<img src="./images/brent.png" width="500"/>
    <br><br>
</div>

[WeRateDogs](https://twitter.com/dog_rates) is a [popular](https://en.wikipedia.org/wiki/WeRateDogs) twitter account that offers humorous dog ratings and has spawned many memes. The Twitter exchange above was a popular meme a few years ago, and the account is even [credited](https://www.npr.org/sections/alltechconsidered/2017/04/23/524514526/dogs-are-doggos-an-internet-language-built-around-love-for-the-puppers) with creating or formalizing the "pupper" and "doggo" lingo used to describe dogs over the internet. Much like 4chan users are credited with creating a "LOLCAT" pigeon language in the early 2000s.

Outside of exploring the history of internet memes, the account can demonstrate the concept of grade inflation. Usually when grade inflation is discussed, it is in the context of high schools or universities giving out more B and A grades leading to a diminishing value or meaning of those grades. Here we will see how scores given out by the WeRateDogs account may or may not be suffering from grade inflation. Hence, our question for this notebook is:
* Do the WeRateDogs scores inflate (or increase) over time?

## 2. Gather

First we will import the Python libraries we need.

In [None]:
#import libraries
%pip install -q pyodide_http plotly nbformat statsmodels
import pyodide_http
pyodide_http.patch_all()
import numpy as np
import pandas as pd
import re
import plotly.express as px
import os

Then we read in our dataset of collected WeRateDogs tweets from a csv file into a pandas dataframe.

In [None]:
#read in the dataset
path = os.path.join('datasets', 'dog_rates_tweets.csv')
data = pd.read_csv(path, dtype ={'text':str},parse_dates = ['created_at']).set_index(keys='id')

For the last part of the 'gather' step is to inspect what we pulled in to make sure everything looks okay and start to understand what our dataset looks like.

In [None]:
print('There are', data.shape[0], 'tweets in the dataset.\nThe first few rows look like this:')
#view the unmutated data
data.head()

## 3. Organize

This step will take some work, since our data does not have a score column. Instead the scores, if they are present, are nested in the text. First, we will need to extract scores from the tweets in an automated fashion – manually checking eight thousand tweets would take too long and be prone to errors.

To extract the tweets, we create a helper function to find any '*x*/10' scores in a tweet, where *x* is a digit.

In [None]:
def text_to_rating(text):
    # a helper function to find any 'x/10'
    # the function only returns the first such score found in a tweet
    match = re.search(r'(\d+(\.\d+)?)/10',str(text))
    if match:
        numerator, denominator = match[0].split('/')
        return float(numerator)/float(denominator)
    else:
        return None

Next, we apply the above function to the `text` column to create a new column consisting of scores. We will also remove the original text from the dataset and remove any tweets that don't contain a score.

In [None]:
#find the ratings
data['rating'] = data['text'].map(text_to_rating)
#drop the actual content of the tweets
data = data.drop(['text'], axis =1)
#drop entries with no scores found
data = data[np.isfinite(data['rating'])]
#rename create_at for easy of plotting later
data.rename(columns={'created_at':'date created'}, inplace=True)

Here we print the data sorted by highest to lowest score to get a picture of the range of scores.

In [None]:
#sort the data
data = data.sort_values(by='rating', ascending=False)
#display the data
data

Looking at the output above, we see that the data has a large range of values.The top few seem particularly large for a rating system out of ten.

Very large data points are usually referred to as outliers, depending on the use, the data, and the question. Outliers sometimes need to be removed, but other times are important for gaining an accurate picture of the data. We will look at the tweets to see if they are valid scores or something else, and remove some of the largest values.


Sometimes the the large scores were dates:
![holiday score](./images/1776.png)

Sometimes they were scores scrapped from conversations about scoring:
![](./images/meta.png)

Sometimes they were honorary:
![](./images/honourary.png)

This dataset did not contain any genuine scores above 14/10. However, the web scraping used to collect the data did not find every single tweet. It's possibly missing valid 15/10 scores for especially good dogs.

The low scores, except for the 0/10, seemed to be genuine ratings. Many low scores, however, were rating animals other than dogs. For our purposes we will consider those to be true ratings.  However depending on how the question is interpreted it could be fair to remove them.

Many low scores were like this one given to a goat:
![](./images/goat.png)

Since this data exploration isn't a serious one we can freely decide on the criteria for dropping data points. However data should never be removed without careful thought when doing actual research. Changing key points of data can significantly change the results. 

Our last step in organizing the data is removing the outliers.

In [None]:
# remove largest scores points
data_cleaned = data.iloc[13:-3].copy()
# print the data
data_cleaned

## 4. Explore

In exploring the data we will create some plots. First, we create a helper function to make a scatterplot with a line of 'best fit' running through the points.

In [None]:
def create_plot(data):
    """
    A simple function that will take in a dataframe formatted like ours 
    and produce a scatterplot with best fit line
    input:
        a pandas dataframe with 'rating' and 'date created columns
    return:
        a plotly express scatterplot with best fit line(untitled)
    """
    fig = px.scatter(data, x='date created', y='rating', trendline='ols')
    #highlight the best fit line in red to make it more visible
    fig.data[1].update(line_color='red')
    #show the tweets in the legend
    fig['data'][0]['showlegend']=True
    fig['data'][0]['name']='Tweet'
    # show the best fit line in the legend
    fig['data'][1]['showlegend']=True
    fig['data'][1]['name']='Best Fit Line (OLS)'
    fig['data'][1]['visible']='legendonly'
    fig.update_layout(showlegend=True)
    #show the plot
    return fig

### 4(a) Explore - Data With Outliers 

First, we'll create and show the plot from the dataset of scores without filtering out the outliers.

In [None]:
# create a plot without the outliers removed
fig = create_plot(data)
# add an appropriate title to the plot
fig.update_layout(title='Plot With Outliers Present')
# show the plot
fig.show()

**Click on the `Best Fit Line (OLS)` entry in the legend located on the right hand side of the plot to show the best fit line.**

After looking at the data with and without the best fit line, do you think the *WeRateDogs scores suffer from inflation*?

When the outliers are not removed the visual is hard to make out any detail. As well, the line that best fits the data has a slope of $2.8 \times 10^{-9}$

### 4(b) Explore - Data Without Outliers

Next, we'll generate a plot with the outliers removed. We expect to see a much clearer visualization of the data. 

Note that the goal of removing outliers is **not** to make a clearer plot. The goal of removing outliers is to create a more accurate dataset to answer our question. We shouldn't try to create an aesthetically pleasing plot at the expense of accuracy.

In [None]:
# create a plot with the outliers removed
fig = create_plot(data_cleaned)
# add an appropriate title to the plot
fig.update_layout(title = 'WeRateDogs Scores Versus Time')
# show the plot
fig.show()

**Click on the `Best Fit Line (OLS)` entry in the legend located on the right to show the best fit line**

After looking at the data with and without the best fit line do you think the *WeRateDogs scores suffer from inflation*?

Once the outliers are removed the visual clearly shows separate scores, and the slope is now $3.4\times 10^{-9}$

## 5. Interpret

The positive slope on the red line reveals that scores have been increasing as time has gone on.

This is the line through the data points that minimizes the square of the *y-value distance* between the line and the data points. This method is called *Ordinary Least Squares* and is a common method for approximating a straight line through a dataset to reveal a relationship. In this case the relationship shows that more recent scores are higher on average.

From this we can conclude that inflation has been occuring in the WeRateDogs scores.

However, removing the outliers did change the slope. Do you think they were fairly removed? Why or why not?

Do you think the low scores in tweets that contained images of animals other than dogs should have been removed?

Also this data is missing tweets from mid 2018. Do you think they would change the best fit line?


## 6. Communicate

When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have? These writing prompts can help you reflect.

* I used to think __ but now I know __.
* I wish I knew more about __.
* This visualization reminds me of __.
* I really like __.


[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)