![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

### Callysto's Weekly Data Visualization
## WeRateDogs inflationary scoring


### Reccommended grade level: 9-12

### Instructions
#### Step 1 (your only step): “Run” the cells to see the graphs
Click “Cell” and select “Run All.” This will import the data and run all the code to make this week's data visualizations (scroll to the top after you’ve run the cells). **You don’t need to do any coding**.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

### About This Notebook

Callysto's Weekly Data Visualization is a learning resource that helps Grades 5-12 teachers and students grow and develop data literacy skills. We do this by providing a data visualization, like a graph, and asking teachers and students to interpret it. This companion resource walks learners through how the data visualization is created and interpreted using the data science process. The steps of this process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need.
3. Organize - Arrange the data so that you can easily explore it.
4. Explore - Examine the data to look for evidence to answer our question. This includes creating visualizations.
5. Interpret - Explain how the evidence answers our question.
6. Communicate - Reflect on the interpretation.

### Acknowledgment

This project and it's idea are based of [this](http://dhmontgomery.com/2017/03/dogrates/) data science exploration using, with permission, a data set of tweets collected by [Greg Baker](https://www.sfu.ca/computing/people/faculty/gregbaker.html), a senior lecturer at [SFU](https://www.sfu.ca) for use in his Computational Data Science class. 

## 1. Question
<div>
<img src="./images/brent.png" width="500"/>
</div>

[WeRateDogs](https://twitter.com/dog_rates) is a [popular](https://en.wikipedia.org/wiki/WeRateDogs) twitter account that offers humourous dog ratings and has spawned many memes. The twitter exchange above was a popular meme a few years ago, and the account is even [credited](https://www.npr.org/sections/alltechconsidered/2017/04/23/524514526/dogs-are-doggos-an-internet-language-built-around-love-for-the-puppers) with creating or formalizing the "pupper" and "doggo" lingo used to describe dogs over the internet. Much like 4chan users are credited with creating a 'LOLCAT' pigeon language in the earlier 2000s.

Outside of exploring the history of internet meme's the account can demonstrate the concept of grade inflation. Usually when grade inflation is discussed, it is discussed in context of highschools or university giving out more high B and A grades leading to a dimishing value of those higher grades. Here we will see how scores given out by the WeRateDogs account may or may not be suffering from grade inflation. Hence, our question for this notebook is:
* Are the WeRateDogs scores suffering from inflation?



## 2. Gather

We will import the python libraries we need then read in our already collected dataset of WeRateDogs tweets.

In [None]:
#import needed libraries
import numpy as np
import pandas as pd
import re
import plotly.graph_objects as go
import plotly.express as px
import os

In [None]:
#read in the dataset
path = os.path.join('datasets', 'dog_rates_tweets.csv')
data = pd.read_csv(path, dtype ={'text':str},parse_dates = ['created_at']).set_index(keys='id')

In [None]:
print('There are', data.shape[0], 'tweets in the dataset.\nThe first few rows look like this:')
#view the unmutated data
data.head()

## 3. Organize
This step will take some work.  Our data does not have any scores clearly visible. First we will need to extract scores from our tweets in an automated fashion. It would take far too long and be far too error prone to manually check eight-thousand tweets. Then we will have to ensure our manual proccess was robust enough for our study.

In [None]:
def text_to_rating(text):
    """a helper function to find any 'x/10'
    the function only returns the first such score found in a tweet"""
    match = re.search(r'(\d+(\.\d+)?)/10',str(text))
    if match:
        top, btm = match[0].split('/')
        return float(top)/float(btm)
    else:
        return None

In [None]:
#find the ratings
data['rating']= data['text'].map(text_to_rating)
#drop the actual content of the tweets
data =data.drop(['text'], axis =1)
#drop the 0/10 scores found
data = data[np.isfinite(data['rating'])]
#rename create_at
data.rename(columns={'created_at':'date created'}, inplace=True)

In [None]:
data = data.sort_values(by='rating', ascending = False)
data

We see that the data has a large range of values and the top few seems particularily large for an out of ten system. Very large datapoints are usually refered to as outliers, depending on the use, the data and the question outliers sometimes need to be removed, but other times are very key to painting an accurate picture of the data. Below, we do remove a number of the largest data points. This was decide by looking at the tweets and seeing if they were valid scores or somthing else. The scores about 14/10 all seemed irrelevant.

Sometimes they were dates:
![holiday score](./images/1776.png)


Sometimes they were scores scrapped from conversations about scoring:
![](./images/meta.png)


Sometimes they were honourary:
![](./images/honourary.png)

This dataset did not contain any genuine scores about 14. However, the web scraping used to collect the data did not find every single tweet, so it is possibly missing valid 15/10 scores for espesially good dogs.

the low scores, except for the 0/10, seemed to be genuine ratings. Many low scores, however, were rating animals other than dogs on the same scale. For our purposes we will consider those to be true ratings.  

Many low rates given were like this one given to a goat:
![](./images/goat.png)

Since this data exploration isn't a serious one we can decide farily freely which datapoints to keeo and which to drop. However, in hard science or research data should never be removed without careful thought. Changing key points of data can grosly change the results. 

In [None]:
#remove largest scores points
data_cleaned = data.iloc[13:-3].copy()

In [None]:
data_cleaned

## 4. Explore

In [None]:
def create_plot(data):
    """A simple function that will take in a dataframe formated like ours 
    and produce a scatterplot with best fit line
    input:
        a pandas dataframe with 'rateing' and 'date created columns
    return:
        a plotly express scatterplot with best fit line(untitled)
    """
    fig = px.scatter(data, x='date created', y='rating', trendline='ols')
    #highlight the best fit line in red to make it more visible
    fig.data[1].update(line_color='red')
    #show the tweets in the legend
    fig['data'][0]['showlegend']=True
    fig['data'][0]['name']='Tweet'
    # show the best fit line in the legend
    fig['data'][1]['showlegend']=True
    fig['data'][1]['name']='Best Fit Line (OLS)'
    fig.update_layout(showlegend=True)
    #show the plot
    return fig

In [None]:
fig = create_plot(data)
fig.update_layout(title = 'Plot without outliers removed')
fig.show()

When the outliers are not removed the visual is very hard to make out any detail and the line best fitting the data has a slope of $2.8 \times 10^{-9}$

In [None]:
fig = create_plot(data_cleaned)
fig.update_layout(title = 'WeRateDogs Scores Given Versus Time')
fig.show()

Once the outliers are removed the visual clearly shows seperate scores and the slope is now $3.4\times 10^{-9}$

## 5. Interpret
The positive slope on the red line reveals that scores have been increasing as time has gone on. This line is the line through the datapoints that minimizes the squared y-value distance between the line and datapoints. This method is called *Ordinary Least Squares* and is a common method for approximating a a straight line through a dataset to reveal a relationship.

We can conclude that grade inflation has been affecting the WeRateDogs scores.

However, removing the outliers did change the amount of that slope. Do you think they were fairly removed? Why or why not?

Also this data is missing tweets from mid 2018. Do you think they'd change the best fit line?

## 6. Communicate
Below we will reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have? These writing prompts can help you reflect.

* I used to think __ but now I know __.
* I wish I knew more about __.
* This visualization reminds me of __.
* I really like __.


[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)