# Lab 0: Introduction to Reddit Data

In this lab, we'll cover:
- What reddit data look like
- Several ways to summarize the conversation's tone
- Evaluation of data over time

In [None]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

%matplotlib inline

## Getting data
- Data files with Reddit comments are publicly available many places online, including torrents, google's BigQuery, and several data hosting websites. UM keeps a full copy in our Advanced Research Computing resources.
- Reddit is one of the biggest sites on the internet. 
    - It has over 3.5 billion comments, and the data take up several TB of disk space (`1 TB = 1024 GB`)! 
    - This makes working with the data difficult.
    - For simplicity, we went ahead and used some big data tools like `pyspark` and `hadoop` to go through all the comments and select out smaller sets to work with in this lab. 
- Let's start by looking at just the comments from the subreddit community for the University of Michigan
    - This file is only 34 MB: a more managable size!
    - The `shape` property tells us that there are 66 thousand rows (comments) and 28 columns.

In [None]:
#read the data
um_comments = pd.read_csv('data/merged/uofm.tsv', sep='\t')

#convert our dates to the date data type
um_comments['date'] = pd.to_datetime(um_comments.date)
#show the shape of our table
um_comments.shape

### What information do we have about each comment?
- We have a lot! Here are some of the most interesting columns:
    - `body` the text of the comment
    - `author` the username of the person who posted it
    - `date` when the comment was made
    - `subreddit` which community a comment is from. Here, they're all from `r/uofm`
    - `politeness` scores, computed by the [Stanford NLP group's software](https://www.cs.cornell.edu/~cristian/Politeness.html), tell us how "polite" a comment is, from 0 (not at all) to 1 (very polite). The program that gives these scores was designed primarily for comments where someone was replying to a request.
    - `sentiment` (how positive or negative a comment is), computed by the [VADER program in NLTK](http://www.nltk.org/_modules/nltk/sentiment/vader.html). (-1 is very negative, 0 is neutral, and 1 is very positive). 
    - `pej_nouns`: Sometimes when an adjective for people is used as a noun, it takes on a pejorative meaning. Research has found this is often true for the words "female," "gay," "poor," and "illegal," so this column counts the number of times those words (or versions of them like "females") are used as nouns. For more information, see this paper:
        - Palmer, Alexis, Melissa Robinson, and Kristy Philips. 2017. “[Illegal Is Not a Noun: Linguistic Form for Detection of Pejorative Nominalizations](http://www.aclweb.org/anthology/W17-3014).” Pp. 91–100 in *Proceedings of the First Workshop on Abusive Language Online.* Vancouver.
    - Several scores from the [Perspective API](https://www.perspectiveapi.com/). In this project, Google and Jigsaw teamed up to build automatic systems for finding bad comments. We used their program to score these comments already, and the scores are saved in the file.
        - `ATTACK_ON_COMMENTER` the probability that this comment is a personal attack on another commenter 
        - `INCOHERENT` whether the comment seems to make sense (high values don't make sense).
        - `INFLAMMATORY` how inflammatory the comment is
        - `LIKELY_TO_REJECT` the liklihood that New York Times comment editors would reject  the comment if it was posted on their site 
        - `OBSCENE` probability that the comment is obscene
        - `TOXICITY` probability that the comment is 'toxic' for community discussion
    

In [None]:
um_comments.columns.values

In [None]:
um_comments.head()

### Example comments
- This randomly selects one of the comments and shows the text. 
- Run it multiple times to see different randomly chosen comments

In [None]:
print(um_comments.sample(1).body.iloc[0])

#### Helper function for finding examples of comments that score high or low
Run this code and scroll down.

In [None]:
def get_example(data, column, where='high'):
    #pick whether to use high or low scoring comments
    if where == 'high':
        asc = False
    else:
        asc = True
    #Select the 100 most extreme comments in this column
    df = data.sort_values(by=column, ascending=asc).head(100)
    #pick one at random and print the text of it
    print(df.sample(1).body.iloc[0])
    return

### What do all those scores mean? (They're not perfect!)
- **Run the code below, trying different column names** to see examples of comments that scored high or low in each measure.  
    - The function `get_example()` picks a comment at random, so run it more than once with `alt`+`enter` and you'll see different comments.
    - Do you notice a pattern with the types of comments that come up? 
    - Do any of the scores seem to mean something a little different than you expected?
- Note that the scores don't always seem right. For example, sometimes a comment that scored high in `ATTACK_ON_COMMENTER` isn't actually a personal attack.
    - The scores were made by some of the most advanced software for this in the world, and they're still not perfect. This reminds us just how hard it is for computers to understand human language.
- Still, most of the scores seem about right. And, as we know from statistics, we can still make inferences about average scores even when there are some errors in our measurements.

In [None]:
get_example(data=um_comments, column='sentiment', where='high')

In [None]:
get_example(data=um_comments, column='sentiment', where='low')

In [None]:
get_example(data=um_comments, column='politeness', where='low')

In [None]:
get_example(data=um_comments, 
            column='ATTACK_ON_COMMENTER', where='high')

In [None]:
get_example(data=um_comments, column='TOXICITY', where='high')

In [None]:
get_example(data=um_comments, column='OBSCENE', where='high')

### Getting a feel for our data
- One of the first things to do with any data is plot it. We want to get a feel for what's in it. Take a look at the histograms below.
- Some of the scores are normally distributed, like sentiment and politeness. 
    - What might this mean?
    - Why might there be a spike of comments with exactly 0 (totally neutral) sentiment?
- The distributions of other scores, like personal attacks and obscenity are very skewed. 
    - Most comments are nice (low scores), but a few are not (high scores). 

In [None]:
um_comments.sentiment.hist(bins=20)

In [None]:
um_comments.politeness.hist(bins=30)

In [None]:
um_comments.ATTACK_ON_COMMENTER.hist(bins=20)

In [None]:
um_comments.OBSCENE.hist(bins=20)

In [None]:
um_comments.TOXICITY.hist(bins=20)

## Seeing trends over time
- In this lab, we're not just interested in individual comments, but in the community (in this case, a subreddit forum) and how it changes over time. 
- To study this, we're going to be using the `groupby` and `resample` functions in pandas. They're two slightly different functions that do the same basic thing:
    - Take all of our comments and put them into groups (in our case, one group for each month).
    - Summarize each group (e.g. by telling us how many comments are in it or what their average score is).
- Once we have summaries for each group, we can plot them on a graph where the X axis is time. Take a look at the examples below.

In [None]:
#Group the comments by month
monthly = um_comments.resample('M', on='date')

#count the number of comments in ach group
total_comments = monthly.body.count()

#show first few months
total_comments.plot(title='Number of comments posted')

#### We can make the plots prettier with this helper function.
Don't worry about how this code works, just run it and scroll down.

In [None]:
def make_plot(grouped, columns='id', title=None, top=None, bottom=None, 
             games=None, exams=None, classes=None, agg='mean',
             years=[2012, 2018]):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
    if top is not None:
        axs.set_ylim(top=top)
    
    if years is not None:
        axs.set_xlim(left=datetime(year=years[0], month=1, day=1), 
                     right=datetime(year=years[1], month=1, day=1))
    
    if games is not None:
        for g in games.iterrows():
            if g[1].game_result == 'W':
                axs.axvline(g[1].date, color='k', alpha=.6)
            elif g[1].game_result == 'L':
                axs.axvline(g[1].date, color='r', alpha=.6)
                
    if exams is not None:
        for e in exams.iterrows():
            if e[1].exams == 1:
                axs.axvline(e[1].date, color='r', alpha=.5)
                
    if classes is not None:
        for c in classes.iterrows():
            axs.axvspan(c[1].class_start, c[1].class_end, 
                        color='g', alpha=0.35)
                
    if isinstance(columns, str):
        columns = [columns]
        
        
    if agg == 'mean':
        for c in columns:
            means = grouped[c].mean()
            sems = grouped[c].sem()
            axs.plot(means.index, means)
            axs.fill_between(sems.index, means-(1.96*sems), 
                             means+(1.96*sems), alpha=0.5)

        if title is None:
            title = 'Average scores with 95% confidence interval'
    elif agg == 'count':
        for c in columns:
            counts = grouped[c].count()
            axs.plot(counts)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        for c in columns:
            counts = grouped[c].nunique()
            axs.plot(counts)
        if title is None:
            title = 'Number of unique ___ per month'
    axs.set_title(title)
    axs.set_xlabel('Time')

    if len(columns) == 1:
        axs.set_ylabel(columns[0])
    else:
        axs.legend()
                
    plt.show()
    return

In [None]:
make_plot(monthly, agg='count', years=[2011,2018])

In [None]:
make_plot(monthly, columns='author', agg='unique', years=[2011,2018])

### Looking for patterns
- Do you notice a pattern in the number of comments or active users over time?
    - It is a little messy, but it seems like there are less people posting comments in the middle of each year (summer time). Why might that be? 
    
### What about the comment scores? 
- We can plot the average score of comments each month.
- Because the score is an average, it also has a standard error.
- We'll write a simple helper function to make nice plots of the averages and the confidence interval around them.


# Try it yourself:
- Call the function `make_plot()` with different column names to see different plots. 
- You can also call it with multiple column names in a list, like in the third example.
- **Hint** you can change the range of the y axis by setting the arguments `top` and `bottom`. Otherwise they'll be chosen automatically. 

In [None]:
make_plot(monthly, columns='TOXICITY', top=.4)

# Reflect: look for relationships
- In the example below, we see toxicity and sentiment seem to have an inverse relationship: when one goes up, the other goes down. (In fact, the correlation is -0.8)
    - Why might this be?
- Try different combinations of variables: do other scores seem to have a relationship like this?
- Write a few sentences for each question below.

# Reflect here
- . 
- .

In [None]:
make_plot(monthly, columns=['TOXICITY', 'sentiment'], top=.3)

### Adding events
- Maybe some of the patterns we see in the data corrispond to events happening at the same time. 

#### Let's load data about when UM classes are in session 

In [None]:
#read data
classes = pd.read_csv('data/UM_class_periods_no_summer.tsv', sep='\t')
#convert dates to date data type
classes['class_start'] = pd.to_datetime(classes.class_start)
classes['class_end'] = pd.to_datetime(classes.class_end)
#show the most recent information
classes.tail()

In [None]:
make_plot(monthly, columns=['sentiment'], classes=classes, top=.3)

#### See a pattern?
- The graph is green during times when classes are in session, and white otherwise.
- Sentiment on the r/uofm subreddit seems to get more negative when Summer ends and Fall semester begins each year.

# Try it youself
- Does that happen for the other scores? **Try the function with different column names** instead of `sentiment` to see. Write a list of scores where you see a pattern.

# Reflect here
- .
- .

#### What about final exams?
- Load data on exams

In [None]:
exams = pd.read_csv('data/UM_academic_calendar_no_summer.tsv', sep='\t')
exams['date'] = pd.to_datetime(exams.date)
exams.tail()

#### Helper functions
- Don't worry about how this code works, just run it and scroll down.

In [None]:
def make_plot2(grouped, columns='id', title=None, 
               top=None, bottom=None, colors='vega',
               agg='mean', names = None):
    
    color_sets = {'UM': ['#024794', '#ffcb05', '#83b2a8',
                         '#989c97', '#7a121c'],
                  'vega': ['#1f77b4', '#ff7f0e', '#2ca02c', 
                           '#d62728', '#9467bd', '#8c564b',
                           '#e377c2', '#7f7f7f', '#bcbd22']}
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
    if top is not None:
        axs.set_ylim(top=top)
                
    if isinstance(columns, str):
        columns = [columns]
    
    if not isinstance(grouped, list):
        grouped = [grouped]
        names=['']
        
    scheme = color_sets[colors]
          
    axs.axvline(0, color='k', linestyle='dashed', alpha=.5)
    i = 0
    for g, n in zip(grouped, names):
        if agg == 'mean':
            for c in columns:
                means = g[c].mean()
                sems = g[c].sem()
                axs.plot(means.index, means, color=scheme[i],
                         label=n+' '+c)
                axs.fill_between(sems.index, means-(1.96*sems), 
                                 means+(1.96*sems), 
                                 color=scheme[i], alpha=0.5)
                i += 1

            if title is None:
                title = 'Average with 95% confidence interval'
        elif agg == 'count':
            for c in columns:
                counts = g[c].count()
                axs.plot(counts, color=scheme[i], label=n+' '+c)
                i += 1
            if title is None:
                title = 'Number of comments'
            axs.set_ylabel('Count')
        elif agg == 'unique':
            for c in columns:
                counts = g[c].nunique()
                axs.plot(counts, color=scheme[i], label=n+' '+c)
                i += 1
            if title is None:
                title = 'Number of unique ___'
            axs.set_ylabel('Count')

    axs.set_title(title)
    axs.set_xlabel('Days from Event')

    if len(columns) == 1:
        axs.set_ylabel(columns[0])
        if len(grouped) > 1:
            axs.legend()
    else:
        axs.legend()
                
    plt.show()
    return

def center_on_dates(comments, dates, window_size=14):
    subset = []
    for d in dates.date:
        start = d - pd.Timedelta(window_size, unit='d')
        end = d + pd.Timedelta(window_size+1, unit='d')
        tmp = comments[(comments.date >= start) & 
                          (comments.date <= end)].copy()
        tmp['days'] = tmp.date.apply(lambda x: (x - d).days)
        subset.append(tmp)

    subset = pd.concat(subset)
    return subset.groupby(by='days')

def get_example_from_day(grouped, day=0, search=None):
    if search is not None:
        tmp = grouped.apply(lambda x: x.sample(frac=1))
        tmp = tmp[tmp.body.str.contains(search, na=False)]  
        tmp = tmp[tmp.days == day]
        if len(tmp) > 0:
            tmp = tmp.sample(1).body.values[0]
        else:
            tmp = 'Sorry, there are no comments with the search term "'
            tmp += search
            tmp += '" on day '
            tmp += str(day)
            tmp += ". Try another search or day."
    else:
        tmp = grouped.apply(lambda x: x.sample(1))
        tmp = tmp[tmp.days == day].body.values[0]
    print(tmp)
    return

In [None]:
exam_weeks = center_on_dates(um_comments, exams)

### Plots showing posts one week before and after finals start
- The vertical bar shows when finals start.
    - Note that we added together the two weeks before and after finals for every semester, so what you see is the total over all. That's why the X axis is "days since exams started" rather than a specific date.

In [None]:
make_plot2(exam_weeks, agg='count', bottom=0, top=5500)

In [None]:
make_plot2(exam_weeks, columns='sentiment')

In [None]:
make_plot2(exam_weeks, columns='INFLAMMATORY')

# Reflect
### Interpreting these graphs
- Do you see any interesting patterns? Pick one.
- What might be causing these two patterns? Write a few sentences. Hints:
    - Do you think different people are posting on different days?
    - Do you think the same people might post different things on different days?
- Test your answers by looking at example posts from some of these days:
    - Run the function below multiple times: it will show you random posts from whatever day you ask for. 
    - Try asking it for different days.
- Try the plots above with different scores. Do exams corrispond with trends in scores other than sentiment? Write a few sentences.

# Reflect here:
- .
- .
- .

In [None]:
get_example_from_day(exam_weeks, day=10)

#### You can also search for comments with specific words

In [None]:
get_example_from_day(exam_weeks, day=10, search='finals')

### Football games
- Load data from the subreddit for UM athletics, `r/MichiganWolverines`, and the dates of football games.

In [None]:
sports_comments = pd.read_csv('data/merged/MichiganWolverines.tsv', 
                              sep='\t')
sports_comments['date'] = pd.to_datetime(sports_comments.date)

games = pd.read_csv('data/UM_football.tsv', sep='\t')
games['date'] = pd.to_datetime(games.date)
games.head()

In [None]:
games.head()

In [None]:
tmp = games[games.date > datetime(year=2011, month=1, day=1)]
tmp.shape

#### Just for fun, what is our all time win / loss record?

In [None]:
games.game_result.value_counts()

### Game day sentiment
- First, let's separate out the games we won and lost.

In [None]:
win_days = center_on_dates(sports_comments, 
                           games[games.game_result == 'W'], 
                           window_size=7)
loss_days = center_on_dates(sports_comments, 
                           games[games.game_result == 'L'], 
                           window_size=7)

#### Look at a few examples of posts from days we won and lost games.

In [None]:
#games we won
get_example_from_day(win_days, day=0)

In [None]:
#games we lost
get_example_from_day(loss_days, day=0)

In [None]:
#games we lost, where people mention referees
get_example_from_day(loss_days, day=0, search='ref')

In [None]:
make_plot2([win_days, loss_days], names=['win', 'loss'],
           columns='sentiment', 
           colors='UM',
           title='Average sentiment in r/MichiganWolverines before and after game days')


# Reflect
### Why is sentiment worse on game days, even when we win?
- Before going further, come up with a hypothesis that might explain lower sentiment on game days, regardless of whether we win. Write it down.
- In the next part, we separate out comments not by whether we won or lost the game that day, but by whether their sentiment was positive or negative. That will help us answer these questions:
    - Are there more negative comments on game days? 
    - Fewer positive ones?
    - Do the postive comments get *less* positive on game days? 
    - Do the negative ones get *more* negative?

# Hypothesis here
- .

In [None]:
game_pos = center_on_dates(sports_comments[sports_comments.sentiment > 0], 
                          games, window_size=7)
game_neg = center_on_dates(sports_comments[sports_comments.sentiment < 0], 
                           games[games.game_result == 'W'], 
                           window_size=7)

In [None]:
make_plot2([game_pos, game_neg], names=['pos', 'neg'],
           columns='id', agg='count', colors='UM',
           title='Average number of comments in r/MichiganWolverines before and after game days')


In [None]:
make_plot2([game_pos, game_neg], names=['pos', 'neg'],
           columns='sentiment', colors='UM',
           title='Average number of comments in r/MichiganWolverines before and after game days')


## What we learned:
1. What reddit is, and what comment data looks like.
2. Various ways of scoring comments to summarize their contents.
3. The difficulty of getting good scores.
4. Grouping data by time and showing trends in average comment scores.
5. Comparing time series data with events.

# Reflect
### Write a brief proposal:
Write a 2 paragraph proposal to expand on this analysis. Answer the following questions:
1. What would you do to test our preliminary findings? That is, what analysis could you do to check if our initial guesses were right?
2. What outside factors would you need to control for or look at in your comparison?
    - **Hint:** Game days are usually Saturdays. What if our findings happen because it is Saturday, not because it is a game day? 
3. What other scores might you want to look at? Why?

# Write your proposal here
.
.
.
.
.

# Optional

## Do Saturdays cause negative comments? 
- This code will help us find out!

In [None]:
from datetime import timedelta

def get_dates(from_date, to_date, day_list=[5]):
    tmp_list = list()
    date_list = list()
    for x in range((to_date - from_date).days+1):
        tmp_list.append(from_date + timedelta(days=x))
    for date_record in tmp_list:
        if date_record.weekday() in day_list:
            date_list.append(date_record)
 
    return date_list

dates = get_dates(datetime(year=2012, month=1, day=1), 
          datetime(year=2018, month=1, day=1), 
          day_list=[5])

gd = set(games.date.tolist())

sats = pd.DataFrame(dates).rename(columns={0: 'date'})
sats = set(sats.date) - gd
sats = pd.DataFrame(list(sats)).rename(columns={0: 'date'})

In [None]:
saturdays = center_on_dates(sports_comments, 
                           sats, 
                           window_size=7)

make_plot2([win_days, loss_days, saturdays], 
           colors='vega',
           names=['win', 'loss', 'saturday'],
           columns='sentiment', 
           title='Average sentiment in r/MichiganWolverines before and after game days')


## Looks like game days are worse than regular Saturdays!