# Lab 0: Introduction to Reddit Data

In this lab, we'll cover:
- What reddit data look like
- Several ways to summarize the conversation's tone
- Evaluation of data over time

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Getting data
- Data files with Reddit comments are publicly available many places online, including torrents, google's BigQuery, and several data hosting websites. UM keeps a full copy in our Advanced Research Computing resources.
- Reddit is one of the biggest sites on the internet. 
    - It has over 3 billion comments, and the data take up several TB of disk space (`1TB = 1024 GB`)! 
    - This makes working with the data difficult.
    - For simplicity, we went ahead and used some big data tools like `pyspark` and `hadoop` to go through all the comments and select out smaller sets to work with in this lab. 
- Let's start by looking at just the comments from the subreddit community for the University of Michigan
    - This file is only 34 MB: a more managable size!
    - The `shape` property tells us that there are 66 thousand rows (comments) and 51 columns.

In [None]:
um_comments = pd.read_csv('data/merged/uofm.tsv', sep='\t')
um_comments.shape

### A little cleanup for our data
- Don't worry about this code, just run it and scroll down

In [None]:
def clean_up(df):
    df['date'] = pd.to_datetime(df.created_utc, unit='s')
    drop_cols = ['approved_at_utc', 'approved_by', 'archived', 
                 'author_cakeday', 'author_flair_css_class', 
                 'author_flair_text', 'banned_at_utc', 'banned_by',
                 'can_gild', 'can_mod_post', 'collapsed', 
                 'collapsed_reason', 'created_utc', 'distinguished', 
                 'downs', 'likes', 'removal_reason', 
                 'report_reasons', 'retrieved_on', 'saved',
                 'score_hidden', 'stickied',  'subreddit_id', 'ups']
    df.drop(columns=drop_cols, errors='ignore', inplace=True)
    
    return df

um_comments = clean_up(um_comments)

### What information do we have about each comment?
- We have a lot! Here are some of the most interesting columns:
    - `body` the text of the comment
    - `author` the username of the person who posted it
    - `created_utc` when the comment was made
    - `subreddit` which community a comment is from. Here, they're all from `r/uofm`
    - Several scores from the [Perspective API](https://www.perspectiveapi.com/). In this project, Google and Jigsaw teamed up to build automatic systems for finding bad comments. We used their program to score these comments already, and the scores are saved in the file.
        - `ATTACK_ON_COMMENTER` the probability that this comment is a personal attack on another commenter
        - `INCOHERENT` whether the comment seems to make sense
        - `INFLAMMATORY` how inflammatory the comment is
        - `LIKELY_TO_REJECT` the liklihood that New York Times comment editors would reject  the comment if it was posted on their site
        - `OBSCENE` whether the comment is obscene
        - `TOXICITY` whether the comment is 'toxic' for community discussion
    - `politeness` scores, computed by the [Stanford NLP group's software](https://www.cs.cornell.edu/~cristian/Politeness.html).
    - `sentiment` (how positive or negative a comment is), computed by the [VADER program in NLTK](http://www.nltk.org/_modules/nltk/sentiment/vader.html)

In [None]:
um_comments.columns.values

In [None]:
um_comments.head()

### Example comments
- This randomly selects one of the comments and shows the text. Run it multiple times to see different randomly chosen comments

In [None]:
print(um_comments.sample(1).body.iloc[0])

### Getting a feel for our data
- One of the first things to do with any data is plot it. We want to get a feel for what's in it. 
- Some of the scores are normally distributed, like sentiment and politeness. 
    - What might this mean?
    - Why might there be a spike of comments with exactly 0 (totally neutral) sentiment?
- The distributions of other scores, like personal attacks and obscenity are very skewed. 
    - Most comments are nice (low scores), but a few are not (high scores). 

In [None]:
um_comments.sentiment.hist(bins=20)

In [None]:
um_comments.politeness.hist(bins=20)

In [None]:
um_comments.ATTACK_ON_COMMENTER.hist(bins=20)

In [None]:
um_comments.OBSCENE.hist(bins=20)

## Automated scores aren't perfect.
- Here we have a function that will show us examples of some comments that score highest or lowest in one of the measures. 
- The function picks a comment at random, so run it more than once and you'll see different comments.
- Note that the scores don't always seem right. For example, sometimes a comment that scored high in `ATTACK_ON_COMMENTER` isn't actually a personal attack.
    - The scores were made by some of the most advanced software for this in the world, and they're still not perfect. This reminds us just how hard it is for computers to understand human language.
- Still, most of the scores seem about right. And, as we know from statistics, we can still make inferences about average scores even when there are some errors in our measurements.

In [None]:
def get_example(data, column, where='high'):
    #pick whether to use high or low scoring comments
    if where == 'high':
        asc = False
    else:
        asc = True
    #Select the 100 most extreme comments in this column
    df = data.sort_values(by=column, ascending=asc).head(100)
    #pick one at random and print the text of it
    print(df.sample(1).body.iloc[0])
    return

In [None]:
get_example(data=um_comments, column='sentiment', where='high')

In [None]:
get_example(data=um_comments, column='politeness', where='low')

In [None]:
get_example(data=um_comments, 
            column='ATTACK_ON_COMMENTER', where='high')

## Seeing trends over time
- In this lab, we're not just interested in individual comments, but in the community (in this case, a subreddit forum) and how it changes over time. 
- To study this, we're going to be using the `groupby` and `resample` functions in pandas. They're two slightly different functions that do the same basic thing:
    - Take all of our comments and put them into groups (in our case, one group for each month).
    - Summarize each group (e.g. by telling us how many comments are in it or what their average score is).
- Once we have summaries for each group, we can plot them on a graph where the X axis is time. Take a look at the examples below.

In [None]:
#Group the comments by month
monthly = um_comments.resample('M', on='date')

#count the number of comments in ach group
total_comments = monthly.body.count()

#count the number of unique comment authors in each group
active_users = monthly.author.nunique()

#show first few months
active_users.head(20)

In [None]:
total_comments.plot(title='Number of comments posted')

In [None]:
#show all months in a graph
active_users.plot(title='Number of active users')

### Looking for patterns
- Do you notice a pattern in the number of comments or active users over time?
    - It is a little messy, but it seems like there are less people posting comments in the middle of each year (summer time). Why might that be? 