# Lab 2: Other Shocks to Communities

## Contents
1. [Setup](#Section-1%3A-Setup)
    1. [Import](#1.1-Import-Packages)
    1. [Data](#1.2-Data)
1. [Activity over Time](#Section-2%3A-Activity-over-Time)
    1. [Measuring Overall Conflict](#2.1-Measuring-Overall-Conflict)
    1. [Drivers of Change](#2.2-Drivers-of-Change)
    1. [Joining after Default Day](#2.3-Joining-after-Default-Day)
    1. [Being New to the Community](#2.4-Being-New-to-the-Community)

## Section 0: Background
- We saw in Lab 1 that the community r/TwoXChromosomes was dramatically changed when it was made a default subreddit.
- In this lab, we look at other communities that were made default subreddits on the same day that TwoX was (May 7, 2014).

## Section 1: Setup
### 1.1 Import Packages

In [None]:
import pandas as pd
import numpy as np
import os
import urllib.request
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
pd.plotting.register_matplotlib_converters()

%matplotlib inline

### 1.2 Data
- Pick the community you want to look at in this lab and put its name below. 
- Be sure your spelling and capitalization is **exactly** the same as in the list, or else the code will break.

In [None]:
links = {'photoshopbattles': 'https://docs.google.com/uc?export=download&id=1E_Yzu4-uH4Lu-xuniyjJPn9IZ7OmfOmE',
         'tifu': 'https://umich.box.com/shared/static/9ckqb90ink4looycv9zq3sxrf9texl3x.gz',
         'personalfinance': 'https://umich.box.com/shared/static/bda1e5c7nme84wfk7zjma0gm71zwvjvt.gz',
         'nottheonion': 'https://umich.box.com/shared/static/tz1lpysrl0u97ji6ekrwnu952kyiti2n.gz',
         'mildlyinteresting': 'https://umich.box.com/shared/static/7rmejj8d5avtigfic4qh0m8fdvw40s4p.gz',
         'dataisbeautiful': 'https://docs.google.com/uc?export=download&id=1OHXq7oHNgumXKI3n3oa8sWKF0jJhC_jM',
         'listentothis': 'https://docs.google.com/uc?export=download&id=1OjD6NDFdAsr_td94y8iJY1MbUml7fKir',
         'gadgets': 'https://docs.google.com/uc?export=download&id=1usRza8vTjXKbQJ5ACFLMYgjO1mc6u4pO',
         'Showerthoughts': 'https://umich.box.com/shared/static/lwgd51d1zwf0bwuj8rdhl14jhpo4nb11.gz',
         'LifeProTips': 'https://umich.box.com/shared/static/r1w0s4tpyunw2zxmc7dszipz7kvh6r3z.gz',
         'Jokes': 'https://docs.google.com/uc?export=download&id=1JbJ3VrzYVg2np4LdC7sO7yTWqWIHTCIE',
         'DIY': 'https://docs.google.com/uc?export=download&id=10esiYYTyqXEq40h7OTq84TzS91aRx0J4',
         'creepy': 'https://docs.google.com/uc?export=download&id=1gorS9SbXhtslWAp00xnKdZnKuGfZELXT',
         'Documentaries': 'https://docs.google.com/uc?export=download&id=1-NFpyno9pO_m0IY1w1uVfj-sNx7FFeCu',
         'Art': 'https://docs.google.com/uc?export=download&id=1snhr8irqgrMZ63-PlxFbVwEPOfcUiMOg',
         'GetMotivated': 'https://docs.google.com/uc?export=download&id=16XX8l41byuawooQRaFDm2OG5GPH2EwUQ',
         'UpliftingNews': 'https://docs.google.com/uc?export=download&id=1Og-9dsj1AbGz-spK4xowJpAB_bdPnX6a',
         'InternetIsBeautiful': 'https://docs.google.com/uc?export=download&id=15c2ye-kM1Lmknu0e50YuYCv4zWbxsWcb'
        }

print("Your options are:")
for l in links.keys():
    print('    - '+ l)

#### Chose your subreddit here

In [None]:
subreddit = 'UpliftingNews'

- Here we load the data and do a little cleaning up. 
- Don't worry about how the cleanup is done, just run the code and scroll down.

In [None]:
fname = 'data/merged/'+subreddit+'.tsv.gz'
if os.path.isfile(fname):
    print("You already downloaded the data. Great!")
else:
    print('Downloading file...')
    urllib.request.urlretrieve(url=links[subreddit],
                               filename=fname)
    print('Done!')

comments = pd.read_csv(fname, sep='\t', index_col=0)

#downsample for size
comments = comments.sample(n=min(comments.shape[0], 3000000))

#convert our dates to the date data type
comments['date'] = pd.to_datetime(comments.date)
#compress data for lower memory environments
comments.drop_duplicates(subset=['id'], inplace=True)
comments['id'] = comments.index.values
f = comments.select_dtypes(include=['int']).apply(pd.to_numeric,downcast='unsigned')
comments[f.columns] = f
f = comments.select_dtypes(include=['float']).apply(pd.to_numeric,downcast='float')
comments[f.columns] = f
#remove unnecessary columns
if 'parent_id' in comments.columns:
    comments.drop(columns=['parent_id'], inplace=True)
comments['author'] = comments.author.astype('category')

comments.shape

#### Peaking at our data

In [None]:
comments.columns.values

In [None]:
comments.head()

#### Helper function for plotting
Don't worry about what's in it, just run the code and scroll down.

In [None]:
def make_plot(grouped, columns='id', title=None, top=None, bottom=None, 
             events=None, agg='mean', years=[2012,2016]):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
        
    if top is not None:
        axs.set_ylim(top=top)

    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)

    if isinstance(columns, str):
        columns = [columns]
        
    if agg == 'mean':
        for c in columns:
            means = grouped[c].mean()
            sems = grouped[c].sem()
            axs.plot(means.index, means, label=c)
            axs.fill_between(sems.index, means-(1.96*sems), 
                             means+(1.96*sems), alpha=0.5)
        if title is None:
            title = 'Average scores with 95% confidence interval'
    elif agg == 'count':
        for c in columns:
            counts = grouped.count()[c]
            axs.plot(counts)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        for c in columns:
            counts = grouped[c].nunique()
            axs.plot(counts)
        if title is None:
            title = 'Number of unique ___ per month'
            
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description)            
            
    axs.set_title(title)
    
    if len(columns) == 1:
        axs.set_ylabel(columns[0])
    else:
        axs.legend()
                
    plt.show()
    return

events = [{'date': datetime(year=2014, month=5, day=7),
          'event': 1,
          'description': 'Default Day'}]
events = pd.DataFrame(events)

### Section 2: Activity over Time
- This code puts the comments into groups: one group for every month. Then it gets the count of comments and the count of people commenting in each month. Finally, it plots them for us to see how the level of activity changes over time. 

In [None]:
#Group the comments by month
monthly = comments.resample('M', on='date')

In [None]:
make_plot(monthly, agg='count', events=events)

In [None]:
make_plot(monthly, columns='author', agg='unique', events=events)

#### Notice anything?
- As we expected, there's a jump up in the number of comments and users in May 2014, when it was added to the front page. 

### 2.1 Measuring Overall Conflict
- If there is conflict among people in the subreddit, we might expect more posts to get deleted. 
- The first graph shows us the raw number of posts that are deleted. It looks a lot like the total number of posts, though. 
- The second graph shows us what percent of posts are deleted.

In [None]:
make_plot(monthly, columns='deleted', 
          title="Percent of comments deleted",
          events=events)

In [None]:
make_plot(monthly, columns=['OBSCENE', 'TOXICITY', 
                            'ATTACK_ON_COMMENTER', 
                            'INFLAMMATORY'], 
          events=events)

In [None]:
make_plot(monthly, columns=['sentiment'], 
          events=events)

### 2.2 Drivers of Change
- We saw above that the community changed dramatically after it became a default community. But that leaves unanswered questions:
    - Did everyone start behaving differently after default day?
    - Is the change because there are different people commenting than before?
    - Is the change because newcomers haven't adapted to the community's norms yet?  
    - Does it matter what other communities a user is in if we want to know their behavior here?
- Next we're going to compare users who joined before default day with those who joined after it.

#### Short Answer 1:
- Before running the code below, answer these questions with a sentence or two each:
    - What do you think might be driving the changes in average scores that we saw above?
    - What ways might users who joined after it was a default community be different in our measures of comment civility than the users who joined before it was a default community?

**Write your response here:**
....

The helper code below labels users based on when they joined the community.
- Don't worry about how the code in this cell works, just run it and scroll down.

In [None]:
users = {}
groups = [{'name': 'before', 'direction': 'lt',
           'date': datetime(year=2014, month=5, day=7)},
          {'name': 'after', 'direction': 'gt',
           'date': datetime(year=2014, month=5, day=7)}]

#select people who only posted one comment ever
tmp = comments.groupby(by='author').count()
users['once'] = set(tmp[tmp.id == 1].index.values)
del tmp

#select all other people, 
firsts = comments[~comments.author.isin(users['once']
                                                 )][['author',
                                                     'date']]
#figure out their first post date
firsts.sort_values(by='date', inplace=True)
firsts.drop_duplicates(subset='author', keep='first', inplace=True)

#select the users in each group
for g in groups:
    if g['direction'] == 'lt':
        users[g['name']] = set(firsts[firsts.date < g['date']].author)
    elif g['direction'] == 'gt':
        users[g['name']] = set(firsts[firsts.date > g['date']].author)

def get_cohort(user):
    c = np.nan
    # these need to be in chronological order
    cohorts = ['once', 'before', 'after']
    for co in cohorts:
        if user == '[deleted]':
            c = np.nan
        elif user in users[co]:
            c = co
    
    return c

#label the comments by their author's cohort
comments['cohort'] = comments.author.apply(get_cohort)

del users

#group comments by both date and cohort
cohorts = comments.groupby(by='cohort').resample('M', on='date')

In [None]:
def plot_users(grouped, column='id', title=None, top=None, 
               bottom=None, events=None, agg='mean', 
               years=[2012,2016], cohorts='once'):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
    if top is not None:
        axs.set_ylim(top=top)
    
    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)
                
    if isinstance(cohorts, str):
        cohorts = [cohorts]
        
    if agg == 'mean':
        means = grouped.mean()
        sems = grouped.sem()
        for c in cohorts:
            axs.plot(means[column][c].index, 
                     means[column][c],
                     label=c)
            axs.fill_between(sems[column][c].index, 
                             means[column][c]-(1.96*sems[column][c]), 
                             means[column][c]+(1.96*sems[column][c]), 
                             alpha=0.5)
        if title is None:
            title = 'Average '+column+' with 95% confidence interval'
    elif agg == 'count':
        counts = grouped.count()
        for c in cohorts:
            axs.plot(counts[column][c], 
                     label=c)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        counts = grouped.nunique()
        for c in cohorts:
            axs.plot(counts[column][c], 
                     label=c)
        if title is None:
            title = 'Number of unique ___ per month'
            
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description) 
            
    axs.set_title(title)
    axs.set_ylabel(column)
       
    axs.legend()
                
    plt.show()
    return

### 2.3 Joining after Default Day

#### What fraction of posts were made by people who joined before vs after default day?

In [None]:
plot_users(cohorts, column='id', 
           cohorts=['before', 'after'], 
           agg='count', 
           events=events)

#### What does each group's posting behavior look like?

In [None]:
plot_users(cohorts, column='TOXICITY', 
           cohorts=['before', 'after'], 
           events=events)

In [None]:
plot_users(cohorts, column='ATTACK_ON_COMMENTER', 
           cohorts=['before', 'after'], 
           events=events)

In [None]:
plot_users(cohorts, column='INFLAMMATORY', 
           cohorts=['before', 'after'], 
           events=events)

#### Short Answer 2:
- Write a sentence or two for each question:
    1. Were the people who came to the community after default day different than those who came to the community before it?
    2. How do these graphs compare with what you predicted before you looked?
    3. Did either group seem to contribute more to the overall changes in the community we saw in the beginning?

**Write your response here:**
....

### 2.4 Being New to the Community
Is it about new users?
- We saw that people who joined after default day behaved differently than those who joined before. In the next section, we'll look at whether *new* users are always different than users who have been around longer. 
    - Here, we define comments from new users as comments posted within the first 7 days a user is in the community. All other comments are from "old" users, who have been in the community more than one week. 
- Run the helper code below and scroll down.

In [None]:
firsts = firsts[firsts.author != '[deleted]']

tmp = comments[['author', 'date']].merge(firsts, 
                                              how='left', 
                                              on='author',
                                             )
tmp = (tmp.date_x - tmp.date_y) < timedelta(days=7)
comments['new_user'] = tmp.replace({True: 'new', False:'old'})

del tmp
del firsts

#group comments by both date and cohort
newbies = comments.groupby(by='new_user').resample('M', on='date')

In [None]:
plot_users(newbies, column='TOXICITY', 
           cohorts=['old', 'new'], 
           events=events)

In [None]:
plot_users(newbies, column='ATTACK_ON_COMMENTER', 
           cohorts=['old', 'new'], 
           events=events)

In [None]:
plot_users(newbies, column='INFLAMMATORY', 
           cohorts=['old', 'new'], 
           events=events)

#### Short Answer 3:
Answer these questions in a few sentences each:
- Are comments from people who just joined the community different than comments from users who have been members longer?
- How do these graphs compare with what you predicted before you looked?
- Did either group contribute more to the overall changes in the community we saw in the beginning?
- What do these comparisons tell us?
- Overall, what happened in this community on default day?
- How are users who joined after default day different from those who joined before?
- How did users who joined before default day change when it was made a default subreddit? 
    - In what ways did they stay the same?
    

**Write your response here:**
....