# Lab 2: Other shocks to communities

- We saw in Lab 1 that the community r/TwoXChromosomes was dramatically changed when it was made a default subreddit.
- In this lab, we look at other communities that were made default subreddits on the same day that TwoX was (May 7, 2014).
- Pick one of the communities below to use for this lab.

#### Communities made into default subreddits on May 7, 2014

| Subreddit | N comments | Operational | Interesting
| --- | ---:| --- | ---
| funny |67,239,334|No |
| pics |63,556,365|No |
| gaming |44,516,437|No |
| worldnews |39,636,199|No |
| videos |36,535,462|No |
| todayilearned |32,095,700|No |
| news |28,629,460|No |
| movies |22,180,508|No |
| gifs |14,803,594|No |
| aww |12,403,650|No |
| Fitness | 9,757,307|No |
| Showerthoughts | 9,747,251|Yes | mild
| TwoXChromosomes | 7,809,455|Yes | yes!
| mildlyinteresting | 7,355,302|Yes | yes
| personalfinance | 6,613,244|Yes | yes (odd)
| tifu | 4,727,801|Yes | yes
| LifeProTips | 4,410,593|Yes | yes (drift)
| nottheonion | 4,129,381|Yes | yes
| food | 3,638,060|Yes | barely
| Jokes | 3,607,785|Yes | barely
| DIY | 2,543,822|Yes | yes (drift)
| dataisbeautiful | 2,329,279|Yes | yes
| photoshopbattles | 2,158,450|Yes | yes!
| nosleep | 2,127,058|Yes | mild
| creepy | 2,043,946|Yes | mixed
| Documentaries | 1,776,541|Yes | yes!
| gadgets | 1,626,444|Yes | yes
| history | 1,618,372|Yes | mixed
| Art | 1,508,349|Yes | yes
| philosophy | 1,461,466|Yes | mild
| GetMotivated | 1,371,786|Yes | yes (reverse trends)
| UpliftingNews | 1,289,856|Yes | mild
| listentothis | 1,040,005|Yes | yes
| InternetIsBeautiful | 694,321|Yes | barely
| announcements | 540,792|Yes | no

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

%matplotlib inline

## Data
- Here we load the data and do a little cleaning up. 
- Don't worry about how the cleanup is done, just run the code and scroll down.

In [None]:
cols = ['date', 'author', 'TOXICITY', 'id', 'subreddit', 'deleted',
        'ATTACK_ON_COMMENTER', 'INFLAMMATORY', 'LIKELY_TO_REJECT', 
        'OBSCENE', 'SEVERE_TOXICITY', 'ATTACK_ON_AUTHOR', 'SPAM', 
        'UNSUBSTANTIAL', 'INCOHERENT', 'sentiment', #'ups', 'edited', 
        #'is_submitter', 'parent_id', 'replies', 'score', 
        # 'stickied', 'archived', 'collapsed', 'controversiality', 
        #'body', 'politeness', 'sentiment', 'pej_nouns', 
       ]

comments = pd.read_csv('data/merged/gadgets.tsv', 
                       sep='\t', usecols=cols)
#convert our dates to the date data type
comments['date'] = pd.to_datetime(comments.date)
comments.shape

In [None]:
#select a smaller subset of the data for laptop memory limits
start = datetime(year=2012, month=1, day=1)
end = datetime(year=2016, month=1, day=1)
comments = comments[comments.date >= start]
comments = comments[comments.date <= end]

### Peaking at our data

In [None]:
comments.columns.values

In [None]:
comments.head()

### Helper function for plotting
Don't worry about what's in it, just run the code and scroll down.

In [None]:
def make_plot(grouped, columns='id', title=None, top=None, bottom=None, 
             events=None, agg='mean', years=[2012,2016]):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
        
    if top is not None:
        axs.set_ylim(top=top)

    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)

    if isinstance(columns, str):
        columns = [columns]
        
    if agg == 'mean':
        for c in columns:
            means = grouped[c].mean()
            sems = grouped[c].sem()
            axs.plot(means.index, means)
            axs.fill_between(sems.index, means-(1.96*sems), 
                             means+(1.96*sems), alpha=0.5)
        if title is None:
            title = 'Average scores with 95% confidence interval'
    elif agg == 'count':
        for c in columns:
            counts = grouped.count()[c]
            axs.plot(counts)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        for c in columns:
            counts = grouped[c].nunique()
            axs.plot(counts)
        if title is None:
            title = 'Number of unique ___ per month'
            
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description)            
            
    axs.set_title(title)
    
    if len(columns) == 1:
        axs.set_ylabel(columns[0])
    else:
        axs.legend()
                
    plt.show()
    return

events = [{'date': datetime(year=2014, month=5, day=7),
          'event': 1,
          'description': 'Default Day'}]
events = pd.DataFrame(events)

### Measuring activity over time
- This code puts the comments into groups: one group for every month. Then it gets the count of comments and the count of people commenting in each month. Finally, it plots them for us to see how the level of activity changes over time. 

In [None]:
#Group the comments by month
monthly = comments.resample('M', on='date')

In [None]:
make_plot(monthly, agg='count', events=events)

In [None]:
make_plot(monthly, columns='author', agg='unique', events=events)

### Notice anything?
- As we expected, there's a jump up in the number of comments and users in May 2014, when it was added to the front page. 

### Measuring overall conflict
- If there is conflict among people in the subreddit, we might expect more posts to get deleted. 
- The first graph shows us the raw number of posts that are deleted. It looks a lot like the total number of posts, though. 
- The second graph shows us what percent of posts are deleted.

In [None]:
make_plot(monthly, columns='deleted', 
          title="Percent of comments deleted",
          events=events)

In [None]:
make_plot(monthly, columns=['OBSCENE', 'TOXICITY', 
                            'ATTACK_ON_COMMENTER', 
                            'INFLAMMATORY'], 
          events=events)

In [None]:
make_plot(monthly, columns=['sentiment'], 
          events=events)

## Who is driving the change?
- We saw above that the community changed dramatically after it became a default community. But that leaves open questions:
    - Did everyone start behaving differently after default day?
    - Is the change because there are so many *new* people?
    - Is the change because the new people are new, or do they stay different even once they're established? 
    - Does it matter what other communities a user is in if we want to know their behavior here?
- Next we're going to compare users who joined before default day with those who joined after it.
- Before running the code below, answer these questions:
    - Do you think the overall change is because a different kind of user started joining after the community was made a default? 
    - What ways might users who joined a default community be different in our measures of comment civility than the users who joined when it was  more niche community?

The code below labels users based on when they joined the community.
- Don't worry about how the code in this cell works, just run it and scroll down.

In [None]:
users = {}
groups = [{'name': 'before', 'direction': 'lt',
           'date': datetime(year=2014, month=5, day=7)},
          {'name': 'after', 'direction': 'gt',
           'date': datetime(year=2014, month=5, day=7)}]

#select people who only posted one comment ever
tmp = comments.groupby(by='author').count()
users['once'] = set(tmp[tmp.id == 1].index.values)
del tmp

#select all other people, 
firsts = comments[~comments.author.isin(users['once']
                                                 )][['author',
                                                     'date']]
#figure out their first post date
firsts.sort_values(by='date', inplace=True)
firsts.drop_duplicates(subset='author', keep='first', inplace=True)

#select the users in each group
for g in groups:
    if g['direction'] == 'lt':
        users[g['name']] = set(firsts[firsts.date < g['date']].author)
    elif g['direction'] == 'gt':
        users[g['name']] = set(firsts[firsts.date > g['date']].author)

del firsts

def get_cohort(user):
    c = np.nan
    # these need to be in chronological order
    cohorts = ['once', 'before', 'after']
    for co in cohorts:
        if user == '[deleted]':
            c = np.nan
        elif user in users[co]:
            c = co
    
    return c

#label the comments by their author's cohort
comments['cohort'] = comments.author.apply(get_cohort)

#group comments by both date and cohort
cohorts = comments.groupby(by='cohort').resample('M', on='date')

In [None]:
def plot_users(grouped, column='id', title=None, top=None, 
               bottom=None, events=None, agg='mean', 
               years=[2012,2016], cohorts='once'):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
    if top is not None:
        axs.set_ylim(top=top)
    
    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)
                
    if isinstance(cohorts, str):
        cohorts = [cohorts]
        
    if agg == 'mean':
        means = grouped.mean()
        sems = grouped.sem()
        for c in cohorts:
            axs.plot(means[column][c].index, 
                     means[column][c],
                     label=c)
            axs.fill_between(sems[column][c].index, 
                             means[column][c]-(1.96*sems[column][c]), 
                             means[column][c]+(1.96*sems[column][c]), 
                             alpha=0.5)
        if title is None:
            title = 'Average '+column+' with 95% confidence interval'
    elif agg == 'count':
        counts = grouped.count()
        for c in cohorts:
            axs.plot(counts[column][c], 
                     label=c)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        counts = grouped.nunique()
        for c in cohorts:
            axs.plot(counts[column][c], 
                     label=c)
        if title is None:
            title = 'Number of unique ___ per month'
            
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description) 
            
    axs.set_title(title)
    axs.set_ylabel(column)
       
    axs.legend()
                
    plt.show()
    return

### What fraction of posts were made by each group?

In [None]:
plot_users(cohorts, column='id', 
           cohorts=['before', 'after'], 
           agg='count', 
           events=events)

### What does each group's posting behavior look like?

In [None]:
plot_users(cohorts, column='TOXICITY', 
           cohorts=['before', 'after'], 
           events=events)

In [None]:
plot_users(cohorts, column='ATTACK_ON_COMMENTER', 
           cohorts=['before', 'after'], 
           events=events)

In [None]:
plot_users(cohorts, column='INFLAMMATORY', 
           cohorts=['before', 'after'], 
           events=events)

### Reflect
- Were the people who came to TwoX after default day different than those who came to the community before it?
- How do these graphs compare with what you predicted before you looked?
- Did either group contribute more to the overall changes in the community we saw in the beginning?

### Is it about new users?
- We saw that people who joined after default day behaved differently than those who joined before. In the next section, we'll look at whether *new* users are always different than users who have been around longer.
- Run the helper code below and scroll down.

In [None]:
users = {}

#select people who only posted one comment ever
tmp = comments.groupby(by='author').count()
users['once'] = set(tmp[tmp.id == 1].index.values)
del tmp

#select all other people, 
firsts = comments[~comments.author.isin(users['once']
                                                 )][['author',
                                                     'date']]
del users

#figure out their first post date
firsts.sort_values(by='date', inplace=True)
firsts.drop_duplicates(subset='author', keep='first', inplace=True)
mask = firsts.author.str.startswith('[deleted]')
firsts.loc[mask, 'date'] = np.nan

tmp = comments[['author', 'date']].merge(firsts, 
                                              how='left', 
                                              on='author',
                                             )
m = timedelta(days=7)
tmp = (tmp.date_x - tmp.date_y) < m
comments['new_user'] = tmp.replace({True: 'new', False:'old'})

del tmp
del firsts

#group comments by both date and cohort
newbies = comments.groupby(by='new_user').resample('M', on='date')

In [None]:
plot_users(newbies, column='TOXICITY', 
           cohorts=['old', 'new'], 
           events=events)

In [None]:
plot_users(newbies, column='ATTACK_ON_COMMENTER', 
           cohorts=['old', 'new'], 
           events=events)

In [None]:
plot_users(newbies, column='INFLAMMATORY', 
           cohorts=['old', 'new'], 
           events=events)

# Reflect
- Are comments from people who just joined the community different than comments from users who have been members longer?
- How do these graphs compare with what you predicted before you looked?
- Did either group contribute more to the overall changes in the community we saw in the beginning?
- What do these comparisons tell us?
- Overall, what happened on default day?
- How are users who joined after default day different from those who joined before?
- How did users who joined before default day change when it was made a default subreddit? 
    - In what ways did they stay the same?
    
    
are there other decompositions?
which were most illuminating 
track down amit's paper 