# Lab 1: Evolution of Community Norms

## Contents
1. [Setup](#Section-1%3A-Setup)
    1. [Import](#1.1-Import-Packages)
    1. [Data](#1.2-Data)
1. [Activity over Time](#Section-2%3A-Activity-over-Time)
    1. [Measuring Overall Conflict](#2.1-Measuring-Overall-Conflict)
    1. [Drivers of Change](#2.2-Drivers-of-Change)
    1. [Joining after Default Day](#2.3-Joining-after-Default-Day)
    1. [Being New to the Community](#2.4-Being-New-to-the-Community)
    1. [Coming from Hostile Communities](#2.5-Coming-from-Hostile-Communities)    

## Section 0: Background
- `r/TwoXChromosomes` (TwoX) is a feminist subreddit community that started in 2009. People interested in the topic found it by searching or following links that other people posted.
- On May 7, 2014, it was made a "default subreddit," which means people saw it on the front page of reddit. 
- A flood of new people went to the subreddit. They didn't know the community, and many of them didn't like feminism.
    - Users of anti-feminist subreddits, in particular, were angry that a feminist subreddit made the list of defaults. 
- This was a big contraversy, both within reddit and in broader newsmedia.
    - For example: "[Reddit women protest at new front-page position](https://www.theguardian.com/technology/2014/may/13/reddit-women-protest-front-page-subforum-subreddit-position)"
- We're going to investigate what happened using data from this subreddit.

### 0.1 Topics
- Changes in online communities over time
- Changes in communities due to sudden events
- Whether community change is driven by longtime users changing their behavior, or new users bringing new behaviors

## Section 1: Setup
### 1.1 Import Packages

In [None]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
import numpy as np
import urllib.request
import os.path
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
pd.plotting.register_matplotlib_converters()
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

### 1.2 Data
- Here we load the data and do a little cleaning up. 
- Don't worry about how the cleanup is done, just run the code and scroll down.

In [None]:
if os.path.isfile("data/merged/TwoXChromosomes.tsv.gz"):
    print("You already downloaded the data. Great!")
else:
    print('Downloading file...')
    urllib.request.urlretrieve(url="https://www.dropbox.com/s/ymwbvjky3i1otvb/TwoXChromosomes.tsv.gz?dl=1",
                               filename="data/merged/TwoXChromosomes.tsv.gz")
    print('Done!')
    
print("Loading data...")
twox_comments = pd.read_csv('data/merged/TwoXChromosomes.tsv.gz', 
                            sep='\t')

print('Sampling data...')
#the full data requires more than 4GB RAM to run this notebook
twox_comments = twox_comments.sample(frac=.6)

print("Converting dates...")
twox_comments['date'] = pd.to_datetime(twox_comments.date)

print('Compressing data...')
twox_comments.drop_duplicates(subset=['id'], inplace=True)
twox_comments['id'] = twox_comments.index.values
f = twox_comments.select_dtypes(include=['int']).apply(pd.to_numeric,downcast='unsigned')
twox_comments[f.columns] = f
f = twox_comments.select_dtypes(include=['float']).round(3).apply(pd.to_numeric,downcast='float')
twox_comments[f.columns] = f
if 'parent_id' in twox_comments.columns:
    twox_comments.drop(columns=['parent_id'], inplace=True)
twox_comments['author'] = twox_comments.author.astype('category')

print("Done!")
twox_comments.shape

In [None]:
twox_comments.head()

#### Helper function for plotting
Don't worry about what's in it, just run the code and scroll down.

In [None]:
def make_plot(grouped, columns='id', title=None, top=None, bottom=None, 
             events=None, agg='mean', years=[2012,2016]):
    
    plt.close('all')
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
        
    if top is not None:
        axs.set_ylim(top=top)

    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)

    if isinstance(columns, str):
        columns = [columns]
        
    if agg == 'mean':
        for c in columns:
            means = grouped[c].mean()
            sems = grouped[c].sem()
            axs.plot(means.index, means, label=c)
            axs.fill_between(sems.index, means-(1.96*sems), 
                             means+(1.96*sems), alpha=0.5)
        if title is None:
            title = 'Average scores with 95% confidence interval'
        axs.legend()
        axs.set_ylabel('Score')
            
    elif agg == 'count':
        for c in columns:
            counts = grouped.count()[c]
            axs.plot(counts, label=c)
        if title is None:
            title = 'Number of comments per month'
        axs.set_ylabel('Count')
            
    elif agg == 'unique':
        for c in columns:
            counts = grouped[c].nunique()
            axs.plot(counts, label=c)
        if title is None:
            title = 'Number of unique ___ per month'
        axs.legend()
        axs.set_ylabel('Count')
            
        
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description)            
            
    axs.set_title(title)      
                
    plt.show()
    return

events = [{'date': datetime(year=2014, month=5, day=7),
          'event': 1,
          'description': 'Default Day'},
          {'date': datetime(year=2017, month=2, day=1),
          'event': 1,
          'description': 'Ban Day'}]
events = pd.DataFrame(events)

## Section 2: Activity over Time
- When TwoX was added to Reddit's front page, people say there was a surge of new users. We can test this by looking at the data.
- This code puts the comments into groups: one group for every month. Then it gets the count of comments and the count of people commenting in each month. Finally, it plots them for us to see how the level of activity changes over time. 

In [None]:
#Group the comments by month
monthly = twox_comments.resample('M', on='date')

In [None]:
make_plot(monthly, agg='count', events=events)

In [None]:
make_plot(monthly, columns='author', agg='unique', events=events)

#### Notice anything?
- There's a huge jump up in the number of comments and users in May 2014, when it was added to the front page. 

### 2.1 Measuring Overall Conflict
- If there is conflict among people in the subreddit, we might expect more posts to get deleted. 
- The first graph shows us what percent of posts are deleted.
- The other graphs show different measures of the comments.

In [None]:
make_plot(monthly, columns='deleted', 
          title="Percent of comments deleted",
          events=events, bottom=0, top=.3)

In [None]:
make_plot(monthly, columns=['OBSCENE', 'TOXICITY', 
                            'ATTACK_ON_COMMENTER', 
                            'INFLAMMATORY'], 
          events=events, top=.4, bottom=.2)

In [None]:
make_plot(monthly, columns=['sentiment'], 
          events=events, top=.2)

### 2.2 Drivers of Change
- We saw above that the community changed dramatically after it became a default community. But that leaves unanswered questions:
    - Did everyone start behaving differently after default day?
    - Is the change because there are different people commenting than before?
    - Is the change because newcomers haven't adapted to the community's norms yet?  
    - Does it matter what other communities a user is in if we want to know their behavior here?
- Next we're going to compare users who joined before default day with those who joined after it.

#### Short Answer 1:
- Before running the code below, answer these questions with a sentence or two each:
    - What do you think might be driving the changes in average scores that we saw above?
    - What ways might users who joined after it was a default community be different in our measures of comment civility than the users who joined before it was a default community?

**Write your response here:**
....

#### Helper functions
- The code below labels users based on when they joined TwoX.
- Don't worry about how the code in this cell works, just run it and scroll down.

In [None]:
def p(top, bottom, events, years, axs):    
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
    if top is not None:
        axs.set_ylim(top=top)
    
    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
            
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        else:
            start = events.iloc[0].date - timedelta(days=30)
            end = events.iloc[0].date + timedelta(days=30)
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description) 
    axs.set_xlim(left=start, right=end)
    axs.legend()
    plt.show()
    return 

def plot_users(counts, column='id', title=None, top=None, 
               bottom=None, events=None,
               years=[2012,2016], cohorts='once'):
    
    plt.close('all')
    fig, axs = plt.subplots(figsize=(14,10))
    
    if isinstance(cohorts, str):
        cohorts = [cohorts]
        
    for c in cohorts:
        axs.plot(counts[column][c], 
                 label=c)
        
    if title is None:
        title = 'Number of comments per month'
    axs.set_title(title)
    axs.set_ylabel(column)
    
    p(top, bottom, events, years, axs)
    return

def plot_users_mean(means, sems, title=None, top=None, 
               bottom=None, events=None,column='id',
               years=[2012,2016], cohorts='once'):
    
    plt.close('all')
    fig, axs = plt.subplots(figsize=(14,10))
    
    if isinstance(cohorts, str):
        cohorts = [cohorts]
        
    for c in cohorts:
        axs.plot(means[column][c].index, 
                 means[column][c],
                 label=c)
        axs.fill_between(sems[column][c].index, 
                         means[column][c]-(sems[column][c]), 
                         means[column][c]+(sems[column][c]), 
                         alpha=0.5)
    if title is None:
        title = 'Average '+column+' with 95% confidence interval'
    axs.set_title(title)
    axs.set_ylabel(column)
    
    p(top, bottom, events, years, axs)
    return

In [None]:
users = {}
groups = [{'name': 'before', 'direction': 'lt',
           'date': datetime(year=2014, month=5, day=7)},
          {'name': 'after', 'direction': 'gt',
           'date': datetime(year=2014, month=5, day=7)}]

print('Finding one-time posters...')
#select people who only posted one comment ever
tmp = twox_comments.groupby(by='author').count()
users['once'] = set(tmp[tmp.id == 1].index.values)
del tmp

#select all other people, 
firsts = twox_comments[~twox_comments.author.isin(users['once']
                                                 )][['author',
                                                     'date']]

print('Finding initial posts...')
#figure out their first post date
firsts.sort_values(by='date', inplace=True)
firsts.drop_duplicates(subset='author', keep='first', inplace=True)

print('Dividing users into groups...')
#select the users in each group
for g in groups:
    if g['direction'] == 'lt':
        users[g['name']] = set(firsts[firsts.date < g['date']].author)
    elif g['direction'] == 'gt':
        users[g['name']] = set(firsts[firsts.date > g['date']].author)

def get_cohort(user):
    c = np.nan
    # these need to be in chronological order
    cohorts = ['once', 'before', 'after']
    for co in cohorts:
        if user == '[deleted]':
            c = np.nan
        elif user in users[co]:
            c = co
    
    return c

#label the comments by their author's cohort
twox_comments['cohort'] = twox_comments.author.apply(get_cohort).astype('category')

del users

print('Calculating statistics....')
#group comments by both date and cohort
cohorts = twox_comments.groupby(by='cohort').resample('M', on='date')

cohort_mean = cohorts.mean()
cohort_error = cohorts.sem() * 1.96

### 2.3 Joining after Default Day

#### What fraction of posts were made by people who joined before vs after default day?

In [None]:
plot_users(cohorts.count(), column='id', 
           cohorts=['before', 'after'], 
           bottom=0, top=90000,
           events=events)

#### Short Answer 2:

The patterns we observe after "Default Day" in the above figure, for users who joined before and after the "Default Day". For example, this pattern may occur  in a few different ways based on the number of users in each group and how frequently they comment. Write a few sentences on how these two dimensions could have produced the post "Default Day" patterns for each of the two groups of users.   


**Write your response here:**
....

#### What do the posts of people who joined before and after default day look like?

In [None]:
plot_users_mean(cohort_mean, cohort_error, column='TOXICITY', 
           cohorts=['before', 'after'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users_mean(cohort_mean, cohort_error, column='ATTACK_ON_COMMENTER', 
           cohorts=['before', 'after'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users_mean(cohort_mean, cohort_error, column='INFLAMMATORY', 
           cohorts=['before', 'after'], 
           events=events, bottom=.2, top=.4)

#### Short Answer 3:
- Write a sentence or two for each question:
    1. Were the people who came to TwoX after default day different than those who came to the community before it?
    2. How do these graphs compare with what you predicted before you looked?
    3. Did either group seem to contribute more to the overall changes in the community we saw in the beginning?

**Write your response here:**
....

### 2.4 Being New to the Community
Is it about new users?
- We saw that people who joined after default day behaved differently than those who joined before. In the next section, we'll look at whether *new* users are always different than users who have been around longer. 
    - Here, we define comments from new users as comments posted within the first 7 days a user is in the community. All other comments are from "old" users, who have been in the community more than one week. 
- Run the helper code below and scroll down.

In [None]:
firsts.head()

In [None]:
print('Finding when users joined...')
firsts = firsts[firsts.author != '[deleted]']
tmp = twox_comments[['author', 'date']].merge(firsts, 
                                              how='left', 
                                              on='author',
                                             )

print('Getting time since first post...')
tmp = (tmp.date_x - tmp.date_y) < timedelta(days=7)
twox_comments['new_user'] = tmp.replace({True: 'new', False:'old'}).astype('category')

del tmp
del firsts

print('Calculating statistics...')
#group comments by both date and cohort
newbies = twox_comments.groupby(by='new_user').resample('M', on='date')
newbies_mean = newbies.mean()
newbies_error = newbies.sem() * 1.96

In [None]:
plot_users_mean(newbies_mean, newbies_error, column='TOXICITY', 
           cohorts=['old', 'new'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users_mean(newbies_mean, newbies_error, column='ATTACK_ON_COMMENTER', 
           cohorts=['old', 'new'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users_mean(newbies_mean, newbies_error, column='INFLAMMATORY', 
           cohorts=['old', 'new'], 
           events=events, bottom=.2, top=.4)

### 2.5 Coming from Hostile Communities
Is it about what other communities users belong to?
- We saw above that posts from people who just joined aren't really different, on average, from posts by more established users. In the next section, we'll look at whether it matters what other communities a users belong to.
    - Much of the debate was about *who* was joining and *why*, not when.
- In the next part, we divide users into two groups:
    - Users who post in *both* TwoX and anti-feminist (e.g. MRA) communities. 
    - Users who post in TwoX, but not in anti-feminist communities. 
    
#### Short Answer 4:
- Do you think the two groups will be different? If so, in what ways? Write a few sentences explaining why.
- This next cell loads data about the users and splits them into groups.
- Run the helper code below and scroll down.

**Write your response here:**
....

In [None]:
print('Loading list of MRA suthors...')
mra_authors = set(pd.read_csv('data/mra_authors.tsv.gz', sep='\t').author.values)

def get_mra(user):
    c = 'only TwoX'
    if user == '[deleted]':
        c = np.nan
    elif user in mra_authors:
        c = 'TwoX and MRA'
    return c

print('Matching with TwoX authors...')
twox_comments['MRA'] = twox_comments.author.apply(get_mra).astype('category')
del mra_authors

print('Calculating statistics...')
mra = twox_comments.groupby(by='MRA').resample('M', on='date')
mra_mean = mra.mean()
mra_error = mra.sem() * 1.96

#### How many posts come from each group?

In [None]:
plot_users(mra.count(), 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           events=events)

#### How many users come from each group?

In [None]:
plot_users(mra.apply(lambda x: len(x["author"].unique())).reset_index().melt(id_vars="MRA").set_index(["MRA","date"]),
           column="value",
           title="Number of users per month",
           cohorts=['only TwoX', 'TwoX and MRA'], 
           events=events)

#### How does the behavior of each group compare?

In [None]:
plot_users_mean(mra_mean, mra_error, column='TOXICITY', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           events=events)

In [None]:
plot_users_mean(mra_mean, mra_error, column='ATTACK_ON_COMMENTER', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           events=events)

In [None]:
plot_users_mean(mra_mean, mra_error, column='sentiment', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           events=events)

In [None]:
plot_users_mean(mra_mean, mra_error, column='INFLAMMATORY', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           events=events)

#### Short Answer 5:
- Write a few sentences answering each of these questions:
    1. What do these comparisons tell us?
    2. Overall, what happened on default day?
    3. How are users who joined TwoX after default day different from those who joined before?
    4. How did users who joined before default day change when it was made a default subreddit? 
        - In what ways did they stay the same?
    5. How are users who post only in the feminist community TwoX different from those who also post in anti-feminist (MRA) communities? 
        - Why might that be? 
    6. How do you think the users who post only in the feminist community TwoX may have responded to the influx of those who post in anti-feminist (MRA) reddits and their behavior?

**Write your response here:**
....