# Lab 1: Evolution of Community Norms

In this lab, we'll cover:
- Changes in online communities over time
- Changes in communities due to sudden events
- Whether community change is driven by longtime users changing their behavior, or new users bringing new behaviors

### Background Information
- `r/TwoXChromosomes` (TwoX) is a feminist subreddit community that started in 2009. People interested in the topic found it by searching or following links that other people posted.
- On May 7, 2014, it was made a "default subreddit," which means people saw it on the front page of reddit. 
- A flood of new people went to the subreddit. They didn't know the community, and many of them didn't like feminism.
    - Users of anti-feminist subreddits, in particular, were angry that a feminist subreddit made the list of defaults. 
- This was a big contraversy, both within reddit and in broader newsmedia.
    - For example: "[Reddit women protest at new front-page position](https://www.theguardian.com/technology/2014/may/13/reddit-women-protest-front-page-subforum-subreddit-position)"
- In February 2017, the moderators of TwoX tried to decrease the hostility in their subreddit by banning all users who post in explicitly anti-feminist subreddits from posting in TwoX.
- We're going to investigate what happened using data from this subreddit.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

%matplotlib inline

## Data
- Here we load the data and do a little cleaning up. 
- Don't worry about how the cleanup is done, just run the code and scroll down.

In [None]:
cols = ['date', 'author', 'TOXICITY', 
        'ATTACK_ON_COMMENTER', 'INFLAMMATORY', 'LIKELY_TO_REJECT', 
        'OBSCENE', 'SEVERE_TOXICITY', 'ATTACK_ON_AUTHOR', 'SPAM', 
        'UNSUBSTANTIAL', 'INCOHERENT', 
        'id', 'parent_id', 'replies','deleted', 
        'politeness', 'sentiment', 'pej_nouns', 
       ]

twox_comments = pd.read_csv('data/merged/TwoXChromosomes_thin.tsv', 
                            sep='\t', usecols=cols)
#convert our dates to the date data type
twox_comments['date'] = pd.to_datetime(twox_comments.date)
twox_comments.shape

In [None]:
start = datetime(year=2012, month=1, day=1)
end = datetime(year=2016, month=1, day=1)
twox_comments = twox_comments[twox_comments.date >= start]
twox_comments = twox_comments[twox_comments.date <= end]

In [None]:
twox_comments.columns.values

In [None]:
twox_comments.head()

### Helper function for plotting
Don't worry about what's in it, just run the code and scroll down.

In [None]:
def make_plot(grouped, columns='id', title=None, top=None, bottom=None, 
             events=None, agg='mean', years=[2012,2016]):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
        
    if top is not None:
        axs.set_ylim(top=top)

    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)

    if isinstance(columns, str):
        columns = [columns]
        
    if agg == 'mean':
        for c in columns:
            means = grouped[c].mean()
            sems = grouped[c].sem()
            axs.plot(means.index, means)
            axs.fill_between(sems.index, means-(1.96*sems), 
                             means+(1.96*sems), alpha=0.5)
        if title is None:
            title = 'Average scores with 95% confidence interval'
    elif agg == 'count':
        for c in columns:
            counts = grouped.count()[c]
            axs.plot(counts)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        for c in columns:
            counts = grouped[c].nunique()
            axs.plot(counts)
        if title is None:
            title = 'Number of unique ___ per month'
            
        
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description)            
            
    axs.set_title(title)
    
    if len(columns) == 1:
        axs.set_ylabel(columns[0])
    else:
        axs.legend()
                
    plt.show()
    return

events = [{'date': datetime(year=2014, month=5, day=7),
          'event': 1,
          'description': 'Default Day'},
          {'date': datetime(year=2017, month=2, day=1),
          'event': 1,
          'description': 'Ban Day'}]
events = pd.DataFrame(events)

### Measuring activity over time
- When TwoX was added to Reddit's front page, people say there was a surge of new users. 
- This code puts the comments into groups: one group for every month. Then it gets the count of comments and the count of people commenting in each month. Finally, it plots them for us to see how the level of activity changes over time. 

In [None]:
#Group the comments by month
monthly = twox_comments.resample('M', on='date')

In [None]:
make_plot(monthly, agg='count', events=events)

In [None]:
make_plot(monthly, columns='author', agg='unique', events=events)

### Notice anything?
- There's a huge jump up in the number of comments and users in May 2014, when it was added to the front page. 
- There's a huge drop down in the number of comments and users in February 2017, when many users were banned.

### Measuring overall conflict
- If there is conflict among people in the subreddit, we might expect more posts to get deleted. 
- The first graph shows us the raw number of posts that are deleted. It looks a lot like the total number of posts, though. 
- The second graph shows us what percent of posts are deleted.

In [None]:
make_plot(monthly, columns='deleted', 
          title="Percent of comments deleted",
          events=events, bottom=0, top=.3)

In [None]:
make_plot(monthly, columns=['OBSCENE', 'TOXICITY', 
                            'ATTACK_ON_COMMENTER', 
                            'INFLAMMATORY'], 
          events=events, top=.55, bottom=.15)

In [None]:
make_plot(monthly, columns=['sentiment'], 
          events=events, top=.2)

## Who is driving the change?
- We saw above that the community changed dramatically after it became a default community. But that leaves open questions:
    - Did everyone start behaving differently after default day?
    - Is the change because there are so many *new* people?
    - Is the change because the new people are new, or do they stay different even once they're established? 
    - Does it matter what other communities a user is in if we want to know their behavior here?
- Next we're going to compare users who joined before default day with those who joined after it.
- Before running the code below, answer these questions:
    - Do you think the overall change is because a different kind of user started joining after the community was made a default? 
    - What ways might users who joined a default community be different in our measures of comment civility than the users who joined when it was  more niche community?

The code below labels users based on when they joined TwoX.
- Don't worry about how the code in this cell works, just run it and scroll down.

In [None]:
users = {}
groups = [{'name': 'before', 'direction': 'lt',
           'date': datetime(year=2014, month=5, day=7)},
          {'name': 'after', 'direction': 'gt',
           'date': datetime(year=2014, month=5, day=7)}]

#select people who only posted one comment ever
tmp = twox_comments.groupby(by='author').count()
users['once'] = set(tmp[tmp.id == 1].index.values)
del tmp

#select all other people, 
firsts = twox_comments[~twox_comments.author.isin(users['once']
                                                 )][['author',
                                                     'date']]
#figure out their first post date
firsts.sort_values(by='date', inplace=True)
firsts.drop_duplicates(subset='author', keep='first', inplace=True)

#select the users in each group
for g in groups:
    if g['direction'] == 'lt':
        users[g['name']] = set(firsts[firsts.date < g['date']].author)
    elif g['direction'] == 'gt':
        users[g['name']] = set(firsts[firsts.date > g['date']].author)

del firsts

def get_cohort(user):
    c = np.nan
    # these need to be in chronological order
    cohorts = ['once', 'before', 'after']
    for co in cohorts:
        if user == '[deleted]':
            c = np.nan
        elif user in users[co]:
            c = co
    
    return c

#label the comments by their author's cohort
twox_comments['cohort'] = twox_comments.author.apply(get_cohort)

#group comments by both date and cohort
cohorts = twox_comments.groupby(by='cohort').resample('M', on='date')

#### A helper function for making plots broken down by user group
- Again, run the code but don't worry about how it works

In [None]:
def plot_users(grouped, column='id', title=None, top=None, 
               bottom=None, events=None, agg='mean', 
               years=[2012,2016], cohorts='once'):
    
    fig, axs = plt.subplots(figsize=(14,10))
    if bottom is not None:
        axs.set_ylim(bottom=bottom)
    if top is not None:
        axs.set_ylim(top=top)
    
    if years is not None:
        start = datetime(year=years[0], month=1, day=1)
        end = datetime(year=years[1], month=1, day=1)
        axs.set_xlim(left=start, right=end)
                
    if isinstance(cohorts, str):
        cohorts = [cohorts]
        
    if agg == 'mean':
        means = grouped.mean()
        sems = grouped.sem()
        for c in cohorts:
            axs.plot(means[column][c].index, 
                     means[column][c],
                     label=c)
            axs.fill_between(sems[column][c].index, 
                             means[column][c]-(1.96*sems[column][c]), 
                             means[column][c]+(1.96*sems[column][c]), 
                             alpha=0.5)
        if title is None:
            title = 'Average '+column+' with 95% confidence interval'
    elif agg == 'count':
        counts = grouped.count()
        for c in cohorts:
            axs.plot(counts[column][c], 
                     label=c)
        if title is None:
            title = 'Number of comments per month'
    elif agg == 'unique':
        counts = grouped.nunique()
        for c in cohorts:
            axs.plot(counts[column][c], 
                     label=c)
        if title is None:
            title = 'Number of unique ___ per month'
            
    if events is not None:
        if years is not None:
            events = events[events.date >= start]
            events = events[events.date <= end]
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', linestyle='dashed', 
                            alpha=.5)
                spot = axs.get_ylim()[1] - axs.get_ylim()[0]
                spot *= .1 
                spot += axs.get_ylim()[0]
                axs.text(e[1].date, spot, e[1].description) 
            
    axs.set_title(title)
    axs.set_ylabel(column)
       
    axs.legend()
                
    plt.show()
    return

### What fraction of posts were made by each group?

In [None]:
plot_users(cohorts, column='id', 
           cohorts=['before', 'after'], 
           agg='count', 
           events=events)

### What does each group's posting behavior look like?

In [None]:
plot_users(cohorts, column='TOXICITY', 
           cohorts=['before', 'after'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users(cohorts, column='ATTACK_ON_COMMENTER', 
           cohorts=['before', 'after'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users(cohorts, column='INFLAMMATORY', 
           cohorts=['before', 'after'], 
           events=events, bottom=.2, top=.4)

### Reflect
- Were the people who came to TwoX after default day different than those who came to the community before it?
- How do these graphs compare with what you predicted before you looked?
- Did either group contribute more to the overall changes in the community we saw in the beginning?

### Is it about new users?
- We saw that people who joined after default day behaved differently than those who joined before. In the next section, we'll look at whether *new* users are always different than users who have been around longer.
- Run the helper code below and scroll down.

In [None]:
del monthly
del cohorts

In [None]:
users = {}

#select people who only posted one comment ever
tmp = twox_comments.groupby(by='author').count()
users['once'] = set(tmp[tmp.id == 1].index.values)
del tmp

#select all other people, 
firsts = twox_comments[~twox_comments.author.isin(users['once']
                                                 )][['author',
                                                     'date']]
del users

#figure out their first post date
firsts.sort_values(by='date', inplace=True)
firsts.drop_duplicates(subset='author', keep='first', inplace=True)
mask = firsts.author.str.startswith('[deleted]')
firsts.loc[mask, 'date'] = np.nan

tmp = twox_comments[['author', 'date']].merge(firsts, 
                                              how='left', 
                                              on='author',
                                             )
m = timedelta(days=7)
tmp = (tmp.date_x - tmp.date_y) < m
twox_comments['new_user'] = tmp.replace({True: 'new', False:'old'})

del tmp
del firsts

#group comments by both date and cohort
newbies = twox_comments.groupby(by='new_user').resample('M', on='date')

In [None]:
plot_users(newbies, column='TOXICITY', 
           cohorts=['old', 'new'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users(newbies, column='ATTACK_ON_COMMENTER', 
           cohorts=['old', 'new'], 
           events=events, bottom=.2, top=.4)

In [None]:
plot_users(newbies, column='INFLAMMATORY', 
           cohorts=['old', 'new'], 
           events=events, bottom=.2, top=.4)

### Reflect
- Are comments from people who just joined the community different than comments from users who have been members longer?
- How do these graphs compare with what you predicted before you looked?
- Did either group contribute more to the overall changes in the community we saw in the beginning?

### Is it about what other communities users belong to?
- We saw that posts from people who just joined aren't really different, on average, from posts by more established users. In the next section, we'll look at whether it matters what other communities a users belong to.
    - Much of the debate was about *who* was joining and *why*, not when.
- In the next part, we divide users into two groups:
    - Users who post in *both* TwoX and anti-feminist (e.g. MRA) communities. These are the users who were banned in 2017 because the moderators saw them fighting. 
    - Users who post in TwoX, but not in anti-feminist communities. 
- Do you think the two groups will be different? If so, in what ways?
- This next cell loads data about the users and splits them into groups.
- Run the helper code below and scroll down.

In [None]:
mra_authors = set(pd.read_csv('data/mra_authors.tsv', sep='\t').author.values)

def get_mra(user):
    c = 'only TwoX'
    if user == '[deleted]':
        c = np.nan
    elif user in mra_authors:
        c = 'TwoX and MRA'
    return c

twox_comments['MRA'] = twox_comments.author.apply(get_mra)
mra_grouped = twox_comments.groupby(by='MRA').resample('M', on='date')

### How many posts come from each group?
(Note that the number of posts by people who aso post in MRA communities is not zero after the ban. This is because some users only posted in those communities after the ban already happened.)

In [None]:
plot_users(mra_grouped, column='id', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           agg='count',
           events=events)

### How does the behavior of each group compare?

In [None]:
plot_users(mra_grouped, column='TOXICITY', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           agg='mean',
           events=events)

In [None]:
plot_users(mra_grouped, column='ATTACK_ON_COMMENTER', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           agg='mean', 
           events=events)

In [None]:
plot_users(mra_grouped, column='sentiment', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           agg='mean', 
           events=events)

In [None]:
plot_users(mra_grouped, column='INFLAMMATORY', 
           cohorts=['only TwoX', 'TwoX and MRA'], 
           agg='mean', 
           events=events)

# Reflect
- What do these comparisons tell us?
- Overall, what happened on default day?
- Overall, what happened on ban day?
- How are users who joined TwoX after default day different from those who joined before?
- How did users who joined before default day change when it was made a default subreddit? 
    - In what ways did they stay the same?
- How are users who post only in the feminist community TwoX different from those who also post in anti-feminist (MRA) communities? 
    - Why might that be? 
    
    
are there other decompositions?
which were most illuminating 
track down amit's paper 