# Lab 1: The Evolution of Community Norms

In this lab, we'll cover:
- Changes in online communities over time
- Changes in communities due to sudden events
- Whether community change is driven by longtime users changing their behavior, or new users bringing new behaviors

### Background Information
- `r/TwoXChromosomes` is a feminist subreddit community that started in 2009. People interested in the topic found it by searching or following links that other people posted. (We'll abbreviate it as "TwoX" from now on.)
- On May 7, 2014, it was made a "default subreddit," which means that everyone who went to reddit.com saw a link to TwoXChromosomes on the front page. 
- A flood of new people clicked the link and went to the subreddit. They didn't know the community, and many of them didn't like feminism.
- Users of anti-feminist subreddits, in particular, were angry that a feminist subreddit made the list of defaults. 
- This was a big contraversy, both within reddit and in broader newsmedia.
    - For example: "[Reddit women protest at new front-page position](https://www.theguardian.com/technology/2014/may/13/reddit-women-protest-front-page-subforum-subreddit-position)"
- In February 2017, the moderators of TwoX tried to decrease the hostility in their subreddit by banning all users who post in explicitly anti-feminist subreddits (e.g. r/pussypassdenied) from posting in TwoX.
- We're going to investigate what happened using data from this subreddit.

In [None]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

%matplotlib inline

## Data
- Here we load the data and do a little cleaning up. 
- Don't worry about how the cleanup is done, just run the code and scroll down.

In [None]:
twox_comments = pd.read_csv('data/merged/TwoXChromosomes.tsv', 
                            sep='\t')
#convert our dates to the date data type
twox_comments['date'] = pd.to_datetime(twox_comments.date)
twox_comments.shape

In [None]:
twox_comments.head()

### Measuring activity over time
- When TwoX was added to Reddit's front page, people say there was a surge of new users. 
- This code puts the comments into groups: one group for every month. Then it gets the count of comments and the count of people commenting in each month. Finally, it plots them for us to see how the level of activity changes over time. 

In [None]:
#Group the comments by month
monthly = twox_comments.resample('M', on='date')

#count the number of comments in each group
total_comments = monthly.body.count()

#count the number of unique comment authors in each group
active_users = monthly.author.nunique()

#show first few months
active_users.head()

In [None]:
total_comments.plot(title='Comments posted each month')

In [None]:
active_users.plot(title='# People active in r/TwoXChromosomes')

### Notice anything?
- There's a huge jump up in the number of comments and users in May 2014, when it was added to the front page. 
- There's a huge drop down in the number of comments and users in February 2017, when many users were banned.

### Measuring overall conflict
- If there is conflict among people in the subreddit, we might expect more posts to get deleted. 
- The first graph shows us the raw number of posts that are deleted. It looks a lot like the total number of posts, though. 
- The second graph shows us what percent of posts are deleted.

In [None]:
#count the number of deleted or removed comments in each group
deleted = monthly.body.apply(lambda x: x[(x=='[deleted]') | (x=='[removed]')].count())

In [None]:
deleted.plot(title='Number of posts deleted')

In [None]:
pct_deleted = (deleted / total_comments) * 100
pct_deleted.plot(title='Percent of all comments deleted', ylim=(0,50), grid=True)

#### A helper function for more complicated plots.
Run this code and scroll down. Don't worry about how it works.

In [None]:
def make_plot(grouped, columns, title=None, top=None, bottom=0, 
             events=None, years=None):
    
    fig, axs = plt.subplots(figsize=(14,10))
    axs.set_ylim(bottom=bottom, top=top)
    
    if years is not None:
        axs.set_xlim(left=datetime(year=years[0], month=1, day=1), 
                     right=datetime(year=years[1], month=1, day=1))
                
    if events is not None:
        for e in events.iterrows():
            if e[1].event == 1:
                axs.axvline(e[1].date, color='k', alpha=.75)
                
    if isinstance(columns, str):
        columns = [columns]
        
    for c in columns:
        means = grouped[c].mean()
        sems = grouped[c].sem()
        axs.plot(means.index, means)
        axs.fill_between(sems.index, means-(1.96*sems), 
                         means+(1.96*sems), alpha=0.55)

    if title is None:
        title = 'Average scores with 95% confidence interval'
        
    axs.set_title(title)
    axs.legend()
                
    plt.show()
    return

events = [{'date': datetime(year=2014, month=5, day=7),
          'event': 1,
          'description': 'TwoX made default'},
          {'date': datetime(year=2017, month=2, day=1),
          'event': 1,
          'description': 'TwoX bans anti-feminists'}]
events = pd.DataFrame(events)

In [None]:
make_plot(monthly, columns=['OBSCENE', 'TOXICITY', 'ATTACK_ON_COMMENTER', 
                            'INFLAMMATORY', 'LIKELY_TO_REJECT'], 
          events=events, top=.55)

In [None]:
make_plot(monthly, columns=['sentiment', 'pej_nouns'], events=events, top=.2)