**Continued from Part 4 Language Model**

In this part, we take a look at how the topics of discussion changes over time in both subreddits. For this purpose we focus on the corpus within a moving window of fixed length (12 months), and identify the most overrepresented words within each moving window.

First let's create a counter object of the words (here restricted to noun, adjective, and verb) grouped by subreddit, year and month, similar to what we did in Part 3.

In [1]:
import itertools
import datetime
import pickle
import glob
import os
from collections import Counter, Iterable

import pandas as pd
from pandas import DataFrame

from preprocess import *

data_path = "/home/jichao/MongoDB/reddit"
bot_file = os.path.join(data_path, "bot_authors_2015_05.csv")
author_bot = pd.read_csv(bot_file)

srs = ("MensRights", "Feminism")
col_names = ("body", "author", "subreddit", "created_utc")

for sr in srs:
    fn_wildcard = os.path.join(data_path, sr + "_RC_*.pickle")
    filenames = glob.glob(fn_wildcard)

    reddit = DataFrame()

    for fn in filenames:
        df = pickle.load(open(fn))
        df = df[list(col_names)]

        df["author"] = df["author"].astype(str)
        df["subreddit"] = df["subreddit"].astype(str)
        df["created_utc"] = df["created_utc"].astype(int)

        df = df.ix[~df["author"].isin(author_bot["author"]), :]
        df["created_utc"] = df["created_utc"].map(lambda x: datetime.datetime.fromtimestamp(x))

        reddit = pd.concat([reddit, df], axis=0)

    reddit = reddit.reset_index(drop=True)

    # Group by year and month
    reddit["year"] = reddit["created_utc"].map(lambda x: x.year)
    reddit["month"] = reddit["created_utc"].map(lambda x: x.month)
    groupby_obj = reddit.groupby(["year", "month"])

    for key in groupby_obj.groups.keys():
        bymonth = reddit.ix[groupby_obj.groups[key], :]
        bymonth["tokens"] = bymonth["body"].map(lambda text: get_reddit_tokens(text, njv_only=True))
        year = str(key[0])
        month = "{0:02d}".format(key[1])
        name = "_".join([sr, year, month]) + ".pickle"

        bymonth = bymonth[["author", "tokens"]]
        tokens = list(itertools.chain(*bymonth["tokens"]))
        bymonth = Counter(tokens)

        # Save the counter as pickle object
        pickle.dump(bymonth, open(name, "w"))

For example, the counter for MensRights April 2008 looks like this:

In [2]:
counter = pickle.load(open("MensRights_2008_04.pickle"))
counter.most_common(10)

[(u'woman', 19),
 (u'men', 18),
 (u'work', 12),
 (u'get', 10),
 (u'pay', 10),
 (u'guy', 7),
 (u'article', 6),
 (u'child', 5),
 (u'go', 4),
 (u'likely', 4)]

Given the counter for each (year, month) pair, we consider two corpora $C_{i}$, which contains words in a 12-month moving window (we'll merge the counters within each 12-month window), and $C_{j}$, which contains words outside of the moving window. We also need a back ground corpus $C_{0}$ which is simply the union of $C_{i}$ and $C_{j}$.

To identify the words that are overrepresented in $C_{i}$ than in $C_{j}$, we compute the log-odds ratio as follows:

$$\delta_{w}^{i,j} = \log\left( \frac{y_{w}^{i} + \alpha_{w}}{n^{i} + \alpha_{0} - (y_{w}^{i} + \alpha_{w}) } \right) - \log\left( \frac{y_{w}^{j} + \alpha_{w}}{n^{j} + \alpha_{0} - (y_{w}^{j} + \alpha_{w}) } \right)$$

where
* $n^{i}$ ($n^{j}$) is the size of corpus $C_{i}$ ($C_{j}$)
* $y_{w}^{i}$ ($y_{w}^{j}$) is the count of word $w$ in corpus $C_{i}$ ($C_{j}$)
* $\alpha_{0}$ is the size of the backgroud corpus $C_{0}$, and $\alpha_{w}$ is the count of word $w$ in the background corpus $C_{0}$

The count of word in backgroud corpus $\alpha_{w}$ and size of background corpus $\alpha_{0}$ is the effectively the prior information incorporated in the log-ratio

We also compute an estimate for the variance of as  $$ \sigma^{2}(\delta_{w}^{i,j}) \approx \frac{1}{y_{w}^{i} + \alpha_{w}} + \frac{1}{y_{w}^{j} + \alpha_{w}}$$,

and the z-score as:
$$Z = \frac{\delta_{w}^{i,j}}{\sqrt{\sigma^{2}(\delta_{w}^{i,j})}}$$
Thus, the larger the z-score, the more overrepresented a word $w$ is in corpus $C_{i}$ than in $C_{j}$, and vice versa. It's implemented in the function ***diff_words*** as follows:

In [3]:
import os
import re
import glob
import pickle
from collections import Counter

import matplotlib.pyplot as plt
from pandas import DataFrame
import pandas as pd
import numpy as np

data_path = "."

def zscore(word, counter_i, n_i, 
           counter_j, n_j,
           counter_0, n_0, log=np.log):

    y_i = np.float(counter_i[word])
    y_j = np.float(counter_j[word])
    a_w = np.float(counter_0[word])

    ratio_i = (y_i + a_w) / (n_i + n_0 - y_i - a_w)
    ratio_j = (y_j + a_w) / (n_j + n_0 - y_j - a_w)

    if ratio_i < 0.:
        raise ValueError("ratio_i is negative: %f\n" % ratio_i)
    if ratio_j < 0.:
        raise ValueError("ratio_j is negative: %f\n" % ratio_j)

    logratio = log(ratio_i) - log(ratio_j)

    try:
        var_logratio = 1. / (y_i + a_w) + 1. / (y_j + a_w)
    except ZeroDivisionError as inst:
        raise inst

    try:
        z = logratio / np.sqrt(var_logratio)
    except Exception as inst:
        raise inst

    return z

def diff_words(sr, size=12):
    fn_wildcard = os.path.join(data_path, sr + "*.pickle")
    filenames = glob.glob(fn_wildcard)
    filenames = sorted(filenames)

    # 1. The background corpus
    corpus_bg = Counter()
    for fn in filenames:
        bymonth = pickle.load(open(fn))
        corpus_bg.update(bymonth)

    # Size of the background corpus
    n_0 = sum(corpus_bg.values())

    result = []

    for i in range(len(filenames) - size + 1):
        corpus_i = Counter()
        for j in range(i, i + size):
            bymonth = pickle.load(open(filenames[j]))
            corpus_i.update(bymonth)        

        n_i = sum(corpus_i.values())

        corpus_j = corpus_bg.copy()
        corpus_j.subtract(corpus_i)
        n_j = sum(corpus_j.values())

        # take the intersection of the two corpora
        # z-score is computable only on the intersection
        common = set(corpus_i.keys()) & set(corpus_j.keys())
        common = list(common)

        df = [(w, zscore(w, counter_i=corpus_i, n_i=n_i,
                        counter_j=corpus_j, n_j=n_j,
                        counter_0=corpus_bg, n_0=n_0)) for w in common]

        df = pd.DataFrame(np.array(zip(*df)).T, columns=["word", "zscore"])
        df["zscore"] = df["zscore"].astype(np.float)
        df = df.sort_values(by="zscore")

        under = df.head()
        under = under.reset_index(drop=True)
        over = df.tail()
        over = over.reset_index(drop=True)

        words = pd.concat([under, over], axis=1, keys=["Underrepr", "Overrepr"])

        bn = os.path.basename(filenames[i])
        bn = bn.split(".")[0]
        _, yyyy, mm = bn.split("_")
        start = "-".join([yyyy, mm])
    
        bn = os.path.basename(filenames[i + size -1 ])
        bn = bn.split(".")[0]
        _, yyyy, mm = bn.split("_")
        end = "-".join([yyyy, mm])
    
        result.append((" to ".join([start, end]), words, df["zscore"].describe()))
        
    return result     

In [4]:
men = diff_words("MensRights")
fem = diff_words("Feminism")

***diff_words*** computes for each 12-month window the top over- and under-represented words, and the summary statistics of the distribution of z-scores over words:

In [5]:
men[0][0]

'2008-03 to 2009-02'

In [6]:
men[0][1]

Unnamed: 0_level_0,Underrepr,Underrepr,Overrepr,Overrepr
Unnamed: 0_level_1,word,zscore,word,zscore
0,feminist,-1.300563,xtian,3.295225
1,rape,-0.973318,pretards,3.671119
2,mra,-0.76923,downmodded,3.834875
3,people,-0.758582,soceity,5.963146
4,sub,-0.757366,pn6,11.08058


In [7]:
men[0][2]

count    15330.000000
mean         0.208563
std          0.385303
min         -1.300563
25%         -0.005423
50%          0.087080
75%          0.254178
max         11.080580
Name: zscore, dtype: float64

Lastly, let's summarize the info about the overrepresented words using the folloing helper function

In [8]:
def summary_overrepr_words(result):
    # string for the 12-month window
    index = pd.Index([result[i][0] for i in range(len(result))])

    overrepr = pd.concat([result[i][1].iloc[:, [2]].T for i in range(len(result))])
    overrepr = overrepr.reset_index(drop=True)
    overrepr.columns = ["overrepr_%d" % i for i in range(1, 6)]
    
    stats = pd.concat([DataFrame(result[i][2]).T for i in range(len(result))])
    stats = stats.reset_index(drop=True)
    
    df = pd.concat([overrepr, stats], axis=1)
    df.index = index
    return df

In [9]:
df_men = summary_overrepr_words(men)
df_fem = summary_overrepr_words(fem)

Let's take a look at what words are overrepresented in each 12-month window

In [10]:
# MensiRights
df_men.ix["2012-06 to 2013-05":"2015-08 to 2016-07", 0:5]

Unnamed: 0,overrepr_1,overrepr_2,overrepr_3,overrepr_4,overrepr_5
2012-06 to 2013-05,adria,srs,text,copy,sr
2012-07 to 2013-06,adria,srs,text,copy,sr
2012-08 to 2013-07,gww,poster,text,copy,sr
2012-09 to 2013-08,rape,patriarchy,text,copy,sr
2012-10 to 2013-09,adria,patriarchy,text,sr,copy
2012-11 to 2013-10,feminist,rape,text,patriarchy,copy
2012-12 to 2013-11,text,rape,copy,feminist,patriarchy
2013-01 to 2013-12,adria,feminist,text,copy,patriarchy
2013-02 to 2014-01,feminist,text,copy,adria,patriarchy
2013-03 to 2014-02,text,copy,geek,patriarchy,adria


In [11]:
# Feminism
df_fem.ix["2012-06 to 2013-05":"2015-08 to 2016-07", 0:5]

Unnamed: 0,overrepr_1,overrepr_2,overrepr_3,overrepr_4,overrepr_5
2012-06 to 2013-05,rapist,men,sr,joke,rape
2012-07 to 2013-06,men,rapist,tosh,joke,rape
2012-08 to 2013-07,nerd,rapist,geek,picture,rape
2012-09 to 2013-08,femen,picture,geek,rapist,rape
2012-10 to 2013-09,soy,femen,consent,false,rape
2012-11 to 2013-10,false,rapist,consent,miley,rape
2012-12 to 2013-11,song,consent,cyrus,rape,miley
2013-01 to 2013-12,thicke,rape,objectionable,cyrus,miley
2013-02 to 2014-01,femen,thicke,objectionable,cyrus,miley
2013-03 to 2014-02,thicke,allen,objectionable,cyrus,miley


* The 2016 Presidential Election seemed to dominate the topics of discussion in both subreddits starting from early 2016 ("berinie", "candidate", "hillary", "clinton")

* The resignation of Ellen Pao from Reddit (https://en.wikipedia.org/wiki/Ellen_Pao#Exit_from_Reddit) seemed to have ignited heated discussion from 2015-06 to 2016-01