**Continued from Part 4 Language Model**

In this part, we're interested in finding out what has been the trending topic in **r/MensRights** and **r/Feminism**, this may help us get an intuitive understanding about the evolution of the topics over the years.

Technically, we'll apply a moving interval $I$ along the time axis, and build two corpora $C_{I}$ and $C_{\overline{I}}$, which contains the words within or outside of the moving interval, respectively. We want to find the most overrepresented words in $C_{I}$ versus those in $C_{\overline{I}}$. To reduce the variation of the statistical analysis, we'll consider a moving interval of 12 months, so the sequence of $C_{I}$ would be something like this: $C_{May,2009-Apr,2010}, C_{Jun,2009,May,2010}, \cdots$

In [1]:
from pymongo import MongoClient

client = MongoClient("localhost", 27017)
db = client["reddit_polarization"]

In [2]:
post = db["MensRights"].find_one()

In [3]:
post["created_utc"]

datetime.datetime(2012, 4, 30, 20, 0, 15)

In [4]:
post["tokens_njv"]

u'event happen sixth grade way day study teacher right like egalitarian possible switch major something involve abnormal psych nonjudging'

Note that we've already stored a clean list of tokens (noun, adjective, verb) in each document. We can merge the list of tokens with the same tuple (`year`, `month`), which results in $C_{May,2009}, C_{Jun,2009}, \cdots$

So $C_{May,2009-Apr,2010} = C_{May,2009} \cup C_{Jun,2009} \cup \cdots \cup C_{Apr,2010}$

We define a function `counter_bymonth` that
* Retrives data from mongodb
* Group posts by `(post.year, post.month)`
* Compute a `counter` on the union of tokens in each group
* Retern the result as a dataframe with columns `(year, month)` and `counter`

In [5]:
import pandas as pd
from collections import Counter
    
def counter_bymonth(subreddit):    
    created_utc = list(db[subreddit].find({}, {"created_utc": 1}))
    df = pd.DataFrame(created_utc)
    df = df.set_index("_id")
    groupby_obj = df.groupby(df["created_utc"].map(lambda x: (x.year, x.month)))

    result = []
    for key in groupby_obj.groups.keys():
        c = Counter()
        query = db[subreddit].find({"_id": {"$in": list(groupby_obj.groups[key])}}, {"tokens_njv": 1})
        for doc in query:
            c.update(doc["tokens_njv"].split(" "))
        
        result.append(pd.Series([key, c]))
        
    counter = pd.DataFrame(result)
    counter.columns = ["month", "counter"]
    counter = counter.sort_values("month", axis=0)
    counter = counter.reset_index(drop=True)
    return counter

In [6]:
mensrights_counter = counter_bymonth("MensRights")
mensrights_counter[:5]

Unnamed: 0,month,counter
0,"(2008, 3)","{u'essay': 1, u'consider': 3, u'feminist': 4, ..."
1,"(2008, 4)","{u'incarcerate': 1, u'bait': 1, u'forget': 1, ..."
2,"(2008, 5)","{u'sibling': 1, u'comment': 1, u'poor': 1, u'k..."
3,"(2008, 6)","{u'': 3, u'limited': 2, u'pheremones': 1, u'dy..."
4,"(2008, 7)","{u'': 6, u'limited': 3, u'unscientific': 1, u'..."


In [7]:
feminism_counter = counter_bymonth("Feminism")
feminism_counter[:5]

Unnamed: 0,month,counter
0,"(2009, 2)","{u'interesting': 1, u'notion': 1}"
1,"(2009, 4)","{u'suicide': 2, u'many': 1, u'least': 1, u'app..."
2,"(2009, 5)","{u'woman': 2, u'group': 1, u'figure': 1, u'thi..."
3,"(2009, 6)","{u'live': 1, u'bit': 1, u'bummer': 1}"
4,"(2009, 7)","{u'code': 2, u'psyche': 1, u'keyboard': 1, u'f..."


For example, the counter for MensRights April 2008 looks like this:

In [8]:
mensrights_counter.ix[mensrights_counter["month"] == (2008, 4), "counter"].iloc[0].most_common(10)

[(u'woman', 19),
 (u'men', 18),
 (u'work', 12),
 (u'get', 10),
 (u'pay', 10),
 (u'guy', 7),
 (u'article', 6),
 (u'child', 5),
 (u'go', 4),
 (u'number', 4)]

Next, we'll merge the counters to form the corpus $C_{I}$ that spans a 12-month interval, and also compute the corresponding $C_{\overline{I}}$.

Let $C_{i} = C_{I}$ and $C_{j} = C_{\overline{I}}$.

To identify the words that are overrepresented in $C_{i}$ than in $C_{j}$, we compute the log-odds ratio as follows:

$$\delta_{w}^{i,j} = \log\left( \frac{y_{w}^{i} + \alpha_{w}}{n^{i} + \alpha_{0} - (y_{w}^{i} + \alpha_{w}) } \right) - \log\left( \frac{y_{w}^{j} + \alpha_{w}}{n^{j} + \alpha_{0} - (y_{w}^{j} + \alpha_{w}) } \right)$$

where
* $n^{i}$ ($n^{j}$) is the size of corpus $C_{i}$ ($C_{j}$)
* $y_{w}^{i}$ ($y_{w}^{j}$) is the count of word $w$ in corpus $C_{i}$ ($C_{j}$)
* $\alpha_{0}$ is the size of the backgroud corpus $C_{0}$, and $\alpha_{w}$ is the count of word $w$ in the background corpus $C_{0}$

The count of word in backgroud corpus $\alpha_{w}$ and size of background corpus $\alpha_{0}$ is the effectively the prior information incorporated in the log-ratio

We also compute an estimate for the variance of as  $$ \sigma^{2}(\delta_{w}^{i,j}) \approx \frac{1}{y_{w}^{i} + \alpha_{w}} + \frac{1}{y_{w}^{j} + \alpha_{w}}$$,

and the z-score as:
$$Z = \frac{\delta_{w}^{i,j}}{\sqrt{\sigma^{2}(\delta_{w}^{i,j})}}$$
Intuitively, the larger the z-score, the more overrepresented a word $w$ is in corpus $C_{i}$ than in $C_{j}$, and vice versa. It's implemented in the function `differential_words` as follows:

In [13]:
import numpy as np

def differential_words(df_subreddit, size=12):

    # 1. The background corpus
    corpus_bg = Counter()
    for counter in df_subreddit["counter"]:
        corpus_bg.update(counter)

    # Size of the background corpus
    n_0 = sum(corpus_bg.values())

    result = []
    for i in range(len(df_subreddit["counter"]) - size + 1):
        # 2. corpus `i`
        corpus_i = Counter()
        for j in range(i, i + size):
            counter = df_subreddit.ix[j, "counter"]
            corpus_i.update(counter)        
        # Size of corpus `i`
        n_i = sum(corpus_i.values())

        # 3. corpus `j`
        corpus_j = corpus_bg.copy()
        corpus_j.subtract(corpus_i)
        # Size of corpus `j`
        n_j = sum(corpus_j.values())

        # take the intersection of the two corpora
        # z-score is computable only on the intersection
        common = set(corpus_i.keys()) & set(corpus_j.keys())
        common = list(common)

        df = [(w, zscore(w, counter_i=corpus_i, n_i=n_i,
                        counter_j=corpus_j, n_j=n_j,
                        counter_0=corpus_bg, n_0=n_0)) for w in common]

        df = pd.DataFrame(np.array(zip(*df)).T, columns=["word", "zscore"])
        df["zscore"] = df["zscore"].astype(np.float)
        df = df.sort_values(by="zscore")

        underrepresented = df.head()
        underrepresented = underrepresented.reset_index(drop=True)
        overrepresented = df.tail()
        overrepresented = overrepresented.reset_index(drop=True)

        words = pd.concat([underrepresented, overrepresented],
                          axis=1,
                          keys=["U_represented", "O_represented"])

        year, month = df_subreddit["month"].iloc[i]
        low = "{0}-{1:02d}".format(year, month)
        year, month = df_subreddit["month"].iloc[i + size - 1]
        high = "{0}-{1:02d}".format(year, month)
            
        result.append((" to ".join([low, high]), words, df["zscore"].describe()))
        
    return result     

def zscore(word, counter_i, n_i, 
           counter_j, n_j,
           counter_0, n_0, log=np.log):

    y_i = np.float(counter_i[word])
    y_j = np.float(counter_j[word])
    a_w = np.float(counter_0[word])

    ratio_i = (y_i + a_w) / (n_i + n_0 - y_i - a_w)
    ratio_j = (y_j + a_w) / (n_j + n_0 - y_j - a_w)

    if ratio_i < 0.:
        raise ValueError("ratio_i is negative: %f\n" % ratio_i)
    if ratio_j < 0.:
        raise ValueError("ratio_j is negative: %f\n" % ratio_j)

    logratio = log(ratio_i) - log(ratio_j)

    try:
        var_logratio = 1. / (y_i + a_w) + 1. / (y_j + a_w)
    except ZeroDivisionError as inst:
        raise inst

    try:
        z = logratio / np.sqrt(var_logratio)
    except Exception as inst:
        raise inst

    return z

In [14]:
men = differential_words(mensrights_counter)
fem = differential_words(feminism_counter)

`differential_words` computes for each 12-month window the top over- and under-represented words, and the summary statistics of the distribution of z-scores over words:

In [15]:
men[0][0]

'2008-03 to 2009-02'

In [16]:
men[0][1]

Unnamed: 0_level_0,U_represented,U_represented,O_represented,O_represented
Unnamed: 0_level_1,word,zscore,word,zscore
0,feminist,-1.300174,xtian,3.29523
1,rape,-0.972982,pretards,3.671122
2,mra,-0.769091,downmodded,3.834883
3,people,-0.758109,soceity,5.963154
4,sub,-0.757249,pn6,11.080591


In [17]:
men[0][2]

count    15331.000000
mean         0.208556
std          0.385296
min         -1.300174
25%         -0.005412
50%          0.087121
75%          0.254180
max         11.080591
Name: zscore, dtype: float64

Lastly, let's summarize the info about the overrepresented words using the folloing helper function

In [20]:
def summary_overrepr_words(result):
    # string for the 12-month window
    index = pd.Index([result[i][0] for i in range(len(result))])

    overrepr = pd.concat([result[i][1].iloc[:, [2]].T for i in range(len(result))])
    overrepr = overrepr.reset_index(drop=True)
    overrepr.columns = ["overrepr_%d" % i for i in range(1, 6)]
    
    stats = pd.concat([pd.DataFrame(result[i][2]).T for i in range(len(result))])
    stats = stats.reset_index(drop=True)
    
    df = pd.concat([overrepr, stats], axis=1)
    df.index = index
    return df

In [21]:
df_men = summary_overrepr_words(men)
df_fem = summary_overrepr_words(fem)

Let's take a look at what words are overrepresented in each 12-month window

In [22]:
# MensiRights
df_men.ix["2012-06 to 2013-05":"2015-08 to 2016-07", 0:5]

Unnamed: 0,overrepr_1,overrepr_2,overrepr_3,overrepr_4,overrepr_5
2012-06 to 2013-05,adria,srs,text,copy,sr
2012-07 to 2013-06,adria,srs,text,copy,sr
2012-08 to 2013-07,gww,poster,text,copy,sr
2012-09 to 2013-08,rape,patriarchy,text,copy,sr
2012-10 to 2013-09,adria,patriarchy,text,sr,copy
2012-11 to 2013-10,feminist,rape,text,patriarchy,copy
2012-12 to 2013-11,text,rape,copy,feminist,patriarchy
2013-01 to 2013-12,adria,feminist,text,copy,patriarchy
2013-02 to 2014-01,feminist,text,adria,copy,patriarchy
2013-03 to 2014-02,text,copy,geek,patriarchy,adria


In [23]:
# Feminism
df_fem.ix["2012-06 to 2013-05":"2015-08 to 2016-07", 0:5]

Unnamed: 0,overrepr_1,overrepr_2,overrepr_3,overrepr_4,overrepr_5
2012-06 to 2013-05,rapist,men,sr,joke,rape
2012-07 to 2013-06,men,rapist,tosh,joke,rape
2012-08 to 2013-07,nerd,rapist,geek,picture,rape
2012-09 to 2013-08,femen,picture,geek,rapist,rape
2012-10 to 2013-09,soy,femen,consent,false,rape
2012-11 to 2013-10,false,rapist,consent,miley,rape
2012-12 to 2013-11,song,consent,cyrus,rape,miley
2013-01 to 2013-12,thicke,rape,objectionable,cyrus,miley
2013-02 to 2014-01,femen,thicke,objectionable,cyrus,miley
2013-03 to 2014-02,thicke,allen,objectionable,cyrus,miley


* The 2016 Presidential Election seemed to dominate the topics of discussion in both subreddits starting from early 2016 ("berinie", "candidate", "hillary", "clinton")

* The resignation of Ellen Pao from Reddit (https://en.wikipedia.org/wiki/Ellen_Pao#Exit_from_Reddit) seemed to have ignited heated discussion from 2015-06 to 2016-01