This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one:

* Difference of proportions (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) section 3.2.2
* Mann-Whitney rank-sums test (described in [Kilgarriff 2001, Comparing Corpora](https://www.sketchengine.eu/wp-content/uploads/comparing_corpora_2001.pdf), section 2.3)

In [6]:
import sys, operator
from collections import Counter
from scipy.stats import mannwhitneyu

In [7]:
# the convote data is already tokenized so just split on whitespace
repub_tokens=open("../data/repub.convote.txt", encoding="utf-8").read().split(" ")
dem_tokens=open("../data/dem.convote.txt", encoding="utf-8").read().split(" ")

Q1: First, calculate the simple "difference of proportions" measure from Monroe et al.'s "Fighting Words", section 3.2.2.  What are the top ten terms in this measurement that are most republican and most democrat?

In [16]:
## Preprocess
# Remove punctuations that are not informative: comma, full stop, hyphen, empty character.
PLAIN_PUNCT = ['.', ',', '-', '--', '']
repub_tokens = [x for x in repub_tokens if x not in PLAIN_PUNCT]
dem_tokens = [x for x in dem_tokens if x not in PLAIN_PUNCT]

In [11]:
def difference_of_proportions(one_tokens, two_tokens):
    # your code here
    freq_dict1 = Counter(one_tokens)
    freq_dict2 = Counter(two_tokens)
    n1 = len(one_tokens)
    n2 = len(two_tokens)
    print(len(freq_dict1.keys()))
    print(len(freq_dict2.keys()))

    diff_dict = {}
    
    for w in set(freq_dict1.keys()).intersection(set(freq_dict2.keys())):
        p1 = freq_dict1[w]/n1
        p2 = freq_dict2[w]/n2
        diff_prop = p1-p2
        diff_dict[w] = diff_prop                                                
    
    ##if the word is not mentioned by the other corpus at all                                 
    for w in set(freq_dict1.keys()).difference(set(freq_dict2.keys())): 
        p1 = freq_dict1[w]/n1
        diff_prop = p1 - 0
        diff_dict[w] = diff_prop
    
    for w in set(freq_dict2.keys()).difference(set(freq_dict1.keys())):
        p2 = freq_dict2[w]/n2
        diff_prop = 0 - p2
        diff_dict[w] = diff_prop
    return diff_dict
        

In [17]:
diff_dict = difference_of_proportions(dem_tokens, repub_tokens)

15475
13938


In [18]:
most_repub = sorted(diff_dict.items(), key = lambda x: x[1])[:10]
most_dem = sorted(diff_dict.items(), key = lambda x: x[1], reverse = True)[:10]

print("Most Republican words:")
for r in most_repub:
    print(r)
print("****\nMost Democratic words:")    
for d in most_dem:
    print(d)

Most Republican words:
('i', -0.0032929753891177554)
('we', -0.0023808265674042563)
('and', -0.001995265535181777)
('of', -0.0017297224302114632)
('chairman', -0.001099756559772172)
('that', -0.0010955393393343796)
('as', -0.001047375123724081)
('a', -0.0009286487567812772)
('gentleman', -0.0009272354531633575)
('mr.', -0.000866433585557211)
****
Most Democratic words:
('not', 0.0017953840754873105)
('$', 0.0017246920195740645)
('cuts', 0.0011791436974444912)
('bill', 0.0011663230146247991)
('republican', 0.0011447679771725845)
('budget', 0.0010580517051874212)
('billion', 0.0010081561042747087)
('would', 0.0008780387846397406)
('health', 0.0008609657671188335)
('for', 0.000832992252975907)


Simply analyzing the difference in relative frequencies has a number of downsides: 
1.) As Monroe et al (2009) points out (and we can see here as well), it tends to emphasize high-frequency words (be sure you understand why). 
2.) We're not measuring whether a difference is statistically meaningful or just due to chance; the $\chi^2$ test is one method (described in Kilgarriff 2001 and in the context of collocations in Manning and Schuetze [here](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)) that addresses the desideratum of finding statistically significant terms, but it too has another downside:
3.) Simply counting up the total number of mentions of a term doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text.  The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently *everywhere* in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties).

Q2 (check-plus): One measure that does account for this burstiness is the adaptation by corpus linguistics of the non-parametric Mann-Whitney rank-sum test. The specific adaptation of this test for text is described in Kilgarriff 2001, section 2.3.  Implement this test using a fixed chunk size of 500 and the [scikit-learn mannwhitneyu function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html); what are the top ten terms in this measurement that are most republican and most democrat? 

In [20]:
t = list(mannwhitneyu([1,2,3],[2,3,4]))

In [21]:
t[0]

2.0

In [26]:
#TODO: remove punctuation

## Citation: code adapted from https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def divide_chunks(chunk_size, ls):
    for i in range(0, len(ls), chunk_size):
        yield ls[i:i + chunk_size]

def mann_whitney_analysis(one_tokens, two_tokens):
    # your code here
    
    one_chunks = list(divide_chunks(500, one_tokens))
    two_chunks = list(divide_chunks(500, two_tokens))
    
    freq_dict1 = Counter(one_tokens)
    freq_dict2 = Counter(two_tokens)
    
    one_dict = {}
    two_dict = {}

    ## only using words that appear in both corpus because the test requires each sample size to be > 20
    for word in set(freq_dict1.keys()).intersection(set(freq_dict2.keys())): 
        word_counts1 = []
        word_counts2 = []
        for c1 in one_chunks[:-1]: ## Leftover uneven chunks is discarded
            wc = Counter(c1)
            word_counts1.append(wc[word])
        for c2 in two_chunks[:-1]:
            wc = Counter(c2)
            word_counts2.append(wc[word])
        less_U_score = list(mannwhitneyu(word_counts1, word_counts2, alternative = "less"))
        more_U_score = list(mannwhitneyu(word_counts1, word_counts2, alternative = "greater"))
        two_dict[word] = less_U_score        
        one_dict[word] = more_U_score
    
    return one_dict, two_dict

#U1 + U2 = n1n2

In [27]:
dem_dict, repub_dict = mann_whitney_analysis(dem_tokens, repub_tokens)

In [46]:
print("Top 10 words from Republicans (mann whitney ranking test):")
repub10 = sorted(repub_dict.items(), key = lambda x: x[1])[:10]
for word in repub10:
    print(word)

Top 10 words from Republicans (mann whitney ranking test):
('i', [247686.5, 1.1592872881373215e-11])
('gentleman', [260807.0, 4.790904828804205e-09])
('important', [262048.0, 1.722437149982608e-09])
('as', [263593.5, 3.422636280659407e-07])
('support', [265245.0, 1.5795241686504472e-07])
('may', [266898.5, 1.7649158778150073e-08])
('thank', [269151.0, 1.1110670536357667e-07])
('chairman', [269437.5, 1.871796332075302e-06])
('some', [269718.0, 1.4795575389241594e-06])
('and', [270318.5, 1.6134074124571757e-05])


In [48]:
print("Top 10 words from Democrats (mann whitney ranking test):")
dem10 = sorted(dem_dict.items(), key = lambda x: x[1], reverse = True)[:10]
for word in dem10:
    print(word)


Top 10 words from Democrats (mann whitney ranking test):
('republican', [391025.5, 7.544689920813004e-39])
('cuts', [364802.5, 8.794525913343548e-27])
('$', [363043.0, 3.6705298040946054e-12])
('not', [361343.5, 8.259394993039342e-10])
('majority', [360874.0, 1.320203124308803e-17])
('billion', [357353.5, 3.5467210394973643e-13])
('administration', [351468.5, 9.457601798583503e-13])
('republicans', [349844.0, 2.742561996311067e-16])
('cut', [348380.5, 8.438723219700627e-15])
('opposition', [348243.0, 9.35878959253233e-14])
