This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one:

* Difference of proportions (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) section 3.2.2
* Mann-Whitney rank-sums test (described in [Kilgarriff 2001, Comparing Corpora](https://www.sketchengine.eu/wp-content/uploads/comparing_corpora_2001.pdf), section 2.3)

In [90]:
import sys, operator
from collections import Counter
from scipy.stats import mannwhitneyu

In [91]:
# the convote data is already tokenized so just split on whitespace
repub_tokens=open("../data/repub.convote.txt", encoding="utf-8").read().split(" ")
dem_tokens=open("../data/dem.convote.txt", encoding="utf-8").read().split(" ")

Q1: First, calculate the simple "difference of proportions" measure from Monroe et al.'s "Fighting Words", section 3.2.2.  What are the top ten terms in this measurement that are most republican and most democrat?

In [92]:
def count_differences(one_tokens, two_tokens):
    one_N=len(one_tokens)
    two_N=len(two_tokens)
    
    one_counts=Counter()
    two_counts=Counter()
    
    vocab={}
    for token in one_tokens:
        one_counts[token]+=1
        vocab[token]=1
        
    for token in two_tokens:
        two_counts[token]+=1    
        vocab[token]=1
        
    differences={}
    for word in vocab:
        freq1=one_counts[word]/one_N
        freq2=two_counts[word]/two_N
        
        diff=freq1-freq2
        differences[word]=diff
        
    return differences

def difference_of_proportions(one_tokens, two_tokens):

    differences=count_differences(one_tokens, two_tokens)
    
    sorted_differences = sorted(differences.items(), key=operator.itemgetter(1))
    print ("More Republican:")
    for k,v in sorted_differences[:10]:
        print ("%s\t%s" % (k,v))
    print("\nMore Democrat:")
    for k,v in reversed(sorted_differences[-10:]):

        print ("%s\t%s" % (k,v))
        

In [93]:
difference_of_proportions(dem_tokens, repub_tokens)

More Republican:
i	-0.002870948015418236
we	-0.0020739540633471117
and	-0.0017279456625680124
of	-0.0014950519581076668
,	-0.00105321588184943
chairman	-0.0009598934247981516
that	-0.000945583476245123
as	-0.0009124972356492223
gentleman	-0.0008093810284795912
a	-0.0008020309007514565

More Democrat:
not	0.0015745433184340962
$	0.0015095648428079297
cuts	0.001031315818425968
bill	0.0010228370409021796
republican	0.001001288082839861
budget	0.0009261863664701928
billion	0.0008820967153998979
would	0.0007701123575280444
health	0.0007538336601492987
for	0.0007352277281844066


Simply analyzing the difference in relative frequencies has a number of downsides: 1.) As Monroe et al (2009) points out (and we can see here as well), it tends to emphasize high-frequency words (be sure you understand why).  2.) We're not measuring whether a difference is statistically meaningful or just due to chance; the $\chi^2$ test is one method (described in Kilgarriff 2001 and in the context of collocations in Manning and Schuetze [here](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)) that addresses the desideratum of finding statistically significant terms, but it too has another downside: 3.) Simply counting up the total number of mentions of a term doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text.  The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently *everywhere* in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties).

Q2 (check-plus): One measure that does account for this burstiness is the adaptation by corpus linguistics of the non-parametric Mann-Whitney rank-sum test. The specific adaptation of this test for text is described in Kilgarriff 2001, section 2.3.  Implement this test using a fixed chunk size of 500 and the [scikit-learn mannwhitneyu function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html); what are the top ten terms in this measurement that are most republican and most democrat? 

In [98]:
# convert a sequence of tokens into counts for each chunkLength-word window
def get_chunk_counts(tokens, chunkLength):
    chunks=[]
    for i in range(0, len(tokens), chunkLength):
            counts=Counter()
            for j in range(chunkLength):
                if i+j < len(tokens):
                    counts[tokens[i+j]]+=1
            chunks.append(counts)
    return chunks

# calculate mann-whitney test for each word in vocabulary
def mann_whitney(one_tokens, two_tokens):

    chunkLength=500
    one_chunks=get_chunk_counts(one_tokens, chunkLength)
    two_chunks=get_chunk_counts(two_tokens, chunkLength)
    
    # vocab is the union of terms in both sets
    vocab={}
    
    for chunk in one_chunks:
        for word in chunk:
            vocab[word]=1
    for chunk in two_chunks:
        for word in chunk:
            vocab[word]=1
    
    pvals={}
    
    for word in vocab:
        
        a=[]
        b=[]
        
        # Note a and b can be different lengths (i.e., different sample sizes)
        # 
        # See Mann and Whitney (1947), "On a Test of Whether one of Two Random 
        # Variables is Stochastically Larger than the Other"
        # https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730491
        
        # (This is part of their innovation over the case of equal sample sizes in Wilcoxon 1945)
        
        for chunk in one_chunks:
            a.append(chunk[word])
        for chunk in two_chunks:
            b.append(chunk[word])

        statistic,pval=mannwhitneyu(a,b, alternative="two-sided")
        
        # We'll use the p-value as our quantity of interest.  [Note in the normal appproximation
        # that Mann-Whitney uses to assess significance for large sample sizes, the significance 
        # of the raw statistic depends on the number of ties in the data, so the statistic itself
        # isn't exactly comparable across different words]
        pvals[word]=pval

    return pvals
    
# calculate mann-whitneyfor each word in vocabulary and present the top 10 terms for each group
def mann_whitney_analysis(one_tokens, two_tokens):
    
    pvals=mann_whitney(one_tokens, two_tokens)
    
    # Mann-Whitney tells us the significance of a term's difference in two groups, but we also 
    # need the directionality of that difference (whether it's used more by group A or group B. 
    
    # Let's use our difference-in-proportions function above to check the directionality.  
    # [Note we could also measure directionality by checking whether the Mann-Whitney statistic
    # is greater or less than the mean=len(one_chunks)*len(two_chunks)*0.5.]

    differences=count_differences(one_tokens, two_tokens)
    
    one_terms={k : pvals[k] for k in pvals if differences[k] <= 0}
    two_terms={k : pvals[k] for k in pvals if differences[k] > 0}
    
    sorted_pvals = sorted(one_terms.items(), key=operator.itemgetter(1))
    print("More Republican:\n")
    for k,v in sorted_pvals[:10]:
        print("%s\t%.15f" % (k,v))

    print("\nMore Democrat:\n")
    sorted_pvals = sorted(two_terms.items(), key=operator.itemgetter(1))
    for k,v in sorted_pvals[:10]:
        print("%s\t%.15f" % (k,v))



In [99]:
mann_whitney_analysis(dem_tokens, repub_tokens)

More Republican:

growth	0.000000000001727
i	0.000000000003221
important	0.000000006857649
economy	0.000000011767342
sensenbrenner	0.000000016397606
may	0.000000031020346
gentleman	0.000000037460807
thank	0.000000058448892
consume	0.000000296135922
forward	0.000000309308704

More Democrat:

republican	0.000000000000000
cuts	0.000000000000000
republicans	0.000000000000000
majority	0.000000000000000
cut	0.000000000000001
billion	0.000000000000008
--	0.000000000000044
opposition	0.000000000000079
$	0.000000000000388
fails	0.000000000000422
