[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp24/blob/main/2.compare/ChiSquare_Mann-Whitney_Log-odds.ipynb)

This notebook examines the words that distinguish the [2024 Democrat party platform](https://www.presidency.ucsb.edu/documents/2024-democratic-party-platform) from the [2024 Republican party platform](https://www.presidency.ucsb.edu/documents/2024-republican-party-platform) (both sourced from the American Presidency Project at UCSB), using the Chi-Square test, Mann-Whitney test and log-odds ratio.

In [None]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp24/main/data/2024_democrat_party_platform.txt
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp24/main/data/2024_republican_party_platform.txt

In [None]:
import sys
import json
import nltk
import math
import operator
from collections import Counter
from scipy.stats import mannwhitneyu

In [None]:
def read(filename):
    with open(filename, encoding="utf-8") as file:
        # lowercase text
        return file.read().lower()

In [None]:
democratText=read("2024_democrat_party_platform.txt")

In [None]:
republicanText=read("2024_republican_party_platform.txt")

Explore your assumptions between the words you think will most distinguish the Democrat and Republican platforms.  Before looking at the results of the tests, what words do you think will be comparatively distinct to both?  (If you're not familiar with either, scan the platforms linked above).

In [None]:
def tokenize(data):
    return nltk.word_tokenize(data)

In [None]:
def get_counts(tokens):
    counts=Counter()
    for token in tokens:
        counts[token]+=1
    return counts

## $\chi^2$ test

The $\chi^2$ test as used in the comparison of different texts is designed to measure how statistically significant the distriubtion of counts in a 2x2 contingency table is.  Use the following function to analyze the difference between the platforms.  How do the most distinct terms comport with your assumptions?

In [None]:
def chi_square(one_counts, two_counts):

    one_sum=0.
    two_sum=0.
    vocab={}
    for word in one_counts:
        one_sum+=one_counts[word]
        vocab[word]=1
    for word in two_counts:
        vocab[word]=1
        two_sum+=two_counts[word]

    N=one_sum+two_sum
    vals={}
    
    for word in vocab:
        O11=one_counts[word]
        O12=two_counts[word]
        O21=one_sum-one_counts[word]
        O22=two_sum-two_counts[word]
        
        # We'll use the simpler form given in Manning and Schuetze (1999) 
        # for 2x2 contingency tables: 
        # https://nlp.stanford.edu/fsnlp/promo/colloc.pdf, equation 5.7
        
        vals[word]=(N*(O11*O22 - O12*O21)**2)/((O11 + O12)*(O11+O21)*(O12+O22)*(O21+O22))
        
    sorted_chi = sorted(vals.items(), key=operator.itemgetter(1), reverse=True)
    one=[]
    two=[]
    for k,v in sorted_chi:
        if one_counts[k]/one_sum > two_counts[k]/two_sum:
            one.append(k)
        else:
            two.append(k)
    
    print ("Democrat:\n")
    for k in one[:20]:
        print("%s\t%s" % (k,vals[k]))

    print ("\n\nRepublican:\n")
    for k in two[:20]:
        print("%s\t%s" % (k,vals[k]))

In [None]:
democrat_tokens=tokenize(democratText)
democrat_counts=get_counts(democrat_tokens)

In [None]:
republican_tokens=tokenize(republicanText)
republican_counts=get_counts(republican_tokens)

In [None]:
chi_square(democrat_counts, republican_counts)

Are these results surprising? Examine specific words to check their frequency in both datasets.

In [None]:
print("Totals: R: %s, D: %s" % (len(republican_tokens), len(democrat_tokens)))

In [None]:
word="climate"
print("%s -- R: %s, D: %s" % (word, republican_counts[word], democrat_counts[word]))

## Mann-Whitney

We saw earlier that $\chi^2$ is not a perfect estimator since doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text. The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently everywhere in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties). Use the following function to execute the Mann-Whitney test to account for this phenomenon while finding distinctive terms.

In [None]:
def count_differences(one_tokens, two_tokens):
    one_N=len(one_tokens)
    two_N=len(two_tokens)
    
    one_counts=Counter()
    two_counts=Counter()
    
    vocab={}
    for token in one_tokens:
        one_counts[token]+=1
        vocab[token]=1
        
    for token in two_tokens:
        two_counts[token]+=1    
        vocab[token]=1
        
    differences={}
    for word in vocab:
        freq1=one_counts[word]/one_N
        freq2=two_counts[word]/two_N
        
        diff=freq1-freq2
        differences[word]=diff
        
    return differences

# convert a sequence of tokens into counts for each chunkLength-word window
def get_chunk_counts(tokens, chunkLength):
    chunks=[]
    for i in range(0, len(tokens), chunkLength):
            counts=Counter()
            for j in range(chunkLength):
                if i+j < len(tokens):
                    counts[tokens[i+j]]+=1
            chunks.append(counts)
    return chunks

# calculate mann-whitney test for each word in vocabulary
def mann_whitney(one_tokens, two_tokens):

    chunkLength=500
    one_chunks=get_chunk_counts(one_tokens, chunkLength)
    two_chunks=get_chunk_counts(two_tokens, chunkLength)
    
    # vocab is the union of terms in both sets
    vocab={}
    
    for chunk in one_chunks:
        for word in chunk:
            vocab[word]=1
    for chunk in two_chunks:
        for word in chunk:
            vocab[word]=1
    
    pvals={}
    
    for word in vocab:
        
        a=[]
        b=[]
        
        # Note a and b can be different lengths (i.e., different sample sizes)
        # 
        # See Mann and Whitney (1947), "On a Test of Whether one of Two Random 
        # Variables is Stochastically Larger than the Other"
        # https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730491
        
        # (This is part of their innovation over the case of equal sample sizes in Wilcoxon 1945)
        
        for chunk in one_chunks:
            a.append(chunk[word])
        for chunk in two_chunks:
            b.append(chunk[word])

        statistic,pval=mannwhitneyu(a,b, alternative="two-sided")
        
        # We'll use the p-value as our quantity of interest.  [Note in the normal appproximation
        # that Mann-Whitney uses to assess significance for large sample sizes, the significance 
        # of the raw statistic depends on the number of ties in the data, so the statistic itself
        # isn't exactly comparable across different words]
        pvals[word]=pval

    return pvals
    
# calculate mann-whitneyfor each word in vocabulary and present the top 10 terms for each group
def mann_whitney_analysis(one_tokens, two_tokens):
    
    pvals=mann_whitney(one_tokens, two_tokens)
    
    # Mann-Whitney tells us the significance of a term's difference in two groups, but we also 
    # need the directionality of that difference (whether it's used more by group A or group B. 
    
    # Let's use our difference-in-proportions function above to check the directionality.  
    # [Note we could also measure directionality by checking whether the Mann-Whitney statistic
    # is greater or less than the mean=len(one_chunks)*len(two_chunks)*0.5.]

    differences=count_differences(one_tokens, two_tokens)
    
    one_terms={k : pvals[k] for k in pvals if differences[k] <= 0}
    two_terms={k : pvals[k] for k in pvals if differences[k] > 0}
    
    sorted_pvals = sorted(one_terms.items(), key=operator.itemgetter(1))
    print("More Republican:\n")
    for k,v in sorted_pvals[:20]:
        print("%s\t%.15f" % (k,v))

    print("\nMore Democrat:\n")
    sorted_pvals = sorted(two_terms.items(), key=operator.itemgetter(1))
    for k,v in sorted_pvals[:25]:
        print("%s\t%.15f" % (k,v))

In [None]:
mann_whitney_analysis(democrat_tokens, republican_tokens)

How are the differences identified by Mann-Whitney similar to, and different from, those identified by $\chi^2$?  What conclusions would you draw from the differences between these platforms?

## Log-odds ratio

The log-odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a very common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument).  This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

For the log-odds ratio with an uninformative Dirichlet prior, $\hat\zeta_w^{(i-j)}$ for word $w$ reflects the difference in usage between corpus $i$ and corpus $j$ and is given by the following equation:

This value, $\hat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\hat\zeta_w^{(i-j)}= {\hat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\hat{d}_w^{(i-j)}\right)}}
$$

Where: 

$$
\hat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\hat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)



In [None]:
def logodds_with_uninformative_prior(one_tokens, two_tokens, display=25):
    
    def get_counter_from_list(tokens):
        counter=Counter()
        for token in tokens:
            counter[token]+=1
        return counter


    oneCounter=get_counter_from_list(one_tokens)
    twoCounter=get_counter_from_list(two_tokens)
    
    vocab=dict(oneCounter) 
    vocab.update(dict(twoCounter))
    oneSum=sum(oneCounter.values())
    twoSum=sum(twoCounter.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha
        
    for word in vocab:
        
        log_odds_ratio=math.log( (oneCounter[word] + alpha) / (oneSum+alphaV-oneCounter[word]-alpha) ) - math.log( (twoCounter[word] + alpha) / (twoSum+alphaV-twoCounter[word]-alpha) )
        variance=1./(oneCounter[word] + alpha) + 1./(twoCounter[word] + alpha)
        
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    print("Most positive:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))
    
    print("\nMost negative:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

In [None]:
logodds_with_uninformative_prior(republican_tokens, democrat_tokens)

## Explore 

Explore these methods further by plugging your own texts into the process above. The texts should differ along some salient dimension that's of interest to you -- e.g., author, genre, topic, time period, etc.  ([Project Gutenberg](https://www.gutenberg.org) is one source for texts you could use here.)