This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one:

* Difference of proportions (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) section 3.2.2)
* Mann-Whitney rank-sums test (described in [Kilgarriff 2001, Comparing Corpora](https://www.sketchengine.eu/wp-content/uploads/comparing_corpora_2001.pdf), section 2.3)

### 1. Setup: Importing Necessary Libraries

First, we'll import the libraries needed for our analysis. 
- `sys` and `operator` are standard Python libraries for system functions and efficient operations, respectively. `operator.itemgetter` is particularly useful for sorting dictionaries.
- `Counter` from the `collections` module provides a specialized dictionary subclass for counting hashable objects, which is perfect for tallying word frequencies.
- `mannwhitneyu` from `scipy.stats` is the specific implementation of the Mann-Whitney U rank test we will use later.

In [None]:
# Import standard libraries for system operations and efficient functions.
import sys, operator
# Import the Counter class for easy frequency counting.
from collections import Counter
# Import the Mann-Whitney U test function from the SciPy statistics library.
from scipy.stats import mannwhitneyu

### 2. Loading the Data

Next, we load our two text corpora. The data consists of tokenized text from political speeches, separated into files for Republicans (`repub.convote.txt`) and Democrats (`dem.convote.txt`). We open each file, read its entire content into a single string, and then split that string by spaces to create a list of tokens (words).

In [None]:
# The convote data is already tokenized so just split on whitespace
# Open, read, and split the Republican speeches file into a list of tokens.
repub_tokens=open("../data/repub.convote.txt", encoding="utf-8").read().split(" ")
# Open, read, and split the Democrat speeches file into a list of tokens.
dem_tokens=open("../data/dem.convote.txt", encoding="utf-8").read().split(" ")

### Method 1: Difference of Proportions

Our first approach is a straightforward comparison of word frequencies. For each word in the combined vocabulary, we calculate its relative frequency (proportion) in each corpus and then find the difference between these two proportions. A large positive difference means the word is more characteristic of the first corpus, while a large negative difference means it's more characteristic of the second.

Q1: First, calculate the simple "difference of proportions" measure from Monroe et al.'s "Fighting Words", section 3.2.2.  What are the top ten terms in this measurement that are most republican and most democrat?

In [None]:
# This function calculates the difference in relative frequency for every word in two lists of tokens.
def count_differences(one_tokens, two_tokens):
    # Get the total number of tokens in the first list.
    one_N=len(one_tokens)
    # Get the total number of tokens in the second list.
    two_N=len(two_tokens)
    
    # Create a Counter object to store word frequencies for the first corpus.
    one_counts=Counter()
    # Create a Counter object to store word frequencies for the second corpus.
    two_counts=Counter()
    
    # Create a dictionary to hold the combined vocabulary from both corpora.
    vocab={}
    # Iterate through each token in the first list.
    for token in one_tokens:
        # Increment the count for the current token in the first corpus.
        one_counts[token]+=1
        # Add the token to our vocabulary set (value is arbitrary, just tracking keys).
        vocab[token]=1
        
    # Iterate through each token in the second list.
    for token in two_tokens:
        # Increment the count for the current token in the second corpus.
        two_counts[token]+=1    
        # Add the token to our vocabulary set.
        vocab[token]=1
        
    # Create a dictionary to store the calculated frequency differences.
    differences={}
    # Iterate through every unique word in the combined vocabulary.
    for word in vocab:
        # Calculate the relative frequency of the word in the first corpus.
        freq1=one_counts[word]/one_N
        # Calculate the relative frequency of the word in the second corpus.
        freq2=two_counts[word]/two_N
        
        # Compute the difference between the two frequencies.
        diff=freq1-freq2
        # Store the difference in our dictionary.
        differences[word]=diff
        
    # Return the dictionary of word-frequency differences.
    return differences

# This function uses the count_differences result to find and print the top 10 distinctive terms for each corpus.
def difference_of_proportions(one_tokens, two_tokens):

    # Call the previously defined function to get the frequency differences.
    differences=count_differences(one_tokens, two_tokens)
    
    # Sort the dictionary items by their value (the difference) in ascending order.
    # Words with the most negative scores will appear first.
    sorted_differences = sorted(differences.items(), key=operator.itemgetter(1))
    # Since we passed (dem_tokens, repub_tokens), negative scores mean the word is more frequent in the Republican corpus.
    print ("More Republican:")
    # Loop through the first 10 items in the sorted list (most negative scores).
    for k,v in sorted_differences[:10]:
        # Print the word and its score.
        print ("%s\t%s" % (k,v))
    # Print a newline for better formatting.
    print("\nMore Democrat:")
    # Loop through the last 10 items of the sorted list in reverse order (most positive scores).
    for k,v in reversed(sorted_differences[-10:]):
        # Print the word and its score.
        print ("%s\t%s" % (k,v))

Now, we execute the `difference_of_proportions` function. We pass `dem_tokens` as the first argument and `repub_tokens` as the second. Therefore, words with a high positive score are more distinctive to Democrats, and words with a high negative score are more distinctive to Republicans.

In [None]:
# Run the analysis with Democrat tokens as the first corpus and Republican tokens as the second.
difference_of_proportions(dem_tokens, repub_tokens)

### Method 2: Mann-Whitney U Test

As we can see from the results above, the difference of proportions method tends to highlight very common words (like 'i', 'we', 'and', 'of'). This is because even a small percentage difference for a high-frequency word results in a large absolute difference. This method also doesn't tell us if the difference is statistically significant. 

To address this, we turn to a more robust statistical test: the **Mann-Whitney U test**. This test helps us avoid the pitfalls of word "burstiness"—where a word might appear many times in one document but not be representative of the entire corpus. Instead of comparing total counts, we divide each corpus into smaller, equal-sized chunks. We then compare the *distribution* of word counts across these chunks. The test determines if the word counts in the chunks from one corpus are systematically higher or lower than the counts in the chunks from the other corpus.

Simply analyzing the difference in relative frequencies has a number of downsides: 1.) As Monroe et al (2009) points out (and we can see here as well), it tends to emphasize high-frequency words (be sure you understand why).  2.) We're not measuring whether a difference is statistically meaningful or just due to chance; the $\chi^2$ test is one method (described in Kilgarriff 2001 and in the context of collocations in Manning and Schuetze [here](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)) that addresses the desideratum of finding statistically significant terms, but it too has another downside: 3.) Simply counting up the total number of mentions of a term doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text.  The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently *everywhere* in corpus A (but not in corpus B) over those that show up only very frequently within a narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties).

Q2 (check-plus): One measure that does account for this burstiness is the adaptation by corpus linguistics of the non-parametric Mann-Whitney rank-sum test. The specific adaptation of this test for text is described in Kilgarriff 2001, section 2.3.  Implement this test using a fixed chunk size of 500 and the [scikit-learn mannwhitneyu function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html); what are the top ten terms in this measurement that are most republican and most democrat? 

In [None]:
# This function converts a flat list of tokens into a list of frequency counters, one for each chunk.
def get_chunk_counts(tokens, chunkLength):
    # Initialize an empty list to store the chunk counts.
    chunks=[]
    # Iterate through the tokens, stepping by the chunkLength.
    for i in range(0, len(tokens), chunkLength):
            # Create a new Counter for the current chunk.
            counts=Counter()
            # Iterate from 0 to chunkLength to populate the chunk's counter.
            for j in range(chunkLength):
                # Make sure we don't go beyond the end of the token list.
                if i+j < len(tokens):
                    # Increment the count for the token at the current position.
                    counts[tokens[i+j]]+=1
            # Add the populated counter for the current chunk to our list of chunks.
            chunks.append(counts)
    # Return the list of chunk-based frequency counters.
    return chunks

# This function calculates the Mann-Whitney U p-value for each word in the vocabulary.
def mann_whitney(one_tokens, two_tokens):

    # Define the size of each text chunk.
    chunkLength=500
    # Get the chunked counts for the first corpus.
    one_chunks=get_chunk_counts(one_tokens, chunkLength)
    # Get the chunked counts for the second corpus.
    two_chunks=get_chunk_counts(two_tokens, chunkLength)
    
    # Create a dictionary to hold the combined vocabulary from both sets of chunks.
    vocab={}
    
    # Populate the vocabulary from the first corpus.
    for chunk in one_chunks:
        for word in chunk:
            vocab[word]=1
    # Populate the vocabulary from the second corpus.
    for chunk in two_chunks:
        for word in chunk:
            vocab[word]=1
    
    # Initialize a dictionary to store the p-value for each word.
    pvals={}
    
    # Iterate through every word in the combined vocabulary.
    for word in vocab:
        
        # Initialize a list 'a' to hold the counts of the current word in each chunk of the first corpus.
        a=[]
        # Initialize a list 'b' to hold the counts of the current word in each chunk of the second corpus.
        b=[]
        
        # The Mann-Whitney U test can handle samples of different sizes (i.e., a different number of chunks).
        
        # Populate list 'a' with the frequency of the word in each chunk of the first corpus.
        for chunk in one_chunks:
            a.append(chunk[word])
        # Populate list 'b' with the frequency of the word in each chunk of the second corpus.
        for chunk in two_chunks:
            b.append(chunk[word])

        # Perform the Mann-Whitney U test on the two lists of counts.
        # The 'alternative="two-sided"' checks if the distributions are different, regardless of direction.
        statistic,pval=mannwhitneyu(a,b, alternative="two-sided")
        
        # The p-value indicates the statistical significance of the difference.
        # A smaller p-value means the observed difference is less likely to be due to random chance.
        pvals[word]=pval

    # Return the dictionary of words and their corresponding p-values.
    return pvals
    
# This function orchestrates the Mann-Whitney analysis and presents the results.
def mann_whitney_analysis(one_tokens, two_tokens):
    
    # Calculate the p-values for all words using the mann_whitney function.
    pvals=mann_whitney(one_tokens, two_tokens)
    
    # The p-value tells us *if* there's a significant difference, but not the *direction* of that difference.
    # We reuse our simple 'count_differences' function to determine which corpus uses the word more overall.
    differences=count_differences(one_tokens, two_tokens)
    
    # Create a dictionary for words used more by the second corpus (Republicans in our case).
    # A non-positive difference (<= 0) means the word is more or equally frequent in corpus two.
    one_terms={k : pvals[k] for k in pvals if differences[k] <= 0}
    # Create a dictionary for words used more by the first corpus (Democrats in our case).
    # A positive difference (> 0) means the word is more frequent in corpus one.
    two_terms={k : pvals[k] for k in pvals if differences[k] > 0}
    
    # Sort the Republican-leaning terms by their p-value in ascending order (most significant first).
    sorted_pvals = sorted(one_terms.items(), key=operator.itemgetter(1))
    print("More Republican:\n")
    # Print the top 10 most statistically significant Republican terms.
    for k,v in sorted_pvals[:10]:
        print("%s\t%.15f" % (k,v))

    print("\nMore Democrat:\n")
    # Sort the Democrat-leaning terms by their p-value in ascending order.
    sorted_pvals = sorted(two_terms.items(), key=operator.itemgetter(1))
    # Print the top 10 most statistically significant Democrat terms.
    for k,v in sorted_pvals[:10]:
        print("%s\t%.15f" % (k,v))



Finally, we run the complete Mann-Whitney U test analysis. The output will show the 10 words that are most statistically significant for each party, sorted by their p-value (a lower p-value indicates higher significance).

In [None]:
# Execute the Mann-Whitney analysis function with our token lists.
mann_whitney_analysis(dem_tokens, repub_tokens)

### Conclusion

Comparing the two methods, we see a marked difference in the results. The Mann-Whitney U test produces lists that seem more topically relevant (e.g., 'growth', 'economy' for Republicans; 'cuts', 'billion' for Democrats) and less dominated by common function words. This demonstrates how choosing a more sophisticated statistical method that accounts for the distributional properties of language can yield more insightful and meaningful results when comparing text corpora.