[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/2.compare/Log_odds_ratio_TODO.ipynb)

# Log odds-ratio

The log odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

## Part 1

Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset.
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt"

**Describe each of those datasets and their source in 100-200 words.**

Type your response here:

Dataset 1: Barack Obama Speeches (class1_dataset.txt)
This dataset contains transcripts of two landmark speeches delivered by former U.S. President Barack Obama: his First Inaugural Address in 2009 and his Second Inaugural Address in 2013. Both speeches were collected from the official White House archives and the American Presidency Project, ensuring accurate and reliable sources. Obama's style in these addresses is measured and aspirational, often weaving together historical references, themes of unity, and calls for collective responsibility. His rhetoric emphasizes shared values like democracy, fairness, and resilience in the face of challenges. The speeches are rich in structured argumentation and make frequent use of inclusive language such as “we” and “our,” which reinforces his focus on community and collaboration. Altogether, the dataset contains more than 10,000 tokens, offering ample material for linguistic and rhetorical analysis. It provides a clear example of polished, narrative-driven political oratory that seeks to inspire and unify a diverse audience.


Dataset 2: Donald Trump Speeches (class2_dataset.txt)
This dataset includes transcripts of Donald Trump's Second Inaugural Address in 2025 and his First Inaugural Address in 2017, collected from official government archives and widely cited news outlets. Trump's rhetoric provides a striking contrast to Obama's, making this dataset especially valuable for comparative analysis. His style is direct, populist, and often repetitive, with a focus on nationalism, economic protectionism, and critiques of political opponents or institutions. Trump's language tends to be more conversational and emotionally charged, frequently appealing to patriotism through phrases such as “America First.” Compared to Obama's structured cadence, Trump often employs simpler sentence structures and slogans that are easy to remember and repeat. The two speeches combined easily surpass 10,000 tokens, creating a robust sample for analysis. This dataset captures Trump's distinct approach to political communication, highlighting how his rhetoric resonates with themes of strength, loyalty, and disruption of the political status quo.

## Part 2

Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [16]:
#Upload, auto-detect filenames, download NLTK resources, and tokenize
from google.colab import files
uploaded = files.upload()  #select both class1 and class2 files when prompted

import os, re, nltk, string
from nltk.tokenize import word_tokenize

for pkg in ("punkt", "punkt_tab"): #tokenizers
    try:
        nltk.data.find(f"tokenizers/{pkg}")
    except LookupError:
        nltk.download(pkg)

print("Files in working directory:", os.listdir()) #check if the files are actually present

class1_path = resolve_path(["class1dataset.txt", "class1_dataset.txt"]) #path names
class2_path = resolve_path(["class2dataset.txt", "class2_dataset.txt"])

def read_and_tokenize(filename: str) -> list[str]:
    """Read the file and output a list of strings (tokens)."""
    with open(filename, "r", encoding="utf-8") as f:
        text = f.read()

    try:
        tokens = word_tokenize(text.lower()) #Try NLTK tokenizer first else try a simple regex
    except LookupError:
        tokens = re.findall(r"[A-Za-z0-9']+", text.lower())

    punct = set(string.punctuation) #Remove bare punctuation tokens
    tokens = [t for t in tokens if t not in punct]
    return tokens

class1_tokens = read_and_tokenize(class1_path) #Tokenize both datasets
class2_tokens = read_and_tokenize(class2_path)

print("Using files:", class1_path, "and", class2_path)
print("Class 1 token count:", len(class1_tokens))
print("Class 2 token count:", len(class2_tokens))
print("Sample from Class 1:", class1_tokens[:20])


Saving class2dataset.txt to class2dataset (3).txt
Saving class1dataset.txt to class1dataset (3).txt
Files in working directory: ['.config', 'class1dataset (1).txt', 'class1dataset (3).txt', 'class2dataset.txt', 'class2_dataset.txt', 'class2dataset (1).txt', 'class1dataset.txt', 'class2dataset (3).txt', 'class2dataset (2).txt', 'class1_dataset.txt', 'class1dataset (2).txt', 'sample_data']
Using files: class1dataset.txt and class2dataset.txt
Class 1 token count: 18340
Class 2 token count: 16445
Sample from Class 1: ['\ufeffmy', 'fellow', 'citizens', 'i', 'stand', 'here', 'today', 'humbled', 'by', 'the', 'task', 'before', 'us', 'grateful', 'for', 'the', 'trust', 'you', "'ve", 'bestowed']


## Part 3

Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior. This value, $\widehat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\widehat{\zeta}_w^{(i-j)}= {\widehat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\widehat{d}_w^{(i-j)}\right)}}
$$

Where:

$$
\widehat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\widehat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

In this example, the two corpora are your class1 dataset (e.g., $i$ = your class1) and your class2 dataset (e.g., $j$ = class2). Using this metric, print out the 25 words most strongly aligned with class1, and 25 words most strongly aligned with class2.  Again, consult [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) for more detail.

In [13]:
from collections import Counter
import math

def logodds_with_uninformative_prior(tokens_i: list[str], tokens_j: list[str], display=25):
    """
    Print the words most characteristic of class1 (tokens_i) vs class2 (tokens_j)
    using log-odds with an uninformative Dirichlet prior.
    """

    ci = Counter(tokens_i) #count how many words
    cj = Counter(tokens_j)

    vocab = set(ci.keys()) | set(cj.keys()) #Vocabulary = all unique words that appear in either corpus (using a set would give us unique words)
    V = len(vocab)

    alpha = 0.01
    alpha0 = V * alpha

    ni = sum(ci.values()) #Total token counts for each corpus
    nj = sum(cj.values())

    results = []  #stores (word, z_score)

    for w in vocab:
        yi = ci.get(w, 0)  # count of w in class1
        yj = cj.get(w, 0)  # count of w in class2

        di = math.log((yi + alpha) / (ni + alpha0 - yi - alpha)) #formula as per the text in the assignment and the paper
        dj = math.log((yj + alpha) / (nj + alpha0 - yj - alpha))
        d_hat = di - dj

        sigma2 = (1.0 / (yi + alpha)) + (1.0 / (yj + alpha))

        z = d_hat / math.sqrt(sigma2)

        results.append((w, z))

    results_sorted = sorted(results, key=lambda x: x[1], reverse=True) #sort in descending order

    top_i = results_sorted[:display]  # Top N for class1

    top_j = sorted(results, key=lambda x: x[1])[:display] #Top N for class2 (most negative z). We reverse so highest magnitude appears first.
    top_j = [(w, z) for (w, z) in top_j]  # keep sign (negative means class2)

    print(f"Top {display} words aligned with class1:")
    for w, z in top_i:
        print(f"{w:>15s}: {z: .2f}")

    print("\n" + f"Top {display} words aligned with class2:")
    for w, z in top_j:
        print(f"{w:>15s}: {z: .2f}")

_ = logodds_with_uninformative_prior(class1_tokens, class2_tokens, display=25) #using my class1 and class 2 tokens here

Top 25 words aligned with class1:
           that:  8.32
              ’:  7.63
             it:  5.75
             us:  5.65
             or:  4.79
              t:  4.77
             do:  4.23
           need:  4.10
            who:  3.98
             ve:  3.91
            can:  3.88
           when:  3.85
            let:  3.83
            how:  3.74
             so:  3.62
           work:  3.62
           want:  3.62
           what:  3.59
        because:  3.56
             re:  3.49
        economy:  3.45
          those:  3.36
           make:  3.27
              a:  3.19
         change:  3.18

Top 25 words aligned with class2:
           will: -6.72
          thank: -5.57
           very: -5.22
          never: -4.37
             be: -4.23
            was: -4.15
            you: -4.10
         united: -4.09
             my: -4.01
         states: -3.95
       american: -3.91
          again: -3.73
 administration: -3.59
         nation: -3.53
      president: -3.41
           