[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/2.compare/Log_odds_ratio_TODO.ipynb)

# Log odds-ratio

The log odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

## Part 1

Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset.
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt"

**Describe each of those datasets and their source in 100-200 words.**

Type your response here:



In [2]:
!pip -q install requests beautifulsoup4 tqdm

In [3]:
TOPICS = [
    "Abortion", "Gun_control", "Climate_change", "Feminism", "Immigration",
    "Same-sex_marriage", "Evolution", "Creationism", "Vaccination",
    "COVID-19_pandemic", "Transgender", "Affirmative_action",
    "Capital_punishment", "Socialism", "Free_market", "Universal_health_care",
    "Net_neutrality", "Renewable_energy", "Nuclear_power", "Gun_rights",
    "Intelligent_design", "Environmentalism", "BLM", "Woke", "Cultural_Marxism",
    "Critical_race_theory", "Gender_identity", "School_choice", "Tax_cuts",
    "Minimum_wage", "Universal_basic_income", "Welfare_state", "Secularism",
    "Christian_right", "LGBT_rights", "Great_Replacement", "Deep_state",
    "COVID-19_vaccine", "Mask_mandate", "Globalism", "Nationalism",
    "Multiculturalism", "Law_and_order", "Police_reform", "Media_bias",
    "Fake_news", "Abstinence-only_sex_education", "Sex_education",
    "Stem_cell_research", "Gun_violence"
]

WIKI_REST = "https://en.wikipedia.org/api/rest_v1/page/plain/{title}"
CONSERVA_RENDER = "https://www.conservapedia.com/index.php?title={title}&action=render"

HEADERS = {
    "User-Agent": "academic-research-notebook/1.0 (non-commercial; contact: student@example.edu)"
}

MIN_TOKENS_PER_CORPUS = 12000

In [4]:
import re, time, random, requests
from bs4 import BeautifulSoup
from urllib.parse import quote

def sanitize(text: str) -> str:
    text = re.sub(r"\[\d+\]", " ", text)
    text = re.sub(r"\{\{.*?\}\}", " ", text, flags=re.S)
    text = re.sub(r"<ref.*?>.*?</ref>", " ", text, flags=re.S)
    text = re.sub(r"<.*?>", " ", text, flags=re.S)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def approx_token_count(text: str) -> int:
    return len(re.findall(r"\w+", text))

def fetch_wikipedia(title: str) -> str:
    url = WIKI_REST.format(title=quote(title))
    r = requests.get(url, headers=HEADERS, timeout=30)
    if r.status_code == 200 and r.text:
        return sanitize(r.text)
    return ""

def fetch_conservapedia(title: str) -> str:
    url = CONSERVA_RENDER.format(title=quote(title))
    r = requests.get(url, headers=HEADERS, timeout=30)
    if r.status_code == 200 and r.text:
        soup = BeautifulSoup(r.text, "html.parser")
        text = soup.get_text(separator=" ")
        return sanitize(text)
    return ""

def write_with_license(path: str, corpus_text: str, source_label: str):
    with open(path, "w", encoding="utf-8") as f:
        f.write(
            f"### Dataset: {path}\n"
            f"### Source: {source_label}\n"
            "### Licensing:\n"
            "- Wikipedia: CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/\n"
            "- Conservapedia: CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0/\n"
            "### Attribution:\n"
            "- Each section is prefixed with SOURCE_ARTICLE followed by the original page title.\n"
            "=============================================================\n"
        )
        f.write(corpus_text)


In [5]:
from tqdm import tqdm

def build_dataset(fetch_fn, name: str, topics, min_tokens=MIN_TOKENS_PER_CORPUS,
                  cool_down=(0.7, 1.6), max_errors=8) -> str:
    topics = topics[:]  # copy
    random.shuffle(topics)
    chunks, total, errors = [], 0, 0
    pbar = tqdm(total=min_tokens, desc=f"Building {name}", unit="tok")

    for t in topics:
        try:
            txt = fetch_fn(t)
            if not txt or len(txt) < 500:
                continue
            header = f"\n\n===== SOURCE_ARTICLE: {t.replace('_',' ')} =====\n"
            chunks.append(header + txt)
            gained = approx_token_count(txt)
            total += gained
            pbar.update(gained if total <= min_tokens else max(0, min_tokens - (total - gained)))
            if total >= min_tokens:
                break
            time.sleep(random.uniform(*cool_down))
        except Exception:
            errors += 1
            if errors >= max_errors:
                print(f"[WARN] {name} reached error cap; continuing with collected text.")
                break
            continue
    pbar.close()
    return "\n".join(chunks)

In [8]:
wiki_text = build_dataset(fetch_wikipedia, "Wikipedia", TOPICS)
conserva_text = build_dataset(fetch_conservapedia, "Conservapedia", TOPICS)

write_with_license("class1_dataset.txt", wiki_text, "Wikipedia (CC BY-SA 4.0)")
write_with_license("class2_dataset.txt", conserva_text, "Conservapedia (CC BY-SA 3.0)")

print("Saved:\n - class1_dataset.txt (Wikipedia)\n - class2_dataset.txt (Conservapedia)")

Building Wikipedia:   0%|          | 0/12000 [00:04<?, ?tok/s]
Building Conservapedia: 100%|██████████| 12000/12000 [00:15<00:00, 752.96tok/s]

Saved:
 - class1_dataset.txt (Wikipedia)
 - class2_dataset.txt (Conservapedia)





In [9]:
with open("class1_dataset.txt", "r", encoding="utf-8") as f:
    c1 = f.read()
with open("class2_dataset.txt", "r", encoding="utf-8") as f:
    c2 = f.read()

print("Approx tokens:")
print("class1_dataset.txt:", approx_token_count(c1))
print("class2_dataset.txt:", approx_token_count(c2))

Approx tokens:
class1_dataset.txt: 52
class2_dataset.txt: 14595


## Part 2

Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [10]:
def read_and_tokenize(filename: str) -> list[str]:
    """Read the file and output a list of strings (tokens)."""
    # your code here
    with open(filename, "r", encoding="utf-8") as f:
        text = f.read()
    tokens = re.findall(r"\b\w+\b", text.lower())
    return tokens

In [11]:
# change these file paths to wherever the datasets you created above live.
class1_tokens = read_and_tokenize("class1_dataset.txt")
class2_tokens = read_and_tokenize("class2_dataset.txt")

In [12]:
print("class1 tokens:", len(class1_tokens))
print("class2 tokens:", len(class2_tokens))

class1 tokens: 52
class2 tokens: 14595


## Part 3

Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior. This value, $\widehat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\widehat{\zeta}_w^{(i-j)}= {\widehat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\widehat{d}_w^{(i-j)}\right)}}
$$

Where:

$$
\widehat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\widehat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

In this example, the two corpora are your class1 dataset (e.g., $i$ = your class1) and your class2 dataset (e.g., $j$ = class2). Using this metric, print out the 25 words most strongly aligned with class1, and 25 words most strongly aligned with class2.  Again, consult [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) for more detail.

In [16]:
from collections import Counter
import math

def logodds_with_uninformative_prior(tokens_i: list[str], tokens_j: list[str], display=25):
    """Print out the log odds results given two lists of tokens."""
    # your code here
    counts_i = Counter(tokens_i)
    counts_j = Counter(tokens_j)
    vocab = set(counts_i.keys()) | set(counts_j.keys())
    V = len(vocab)

    alpha_w = 0.01
    alpha_0 = V * alpha_w
    n_i = len(tokens_i)
    n_j = len(tokens_j)

    z_scores = {}

    for w in vocab:
        y_iw = counts_i.get(w, 0)
        y_jw = counts_j.get(w, 0)

        d_w = (
            math.log((y_iw + alpha_w) / (n_i + alpha_0 - y_iw - alpha_w))
            - math.log((y_jw + alpha_w) / (n_j + alpha_0 - y_jw - alpha_w))
        )

        sigma_sq = 1 / (y_iw + alpha_w) + 1 / (y_jw + alpha_w)

        # z-score
        z_w = d_w / math.sqrt(sigma_sq)
        z_scores[w] = z_w

    sorted_i = sorted(z_scores.items(), key=lambda x: x[1], reverse=True)[:display]
    sorted_j = sorted(z_scores.items(), key=lambda x: x[1])[:display]

    print(f"Top {display} words aligned with class1:")
    for w, score in sorted_i:
        print(f"{w}\t{score:.4f}")

    print("\nTop {display} words aligned with class2:")
    for w, score in sorted_j:
        print(f"{w}\t{score:.4f}")

    return sorted_i, sorted_j

In [17]:
logodds_with_uninformative_prior(class1_tokens, class2_tokens)

Top 25 words aligned with class1:
sa	8.1048
0	7.8606
cc	6.2545
4	5.8437
by	5.6242
creativecommons	5.1000
wikipedia	5.1000
licenses	5.1000
org	4.7965
3	4.1695
attribution	3.6074
title	3.6074
prefixed	3.6074
section	3.6074
dataset	3.6074
licensing	3.6074
txt	3.6074
original	3.6074
source	3.5977
conservapedia	3.5977
followed	3.4647
https	3.2275
each	3.1855
page	3.1855
source_article	3.0617

Top {display} words aligned with class2:
the	-1.4772
of	-0.5481
and	-0.5208
to	-0.5049
in	-0.4911
a	-0.4852
for	-0.4275
that	-0.4191
s	-0.4079
covid	-0.4011
climate	-0.3964
as	-0.3940
vaccine	-0.3889
world	-0.3793
are	-0.3764
on	-0.3687
global	-0.3655
it	-0.3604
american	-0.3551
change	-0.3532
school	-0.3513
not	-0.3475
coronavirus	-0.3475
has	-0.3414
warming	-0.3414


([('sa', 8.104814627744409),
  ('0', 7.860570290019256),
  ('cc', 6.254536619930356),
  ('4', 5.843729849383912),
  ('by', 5.624187625862183),
  ('creativecommons', 5.099959726660873),
  ('wikipedia', 5.099959726660873),
  ('licenses', 5.099959726660873),
  ('org', 4.796519676304396),
  ('3', 4.169451530983527),
  ('attribution', 3.607402953222064),
  ('title', 3.607402953222064),
  ('prefixed', 3.607402953222064),
  ('section', 3.607402953222064),
  ('dataset', 3.607402953222064),
  ('licensing', 3.607402953222064),
  ('txt', 3.607402953222064),
  ('original', 3.607402953222064),
  ('source', 3.5977293882165355),
  ('conservapedia', 3.5977293882165355),
  ('followed', 3.464745256886124),
  ('https', 3.227465734731722),
  ('each', 3.185526455047078),
  ('page', 3.185526455047078),
  ('source_article', 3.061650573881265)],
 [('the', -1.47716536013531),
  ('of', -0.5481499917136652),
  ('and', -0.5208181939758695),
  ('to', -0.5049410084034835),
  ('in', -0.49107774070468657),
  ('a', -0

To check your work, you can run log-odds on the party platforms from the lab section. With `nltk.word_tokenize` _before_ lower-casing, these should be your top 5 words (and scores, roughly). Depending on your tokenization strategy, your scores might be slightly different.

**Democrat**:
```
president:	4.75
biden:	4.27
to:	4.11
he:	4.09
has:	4.08
```
**Republican**
```
republicans:	-13.45
our:	-11.23
will:	-10.88
american:	-10.01
restore:	-7.97
```

In [18]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_democrat_party_platform.txt
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_republican_party_platform.txt

--2025-09-09 17:11:03--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_democrat_party_platform.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 283046 (276K) [text/plain]
Saving to: ‘2024_democrat_party_platform.txt’


2025-09-09 17:11:03 (7.23 MB/s) - ‘2024_democrat_party_platform.txt’ saved [283046/283046]

--2025-09-09 17:11:03--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_republican_party_platform.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35319 (34K) [text/plain]
Saving to: ‘

In [22]:
import nltk
nltk.download('punkt_tab')
logodds_with_uninformative_prior(
    [w.lower() for w in nltk.word_tokenize("2024_democrat_party_platform.txt")],
    [w.lower() for w in nltk.word_tokenize("2024_republican_party_platform.txt")]
)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Top 25 words aligned with class1:
2024_democrat_party_platform.txt	0.9185
2024_republican_party_platform.txt	-0.9185

Top {display} words aligned with class2:
2024_republican_party_platform.txt	-0.9185
2024_democrat_party_platform.txt	0.9185


([('2024_democrat_party_platform.txt', 0.9184883309391177),
  ('2024_republican_party_platform.txt', -0.9184883309391177)],
 [('2024_republican_party_platform.txt', -0.9184883309391177),
  ('2024_democrat_party_platform.txt', 0.9184883309391177)])