This notebook performs a comparative text analysis on the tweets of Donald Trump and Alexandria Ocasio-Cortez (AOC) using the Chi-Square ($\chi^2$) statistical test. The goal is to identify the words that most significantly distinguish one's Twitter vocabulary from the other's.

***

### Cell 1: Importing Necessary Libraries
This first cell imports the Python libraries required for the analysis.
* `sys`: Provides access to system-specific parameters and functions.
* `json`: Essential for reading the data, which is stored in JSON format.
* `nltk`: The Natural Language Toolkit, a powerful library for text processing. We'll use its `casual_tokenize` function, which is specifically designed for informal text like tweets.
* `operator`: Used later for sorting the results based on their statistical scores.
* `collections.Counter`: A specialized dictionary subclass used for counting word frequencies efficiently.

In [None]:
# Import necessary libraries for file handling, data processing, and natural language processing.
import sys
import json
import nltk
import operator
from collections import Counter

***

### Cell 2: Function to Read Tweet Data
This cell defines a helper function, `read_tweets_from_json`, to load the tweet data from a JSON file. It opens the specified file, parses the JSON content, and iterates through each tweet object to extract only the text, which it appends to a list.

In [None]:
# Defines a function to open a JSON file and extract the 'text' field from each tweet object.
def read_tweets_from_json(filename):
    # Initialize an empty list to store tweet texts.
    tweets=[]
    # Open the specified file with UTF-8 encoding to handle special characters and emojis.
    with open(filename, encoding="utf-8") as file:
        # Load the JSON data from the file.
        data=json.load(file)
        # Loop through each tweet in the loaded data.
        for tweet in data:
            # Append the text of the tweet to our list.
            tweets.append(tweet["text"])
    # Return the list of tweet texts.
    return tweets

***

### Cell 3: Loading Donald Trump's Tweets
Here, we call the `read_tweets_from_json` function to load the dataset containing tweets from Donald Trump. The returned list of tweet strings is stored in the `trump_tweets` variable.

In [None]:
# Use the previously defined function to load tweets from the Trump JSON file.
trump_tweets=read_tweets_from_json("../data/trump_tweets.json")

***

### Cell 4: Loading Alexandria Ocasio-Cortez's Tweets
Similarly, this cell loads the tweet data for Alexandria Ocasio-Cortez into the `aoc_tweets` variable. Now we have two distinct datasets ready for processing.

In [None]:
# Use the same function to load tweets from the AOC JSON file.
aoc_tweets=read_tweets_from_json("../data/aoc_tweets.json")

***

### Cell 5: Initial Assumptions (Pre-Analysis)
> Explore your assumptions between the words you think will most distinguish the tweets of Donald Trump from those Alexandria Ocasio-Cortez. Before looking at the data, what words do you think will be comparatively distinct to both? (If you're not familiar with either, see http://twitter.com/realDonaldTrump and http://twitter.com/AOC).

Before running the statistical analysis, it's useful to form a hypothesis.
* **For Donald Trump**, I would expect to see self-referential terms like "Trump" and "Donald," along with characteristic adjectives like "great," "tremendous," and "best." I'd also predict politically charged terms like "Fake News," "Witch Hunt," and "MAGA." His use of exclamation points and distinctive capitalization might also be a feature.
* **For Alexandria Ocasio-Cortez**, I'd anticipate words related to policy and social issues, such as "climate," "healthcare," "justice," "community," and "systemic." I would also expect to see terms related to her constituency, like "Bronx" and "Queens," and progressive political language.

***

### Cell 6: Function to Tokenize Tweets
Tokenization is the process of breaking down raw text into smaller units, like words or symbols (called "tokens"). This function, `convert_tweets_to_tokens`, takes a list of tweet strings and uses `nltk.casual_tokenize`. This specific tokenizer is ideal for social media as it correctly handles hashtags, mentions, and emoticons, treating them as single tokens. The function returns one long list containing all tokens from all tweets.

In [None]:
# Defines a function that converts a list of tweet strings into a single list of tokens.
def convert_tweets_to_tokens(tweets):
    # Initialize an empty list to hold all tokens.
    tokens=[]
    # Iterate over each tweet in the input list.
    for tweet in tweets:
        # Tokenize the tweet using NLTK's casual tokenizer and add the resulting tokens to the list.
        tokens.extend(nltk.casual_tokenize(tweet))
    # Return the flat list of all tokens.
    return tokens

***

### Cell 7: Function to Count Token Frequencies
After tokenizing the text, we need to count how many times each token appears. This function, `get_counts`, takes a list of tokens and uses the `Counter` object to create a frequency distribution, which is essentially a dictionary mapping each unique token to its count.

In [None]:
# Defines a function to count the frequency of each token in a list of tokens.
def get_counts(tokens):
    # Create a Counter object, which will store tokens as keys and their frequencies as values.
    counts=Counter()
    # Iterate through each token in the list.
    for token in tokens:
        # Increment the count for the current token.
        counts[token]+=1
    # Return the Counter object with all the token counts.
    return counts

***

### Cell 8: Introduction to the Chi-Square ($\chi^2$) Test
> The $\chi^2$ test as used in the comparison of different texts is designed to measure how statistically significant the distriubtion of counts in a 2x2 contingency table is. Use the following function to analyze the difference between these accounts. How do the most distinct terms comport with your assumptions?

The Chi-square ($\chi^2$) test is a statistical method used to determine if there is a significant association between two categorical variables. In our case, the variables are **the author** (Trump or AOC) and **the word**. For each word in the combined vocabulary, we create a 2x2 contingency table:

| | Trump's Tweets | AOC's Tweets |
| :--- | :---: | :---: |
| **Count of the word** | `O11` | `O12` |
| **Count of all other words**| `O21` | `O22` |

The $\chi^2$ test calculates a score based on the difference between the observed counts (`O`) in this table and the counts we would expect if there were no association between the author and the word. A high $\chi^2$ score indicates that an author uses a particular word with a significantly different frequency (either much more or much less) than the other author, making it a "distinctive" word.

In [None]:
# Defines the main function to perform the Chi-Square test.
def chi_square(one_counts, two_counts):

    # Initialize total word counts for corpus one (Trump) and two (AOC).
    one_sum=0.
    two_sum=0.
    # Create a dictionary to store the combined vocabulary from both corpora.
    vocab={}
    # Calculate the total number of words in corpus one and populate the vocabulary.
    for word in one_counts:
        one_sum+=one_counts[word]
        vocab[word]=1
    # Do the same for corpus two, adding any new words to the vocabulary.
    for word in two_counts:
        vocab[word]=1
        two_sum+=two_counts[word]

    # Calculate the grand total number of words across both corpora.
    N=one_sum+two_sum
    # Initialize a dictionary to store the chi-square value for each word.
    vals={}
    
    # Iterate over every word in the combined vocabulary.
    for word in vocab:
        # O11: Observed count of the word in corpus one.
        O11=one_counts[word]
        # O12: Observed count of the word in corpus two.
        O12=two_counts[word]
        # O21: Observed count of all OTHER words in corpus one.
        O21=one_sum-one_counts[word]
        # O22: Observed count of all OTHER words in corpus two.
        O22=two_sum-two_counts[word]
        
        # We'll use the simpler, computationally faster form of the Chi-square formula
        # for 2x2 contingency tables, as described in Manning and Schuetze (1999).
        
        # Calculate the Chi-square score for the word.
        vals[word]=(N*(O11*O22 - O12*O21)**2)/((O11 + O12)*(O11+O21)*(O12+O22)*(O21+O22))
        
    # Sort the words by their Chi-square score in descending order.
    sorted_chi = sorted(vals.items(), key=operator.itemgetter(1), reverse=True)
    
    # Create empty lists to hold the distinctive words for each author.
    one=[]
    two=[]
    # Iterate through the sorted words.
    for k,v in sorted_chi:
        # To assign a word, we check who used it more frequently relative to their total word output.
        if one_counts[k]/one_sum > two_counts[k]/two_sum:
            one.append(k)
        else:
            two.append(k)
    
    # Print the top 20 most distinctive terms for Donald Trump.
    print ("@realdonaldtrump:\n")
    for k in one[:20]:
        print("%s\t%s" % (k,vals[k]))

    # Print the top 20 most distinctive terms for AOC.
    print ("\n\n@AOC:\n")
    for k in two[:20]:
        print("%s\t%s" % (k,vals[k]))

***

### Cell 10: Processing Trump's Tweets
Now we apply the preprocessing functions to Donald Trump's tweet data. First, we tokenize the list of tweets, and then we generate a frequency count for all of his tokens.

In [None]:
# Convert the list of Trump's tweets into a flat list of tokens.
trump_tokens=convert_tweets_to_tokens(trump_tweets)
# Get the frequency counts for each token in Trump's corpus.
trump_counts=get_counts(trump_tokens)

***

### Cell 11: Processing AOC's Tweets
We perform the exact same preprocessing steps for Alexandria Ocasio-Cortez's tweets, preparing her data for the statistical comparison.

In [None]:
# Convert the list of AOC's tweets into a flat list of tokens.
aoc_tokens=convert_tweets_to_tokens(aoc_tweets)
# Get the frequency counts for each token in AOC's corpus.
aoc_counts=get_counts(aoc_tokens)

***

### Cell 12: Running the Analysis and Displaying Results
With both datasets tokenized and counted, we can now execute the `chi_square` function. This will perform the statistical test and print the top 20 most distinctive words for each account, along with their high $\chi^2$ scores.

In [None]:
# Run the Chi-Square analysis on the two count dictionaries and print the results.
chi_square(trump_counts, aoc_counts)

***

### Post-Analysis: Comparing Results with Assumptions
The results largely align with and add nuance to the initial assumptions.

* **For @realdonaldtrump**: The list includes self-references (`"`, `@realDonaldTrump`, `Trump`, `Donald`, `President`), his characteristic adjective (`great`/`Great`), and a key political opponent (`Obama`). The high score for punctuation like `"` , `!`, `.`, and `?` points to a distinct stylistic signature. The presence of `#Trump2016` grounds the dataset in a specific time period. This matches the hypothesis well.

* **For @AOC**: The results are very revealing. The high frequency of `RT` (Retweet) suggests a different usage pattern of the platform. The prominence of `Queens` and `Bronx` strongly confirms the hypothesis about constituency-focused language. The names `Ocasio-Cortez`, `Alexandria`, `Ocasio`, and her opponent `Crowley` are also highly distinctive. The strange characters (`窶ｦ`, `凋`, etc.) are likely emojis or Unicode symbols that did not render correctly but are clearly a distinguishing feature of her tweets. This also aligns with the initial hypothesis but reveals stylistic markers (like RTs and emojis) that are just as important as topic words.

***

### Cell 13: Discussion on Burstiness and Alternative Methods

> We saw earlier that $\chi^2$ is not a perfect estimator since it doesn't account for the burstiness of language (the tendency of mentions of the same word to clump together in a text). Do you expect this to still hold on Twitter? Why or why not? How are the differences identified by a $\chi^2$ similar to those by Mann-Whitney?

#### Burstiness on Twitter
Yes, the problem of **burstiness** is not only present but likely **amplified** on Twitter. The platform's real-time, event-driven nature encourages it. For example:
* **Replying to a specific event**: A politician might send out a dozen tweets in a short period about a single breaking news story, causing topic words to "burst."
* **Live-tweeting**: During a debate or hearing, a user will post many tweets using the same hashtags and keywords.
* **Campaigning**: During a visit to a specific location (e.g., the Bronx), a politician might mention it repeatedly in a single day.

The $\chi^2$ test assumes that every word is an independent event, but burstiness violates this assumption. A word appearing 10 times in one "bursty" tweet is treated the same as that word appearing once in 10 different, independent tweets. This can inflate the $\chi^2$ score for words associated with specific, high-volume events.

#### Comparison to Mann-Whitney U Test
The Chi-square test and the Mann-Whitney U test identify differences between texts, but they do so in fundamentally different ways.

* **Chi-Square Test (as used here)**: This approach treats each author's entire collection of tweets as one large "bag of words." It compares the **total frequency** of a word in Trump's entire corpus against its total frequency in AOC's entire corpus. It answers the question: "Overall, does one author use this word more than the other?" It's a measure of aggregate frequency.

* **Mann-Whitney U Test**: This test would be used differently. Instead of one big bag of words, we would treat each **individual tweet** as a document. We would calculate the frequency of a word (e.g., "climate") in every single one of AOC's tweets, giving us a list of numbers (e.g., `[0, 1, 0, 0, 2, 0, ...]`). We'd do the same for Trump's tweets. The Mann-Whitney U test would then compare these two lists to see if the **distributions** are different. It's a non-parametric test that essentially asks: "If I pick a random tweet from AOC and a random one from Trump, what is the probability that the AOC tweet has a higher frequency of this word?"

**Similarity and Difference**:
* **Similarity**: Both tests can identify words that are used more by one author than another. A word with a high $\chi^2$ score will often also have a significantly different distribution according to a Mann-Whitney U test.
* **Key Difference**: Mann-Whitney is less sensitive to burstiness. A word used 20 times in a single "bursty" tweet is just one data point (one tweet with a high frequency) for Mann-Whitney. For Chi-square, those 20 occurrences significantly inflate the word's total count. Therefore, Mann-Whitney is better at identifying words that are **consistently** used more frequently on a *per-tweet basis*, while Chi-square identifies words that are more frequent **in total**, which may be due to consistent use or just a few high-volume events.