The log-odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset. 
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt" 

Q1. Describe each of those datasets and their source in 100-200 words.

**ANSWER:**
The selected dataset for this assignment comprises two classic literary works from Project Gutenberg: Shakespeare's "Romeo and Juliet" and a collection of Brazilian Tales by various authors, including pieces from the renowned Brazilian novelist Machado de Assis (one of my favorite Brazilian authors). I chose these two texts due to their allegedly rich linguistic content, and both significantly exceeded the minimum requirement of 10,000 tokens for analysis. Moreover, I also wanted to analyze a text originally written in a different language, considering that the translation process might impact the vocabulary richness and complexity of the book.

However, in line with the class objectives of exploring and working with personal data, two additional datasets were included. One dataset is my own Statement of Purpose, used for my admission to the MIMS program here at Cal. The second dataset consists of an essay generated by Chat GPT, leveraging the GPT-3.5 model, with a similar goal of crafting a Statement of Purpose for myself.

The diversity of these datasets, from classic literature to personal narratives and AI-generated text, promises a multifaceted exploration of natural language processing techniques and the opportunity to gain insights into the distinctive linguistic features that characterize each text category.

**Additional Information:** For the ChatGPT-generated essay, I used the prompt: "Write a Statement of Purpose for Filipe Santos' application to a Master's program in Information Management and Systems at the University of California Berkeley. The Statement of Purpose should describe Filipe's relevant academic and professional experience and accomplishments, his future professional goals once the degree is acquired, and why he is drawn to the MIMS program and believes it would be a good fit for him and his goals. Filipe is a Brazilian industrial engineer with extensive experience in the tech world, being part of an early-stage startup from the beginning until its successful M&A. Filipe is also very interested in Smart Cities and how relevant education is as a transformational factor. The essay should be 2-3 pages long."

Q2. Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [125]:
import spacy
from collections import Counter
import numpy as np

In [126]:
def read_and_tokenize(filename):
#1. From the ExploreTokenization Jupyter Notebook, it firstly sets up some basic parameters for the nlp and instantiate the nlp tokenizer.
    nlp = spacy.load('en_core_web_sm', disable=['tagger,ner,parser'])
    nlp.remove_pipe('ner')
    nlp.remove_pipe('parser');
    tokenizer = nlp.tokenizer
    spacy_tokens = []

#2. Then it generates a list called spacy_tokens, that stores all the tokenized records of the text. It also references the code from the ExploreTokenization notebook.
    with open(filename, encoding="utf-8-sig") as file:
        text = file.read().lower()
        for line in text.splitlines():
            spacy_tokens.extend([token.text for token in nlp(line)])
    return spacy_tokens    

In [127]:
# change these file paths to wherever the datasets you created above live.
class1_tokens=read_and_tokenize("/Users/filipesantos/Documents/ANLP/class1_dataset.txt")
class2_tokens=read_and_tokenize("/Users/filipesantos/Documents/ANLP/class2_dataset.txt")

Q3.  Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior.  This value, $\hat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\hat\zeta_w^{(i-j)}= {\hat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\hat{d}_w^{(i-j)}\right)}}
$$

Where: 

$$
\hat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\hat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

In this example, the two corpora are your class1 dataset (e.g., $i$ = your class1) and your class2 dataset (e.g., $j$ = class2). Using this metric, print out the 25 words most strongly aligned with class1, and 25 words most strongly aligned with class2.  Again, consult [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) for more detail.

In [128]:
# As previously explained in this notebook, I included 4 different datasets for analysis, which are:
#class1_dataset: Romeo and Juliet (Shakespeare)
#class2_dataset: Brazilian_Tales (Multiple authors)
#class3_dataset: UC Berkeley Statement of Purpose (Filipe Santos)
#class4_dataset: LLM-generated Statement of Purpose (ChatGPT)

class1_dataset = ["/Users/filipesantos/Documents/ANLP/class1_dataset.txt"]
class2_dataset = ["/Users/filipesantos/Documents/ANLP/class2_dataset.txt"]
class3_dataset = ["/Users/filipesantos/Documents/ANLP/class3_dataset.txt"]
class4_dataset = ["/Users/filipesantos/Documents/ANLP/class4_dataset.txt"]

In [129]:
#Before we start, I added some specific changes to the previous version of the function because I wanted to adapt it to run not only for the two class_datasets that were part of the homework, but also to a different dataset. Therefore, the one_tokens and the two_tokens parameters should now refer to which of the classn_dataset the program should analyze. The suggested comparisons are either class1_dataset verus class2_dataset; or class3_dataset versus class4_dataset.

def logodds_with_uninformative_prior(one_tokens, two_tokens, display=25):

    #1. It creates Counters for each class to calculate word frequencies
    class1_word_counts = Counter()
    class2_word_counts = Counter()

    #2. It counts the word occurrences in both documents using the read_and_tokenize() function
    for doc in one_tokens:
        doc_tokens = read_and_tokenize(doc)  
        class1_word_counts.update(doc_tokens)

    for doc in two_tokens:
        doc_tokens = read_and_tokenize(doc) 
        class2_word_counts.update(doc_tokens)

    total_words_class1 = sum(class1_word_counts.values())
    total_words_class2 = sum(class2_word_counts.values())

    #3.  It calculates the sum of word frequencies across the two documents and the global parameter alpha_0, required in the formula provided on Q3.
    all_word_counts = class1_word_counts + class2_word_counts
    alpha_0 = len(all_word_counts) * 0.01

    #4. It calculates the frequency of each word through a for-loop
    log_odds_ratios = {}
    for word, freq in all_word_counts.items():
        class1_freq = class1_word_counts[word]
        class2_freq = class2_word_counts[word]

    #5. And then it applies the formula presented on Q3, breaking it down on the calculation of the numerator (log_odds_num) and the denominator (log_odds_denom.
        log_odds_num = np.log((class1_freq + 0.01) / (total_words_class1 + 100 - 0.01 + alpha_0)) - np.log((class2_freq + 0.01) / (total_words_class2 + 100-0.01 + alpha_0))
        log_odds_denom = np.sqrt((1/(class1_freq + 0.01))+(1/(class2_freq + 0.01)))
        log_odds = log_odds_num / log_odds_denom
        log_odds_ratios[word] = log_odds

    #6. Finally, it sorts the words by the calculated log-odds ratio to find the top-n words for each class. N is defined by the 'display' parameter
    top_class1_words = sorted(log_odds_ratios.items(), key=lambda x: x[1], reverse=True)[:display]
    top_class2_words = sorted(log_odds_ratios.items(), key=lambda x: x[1], reverse=False)[:display]

    print("Top 25 words for class1:")
    for word, log_odds in top_class1_words:
        print(f"{word}: {log_odds}")

    print("\nTop 25 words for class2:")
    for word, log_odds in top_class2_words:
        print(f"{word}: {log_odds}")

In [130]:
logodds_with_uninformative_prior(class1_dataset, class2_dataset)

Top 25 words for class1:
.: 16.02434997642278
my: 10.143548785827782
?: 9.848510769215055
’s: 8.930977012975237
,: 8.654346955643229
love: 8.096973046887959
i: 7.97171010408455
]: 7.355755651275787
o: 7.304737054147855
[: 7.287653334126232
thou: 6.799500851336238
what: 6.495312899134617
!: 6.456331322593055
shall: 6.253815643879929
me: 6.123529940683077
lady: 6.0711931570506925
good: 5.393093178220031
night: 5.3455745228428855
not: 5.306334470112611
be: 5.173352483112935
here: 5.0957844524344065
death: 5.036230342517764
thy: 5.019786257698284
enter: 5.014015531533947
come: 5.00489802592075

Top 25 words for class2:
the: -22.2518094643615
of: -15.208750645695574
was: -12.760877751558157
he: -11.207616612977775
his: -10.234678423358822
had: -9.138179313738322
at: -6.517570839844043
in: -5.580092373575886
who: -5.465441345423155
into: -4.933633700813729
little: -4.915903929780011
only: -4.754835994048599
were: -4.744128144995564
has: -4.599517492157525
(: -4.484133479632404
upon: -4.45288