The log-odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it can incorporate prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement this ratio for a dataset of your choice to characterize the words that differentiate each one.

Q1. Your first job is to find two datasets with some interesting opposition -- e.g., news articles from CNN vs. FoxNews, books written by Charles Dickens vs. James Joyce, screenplays of dramas vs. comedies.  Be creative -- this should be driven by what interests you and should reflect your own originality. **This dataset cannot come from Kaggle**.  Feel feel to use web scraping (see [here](https://github.com/CU-ITSS/Web-Data-Scraping-S2023) for a great tutorial) or manually copying/pasting text.  Aim for more than 10,000 tokens for each dataset. 
   
Save those datasets in two files: "class1_dataset.txt" and "class2_dataset.txt" 

Describe each of those datasets and their source in 100-200 words.

Q2. Tokenize those texts by filling out the `read_and_tokenize` function below (your choice of tokenizer). The input is a filename and the output should be a list of tokens.

In [10]:
import sys, operator, math, nltk
from collections import Counter

In [11]:
def read_and_tokenize(filename):
    
    with open(filename, encoding="utf-8") as file:
        tokens=[]
        # lowercase
        for line in file:
            data=line.rstrip().lower()
            # This dataset is already tokenized, so we can split on whitespace
            tokens.extend(data.split(" "))
        return tokens

In [12]:
class1_tokens=read_and_tokenize("../data/negative.reviews.txt")
class2_tokens=read_and_tokenize("../data/positive.reviews.txt")

Q3.  Now let's find the words that characterize each of those sources (with respect to the other). Implement the log-odds ratio with an uninformative Dirichlet prior.  This value, $\hat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\hat\zeta_w^{(i-j)}= {\hat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\hat{d}_w^{(i-j)}\right)}}
$$

Where: 

$$
\hat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\hat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

Here the two corpora are the positive movie reviews (e.g., $i$ = positive) and the negative movie reviews (e.g., $j$ = negative). Using this metric, print out the 25 words most strongly aligned with the positive corpus, and 25 words most strongly aligned with the negative corpus.

In [13]:
def logodds_with_uninformative_prior(one_tokens, two_tokens, display=25):
    
    def get_counter_from_list(tokens):
        counter=Counter()
        for token in tokens:
            counter[token]+=1
        return counter


    oneCounter=get_counter_from_list(one_tokens)
    twoCounter=get_counter_from_list(two_tokens)
    
    vocab=dict(oneCounter) 
    vocab.update(dict(twoCounter))
    oneSum=sum(oneCounter.values())
    twoSum=sum(twoCounter.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha
        
    for word in vocab:
        
        log_odds_ratio=math.log( (oneCounter[word] + alpha) / (oneSum+alphaV-oneCounter[word]-alpha) ) - math.log( (twoCounter[word] + alpha) / (twoSum+alphaV-twoCounter[word]-alpha) )
        variance=1./(oneCounter[word] + alpha) + 1./(twoCounter[word] + alpha)
        
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    print("Most positive:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))
    
    print("\nMost negative:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

In [14]:
logodds_with_uninformative_prior(class1_tokens, class2_tokens)

Most positive:
15.874	bad
15.035	?
11.949	n't
10.960	movie
9.929	worst
9.448	i
9.122	just
8.676	...
8.618	was
7.999	no
7.521	do
7.512	awful
7.446	terrible
7.373	they
7.053	horrible
7.020	why
6.935	this
6.931	poor
6.709	boring
6.685	any
6.674	waste
6.661	script
6.601	worse
6.552	have
6.475	stupid

Most negative:
-9.608	great
-8.445	his
-8.221	best
-8.068	as
-8.008	and
-7.466	love
-7.232	war
-7.143	excellent
-6.697	wonderful
-6.559	is
-6.389	her
-6.052	performance
-5.937	,
-5.769	of
-5.722	life
-5.707	highly
-5.664	world
-5.547	perfect
-5.490	in
-5.466	always
-5.380	performances
-5.356	beautiful
-5.198	most
-5.148	tony
-5.092	loved
