The log-odds ratio with an informative (and uninformative) Dirichlet prior (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf)) is a common method for finding distinctive terms in two datasets (see [Jurafsky et al. 2014](https://firstmonday.org/ojs/index.php/fm/article/view/4944/3863) for an example article that uses it to make an empirical argument). This method for finding distinguishing words combines a number of desirable properties:

* it specifies an intuitive metric (the log-odds) for the ratio of two probabilities
* it incorporates prior information in the form of pseudocounts, which can either act as a smoothing factor (in the uninformative case) or incorporate real information about the expected frequency of words overall.
* it accounts for variability of a frequency estimate by essentially converting the log-odds to a z-score.

In this homework you will implement both of these ratios and compare the results.

In [1]:
import sys, operator, math, nltk
from collections import Counter

In [2]:
def read_and_tokenize(filename):
    
    with open(filename, encoding="utf-8") as file:
        tokens=[]
        # lowercase
        for line in file:
            data=line.rstrip().lower()
            # This dataset is already tokenized, so we can split on whitespace
            tokens.extend(data.split(" "))
        return tokens

The data we'll use in this case comes from a sample of 1000 positive and 1000 negative movie reviews from the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  The version of the data used in this homework has already been tokenized for you.

In [3]:
negative_tokens=read_and_tokenize("../data/negative.reviews.txt")
positive_tokens=read_and_tokenize("../data/positive.reviews.txt")

Q1.  Implement the log-odds ratio with an uninformative Dirichlet prior.  This value, $\hat\zeta_w^{(i-j)}$ for word $w$ reflecting the difference in usage between corpus $i$ and corpus $j$, is given by the following equation:

$$
\hat\zeta_w^{(i-j)}= {\hat{d}_w^{(i-j)} \over \sqrt{\sigma^2\left(\hat{d}_w^{(i-j)}\right)}}
$$

Where: 

$$
\hat{d}_w^{(i-j)} = \log \left({y_w^i + \alpha_w} \over {n^i + \alpha_0 - y_w^i - \alpha_w}) \right) -  \log \left({y_w^j + \alpha_w} \over {n^j + \alpha_0 - y_w^j - \alpha_w}) \right)
$$

$$
\sigma^2\left(\hat{d}_w^{(i-j)}\right) \approx {1 \over {y_w^i + \alpha_w}} + {1 \over {y_w^j + \alpha_w} }
$$

And:

* $y_w^i = $ count of word $w$ in corpus $i$ (likewise for $j$)
* $\alpha_w$ = 0.01
* $V$ = size of vocabulary (number of distinct word types)
* $\alpha_0 = V * \alpha_w$
* $n^i = $ number of words in corpus $i$ (likewise for $j$)

Here the two corpora are the positive movie reviews (e.g., $i$ = positive) and the negative movie reviews (e.g., $j$ = negative). Using this metric, print out the 25 words most strongly aligned with the positive corpus, and 25 words most strongly aligned with the negative corpus.

In [4]:
def logodds_with_uninformative_prior(one_tokens, two_tokens, display=25):
    
    def get_counter_from_list(tokens):
        counter=Counter()
        for token in tokens:
            counter[token]+=1
        return counter


    oneCounter=get_counter_from_list(one_tokens)
    twoCounter=get_counter_from_list(two_tokens)
    
    vocab=dict(oneCounter) 
    vocab.update(dict(twoCounter))
    oneSum=sum(oneCounter.values())
    twoSum=sum(twoCounter.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha
        
    for word in vocab:
        
        log_odds_ratio=math.log( (oneCounter[word] + alpha) / (oneSum+alphaV-oneCounter[word]-alpha) ) - math.log( (twoCounter[word] + alpha) / (twoSum+alphaV-twoCounter[word]-alpha) )
        variance=1./(oneCounter[word] + alpha) + 1./(twoCounter[word] + alpha)
        
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    print("Most positive:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))
    
    print("\nMost negative:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

In [5]:
logodds_with_uninformative_prior(positive_tokens, negative_tokens)

Most positive:
9.608	great
8.445	his
8.221	best
8.068	as
8.008	and
7.466	love
7.232	war
7.143	excellent
6.697	wonderful
6.559	is
6.389	her
6.052	performance
5.937	,
5.769	of
5.722	life
5.707	highly
5.664	world
5.547	perfect
5.490	in
5.466	always
5.380	performances
5.356	beautiful
5.198	most
5.148	tony
5.092	loved

Most negative:
-15.874	bad
-15.035	?
-11.949	n't
-10.960	movie
-9.929	worst
-9.448	i
-9.122	just
-8.676	...
-8.618	was
-7.999	no
-7.521	do
-7.512	awful
-7.446	terrible
-7.373	they
-7.053	horrible
-7.020	why
-6.935	this
-6.931	poor
-6.709	boring
-6.685	any
-6.674	waste
-6.661	script
-6.601	worse
-6.552	have
-6.475	stupid


Q2: As you increase the constant $\alpha_w$ in the equations above, what happens to $\hat\zeta_w^{(i-j)}$, $\hat{d}_w^{(i-j)}$ and $\sigma^2\left(\hat{d}_w^{(i-j)}\right)$ (i.e., do they get bigger or smaller)?  Answer this by plugging the following values in your implementation of these two quantities, and varying $\alpha_w$ (and, consequently, $\alpha_0$).

* $y_w^i=34$
* $y_w^j=17$
* $n^i=1000$
* $n^j=1000$
* $V=500$

In [7]:
y_i=34
y_j=17
oneSum=1000
twoSum=1000
V=500
for alpha in [0.01, 0.1, 1, 10, 100, 1000]:
    alphaV=V*alpha
    log_odds_ratio=math.log( (y_i + alpha) / (oneSum+alphaV-y_i-alpha) ) - math.log( (y_j + alpha) / (twoSum+alphaV-y_j-alpha) )
    variance=1./(y_i + alpha) + 1./(y_j + alpha)

    print("%s:\t%.3f\t%.3f\t%.3f" % (alpha, log_odds_ratio, variance, log_odds_ratio/math.sqrt(variance)))

0.01:	0.710	0.088	2.392
0.1:	0.707	0.088	2.385
1:	0.677	0.084	2.332
10:	0.491	0.060	2.009
100:	0.136	0.016	1.075
1000:	0.017	0.002	0.376


Now let's make that prior informative by including information about the overall frequency of a given word in a background corpus (i.e., a corpus that represents general word usage, without regard for labeled subcorpora).  To do so, there are only two small changes to make:

* We need to gather a background corpus $b$ and calculate $\hat\pi_w$, the relative frequency of word $w$ in $b$ (i.e., the number of times $w$ occurs in $b$ divided by the number of words in $b$).

* In the uninformative prior above, $\alpha_w$ was a constant (0.01) and $\alpha_0 = V * \alpha_w$.  Let us now set $\alpha_0 = 1000$ and $\alpha_w = \hat\pi_w * \alpha_0$.  This reflects a pseudocount capturing the fractional number of times we would expect to see word $w$ in a sample of 1000 words.

This allows us to specify that a common word like "the" (which has a relative frequency of $\approx 0.04$) would have $\alpha_w = 40$, while an infrequent word like "beneficiaries" (relative frequency $\approx 0.00002$) would have $\alpha_w = 0.02$. 

Q3. Implement a log-odds ratio with informative prior, using a larger background corpus of 5M tokens drawn from the same dataset (given to you as `priors` below, which contains the relative frequencies of words calculated from that corpus) and set $\alpha_0 = 1000$. Using this metric, print out again the 25 words most strongly aligned with the positive corpus, and 25 words most strongly aligned with the negative corpus.  Is there a meaningful difference?

In [8]:
def read_priors(filename):
    counts=Counter()
    freqs={}
    tokens=read_and_tokenize(filename)
    total=len(tokens)

    for token in tokens:
        counts[token]+=1

    for word in counts:
        freqs[word]=counts[word]/total

    return freqs
    
priors=read_priors("../data/sentiment.background.txt")

In [11]:
def logodds_with_informative_prior(one_tokens, two_tokens, priors, display=25):
    
    def get_counter_from_list(tokens):
        counter=Counter()
        for token in tokens:
            counter[token]+=1
        return counter
    
    a0=1000
        
    oneCounter=get_counter_from_list(one_tokens)
    twoCounter=get_counter_from_list(two_tokens)
    
    vocab=dict(oneCounter) 
    vocab.update(dict(twoCounter))
    oneSum=sum(oneCounter.values())
    twoSum=sum(twoCounter.values())

    ranks={}

    for word in vocab:

        prior=priors[word]*a0

        log_odds_ratio=math.log( (oneCounter[word] + prior) / (oneSum+a0-oneCounter[word]-prior) ) - math.log( (twoCounter[word] + prior) / (twoSum+a0-twoCounter[word]-prior) )
        variance=1./(oneCounter[word] + prior) + 1./(twoCounter[word] + prior)
        
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    print("Most positive:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))
    
    print("\nMost negative:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

In [12]:
logodds_with_informative_prior(positive_tokens, negative_tokens, priors)

Most positive:
9.591	great
8.428	his
8.207	best
8.052	as
7.990	and
7.454	love
7.226	war
7.136	excellent
6.693	wonderful
6.544	is
6.377	her
6.043	performance
5.922	,
5.755	of
5.712	life
5.704	highly
5.655	world
5.540	perfect
5.477	in
5.456	always
5.371	performances
5.346	beautiful
5.189	most
5.152	tony
5.085	loved

Most negative:
-15.858	bad
-15.012	?
-11.930	n't
-10.942	movie
-9.958	worst
-9.434	i
-9.107	just
-8.661	...
-8.604	was
-7.986	no
-7.513	awful
-7.509	do
-7.450	terrible
-7.361	they
-7.055	horrible
-7.008	why
-6.925	this
-6.923	poor
-6.719	waste
-6.702	boring
-6.674	any
-6.652	script
-6.597	worse
-6.541	have
-6.468	stupid
