# Bonus: Computing chi-squared and (non-normalized) correlation statistics

As a bonus, we will calculate the chi-squared statistic for all of the words in two novels, *Pride and Prejudice* and *Garland for Girls*, and then calculate the non-normalized correlation for two sample words in the corpus.

Don't worry if you don't understand all of this. If it helps some of you, great. If it's a bit advanced no problem, this will not be part of any assignment. Stick with me as much as you can.

### 0. Document Term Matrix
First, I'll create a document term matrix from the two novels. We did this in the tutorial on February 15.

In [105]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer

text_list = []
#open and read the novels, save them as variables
austen_string = open('../data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#append each novel to the list
text_list.append(austen_string)
text_list.append(alcott_string)

countvec = CountVectorizer(stop_words="english")


novels_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
novels_df

Unnamed: 0,000,1500,15th,1813,1887,18th,20,2001,26th,30,...,york,young,younge,younger,youngest,youngsters,youth,youthful,youths,zip
0,0,0,1,2,0,1,0,0,1,0,...,1,129,4,29,14,0,9,0,1,0
1,1,1,1,0,2,0,1,1,0,1,...,2,109,0,7,2,1,9,1,3,1


### 1. Chi-Squared for sample words

Let's look at the frequency for two words, "sister" and "child".

In [106]:
novels_df[['sister', 'child']]

Unnamed: 0,sister,child
0,218,14
1,40,79


To calculate the chi-squared statistic for these two words we need to know the expected frequency, if these two words were used the same in both novels. To do this, we divide the total frequency across both novels by two.

In [107]:
expected_sister = novels_df['sister'].sum(axis=0)/2
expected_sister

129.0

In [108]:
expected_child = novels_df['child'].sum(axis=0)/2
expected_child

46.5

To calculate the chi_squares we subtract the expected frequency for each novel from the actual frequency for each novel, square this value, and divide by the expected frequency, and add the two numbers together.

In [109]:
chi_sister = ((novels_df.loc[0,'sister'] - expected_sister)**2 / expected_sister) + ((novels_df.loc[1,'sister'] - expected_sister)**2 / expected_sister)
chi_sister

122.8062015503876

In [110]:
chi_child = ((novels_df.loc[0,'child'] - expected_child)**2 / expected_child) + ((novels_df.loc[1,'child'] - expected_child)**2 / expected_child)
chi_child

45.43010752688172

These are large values. Let's try a word that has a much closer frequency across both novels. The result is a much smaller chi-squared statistic.

In [111]:
novels_df['writing']

0    14
1    11
Name: writing, dtype: int64

In [112]:
expected_writes = novels_df['writing'].sum(axis=0)/2
chi_writes = ((novels_df.loc[0,'writing'] - expected_writes)**2 / expected_writes) + ((novels_df.loc[1,'writing'] - expected_writes)**2 / expected_writes)
chi_writes

0.35999999999999999

### 2. Partisan Score

Next, we can find the partisan score for our chosen words. We do this simply, by multiplying the word frequency in *Pride and Prejudice* by 1, and multiple the word frequency in *Garland for Girls* by -1, and adding these together. A partisan score above 0 will indicate it's used more often in Austen, a negative score will mean it is used more often in Alcott.

In [113]:
sister_corr = novels_df.loc[0,'sister']*1 + novels_df.loc[1,'sister']*-1
sister_corr

178

In [114]:
child_corr = (novels_df.loc[0,'child']*1) + (novels_df.loc[1,'child']*-1)
child_corr

-65

In [115]:
writing_corr = (novels_df.loc[0,'writing']*1) + (novels_df.loc[1,'writing']*-1)
writing_corr

3

In [116]:
writes_corr = (novels_df.loc[0,'writes']*1) + (novels_df.loc[1,'writes']*-1)
writes_corr

0

What does a partisan score of 0 mean?

In [117]:
novels_df['writes']

0    1
1    1
Name: writes, dtype: int64

### 3. Chi-Squared for every word, using a for-loop

Now we can calculate this for each word in our corpus. To do this we have to introduce the for loop. We've seen this before in list comprehension, but we're splitting it out now into multiple lines. To think this intuitively, take this example:

For every child that knocks on my door on Halloween I will do the following:
1. Ask them what their costume is
2. Give them a piece of candy
3. Cackle wildly

The for loop in Python is intuitively the same. For every element in a list, we want to do something to that element.

In this case, we will loop through all columns in our dataframe and calculate the chi-squared statistic. We will then append both the column name (our word) and the chi-squared statistic to a list using .append().

In [118]:
columns = list(novels_df)
chi_list = []

for c in columns:
    chi_list.append([c,((novels_df.loc[0,c] - novels_df[c].sum(axis=0)/2)**2 / novels_df[c].sum(axis=0)/2) + ((novels_df.loc[1,c] - novels_df[c].sum(axis=0)/2)**2 / novels_df[c].sum(axis=0)/2)])

In [119]:
chi_list[:10]

[['000', 0.25],
 ['1500', 0.25],
 ['15th', 0.0],
 ['1813', 0.5],
 ['1887', 0.5],
 ['18th', 0.25],
 ['20', 0.25],
 ['2001', 0.25],
 ['26th', 0.25],
 ['30', 0.25]]

We can now sort this list by the second element in each tuple (it's not technically a tuple, but no matter), and print the top 50 "partisan" words.

In [122]:
chi_list.sort(key=lambda x: x[1], reverse=True)
chi_list[:50]

[['elizabeth', 155.52507836990597],
 ['mr', 133.99658314350796],
 ['darcy', 104.5],
 ['bennet', 81.0],
 ['bingley', 76.75],
 ['jane', 52.160493827160494],
 ['wickham', 48.5],
 ['collins', 42.549450549450547],
 ['old', 42.275229357798167],
 ['lydia', 42.005813953488371],
 ['catherine', 31.5],
 ['family', 31.117283950617285],
 ['mrs', 30.748886414253896],
 ['sister', 30.7015503875969],
 ['don', 24.300000000000001],
 ['replied', 24.288095238095238],
 ['gardiner', 24.25],
 ['lizzy', 24.25],
 ['work', 23.859467455621303],
 ['gutenberg', 23.25],
 ['soon', 22.816787003610109],
 ['longbourn', 22.0],
 ['rosy', 22.0],
 ['charlotte', 21.25],
 ['jenny', 21.0],
 ['becky', 20.75],
 ['feelings', 20.544444444444444],
 ['project', 20.261764705882353],
 ['ll', 20.011904761904763],
 ['father', 19.848101265822784],
 ['ethel', 19.75],
 ['letter', 19.467557251908396],
 ['brother', 18.299382716049383],
 ['netherfield', 18.25],
 ['lucas', 17.75],
 ['jessie', 17.5],
 ['emily', 17.25],
 ['manner', 17.1203703703

Exercise:

Calculate the partisan score for each word in the corpus and print the most partisan words for each novel.