# Comparing and contrasting groups of texts
We know how to easily make document-term matrices containing the raw and scaled counts for each of our texts. We can use these DTMs to compare groups of our texts based on groups derived from the metadata we have associated with them.

## Importing `make_dtm`

I have created a `.py` file containing all of our `make_dtm` functions that we can easily import. That way, we don't have to copy-paste all of that code in every time!

To import it, we have to have the `make_dtm.py` file located in our working directory:

In [130]:
import os
os.getcwd() # which directory am I in? "cwd" stands for "current working directory"

'/Users/e/Library/Mobile Documents/com~apple~CloudDocs/PhD/ltm/notebooks'

In [131]:
'make_dtm.py' in os.listdir() # is the file in our list of files? we want this to be True

True

Now we can import `make_dtm`:

In [18]:
from make_dtm import * # this means, "from the file make_dtm, import everything." the * means 'everything'
# * is sometimes known as a wildcard operator

Using the syntax above, we import *all* of the functions in `make_dtm`, including the ones that it depends on like `absolute_paths`, etc.

We can also call `help()` to read the documentation about our function:

In [132]:
help(make_dtm)

Help on function make_dtm in module make_dtm:

make_dtm(directory, scaled=False, drop_below=None)
    Makes a document-term matrix from a directory of .txt files.
    `scaled` option scaled frequencies when set to True.
    `drop_below` sets the rate of occurrence at which words will be dropped. Default is None. i.e. retain all words.



In [20]:
hp_dir = '/Users/e/code/literarytextmining/corpora/harry_potter/texts'
make_dtm(hp_dir)

Unnamed: 0_level_0,a,aaaaaaaaaaaaaarrrrrrrrrrrrggggghhhhh,aaaaaaaaargh,aaaaaaaargh,aaaaaaaarrrrrgh,aaaaaaand,aaaaaaarrrgh,aaaaaah,aaaaaand,aaaaah,...,zombie,zone,zonko,zonkos,zoo,zoological,zoom,zoomed,zooming,éclairs
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 Sorcerers Stone.txt,1079,,,,,,,,,,...,2.0,,,,7.0,,1.0,1,2.0,1.0
2 Chamber of Secrets.txt,1901,,,,,,,,,,...,,,,,2.0,,,2,,
3 Prisoner of Azkaban.txt,2248,1.0,,,,,1.0,,,,...,1.0,,1.0,10.0,,,,9,3.0,
4 Goblet of Fire.txt,3714,,,,1.0,1.0,,,1.0,1.0,...,,,,1.0,,1.0,4.0,9,12.0,
5 Order of the Phoenix.txt,5017,,1.0,,,,,1.0,,,...,,1.0,,3.0,,,2.0,23,7.0,
6 Half-Blood Prince.txt,3361,,,1.0,,,,,,,...,,,,2.0,,,,7,2.0,2.0
7 Deathly Hallows.txt,3645,,,,,,,,,,...,,,,,,,1.0,6,5.0,


I added one other option to let us drop rare words from our DTM as we make it:

In [133]:
df = make_dtm(hp_dir, scaled = True, drop_below=15)

`drop_below` gets rid of words that occur `n*` the lowest number of words in the corpus.
As the lowest number of words is always one, this integer allows us to deal with both scaled and raw document-term matrices.

In [134]:
df.head()

Unnamed: 0_level_0,a,aaah,aback,abandoned,abandoning,abbott,aberforth,ability,able,abou,...,yule,zabini,zacharias,zat,ze,zero,zonkos,zoo,zoomed,zooming
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 Sorcerers Stone.txt,0.024407,2.3e-05,2.3e-05,,,2.3e-05,,,0.000226,6.8e-05,...,,2.3e-05,,,,,,0.000158,2.3e-05,4.5e-05
2 Chamber of Secrets.txt,0.021229,,,1.1e-05,1.1e-05,,,,0.000402,,...,,,,,,1.1e-05,,2.2e-05,2.2e-05,
3 Prisoner of Azkaban.txt,0.020045,,1.8e-05,1.8e-05,2.7e-05,9e-06,,9e-06,0.000321,5.3e-05,...,,,,,,2.7e-05,8.9e-05,,8e-05,2.7e-05
4 Goblet of Fire.txt,0.018758,2.5e-05,3.5e-05,3e-05,1.5e-05,1e-05,1e-05,1e-05,0.000263,4e-05,...,4e-05,,,2e-05,8.1e-05,1.5e-05,5e-06,,4.5e-05,6.1e-05
5 Order of the Phoenix.txt,0.019021,4e-06,3.8e-05,3.4e-05,8e-06,2.7e-05,4e-06,2.3e-05,0.000364,7.6e-05,...,1.9e-05,,8.7e-05,,,1.1e-05,1.1e-05,,8.7e-05,2.7e-05


# Comparing differences between groups of texts
There are a *ton* of different ways that we can calculate differences between groups of texts. Each of them serve specific functions, and some are more appropriate for specific use cases.


## Different ways of calculating distinctive words
We're going to discuss three different ways of calculating **distinctive words**, by which I mean simply words that are used differently between two different texts or groups of texts.

They are:
1. **Difference of means**: This analyzes the average difference per document between two groups. Mainly useful if you assume that your documents are broadly similar.
2. **Term-frequency `*` inverse document frequency**: This measure compares how rare words are in your corpus with how frequent they are in a particular document. Words that are rare in the corpus but frequent in an individual document are likely distinctive of that document. For example, `accio` would be common in *Harry Potter*, but absent in almost every other book. That makes it a distinctive word.
3. **Fisher's exact test**: This statistical test compares the total number of times a word was used and not used by two groups, which we'll call A and B. The test calculates the *odds* that group A would use that particular word as compared to group B. For example, a result could tell you that "the odds are 300 to 1 that `accio` was used by J.K. Rowling as opposed to a group of authors who are not J.K. Rowling." It also calculates the p-value, which measures the probability of obtaining that result if the null-hypothesis were correct. A low p-value indicates that we can reject the null-hypothesis. In our case, the null-hypothesis is that our groups use the words at exactly equal rates. Rejecting the null-hypothesis means that the groups of authors do use words at rates that are significantly different.


## Grouping texts
In order to calculate the differences between two groups of texts, you need to have groups. These might be characteristics of the texts themselves (e.g. texts published before or after a certain date, texts with protagonists of specific genders), or they might have something to do with the characteristics of the authors (e.g. white vs. POC word usage).

For these examples, we're going to look for political differences in State of the Union addresses by U.S. presidents between 1900 and 2019. Our groups are going to be the two major political parties, Democrats and Republicans.

In [135]:
meta = '/Users/e/code/literarytextmining/corpora/sotu_1900-2019/meta.csv'

In [136]:
# get the metadata
import pandas as pd
meta = pd.read_csv(meta)

In [137]:
meta.head()

Unnamed: 0,president,year,filepath,party
0,McKinley,1900,1900.McKinley.txt,Republican
1,Roosevelt,1901,1901.Roosevelt.txt,Republican
2,Roosevelt,1902,1902.Roosevelt.txt,Republican
3,Roosevelt,1903,1903.Roosevelt.txt,Republican
4,Roosevelt,1904,1904.Roosevelt.txt,Republican


In [37]:
meta.tail()

Unnamed: 0,president,year,filepath,party
114,Obama,2015,2015.Obama.txt,Democrat
115,Obama,2016,2016.Obama.txt,Democrat
116,Trump,2017,2017.Trump.txt,Republican
117,Trump,2018,2018.Trump.txt,Republican
118,Trump,2019,2019.Trump.txt,Republican


Now let's use our imported `make_dtm` function to get a data frame of the SOTUS:

In [138]:
sotus = '/Users/e/code/literarytextmining/corpora/sotu_1900-2019/texts'

In [139]:
df = make_dtm(sotus, scaled = True, drop_below=50)

In [140]:
df.head()

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,yield,york,you,young,younger,your,yours,youth,zone,zones
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,0.013367,,5.3e-05,5.3e-05,0.000211,0.000159,,,,0.000581,...,,,0.000581,,,0.000793,,,,5.3e-05
1901.Roosevelt.txt,0.013951,5.1e-05,,0.000152,0.000254,0.000558,5.1e-05,,,0.000203,...,,,0.000101,,5.1e-05,0.000203,,,,
1902.Roosevelt.txt,0.018451,0.000102,,,0.000204,0.000102,,,,0.00051,...,,,0.000102,0.000306,,0.000306,,,,
1903.Roosevelt.txt,0.014483,0.000135,6.7e-05,,0.000135,,,,,0.000539,...,,0.000202,0.000135,6.7e-05,,0.000202,,,0.000202,
1904.Roosevelt.txt,0.014109,5.7e-05,,,5.7e-05,0.000343,5.7e-05,5.7e-05,,0.0004,...,,0.000114,0.000228,0.000171,,0.000343,,5.7e-05,0.000114,


Since both data frames contain a `filepath` column, we can merge them easily:

In [141]:
meta.columns

Index(['president', 'year', 'filepath', 'party'], dtype='object')

In [154]:
sotu_df = pd.merge(df, meta, on='filepath')

In [155]:
sotu_df.head()

Unnamed: 0,filepath,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,...,young,younger,your,yours,youth,zone,zones,president_y,year_y,party_y
0,1900.McKinley.txt,0.013367,,5.3e-05,5.3e-05,0.000211,0.000159,,,,...,,,0.000793,,,,5.3e-05,McKinley,1900,Republican
1,1901.Roosevelt.txt,0.013951,5.1e-05,,0.000152,0.000254,0.000558,5.1e-05,,,...,,5.1e-05,0.000203,,,,,Roosevelt,1901,Republican
2,1902.Roosevelt.txt,0.018451,0.000102,,,0.000204,0.000102,,,,...,0.000306,,0.000306,,,,,Roosevelt,1902,Republican
3,1903.Roosevelt.txt,0.014483,0.000135,6.7e-05,,0.000135,,,,,...,6.7e-05,,0.000202,,,0.000202,,Roosevelt,1903,Republican
4,1904.Roosevelt.txt,0.014109,5.7e-05,,,5.7e-05,0.000343,5.7e-05,5.7e-05,,...,0.000171,,0.000343,,5.7e-05,0.000114,,Roosevelt,1904,Republican


We're going to want to set `filepath` as our index (i.e. row names) so that we don't calculate on it by accident:

In [156]:
sotu_df.set_index('filepath', inplace=True)

In [157]:
sotu_df.head()

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,young,younger,your,yours,youth,zone,zones,president_y,year_y,party_y
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,0.013367,,5.3e-05,5.3e-05,0.000211,0.000159,,,,0.000581,...,,,0.000793,,,,5.3e-05,McKinley,1900,Republican
1901.Roosevelt.txt,0.013951,5.1e-05,,0.000152,0.000254,0.000558,5.1e-05,,,0.000203,...,,5.1e-05,0.000203,,,,,Roosevelt,1901,Republican
1902.Roosevelt.txt,0.018451,0.000102,,,0.000204,0.000102,,,,0.00051,...,0.000306,,0.000306,,,,,Roosevelt,1902,Republican
1903.Roosevelt.txt,0.014483,0.000135,6.7e-05,,0.000135,,,,,0.000539,...,6.7e-05,,0.000202,,,0.000202,,Roosevelt,1903,Republican
1904.Roosevelt.txt,0.014109,5.7e-05,,,5.7e-05,0.000343,5.7e-05,5.7e-05,,0.0004,...,0.000171,,0.000343,,5.7e-05,0.000114,,Roosevelt,1904,Republican


## Merge results

Notice that it adds three new columns: `president_y`, `year_y`, and `party_y`. Those `_y` get added from our metadata table because the dataframe *already* had columns for the words president, year, and party. Rather than overwriting the existing columns, Pandas infers that we want to have all of them together.

We know that our words are everything up to, but not including, the last three columns from our metadata, it'll help us to save those for later so that we can only look at words:

In [158]:
words = sotu_df.columns[:-3] # everything from index 0 up to but not including the last 3

In [159]:
words

Index(['a', 'abandon', 'abandoned', 'abiding', 'ability', 'able', 'abolished',
       'abolition', 'abortion', 'about',
       ...
       'yield', 'york', 'you', 'young', 'younger', 'your', 'yours', 'youth',
       'zone', 'zones'],
      dtype='object', length=4844)

# Grouping with Pandas
There are quite a few ways we could group our data to check for differences.

One easy way would be to create variables containing slices of the data frame:

In [160]:
sotu_df['party_y'].unique() # party_y holds our party metadata

array(['Republican', 'Democrat'], dtype=object)

In [169]:
Rs = sotu_df[sotu_df['party_y'] == 'Republican'][words] # this gets all of the rows where the party is republican

In [166]:
Ds = sotu_df[sotu_df['party_y'] == 'Democrat'][words] # ditto, but for democrats

In [170]:
Rs.head()

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,yield,york,you,young,younger,your,yours,youth,zone,zones
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,0.013367,,5.3e-05,5.3e-05,0.000211,0.000159,,,,0.000581,...,,,0.000581,,,0.000793,,,,5.3e-05
1901.Roosevelt.txt,0.013951,5.1e-05,,0.000152,0.000254,0.000558,5.1e-05,,,0.000203,...,,,0.000101,,5.1e-05,0.000203,,,,
1902.Roosevelt.txt,0.018451,0.000102,,,0.000204,0.000102,,,,0.00051,...,,,0.000102,0.000306,,0.000306,,,,
1903.Roosevelt.txt,0.014483,0.000135,6.7e-05,,0.000135,,,,,0.000539,...,,0.000202,0.000135,6.7e-05,,0.000202,,,0.000202,
1904.Roosevelt.txt,0.014109,5.7e-05,,,5.7e-05,0.000343,5.7e-05,5.7e-05,,0.0004,...,,0.000114,0.000228,0.000171,,0.000343,,5.7e-05,0.000114,


In [168]:
Ds.head()

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,yield,york,you,young,younger,your,yours,youth,zone,zones
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1913.Wilson.txt,0.010937,,,0.00028,,,,,,0.000841,...,,,0.002804,,,0.001122,,,,
1914.Wilson.txt,0.014499,,,0.00022,,0.000439,,,,0.00022,...,0.00022,,0.001098,0.000659,,0.000439,,,,
1915.Wilson.txt,0.012451,,,,0.00013,0.00013,,,,0.000908,...,0.000389,,0.002464,0.00013,0.000259,0.001167,,,,
1916.Wilson.txt,0.012224,,,,,,,,,,...,,,0.003761,,,0.00094,,,,
1917.Wilson.txt,0.009909,,,,,0.000254,,,,0.001016,...,,,0.001778,,,0.000254,,,,


So, we used our metadata to group our data frame into Republicans and Democrats. Now, we can compare these two matrices:

## Difference of means
First, we're going to compare the differences between the parties' states of the union by comparing the differences between the mean rates at which they use certain words.

This is a simple but useful technique. It shows us which words appear more often on average in a given group's texts.

You calculate all of the mean values for every word in each party (`Ds.mean()`), then subtract the means of the other party (`Rs.mean()`) from each of those values.

So, more positive values are mentioned more often on average in Democratic texts, whereas negative values would be mentioned more often in Republican texts. The advantage of this method is that it *cancels out* some of the similarities between the groups among very common words. The words closest to zero are used most similarly in both groups.

In [171]:
Ds.head()

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,yield,york,you,young,younger,your,yours,youth,zone,zones
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1913.Wilson.txt,0.010937,,,0.00028,,,,,,0.000841,...,,,0.002804,,,0.001122,,,,
1914.Wilson.txt,0.014499,,,0.00022,,0.000439,,,,0.00022,...,0.00022,,0.001098,0.000659,,0.000439,,,,
1915.Wilson.txt,0.012451,,,,0.00013,0.00013,,,,0.000908,...,0.000389,,0.002464,0.00013,0.000259,0.001167,,,,
1916.Wilson.txt,0.012224,,,,,,,,,,...,,,0.003761,,,0.00094,,,,
1917.Wilson.txt,0.009909,,,,,0.000254,,,,0.001016,...,,,0.001778,,,0.000254,,,,


In [52]:
Ds.mean()[:5] # again, these are vectors so this calculates the mean of every column, i.e. word

a            0.016366
abandon      0.000247
abandoned    0.000202
abiding      0.000235
ability      0.000343
dtype: float64

In [178]:
diff_means = Ds.mean() - Rs.mean()

Now, there are words that will appear in one group but not the other. When you try to subtract a number from a number that does not exist, Pandas returns `NaN`. Since we're looking to compare and contrast these groups, we'll drop the `NaN` results:

In [179]:
diff_means[diff_means.isna()]

abortion          NaN
addicted          NaN
addiction         NaN
advocated         NaN
applaud           NaN
appreciation      NaN
autocracy         NaN
award             NaN
baghdad           NaN
biden             NaN
bosnia            NaN
brady             NaN
cheney            NaN
colored           NaN
commissioners     NaN
comparatively     NaN
competitiveness   NaN
convicted         NaN
etc               NaN
exposition        NaN
filipinos         NaN
fisheries         NaN
friction          NaN
gramm             NaN
gratifying        NaN
hague             NaN
hitler            NaN
hoarding          NaN
hollings          NaN
incident          NaN
internet          NaN
investigate       NaN
iraqis            NaN
isil              NaN
isis              NaN
kremlin           NaN
lebanon           NaN
legislature       NaN
megan             NaN
multiplied        NaN
municipal         NaN
nam               NaN
nazis             NaN
parcel            NaN
protective        NaN
qaida     

In [180]:
diff_means.dropna(inplace=True)

In [174]:
diff_means.sort_values() # more negative values are more Republican; more positive values are more Democratic

the            -0.005448
applause       -0.004546
of             -0.003290
is             -0.002014
in             -0.001891
iraq           -0.001443
be             -0.001298
barrels        -0.001281
saddam         -0.001275
a              -0.001225
terrorists     -0.001140
iraqi          -0.001127
government     -0.001111
has            -0.001082
hussein        -0.000985
federal        -0.000904
been           -0.000899
qaeda          -0.000898
terror         -0.000807
by             -0.000755
mayor          -0.000593
audience       -0.000592
such           -0.000587
america        -0.000575
al             -0.000562
an             -0.000561
earmarks       -0.000552
law            -0.000546
states         -0.000533
under          -0.000521
                  ...   
who             0.000687
japanese        0.000723
planes          0.000744
but             0.000759
salt            0.000768
s               0.000790
all             0.000807
it              0.000875
respectfully    0.000875


And which words are used most similarly among the two groups?

In [181]:
# what is the smallest value greater than 0?
mn = diff_means[diff_means > 0].min()

In [182]:
mn

1.8705293707184378e-08

In [183]:
diff_means[diff_means==mn]

refused    1.870529e-08
dtype: float64

So now that we know that `refused` is the word closest to 0, we can use `get_loc` on the index to find the numerical position of the words closest to zero:

In [184]:
diff_means.sort_values(inplace=True) # make sure sort is consistent

In [61]:
diff_means.index.get_loc('refused')

1570

In [185]:
diff_means[1560:1580] # let's see the words around refused

turned       -5.362892e-07
penalty      -4.611454e-07
at           -4.054953e-07
remained     -3.925947e-07
knows        -3.763799e-07
offensive    -3.409084e-07
places       -2.980729e-07
different    -1.171362e-07
numerous     -7.731823e-08
earlier      -1.564578e-08
refused       1.870529e-08
sensible      1.879756e-07
defined       2.488349e-07
needless      2.541267e-07
commanders    2.653964e-07
protected     3.557504e-07
variety       3.759861e-07
industry      3.862119e-07
map           3.946424e-07
believes      4.317738e-07
dtype: float64

Again, these very low values are used at the most *consistent* rates between groups, on average.

# Term-Frequency `*` Inverse-Document Frequency
TF-IDF is a technique commonly used in information retrieval, including Google's page-ranking algorithm.

It compares the scaled frequency of a term in a document over a log-scaled number given by the number of documents in the corpus divided by the number of *documents in which the target word appears*.

TF-IDF is said to be a measure of "term specificity," i.e. how specific is this term to this document?

This is based on two assumptions:
1. If a word is relatively frequent in all documents, and frequenty in one particular document, it probably isn't that distinctive of that one document.
2. But if a word is relatively infrequent in all documents, and relatively frequenty in one document, it may well be distinctive of that document.

## Term-Frequency
<center><img src="https://latex.codecogs.com/png.latex?TF = \frac{n_w}{n_d}"></center>

We already know how to calculate this; we have been referring to it as the scaled frequency.

Where:
* *Nw* is the number of times a given word *w* appears in a document.
* *Nd* is the number of words in that document.

Term frequencies are exactly the same values we already have calculated in our scaled DTMs, e.g.

In [186]:
sotu_df['jobs'].dropna()[:10]

filepath
1940.Roosevelt.txt     0.000622
1941.Roosevelt.txt     0.000300
1944.Roosevelt.txt     0.000261
1945.Roosevelt.txt     0.001706
1946.Truman.txt        0.000183
1948.Truman.txt        0.000589
1949.Truman.txt        0.000294
1951.Truman.txt        0.000500
1952.Truman.txt        0.000559
1955.Eisenhower.txt    0.000274
Name: jobs, dtype: float64

#### IDF: Inverse Document Frequency

<center><img src="https://latex.codecogs.com/png.latex?IDF = \log \left( \frac{c_d}{i_d} \right)"></center>

Where:
* <img src="https://latex.codecogs.com/png.latex?{c_d}"> is the count of documents in the corpus.
* <img src="https://latex.codecogs.com/png.latex?{i_d}"> = is the number of documents in which that word appears.

Calculating this requires answering three questions:
1. How many documents are there in the corpus?
2. How many documents does the word appear in?
3. What is the `log` of the number of documents divided by the number of documents in which the word appears?

In [187]:
# 1. How many documents are there? We can get that by counting the number of rows in our DTM. One row per document:
n_docs = len(sotu_df)
n_docs

119

In [66]:
# 2. How many documents does the word appear in?
# We can get that by finding out how many documents have a non-zero value:
(sotu_df['jobs'] > 0)[:5]

filepath
1900.McKinley.txt     False
1901.Roosevelt.txt    False
1902.Roosevelt.txt    False
1903.Roosevelt.txt    False
1904.Roosevelt.txt    False
Name: jobs, dtype: bool

In [188]:
# We can then use that True/False vector to filter our data frame
sotu_df[sotu_df['jobs'] > 0].head()

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,young,younger,your,yours,youth,zone,zones,president_y,year_y,party_y
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1940.Roosevelt.txt,0.016791,0.000311,,,,0.000311,,,,0.000622,...,,,0.000311,,0.000933,,,Roosevelt,1940,Democrat
1941.Roosevelt.txt,0.015315,,,0.0003,0.000601,0.000601,,,,0.000601,...,,,0.000601,,0.0003,,,Roosevelt,1941,Democrat
1944.Roosevelt.txt,0.017778,,,,,0.000523,,,,0.000784,...,0.000261,,,,,,,Roosevelt,1944,Democrat
1945.Roosevelt.txt,0.013161,,,,0.000487,0.000731,,,,0.000609,...,0.000366,,,,,0.000122,,Roosevelt,1945,Democrat
1946.Truman.txt,0.013168,,,,0.00011,0.000147,,,,0.001724,...,3.7e-05,,,,,7.3e-05,3.7e-05,Truman,1946,Democrat


In [68]:
# And then we can calculate the number of documents containing the target word like so:
# How many rows (documents) exist where there are more than zero instances of the word "jobs"?
len(sotu_df[sotu_df['jobs'] > 0])

68

In [189]:
n_docs_with_word = len(sotu_df[sotu_df['jobs'] > 0])

In [190]:
# 3. Log-scale the inverse frequency.
# We can have the computer do this for us:
import numpy as np
idf = np.log(n_docs/n_docs_with_word)

In [191]:
idf

0.5596157879354227

Now we multiply every one of our term-frequencies by our inverse-document frequency:

In [193]:
(sotu_df['jobs'] * idf).sort_values(ascending = True)[:20]

filepath
1956.Eisenhower.txt    0.000067
1946.Truman.txt        0.000103
1959.Eisenhower.txt    0.000114
1957.Eisenhower.txt    0.000134
1944.Roosevelt.txt     0.000146
1955.Eisenhower.txt    0.000154
1986.Reagan.txt        0.000157
1967.Johnson.txt       0.000157
1949.Truman.txt        0.000164
1941.Roosevelt.txt     0.000168
2007.Bush.txt          0.000200
2003.Bush.txt          0.000208
1966.Johnson.txt       0.000212
1989.Bush.txt          0.000229
1971.Nixon.txt         0.000247
2001.Bush.txt          0.000258
1951.Truman.txt        0.000280
1987.Reagan.txt        0.000288
1990.Bush.txt          0.000290
1999.Clinton.txt       0.000296
Name: jobs, dtype: float64

Obama talks about "jobs" both more often than other presidents, and in a way that is especially distinctive because not all presidents' State of the Union messages address "jobs."

## TF`*`IDF function
We should put all of this into a function so we can get tf`*`idf for every word in any DTM:

In [194]:
def tf_idf(dtm):
    '''This function expects a scaled-frequency document-term matrix with an index of text names, and words as columns.
    It calculates tf*idf for every column (i.e. every word).'''
    dtm_tfidf = pd.DataFrame() # initialize empty dataframe to store results
    
    for word in dtm.columns: # for every word in our corpus
        # 1. tf: just the scaled frequency of our word
        tf_series = dtm[word]
        
        # 2. idf
        num_docs = len(dtm)
        num_docs_with_word=len(dtm[dtm[word]>0])
        idf=np.log(num_docs/num_docs_with_word)
        
        # 3. tf * idf
        tfidf_series = tf_series * idf
        
        # 4. Add the result to our dataframe
        dtm_tfidf[word]=tfidf_series
    
    return dtm_tfidf # 5. Return our new dataframe

In [78]:
sotu_tfidf = tf_idf(sotu_df[words]) # I call my original dataframe for the columns [words] because we don't want metadata
# this takes about 60 seconds to run on my laptop

In [195]:
sotu_tfidf

Unnamed: 0_level_0,a,abandon,abandoned,abiding,ability,able,abolished,abolition,abortion,about,...,yield,york,you,young,younger,your,yours,youth,zone,zones
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,0.0,,0.000089,0.000092,0.000106,0.000039,,,,0.000056,...,,,0.000040,,,0.000195,,,,0.000106
1901.Roosevelt.txt,0.0,0.000067,,0.000264,0.000127,0.000138,0.000109,,,0.000020,...,,,0.000007,,0.000105,0.000050,,,,
1902.Roosevelt.txt,0.0,0.000134,,,0.000102,0.000025,,,,0.000049,...,,,0.000007,0.000162,,0.000075,,,,
1903.Roosevelt.txt,0.0,0.000177,0.000114,,0.000068,,,,,0.000052,...,,0.000300,0.000009,0.000036,,0.000050,,,0.000405,
1904.Roosevelt.txt,0.0,0.000075,,,0.000029,0.000084,0.000122,0.000131,,0.000039,...,,0.000169,0.000016,0.000091,,0.000084,,0.000081,0.000229,
1905.Roosevelt.txt,0.0,,,0.000069,0.000060,0.000147,0.000341,,,0.000043,...,,0.000118,0.000008,0.000021,,0.000108,,,0.000080,
1906.Roosevelt.txt,0.0,0.000056,,0.000147,0.000043,0.000073,0.000091,0.000194,,0.000037,...,,0.000377,0.000027,0.000202,,0.000125,0.000088,,0.000085,
1907.Roosevelt.txt,0.0,,,0.000127,0.000147,0.000072,0.000156,,,0.000046,...,0.000057,0.000325,0.000003,0.000019,,0.000090,,,,
1908.Roosevelt.txt,0.0,0.000203,,0.000089,0.000026,0.000165,,0.000236,,0.000060,...,,0.000076,0.000004,,,0.000038,,,,
1909.Taft.txt,0.0,,,,0.000181,0.000018,,0.000166,,0.000049,...,,0.000107,0.000005,0.000115,0.000149,0.000089,,,0.000290,


High-frequency words like `a` are not distinctive of particular texts because they are used at comparable rates. Other numbers with higher values distinguish particular texts because they 

We could use this dataframe for a couple of things. We could look at which documents are associated with which words.

`.nlargest()` gives us the largest results

In [196]:
sotu_tfidf['jobs'].nlargest(10) # our function is working:

filepath
2013.Obama.txt      0.002642
2012.Obama.txt      0.002576
2011.Obama.txt      0.001978
1993.Clinton.txt    0.001906
2014.Obama.txt      0.001841
2010.Obama.txt      0.001784
2002.Bush.txt       0.001594
1994.Clinton.txt    0.001489
2015.Obama.txt      0.001459
2009.Obama.txt      0.001306
Name: jobs, dtype: float64

In [197]:
sotu_tfidf['vietnam'].nlargest(10) # would expect this to be during the US war in Vietnam

filepath
1966.Johnson.txt    0.011771
1967.Johnson.txt    0.005730
1969.Johnson.txt    0.003804
1973.Nixon.txt      0.002325
1968.Johnson.txt    0.001592
1970.Nixon.txt      0.001311
1985.Reagan.txt     0.000904
1977.Ford.txt       0.000837
1996.Clinton.txt    0.000612
1987.Reagan.txt     0.000501
Name: vietnam, dtype: float64

In [198]:
sotu_tfidf['war'].nlargest(10) # who talks the most about war

filepath
1944.Roosevelt.txt    0.000676
1943.Roosevelt.txt    0.000508
1945.Roosevelt.txt    0.000479
1942.Roosevelt.txt    0.000459
1917.Wilson.txt       0.000407
1946.Truman.txt       0.000381
1953.Truman.txt       0.000298
1941.Roosevelt.txt    0.000249
1918.Wilson.txt       0.000227
1920.Wilson.txt       0.000210
Name: war, dtype: float64

In [199]:
sotu_tfidf['taxes'].nlargest(10) # who talks the most about taxes?

filepath
1973.Nixon.txt     0.001005
1975.Ford.txt      0.000826
2004.Bush.txt      0.000583
1982.Reagan.txt    0.000577
2001.Bush.txt      0.000542
1920.Wilson.txt    0.000498
1984.Reagan.txt    0.000468
1972.Nixon.txt     0.000422
2012.Obama.txt     0.000375
2010.Obama.txt     0.000373
Name: taxes, dtype: float64

We could also use it in the opposite way, to show which words are most closely associated with each document. Our dataframe's `index` gives us each of the files:

In [87]:
sotu_tfidf.index[:3]

Index(['1900.McKinley.txt', '1901.Roosevelt.txt', '1902.Roosevelt.txt'], dtype='object', name='filepath')

And if we take any given document, and sort its *row* from high to low, we get the words that are most associated with that document:

In [88]:
sotu_tfidf.index[0]

'1900.McKinley.txt'

In [89]:
test = sotu_tfidf.index[0]

We can get all of the values for the row with `loc`:

In [91]:
sotu_tfidf.loc[test].nlargest(10) # these are the words that most distinguish McKinley's 1900 speech from all others

islands       0.004154
imperial      0.002139
convention    0.002084
chinese       0.001943
spain         0.001569
philippine    0.001551
island        0.001438
governor      0.001286
acres         0.001283
commission    0.001244
Name: 1900.McKinley.txt, dtype: float64

Then, we can use a few of the formatting tricks we've learned to see what we get:

In [200]:
n_words = 5
for index in sotu_tfidf.index:
    # get row for this index
    row=sotu_tfidf.loc[index]
    
    # get the lagest words
    top_words_series=row.nlargest(n_words)
    top_words_list=list(top_words_series.index)
    top_words_str=', '.join(top_words_list)
    
    # print
    print(index)
    print(top_words_str)
    print('-'*80)

1900.McKinley.txt
islands, imperial, convention, chinese, spain
--------------------------------------------------------------------------------
1901.Roosevelt.txt
ships, navy, forest, arid, streams
--------------------------------------------------------------------------------
1902.Roosevelt.txt
cable, tariff, corporations, navy, therein
--------------------------------------------------------------------------------
1903.Roosevelt.txt
isthmus, panama, colombia, canal, treaty
--------------------------------------------------------------------------------
1904.Roosevelt.txt
naturalization, forest, islands, indian, reserves
--------------------------------------------------------------------------------
1905.Roosevelt.txt
islands, supervision, railroad, chinese, interstate
--------------------------------------------------------------------------------
1906.Roosevelt.txt
colored, judge, japanese, conference, islands
---------------------------------------------------------------------

# How much more likely is a group to use a given word?
tf`*`idf tells us a lot about which words are in which documents, but not much about the groups as a whole.

Earlier, we used difference of means to compare groups on average. However, text documents vary in length. We might want an approach that takes all of the instances into account.

## Fisher's exact test
[Fisher's Exact Test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) allows us to calculate the odds (and the p-value) that a given word would be used by a given group. It is based on contingency tables (aka cross tabluations) which look like this:

|            | Guess milk | Guess tea |   |
|------------|------------|-----------|---|
| Milk first | 4          | 0         |   |
| Tea first  | 0          | 4         |   |
|   Total    | 4          | 4         | 8 |

To make a comparable contingency matrix, we need the following categories for data about words:

|         | Number of target word | All words except target word |   |
|---------|-----------------------|------------------------------|---|
| Group A | w                     | x                            |   |
| Group B | y                     | z                            |   |
|  Total  |                       |                              | n |

## Fisher's exact requires raw counts, not scaled:
To start, we need a dtm containing raw counts:

In [96]:
sotus = '/Users/e/code/literarytextmining/corpora/sotu_1900-2019/texts'

In [97]:
df = make_dtm(sotus, scaled = False, drop_below=50)

In [98]:
df.head()

Unnamed: 0_level_0,a,ability,able,about,above,abroad,absolutely,abuse,abuses,accept,...,worthy,would,wrong,year,years,yes,yet,you,young,your
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,253,4.0,3.0,11.0,2.0,4.0,,,,,...,2.0,13.0,,48.0,8.0,,4.0,11.0,,15.0
1901.Roosevelt.txt,275,5.0,11.0,4.0,6.0,7.0,4.0,,3.0,1.0,...,,47.0,3.0,9.0,26.0,,10.0,2.0,,4.0
1902.Roosevelt.txt,181,2.0,1.0,5.0,8.0,4.0,1.0,,,,...,1.0,36.0,2.0,7.0,5.0,,6.0,1.0,3.0,3.0
1903.Roosevelt.txt,215,2.0,,8.0,9.0,,,,,,...,1.0,26.0,1.0,39.0,19.0,,9.0,2.0,1.0,3.0
1904.Roosevelt.txt,247,1.0,6.0,7.0,7.0,7.0,3.0,,5.0,4.0,...,3.0,43.0,10.0,12.0,13.0,,14.0,4.0,3.0,6.0


Let's re-add our metadata:

In [99]:
meta.head()

Unnamed: 0,president,year,filepath,party
0,McKinley,1900,1900.McKinley.txt,Republican
1,Roosevelt,1901,1901.Roosevelt.txt,Republican
2,Roosevelt,1902,1902.Roosevelt.txt,Republican
3,Roosevelt,1903,1903.Roosevelt.txt,Republican
4,Roosevelt,1904,1904.Roosevelt.txt,Republican


In [100]:
df = pd.merge(df,meta,on='filepath')

In [102]:
df.set_index('filepath', inplace=True)

Note that once again we get the extra metadata columns at the end:

In [103]:
df.head()

Unnamed: 0_level_0,a,ability,able,about,above,abroad,absolutely,abuse,abuses,accept,...,year_x,years,yes,yet,you,young,your,president_y,year_y,party_y
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,253,4.0,3.0,11.0,2.0,4.0,,,,,...,48.0,8.0,,4.0,11.0,,15.0,McKinley,1900,Republican
1901.Roosevelt.txt,275,5.0,11.0,4.0,6.0,7.0,4.0,,3.0,1.0,...,9.0,26.0,,10.0,2.0,,4.0,Roosevelt,1901,Republican
1902.Roosevelt.txt,181,2.0,1.0,5.0,8.0,4.0,1.0,,,,...,7.0,5.0,,6.0,1.0,3.0,3.0,Roosevelt,1902,Republican
1903.Roosevelt.txt,215,2.0,,8.0,9.0,,,,,,...,39.0,19.0,,9.0,2.0,1.0,3.0,Roosevelt,1903,Republican
1904.Roosevelt.txt,247,1.0,6.0,7.0,7.0,7.0,3.0,,5.0,4.0,...,12.0,13.0,,14.0,4.0,3.0,6.0,Roosevelt,1904,Republican


In [106]:
words = df.columns[:-3] # just get our words to avoid calculating on the metadata

In [105]:
words[:3], words[-3:]

(Index(['a', 'ability', 'able'], dtype='object'),
 Index(['you', 'young', 'your'], dtype='object'))

From skimming the tf`*`idf results above, we know that George W. Bush was most likely to talk about `iraq`. Let's see if that holds true for the Republican party as a whole:

In [107]:
Rs=df[df['party_y'] == 'Republican'][words]
Ds= df[df['party_y'] == 'Democrat'][words]
Rs.head()

Unnamed: 0_level_0,a,ability,able,about,above,abroad,absolutely,abuse,abuses,accept,...,worthy,would,wrong,year_x,years,yes,yet,you,young,your
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,253,4.0,3.0,11.0,2.0,4.0,,,,,...,2.0,13.0,,48.0,8.0,,4.0,11.0,,15.0
1901.Roosevelt.txt,275,5.0,11.0,4.0,6.0,7.0,4.0,,3.0,1.0,...,,47.0,3.0,9.0,26.0,,10.0,2.0,,4.0
1902.Roosevelt.txt,181,2.0,1.0,5.0,8.0,4.0,1.0,,,,...,1.0,36.0,2.0,7.0,5.0,,6.0,1.0,3.0,3.0
1903.Roosevelt.txt,215,2.0,,8.0,9.0,,,,,,...,1.0,26.0,1.0,39.0,19.0,,9.0,2.0,1.0,3.0
1904.Roosevelt.txt,247,1.0,6.0,7.0,7.0,7.0,3.0,,5.0,4.0,...,3.0,43.0,10.0,12.0,13.0,,14.0,4.0,3.0,6.0


## First column: frequencies of target word by group

In [114]:
word='iraq'
sum_word_Rs = Rs[word].sum()
sum_word_Ds = Ds[word].sum()

print(sum_word_Rs,sum_word_Ds) # these are our frequencies for each group. they will be column 1 in our contingency table

105.0 31.0


## Second column: sum of all words minus our target words
We can get the sum of all of our words by calling `sum` twice. This adds every column, and then adds every row in the resulting series:

In [115]:
Rs.sum().sum()

448709.0

In [116]:
sum_allword_Rs=Rs.sum().sum()
sum_allword_Rs

448709.0

In [117]:
sum_allword_Ds=Ds.sum().sum()
sum_allword_Ds

305467.0

In [118]:
sum_notword_Rs = sum_allword_Rs - sum_word_Rs
sum_notword_Rs

448604.0

In [119]:
sum_notword_Ds = sum_allword_Ds - sum_word_Ds
sum_notword_Ds

305436.0

## Make the table:

In [120]:
contingency_table = [
    [sum_word_Rs, sum_notword_Rs],
    [sum_word_Ds, sum_notword_Ds]
]

In [122]:
contingency_table

[[105.0, 448604.0], [31.0, 305436.0]]

To return to our data from above with our example, `contingency_table` now has:

|            | Number of "iraq" | Sum of all words except "Iraq" |         |
|------------|------------------|--------------------------------|---------|
| Republican | 105              | 448604                         |         |
| Democrat   | 31               | 305436                         |         |
| Total      | 136              | 754040                         |  754176 |

It's clear that Republicans use `iraq` more, but they also use more words in total, too. Fisher's test will tell us the *odds* of either party using it.

Now that we have this, we can use the Fisher's exact test to see what the odds of one party using this more than the other

In [123]:
from scipy.stats import fisher_exact

oddsratio, pvalue = fisher_exact(contingency_table)
oddsratio, pvalue

(2.3061347877472795, 1.5757521308192708e-05)

## Interpreting results
So, Republicans are 2.3X more likely to talk about `iraq` than Democrats. We know that value is significant because the p-value is much lower than the standard `0.05` used to reject the null-hypothesis.

# Fisher's exact every word in our data frame
When we check all of our words, which ones are significantly more likely to be used by one group rather than another?

We can write a function to test this:

In [124]:
def fish(group_a, group_a_name, group_b, group_b_name):
    results = []
    if (group_a.columns == group_b.columns).all(): # test the columns for equivalence; don't run if the columns don't match
        for word in group_a.columns:
            # 1. calculate frequencies of each word
            sum_word_a = group_a[word].sum()
            sum_word_b = group_b[word].sum()

            # 2. calculate total number of words
            sum_allword_a = group_a.sum().sum()
            sum_allword_b = group_b.sum().sum()

            # 3. calculate total number of words minus the target word
            sum_notword_a = sum_allword_a - sum_word_a
            sum_notword_b = sum_allword_b - sum_word_b

            # 4. make contingency table
            contingency_table = [[sum_word_a, sum_notword_a], [sum_word_b, sum_notword_b]]

            # 5. run fisher's exact
            odds,pvalue = fisher_exact(contingency_table)

            # 6. capture results in dictionary
            d = {}
            d['word'] = word
            d['odds'] = odds
            d['pvalue'] = pvalue
            d['group_a']= group_a_name
            d['group_b'] = group_b_name

            results.append(d)
    
    return results

This may take a minute to run!

In [125]:
test = fish(Rs, 'Republicans', Ds, 'Democrats')
# this takes about a minute to run on my computer

In [126]:
import pandas as pd
df_fish = pd.DataFrame(test)

In [127]:
df_fish.head()

Unnamed: 0,group_a,group_b,odds,pvalue,word
0,Republicans,Democrats,1.073176,3.9e-05,a
1,Republicans,Democrats,0.943339,0.752629,ability
2,Republicans,Democrats,0.748276,0.012898,able
3,Republicans,Democrats,0.789779,0.000961,about
4,Republicans,Democrats,1.317702,0.082408,above


The values with the highest odds are more likely to appear in group A; the values with the lowest odds are more likely to appear in group B.

Let's first filter for values that are less than than the standard significance threshold of `0.05`:

In [128]:
df_fish = df_fish[df_fish['pvalue'] < 0.05]

In [129]:
df_fish.sort_values(by='odds')

Unnamed: 0,group_a,group_b,odds,pvalue,word
1751,Republicans,Democrats,0.117974,6.635271e-18,vietnam
292,Republicans,Democrats,0.118821,4.320487e-29,college
489,Republicans,Democrats,0.151229,1.171550e-25,don
1613,Republicans,Democrats,0.188356,2.997954e-72,t
222,Republicans,Democrats,0.191555,1.001661e-23,businesses
1712,Republicans,Democrats,0.201861,4.123396e-20,u
725,Republicans,Democrats,0.202565,1.376399e-14,global
290,Republicans,Democrats,0.207166,1.667630e-08,cold
916,Republicans,Democrats,0.209439,2.553747e-09,kids
1394,Republicans,Democrats,0.213546,4.577548e-09,republicans
