# A note about merging data frames with different column names
Some of you had column names in the homework that were mismatched across two different data frames. People did a bunch of different workarounds, including renaming columns.

There's one option that we didn't discuss: Merging on two columns with overlapping values, but different column names.

Let's return to the example from the homework:

In [244]:
coffee = {'food':'coffee','size':'8oz','price':'$3.00'}
banana ={'food':'banana','size':'4oz','price':'$0.50'}
donut = {'food':'donut', 'size':'6oz', 'price':'$1.00'}
food_df = pd.DataFrame([coffee,banana,donut])

In [245]:
food_df

Unnamed: 0,food,price,size
0,coffee,$3.00,8oz
1,banana,$0.50,4oz
2,donut,$1.00,6oz


What if we want to combine that data with some other data about when and how those foods sell **and** the column names are mismatched?

In [155]:
coffee_sales = {'item':'coffee', 'peak sales':'9am', 'total sold':181}
banana_sales = {'item':'banana', 'peak sales':'1pm', 'total sold':36}
donut_sales = {'item':'donut', 'peak sales':'10am', 'total sold':96}
sales_df = pd.DataFrame([banana_sales, coffee_sales, donut_sales])

In [156]:
sales_df

Unnamed: 0,item,peak sales,total sold
0,banana,1pm,36
1,coffee,9am,181
2,donut,10am,96


Note that the **rows** are in a different order. `sales_df` starts with banana whereas `food_df` stats with coffee.

The column `food` in `food_df` matches the column `item` in `sales_df`. Although they don't have the same column name, we can match the columns since they have the same **values**.

We just have to use the arguments `right_on` and `left_on` with `pd.merge`:

In [151]:
pd.merge(food_df, sales_df, left_on = 'food', right_on = 'item')

Unnamed: 0,food,price,size,item,peak sales,total sold
0,coffee,$3.00,8oz,coffee,9am,181
1,banana,$0.50,4oz,banana,1pm,36
2,donut,$1.00,6oz,donut,10am,96


`pd.merge` identifies the columns correctly, and orders them based on the left data frame, in this case `food_df`.

Since the column names are not the same, it retains them both. We can easily resolve that with a `df.drop()` call:

In [247]:
pd.merge(food_df, sales_df, left_on = 'food', right_on = 'item').drop('item', axis = 'columns')

Unnamed: 0,food,price,size,peak sales,total sold
0,coffee,$3.00,8oz,9am,181
1,banana,$0.50,4oz,1pm,36
2,donut,$1.00,6oz,10am,96


# Different ways of calculating distinctive words
We're going to discuss three different ways of calculating **distinctive words**, by which I mean simply words that are used differently between two different texts or groups of texts.

## Which words distinguish my groups from each other?
These groups could be made in any way: differences in time

1. **Difference of means**: This analyzes the average difference per document between two groups. Mainly useful if you assume that your documents are broadly similar.

2. **Fisher's exact test**: This statistical test compares the total number of times a word was used and not used by two groups, which we'll call A and B. The test calculates the *odds* that group A would use that particular word as compared to group B. For example, a result could tell you that "the odds are 300 to 1 that `accio` was used by J.K. Rowling as opposed to a group of authors who are not J.K. Rowling." It also calculates the p-value, which measures the probability of obtaining that result if the null-hypothesis were correct. A low p-value indicates that we can reject the null-hypothesis. In our case, the null-hypothesis is that our groups use the words at exactly equal rates. Rejecting the null-hypothesis means that the groups of authors do use words at rates that are significantly different.

## Which words are distinctive of each *document* in my corpus?
(Or, alternatively, which documents contain a distinctive number of a given word in my corpus?)

**Term-frequency `*` inverse document frequency**: This measure compares how rare words are in your corpus with how frequent they are in a particular document. Words that are rare in the corpus but frequent in an individual document are likely distinctive of that document. For example, `accio` would be common in *Harry Potter*, but absent in almost every other book. That makes it a distinctive word.

## Fisher's exact test
[Fisher's Exact Test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) allows us to calculate the odds (and the p-value) that a given word would be used by a given group. It is based on contingency tables (aka cross tabluations) which look like this:

|            | Guess milk | Guess tea |   |
|------------|------------|-----------|---|
| Milk first | 4          | 0         |   |
| Tea first  | 0          | 4         |   |
|   Total    | 4          | 4         | 8 |

We can think about this as representing *one* conclusion of the experiment. Of course there are numerous other possibilities: One where the guesser gets some or all of their guesses incorrect. Fisher's exact test evaluates the probability that this particular distribution of data was reached by chance.

When dealing with high frequencies like this, it might be preferable to use the chi-squared test, which we may cover at a later date in this class.

To make a comparable contingency matrix for word frequencies, we need the following categories for data about words:

|         | Number of target word | All words except target word |   |
|---------|-----------------------|------------------------------|---|
| Group A | w                     | x                            |   |
| Group B | y                     | z                            |   |
|  Total  |                       |                              | n |

## Fisher's exact requires raw counts, not scaled:
To start, we need a dtm containing raw counts:

In [2]:
from make_dtm import *

In [3]:
sotus = '/Users/e/code/literarytextmining/corpora/sotu_1900-2019/texts'

In [4]:
import pandas as pd
meta = '/Users/e/code/literarytextmining/corpora/sotu_1900-2019/meta.csv'
meta = pd.read_csv(meta)

In [5]:
meta.head()

Unnamed: 0,president,year,filepath,party
0,McKinley,1900,1900.McKinley.txt,Republican
1,Roosevelt,1901,1901.Roosevelt.txt,Republican
2,Roosevelt,1902,1902.Roosevelt.txt,Republican
3,Roosevelt,1903,1903.Roosevelt.txt,Republican
4,Roosevelt,1904,1904.Roosevelt.txt,Republican


In [6]:
df = make_dtm(sotus, scaled = False, drop_below=50)

In [7]:
df.head()

Unnamed: 0_level_0,a,ability,able,about,above,abroad,absolutely,abuse,abuses,accept,...,worthy,would,wrong,year,years,yes,yet,you,young,your
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,253,4.0,3.0,11.0,2.0,4.0,,,,,...,2.0,13.0,,48.0,8.0,,4.0,11.0,,15.0
1901.Roosevelt.txt,275,5.0,11.0,4.0,6.0,7.0,4.0,,3.0,1.0,...,,47.0,3.0,9.0,26.0,,10.0,2.0,,4.0
1902.Roosevelt.txt,181,2.0,1.0,5.0,8.0,4.0,1.0,,,,...,1.0,36.0,2.0,7.0,5.0,,6.0,1.0,3.0,3.0
1903.Roosevelt.txt,215,2.0,,8.0,9.0,,,,,,...,1.0,26.0,1.0,39.0,19.0,,9.0,2.0,1.0,3.0
1904.Roosevelt.txt,247,1.0,6.0,7.0,7.0,7.0,3.0,,5.0,4.0,...,3.0,43.0,10.0,12.0,13.0,,14.0,4.0,3.0,6.0


Let's re-add our metadata:

In [8]:
df = pd.merge(df,meta,on='filepath')

# Returning to Fisher's

In [256]:
df.set_index('filepath', inplace=True)

Note that once again we get the extra metadata columns at the end:

In [165]:
df.head()

Unnamed: 0_level_0,a,ability,able,about,above,abroad,absolutely,abuse,abuses,accept,...,year_x,years,yes,yet,you,young,your,president_y,year_y,party_y
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,253,4.0,3.0,11.0,2.0,4.0,,,,,...,48.0,8.0,,4.0,11.0,,15.0,McKinley,1900,Republican
1901.Roosevelt.txt,275,5.0,11.0,4.0,6.0,7.0,4.0,,3.0,1.0,...,9.0,26.0,,10.0,2.0,,4.0,Roosevelt,1901,Republican
1902.Roosevelt.txt,181,2.0,1.0,5.0,8.0,4.0,1.0,,,,...,7.0,5.0,,6.0,1.0,3.0,3.0,Roosevelt,1902,Republican
1903.Roosevelt.txt,215,2.0,,8.0,9.0,,,,,,...,39.0,19.0,,9.0,2.0,1.0,3.0,Roosevelt,1903,Republican
1904.Roosevelt.txt,247,1.0,6.0,7.0,7.0,7.0,3.0,,5.0,4.0,...,12.0,13.0,,14.0,4.0,3.0,6.0,Roosevelt,1904,Republican


In [257]:
words = df.columns[:-3] # just get our words to avoid calculating on the metadata

In [258]:
words[:3], words[-3:]

(Index(['a', 'ability', 'able'], dtype='object'),
 Index(['you', 'young', 'your'], dtype='object'))

From the tf`*`idf results, we know that George W. Bush was most likely to talk about `iraq`. Let's see if that holds true for the Republican party as a whole:

In [168]:
Rs=df[df['party_y'] == 'Republican'][words]
Ds= df[df['party_y'] == 'Democrat'][words]
Rs.head()

Unnamed: 0_level_0,a,ability,able,about,above,abroad,absolutely,abuse,abuses,accept,...,worthy,would,wrong,year_x,years,yes,yet,you,young,your
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,253,4.0,3.0,11.0,2.0,4.0,,,,,...,2.0,13.0,,48.0,8.0,,4.0,11.0,,15.0
1901.Roosevelt.txt,275,5.0,11.0,4.0,6.0,7.0,4.0,,3.0,1.0,...,,47.0,3.0,9.0,26.0,,10.0,2.0,,4.0
1902.Roosevelt.txt,181,2.0,1.0,5.0,8.0,4.0,1.0,,,,...,1.0,36.0,2.0,7.0,5.0,,6.0,1.0,3.0,3.0
1903.Roosevelt.txt,215,2.0,,8.0,9.0,,,,,,...,1.0,26.0,1.0,39.0,19.0,,9.0,2.0,1.0,3.0
1904.Roosevelt.txt,247,1.0,6.0,7.0,7.0,7.0,3.0,,5.0,4.0,...,3.0,43.0,10.0,12.0,13.0,,14.0,4.0,3.0,6.0


## First column: frequencies of target word by group

In [259]:
word='iraq'
sum_word_Rs = Rs[word].sum()
sum_word_Ds = Ds[word].sum()

print(sum_word_Rs,sum_word_Ds) # these are our frequencies for each group. they will be column 1 in our contingency table

105.0 31.0


## Second column: sum of all words minus our target words
We can get the sum of all of our words by calling `sum` twice. This adds every column, and then adds every row in the resulting series:

In [262]:
Rs.sum().sum()

448709.0

In [170]:
Rs.sum().sum()

448709.0

In [171]:
sum_allword_Rs=Rs.sum().sum()
sum_allword_Rs

448709.0

In [172]:
sum_allword_Ds=Ds.sum().sum()
sum_allword_Ds

305467.0

In [173]:
sum_notword_Rs = sum_allword_Rs - sum_word_Rs
sum_notword_Rs

448604.0

In [174]:
sum_notword_Ds = sum_allword_Ds - sum_word_Ds
sum_notword_Ds

305436.0

## Make the table:

In [263]:
contingency_table = [
    [sum_word_Rs, sum_notword_Rs],
    [sum_word_Ds, sum_notword_Ds]
]

In [264]:
contingency_table

[[105.0, 448604.0], [31.0, 305436.0]]

To return to our data from above with our example, `contingency_table` now has:

|            | Number of "iraq" | Sum of all words except "Iraq" | Sums |
|------------|------------------|--------------------------------|---------|
| Republican | 105              | 448604                         | 448709 |
| Democrat   | 31               | 305436                         | 305467  |
| Total      | 136              | 754040                         |  754176 |

It's clear that Republicans use `iraq` more, but they also use more words in total, too. Fisher's test will tell us the *odds* of either party using it.

Now that we have this, we can use the Fisher's exact test to see what the odds of one party using this more than the other

In [265]:
from scipy.stats import fisher_exact

oddsratio, pvalue = fisher_exact(contingency_table)
oddsratio, pvalue

(2.3061347877472795, 1.5757521308192708e-05)

## Interpreting results
So, Republicans are 2.3X more likely to talk about `iraq` than Democrats according to our `oddsratio`. We know that value is significant because the p-value is much lower than the standard `0.05` used to reject the null-hypothesis.

Remember, Fisher's exact begins from the premise that the *groups are equally likely to use the words*. (Of course we don't actually need to believe that will be true in advance in order to find significant differences between the groups!)

In [269]:
import time
start = time.time()
oddsratio, pvalue = fisher_exact(contingency_table)
(time.time() - start)*60000

1210.1411819458008

# Fisher's exact every word in our data frame
When we check all of our words, which ones are significantly more likely to be used by one group rather than another?

We can write a function to test this:

In [270]:
def fish(group_a, group_a_name, group_b, group_b_name):
    results = []
    if (group_a.columns == group_b.columns).all(): # test the columns for equivalence; don't run if the columns don't match
        for word in group_a.columns:
            # TODO: shrink relative df for rows in cases where a and b contain zeroes
            
            # 1. calculate frequencies of each word
            sum_word_a = group_a[word].sum()
            sum_word_b = group_b[word].sum()

            # 2. calculate total number of words
            sum_allword_a = group_a.sum().sum()
            sum_allword_b = group_b.sum().sum()

            # 3. calculate total number of words minus the target word
            sum_notword_a = sum_allword_a - sum_word_a
            sum_notword_b = sum_allword_b - sum_word_b

            # 4. make contingency table
            contingency_table = [[sum_word_a, sum_notword_a], [sum_word_b, sum_notword_b]]

            # 5. run fisher's exact
            odds,pvalue = fisher_exact(contingency_table)

            # 6. capture results in dictionary
            d = {}
            d['word'] = word
            d['odds'] = odds
            d['pvalue'] = pvalue
            d['group_a']= group_a_name
            d['group_b'] = group_b_name

            results.append(d)
    
    return results

This may take a minute to run!

In [181]:
test = fish(Rs, 'Republicans', Ds, 'Democrats')
# this takes about a minute to run on my computer

In [271]:
df_fish = pd.DataFrame(test)

In [272]:
df_fish.head()

Unnamed: 0,group_a,group_b,odds,pvalue,word
0,Republicans,Democrats,1.073176,3.9e-05,a
1,Republicans,Democrats,0.943339,0.752629,ability
2,Republicans,Democrats,0.748276,0.012898,able
3,Republicans,Democrats,0.789779,0.000961,about
4,Republicans,Democrats,1.317702,0.082408,above


The values with the highest odds are more likely to appear in group A; the values with the lowest odds are more likely to appear in group B.

Let's first filter for values that are less than than the standard significance threshold of `0.05`:

In [277]:
df_fish.shape

(1834, 5)

In [278]:
df_fish = df_fish[df_fish['pvalue'] < 0.05]

Because of the way we have structured our data, the odds are calculated *relative* to group A. So, the usage of `vietnam` is very unlikely by group A (in this case, Republicans), so the odds values is very low. Likewise, the odds value for `applause` is extremely high because it is primarily associated with Republican texts, and appears rarely or never in Democratic texts, which are group B.

In [279]:
df_fish.sort_values(by='odds')

Unnamed: 0,group_a,group_b,odds,pvalue,word
1751,Republicans,Democrats,0.117974,6.635271e-18,vietnam
292,Republicans,Democrats,0.118821,4.320487e-29,college
489,Republicans,Democrats,0.151229,1.171550e-25,don
1613,Republicans,Democrats,0.188356,2.997954e-72,t
222,Republicans,Democrats,0.191555,1.001661e-23,businesses
1712,Republicans,Democrats,0.201861,4.123396e-20,u
725,Republicans,Democrats,0.202565,1.376399e-14,global
290,Republicans,Democrats,0.207166,1.667630e-08,cold
916,Republicans,Democrats,0.209439,2.553747e-09,kids
1394,Republicans,Democrats,0.213546,4.577548e-09,republicans


# Visualizing differences between texts

## Distance matrix

We can also use DTMs to think about the "distance" between documents in the DTM space.

We need to begin with a scaled document-term matrix. I'm going to choose words that appear very frequently:

In [280]:
df = make_dtm(sotus, scaled = True, drop_below=1000)

Going to collect our words:

In [281]:
words = df.columns

In [282]:
words

Index(['a', 'able', 'about', 'abroad', 'achieve', 'act', 'action',
       'additional', 'adequate', 'administration',
       ...
       'workers', 'working', 'world', 'would', 'year', 'years', 'yet', 'you',
       'young', 'your'],
      dtype='object', length=494)

And we'll add party metadata for analysis:

In [283]:
meta.columns

Index(['president', 'year', 'filepath', 'party'], dtype='object')

In [284]:
df = pd.merge(df, meta[['filepath','party']], on='filepath')

Now, I'm going to reset the index so that we can use the filepaths column as a label:

In [285]:
df.head()

Unnamed: 0,filepath,a,able,about,abroad,achieve,act,action,additional,adequate,...,working,world,would,year,years,yet,you,young,your,party
0,1900.McKinley.txt,0.013367,0.000159,0.000581,0.000211,,0.001004,0.001268,0.000264,0.000264,...,0.000106,0.000528,0.000687,0.002536,0.000423,0.000211,0.000581,,0.000793,Republican
1,1901.Roosevelt.txt,0.013951,0.000558,0.000203,0.000355,0.000101,0.000812,0.000609,0.000355,0.000304,...,0.000101,0.001522,0.002384,0.000457,0.001319,0.000507,0.000101,,0.000203,Republican
2,1902.Roosevelt.txt,0.018451,0.000102,0.00051,0.000408,,0.000612,0.000917,0.000714,0.000102,...,,0.001325,0.00367,0.000714,0.00051,0.000612,0.000102,0.000306,0.000306,Republican
3,1903.Roosevelt.txt,0.014483,,0.000539,,,0.000876,0.000674,6.7e-05,0.000269,...,0.000269,0.001078,0.001751,0.002627,0.00128,0.000606,0.000135,6.7e-05,0.000202,Republican
4,1904.Roosevelt.txt,0.014109,0.000343,0.0004,0.0004,0.000114,0.000857,0.000628,5.7e-05,0.000343,...,0.000228,0.000628,0.002456,0.000685,0.000743,0.0008,0.000228,0.000171,0.000343,Republican


**Different:** I'm going to keep filepath as both a column and as the index in order ot use it to label points in a graph:

In [195]:
df.set_index('filepath', drop = False, inplace=True) # drop = False prevents the column from being dropped used as index

In [196]:
df.head()

Unnamed: 0_level_0,filepath,a,able,about,abroad,achieve,act,action,additional,adequate,...,working,world,would,year,years,yet,you,young,your,party
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1900.McKinley.txt,1900.McKinley.txt,0.013367,0.000159,0.000581,0.000211,,0.001004,0.001268,0.000264,0.000264,...,0.000106,0.000528,0.000687,0.002536,0.000423,0.000211,0.000581,,0.000793,Republican
1901.Roosevelt.txt,1901.Roosevelt.txt,0.013951,0.000558,0.000203,0.000355,0.000101,0.000812,0.000609,0.000355,0.000304,...,0.000101,0.001522,0.002384,0.000457,0.001319,0.000507,0.000101,,0.000203,Republican
1902.Roosevelt.txt,1902.Roosevelt.txt,0.018451,0.000102,0.00051,0.000408,,0.000612,0.000917,0.000714,0.000102,...,,0.001325,0.00367,0.000714,0.00051,0.000612,0.000102,0.000306,0.000306,Republican
1903.Roosevelt.txt,1903.Roosevelt.txt,0.014483,,0.000539,,,0.000876,0.000674,6.7e-05,0.000269,...,0.000269,0.001078,0.001751,0.002627,0.00128,0.000606,0.000135,6.7e-05,0.000202,Republican
1904.Roosevelt.txt,1904.Roosevelt.txt,0.014109,0.000343,0.0004,0.0004,0.000114,0.000857,0.000628,5.7e-05,0.000343,...,0.000228,0.000628,0.002456,0.000685,0.000743,0.0008,0.000228,0.000171,0.000343,Republican


# Other `nltk` methods

First, we need to create our familiar list of words. We can do it using our own `tokenize` function, or `nltk`'s:

In [554]:
johnson = '/Users/e/Downloads/1912_johnson_ex-colored.txt'
text = open(johnson).read()

In [555]:
tokens = tokenize(text)

In [556]:
tokens[:5]

['the', 'autobiography', 'of', 'an', 'ex']

We can then create an `nltk` `Text` object that will allow us to use some of its other features:

In [557]:
autobio = nltk.Text(tokens)

In [558]:
autobio.concordance('sixth')

Displaying 6 of 6 matches:
ging house in th street just west of sixth avenue the house was run by a short 
 about the middle of a block between sixth and seventh avenues one of the young
ed that we go to the club we went to sixth avenue walked two blocks and turned 
t dark went round to a restaurant on sixth avenue and ate something then walked
ed to ten blocks the boundaries were sixth avenue from twenty third to thirty t
ss restaurants but i shunned the old sixth avenue district as though it were pe


Our `KWIC` function is already better than this! But NLTK does let us do some other useful stuff with our new `Text` object.

For example, `Text.similar()` simply counts the number of unique *contexts* that words share.

In [559]:
autobio.similar('sixth')

whites blacks them themselves me three shiny him four poor bliss
alaska fifth


In [560]:
autobio.similar('night')

day time race school moment man evening boy newspapers and question
people country me place times world once excitement if


In [561]:
autobio.similar('music')

club me house men table race it them life place world school others
south car paris day country time which


## What is a "context"?
Using the same data as with `Text.similar()`, we can see which contexts words share:

In [562]:
autobio.common_contexts(['sixth','whites'])

between_and


In [563]:
autobio.common_contexts(['sixth','blacks'])

between_and


In [564]:
autobio.common_contexts(['day','night'])

the_after that_i one_near one_at one_a one_i the_i that_she the_before
the_and one_he


In [565]:
autobio.common_contexts(['white','black'])

the_people the_race the_boys a_man the_one the_man the_men as_as


In each of the above cases, `'white'` and `'black'` both appear somewhere in the text in the position given by the `_`.

i.e. the book contains both the phrase "the white boys" and "the black boys."

## `nltk`'s collocates
This enables us to see which words appear next to each other more often than we would expect based on their distributions. By default, it ignores all stopwords.

Frequently collocates are used to identify words have a specific meaning when combined that they do not have individually. A classic example would be "red wine." In this context, "red" is less a description than a conventional way of referring to some types of wines. We do not talk about "maroon wine" or "crimson wine;" it has a conventional name.

NLTK makes collocates very easy to get:

In [None]:
autobio.collocation_list(num=20) # the num option tells nltk how many to return

You could use the collocates listed here to modify the way you count words in your text. For example, in every case where "united" is followed by "states" or "new" is followed by "york," that phrase refers to a different object than any of those words in isolation.