# One quick thing from HW05

When you set a variable name as `True` or `False` in a function declaration, that allows your user to decide *how* they want the function to run. It can be used to determine which parts of the function get run in a particular case. Arguments like these are sometimes called `flags`.

For example:

In [857]:
def weird_print(my_string, reverse_it = False):
    if reverse_it is True:
        print(my_string[::-1]) # this reverses each character in the string
    else:
        print(my_string)

In [858]:
weird_print('check it out')

check it out


In [859]:
weird_print('check it out', reverse_it=True)

tuo ti kcehc


In this case, the user decides whether to `reverse_it` by passing `True` or `False` to the second argument.

We can use the same principle to switch on or off certain parts of our functions.

# Collocates

Last time, Amanda asked about analyzing collocates for specific words. So, I decided to write something to help us do that today.

We'll start by gathering a list of texts using our familiar `absolute_paths` function:

In [860]:
import os

def absolute_paths(directory, txt_only = True):
    files = os.listdir(directory)
    absolute_paths = []
    
    for file in files:
        path = os.path.join(directory, file)
        absolute_paths.append(path)
    
    if txt_only is True:
        txts = []
        for x in absolute_paths:
            if str('.txt') in str(x):
                txts.append(x)
        return txts
    
    else:        
        return absolute_paths

In [861]:
# get harry potter paths using our absolute_paths function
hp_dir = '/Users/e/code/literarytextmining/corpora/harry_potter/texts'
hp_files = absolute_paths(hp_dir)
hp_files

['/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt']

## Looping over our files
Now that we have our files, we want to do the following steps.

For every file:

1. Tokenize it.
2. Find all instances of our keyword.
3. Count all of the words within a certain number of words on either side of every instance of our keyword.
4. Append the resulting dictionary to a list.
5. Repeat.

Our result is going to be a list of dictionaries that we can drop in to Pandas easily.

We'll start with our `tokenize` function:

In [862]:
import string
import re

def tokenize(text, keep_punct = False):
    if keep_punct is True:
        for punct in string.punctuation:
            text = text.replace(punct, ' ' + punct + ' ')
    else:
        for punct in string.punctuation:
            text = text.replace(punct, ' ')
    
    # this replaces *any* amount of whitespace with a single space using regular expressions
    text = re.sub('\s+', ' ', text)
    
    result = []
    
    for x in text.lower().split(' '):
        if x.isalpha():
            result.append(x)
        else:
            word = []
            for y in x: # for every character
                if y.isalpha(): word.append(y)
                    
            result.append(''.join(word))
                
    return result

In [863]:
text = open(hp_files[0]).read()
tokens = tokenize(text)

In [864]:
tokens[:10]

['chapter',
 'one',
 'dudley',
 'demented',
 'the',
 'hottest',
 'day',
 'of',
 'the',
 'summer']

Let's make our test word `hagrid`. We want to find every index where `hagrid` appears in the list.

We're going to do that using `enumerate` to track the position of our tokens:

In [865]:
for i,x in enumerate(['this', 'that', 'the other']):
    print(i)
    print(x)

0
this
1
that
2
the other


In [866]:
indexes = []

for i, token in enumerate(tokens):
    if token == 'hagrid':
        indexes.append(i)

In [867]:
indexes[:5]

[27075, 43277, 47354, 53216, 53264]

In [868]:
len(indexes)

370

This says we got 370 instances of `'hagrid'`. Let's `count` to confirm our method worked:

In [869]:
tokens.count('hagrid')

370

Now we can use roughly the same ideas we used with our `KWIC` function to get all of our collocates:

In [870]:
indexes[0]

27075

In [871]:
tokens[indexes[0]-10:indexes[0]+10]

['cantering',
 'softly',
 'up',
 'and',
 'down',
 'outside',
 'the',
 'bedroom',
 'door',
 'and',
 'hagrid',
 'the',
 'care',
 'of',
 'magical',
 'creatures',
 'teacher',
 'was',
 'saying',
 'they']

One difference is that we don't want to count the number of instances of our central word, since it will over-count the number of times that `'hagrid'` is a collocate of itself.

But we know that the middlemost element of our list is always going to be our target word, and that we're always going to have an odd number of elements if we take the collocates in equal amounts from each side of the target word, so we can remove that middle value:

In [872]:
hagrid_test = tokens[indexes[0]-10:indexes[0]+10]

In [873]:
del hagrid_test[round(len(hagrid_test)/2)]

In [874]:
hagrid_test

['cantering',
 'softly',
 'up',
 'and',
 'down',
 'outside',
 'the',
 'bedroom',
 'door',
 'and',
 'the',
 'care',
 'of',
 'magical',
 'creatures',
 'teacher',
 'was',
 'saying',
 'they']

Now, all we need to do is extend all of these into a big list of words, and count them up the same way we have been with our full texts:

In [875]:
collocates = []

for index in indexes:
    colls = tokens[index-10:index+10]
    del colls[round(len(colls)/2)]
    collocates.extend(colls) # we use extend rather than append because we are adding additional elements *from* a list

In [876]:
collocates[:40]

['cantering',
 'softly',
 'up',
 'and',
 'down',
 'outside',
 'the',
 'bedroom',
 'door',
 'and',
 'the',
 'care',
 'of',
 'magical',
 'creatures',
 'teacher',
 'was',
 'saying',
 'they',
 'of',
 'the',
 'holidays',
 'approached',
 'he',
 'could',
 'not',
 'wait',
 'to',
 'see',
 'again',
 'to',
 'play',
 'quidditch',
 'even',
 'to',
 'stroll',
 'across',
 'the',
 'vanished',
 'six']

Now we can just count these using a dictionary like we've done with prior word lists:

In [877]:
d = {}

for coll in collocates:
    if coll not in d:
        d[coll] = 1
    else:
        d[coll] += 1

In [878]:
d['creatures']

9

Finally, we can wrap all of this logic into an abstract function that will work for any book.

I use the variable name `horizon` for the number of words that we want to catch on either side of our target term:

In [879]:
def get_collocates(filepath, target_word, horizon = 10):
    text = open(filepath).read() # get text
    tokens = tokenize(text) # get tokens
    
    indexes = []

    for i, token in enumerate(tokens):
        if token == target_word:
            indexes.append(i) # get indexes
    
    collocates = []

    for index in indexes:
        colls = tokens[index-10:index+10]
        del colls[round(len(colls)/2)] # don't count target term
        collocates.extend(colls) # we use extend rather than append because we are adding additional elements *from* a list
        
    d = {}
    # we want to make sure we get data about where our values are coming from. this tells us the file:
    d['filepath'] = os.path.split(filepath)[-1] 
    d['target_word'] = target_word # this tells us our target
    
    for coll in collocates:
        if coll not in d:
            d[coll] = 1 # count up collocates
        else:
            d[coll] += 1
    
    return d

In [880]:
hp_files[0]

'/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt'

In [None]:
get_collocates(hp_files[0], 'hagrid') # it works!

Now that we have it working for one file, we can loop this function we've written over multiple files, and append our results to a list of dictionaries:

In [882]:
hp_files

['/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt']

In [883]:
output = []

for file in hp_files:
    collocates = get_collocates(file, 'hagrid')
    output.append(collocates)

In [884]:
len(output)

7

In [885]:
import pandas as pd
hagrid_colls = pd.DataFrame(output)

In [886]:
hagrid_colls.set_index(['filepath', 'target_word']).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,a,abandon,able,abound,about,above,abruptly,absence,absent,acceleration,...,yet,you,young,your,yours,yourself,yowling,zey,zigzagged,zoomed
filepath,target_word,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1 Sorcerers Stone.txt,hagrid,87,,,,20,1.0,,,,,...,5.0,27,,5,1.0,2.0,,,,
2 Chamber of Secrets.txt,hagrid,52,,,,9,1.0,1.0,,,,...,,26,1.0,2,1.0,1.0,,,,
3 Prisoner of Azkaban.txt,hagrid,75,,1.0,,16,,,,,,...,2.0,38,,5,,1.0,,,,
4 Goblet of Fire.txt,hagrid,97,,3.0,,30,1.0,,,,,...,6.0,55,,7,,,1.0,,,
5 Order of the Phoenix.txt,hagrid,143,1.0,4.0,1.0,17,,2.0,2.0,1.0,,...,2.0,68,,7,1.0,,,,,
6 Half-Blood Prince.txt,hagrid,58,,3.0,,3,,,1.0,,,...,,21,1.0,6,,,,,,
7 Deathly Hallows.txt,hagrid,56,,1.0,,3,,,,,1.0,...,2.0,15,,2,,,,1.0,2.0,1.0


## Shrinking the data
This is a big data frame with 3,731 columns. We can use the techniques we discussed last time to cut it down to a more manageable size:

In [887]:
hagrid_colls = hagrid_colls.set_index(['filepath', 'target_word']).sort_index() # reassigning the changes above

Let's see which ones appear with `'hagrid`` the most in general:

In [888]:
hagrid_colls.sum().sort_values(ascending=False)

the            1181.0
said            858.0
to              689.0
and             682.0
a               568.0
harry           553.0
of              486.0
he              465.0
his             463.0
was             428.0
in              326.0
had             284.0
it              276.0
at              274.0
him             262.0
you             250.0
that            238.0
hagrid          231.0
on              220.0
as              214.0
with            209.0
they            197.0
hermione        188.0
up              179.0
i               178.0
ron             162.0
but             162.0
not             154.0
all             141.0
be              140.0
                ...  
ripping           1.0
furry             1.0
france            1.0
road              1.0
frankly           1.0
rubble            1.0
frantically       1.0
freckles          1.0
rub               1.0
rows              1.0
french            1.0
rowle             1.0
fret              1.0
roused            1.0
round     

Ok, so we have a lot that are very large, and others that are very small. Let's see how the data is distributed:

In [889]:
hagrid_colls.sum().describe() # describe() gives us information about the result above

count    3731.00000
mean        7.65398
std        39.56446
min         1.00000
25%         1.00000
50%         2.00000
75%         4.00000
max      1181.00000
dtype: float64

This shows us that 75% of `'hagrid'`'s collocates only appear 4 times across all 7 books.

Clearly we should set a pretty high threshold. What's the value at the 90th percentile?

In [890]:
hagrid_sums = hagrid_colls.sum()

In [891]:
len(hagrid_sums)

3731

In [892]:
round(len(hagrid_sums) * 0.9)

3358

In [893]:
hagrid_sums.sort_values()[3358] # here, sort_values sorts the result by ascending values

10.0

Words that appear 10 times or more with `'hagrid'` represent the 90th+ percentile of collocates. So let's see those:

In [894]:
hagrid_sums.sort_values()[3358:]

grawp         10.0
shaggy        10.0
fact          10.0
suddenly      10.0
find          10.0
given         10.0
sure          10.0
case          10.0
followed      10.0
rita          10.0
supposed      10.0
looks         10.0
caught        10.0
expelled      10.0
someone       10.0
lupin         10.0
sound         10.0
point         10.0
stepped       10.0
heavily       10.0
eaters        10.0
heads         10.0
every         10.0
hard          10.0
big           11.0
deep          11.0
mouth         11.0
sounding      11.0
sight         11.0
small         11.0
             ...  
be           140.0
all          141.0
not          154.0
but          162.0
ron          162.0
i            178.0
up           179.0
hermione     188.0
they         197.0
with         209.0
as           214.0
on           220.0
hagrid       231.0
that         238.0
you          250.0
him          262.0
at           274.0
it           276.0
had          284.0
in           326.0
was          428.0
his         

Nice, some of these seem especially related to `'hagrid'`: `grawp`, `shaggy`, `rita`, etc.

## Using stopwords to shrink our collocates
But a lot of the highest value words here are not related to `'hagrid'` at all. Those are simply the most frequent words, which collocate a lot with just about any word.

There are multiple techniques we could use for getting rid of these. We'll start by using `NLTK`'s standard English stopword list:

In [895]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/e/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [896]:
from nltk.corpus import stopwords
len(stopwords.words('english'))

179

Again, these 179 stopwords are generally defined as high frequency / low content words. Because they are so common, they don't tell us much that is meaningful about the results that we're looking at, because they appear frequently *everywhere*.

Let's look at a few:

In [897]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

How can we use these words to shrink our collocates list further? Easy: we can check each of our words against the stopwords list and only retain those that do not appear with the stopwords.

In [898]:
hagrid_90th = hagrid_sums.sort_values()[3358:].index # index returns the names of the values here

In [899]:
hagrid_90th

Index(['grawp', 'shaggy', 'fact', 'suddenly', 'find', 'given', 'sure', 'case',
       'followed', 'rita',
       ...
       'was', 'his', 'he', 'of', 'harry', 'a', 'and', 'to', 'said', 'the'],
      dtype='object', length=373)

In [900]:
hagrid_nostops = []

for x in hagrid_90th:
    if x not in stopwords.words('english'):
        hagrid_nostops.append(x)

In [901]:
hagrid_nostops[:5]

['grawp', 'shaggy', 'fact', 'suddenly', 'find']

In [902]:
len(hagrid_nostops)

275

Now, finally, let's use this filter on our original dataframe:

In [903]:
hagrid_colls[hagrid_nostops]

Unnamed: 0_level_0,Unnamed: 1_level_0,grawp,shaggy,fact,suddenly,find,given,sure,case,followed,rita,...,dumbledore,like,ter,yeh,back,ron,hermione,hagrid,harry,said
filepath,target_word,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1 Sorcerers Stone.txt,hagrid,,1.0,3.0,4.0,,2.0,1.0,,3.0,,...,8.0,14,14,12,17,12,1,37,79,84
2 Chamber of Secrets.txt,hagrid,,1.0,,,,,1.0,,,,...,17.0,4,5,3,6,26,10,15,46,65
3 Prisoner of Azkaban.txt,hagrid,,2.0,,1.0,2.0,1.0,2.0,6.0,4.0,,...,7.0,6,18,16,24,29,32,32,70,115
4 Goblet of Fire.txt,hagrid,,1.0,1.0,4.0,4.0,3.0,1.0,,2.0,10.0,...,29.0,19,23,20,31,39,41,57,110,191
5 Order of the Phoenix.txt,hagrid,3.0,3.0,6.0,1.0,2.0,,3.0,3.0,,,...,12.0,28,41,37,33,39,79,59,112,281
6 Half-Blood Prince.txt,hagrid,6.0,2.0,,,2.0,,2.0,,1.0,,...,11.0,11,7,22,9,14,17,20,67,89
7 Deathly Hallows.txt,hagrid,1.0,,,,,4.0,,1.0,,,...,,3,8,8,15,3,8,11,69,33


We went from 3,731 corrleates down to just 275, which is an amount of data we can actually read!

Let's see what the top results here are:

In [904]:
hagrid_shrunk = hagrid_colls[hagrid_nostops] # making a new dtm to save the smaller group

In [905]:
hagrid_shrunk.sum().sort_values(ascending = False)[:30]

said          858.0
harry         553.0
hagrid        231.0
hermione      188.0
ron           162.0
back          135.0
yeh           118.0
ter           116.0
like           85.0
dumbledore     84.0
see            83.0
looked         79.0
looking        75.0
got            75.0
one            74.0
know           73.0
could          70.0
asked          67.0
go             67.0
professor      63.0
would          62.0
told           59.0
door           55.0
around         55.0
head           50.0
get            50.0
still          49.0
come           47.0
hand           46.0
right          44.0
dtype: float64

Here, we have some of Hagrid's distinctive vocabulary `ter` and `yeh`, as well as his closest relationships with Harry, Hermione, Ron, and Dumbledore. We're getting close.

# Scaling collocate results relative to the corpus

The numbers above represent the *absolute frequencies* of all of the words associated with Hagrid in *Harry Potter*. That might be ok in this context since we're dealing with a corpus that is a unified object. But in other cases, we might want to scale our collocates relative to their total frequency in the corpus.

To do that we need to make a data frame containing all of our raw frequencies, then divide each of our collocates by the total number of occurrences of that word in the corpus. The results will look very different!

We begin with our `make_dtm` function from last time:

In [906]:
import pandas as pd

def make_dtm(directory, scaled = False):
    files = absolute_paths(directory)
    
    result = [] # empty list where I will append the dictionaries of word counts
    
    for file in files: # looping over the results
        text = open(file).read() # read in text file
        tokens = tokenize(text) # make tokens list
        d = count_words(tokens) # use count_words to create a dictionary
        
        if scaled is True:
            total_words = sum(list(d.values()))
            for key,value in d.items():
                d[key] = d[key] / total_words
        
        # os.path.split() returns the base path and the filename as a pair:
        d['filepath'] = os.path.split(file)[-1] # include the _ before filename in case the text contains "filename"
        result.append(d) # append the unscaled result
    
    return pd.DataFrame(result).set_index('filepath').sort_index()

In [907]:
hp_dir = '/Users/e/code/literarytextmining/corpora/harry_potter/texts/'

In [908]:
hp_dtm = make_dtm(hp_dir)

In [909]:
hp_dtm

Unnamed: 0_level_0,a,aaaaaaaaargh,aaaaaaaarrrrrgh,aaaaaaand,aaaaaand,aaaaahed,aaaaargh,aaaah,aaah,aargh,...,zograf,zombie,zone,zonko,zoo,zoological,zoom,zoomed,zooming,éclairs
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 Sorcerers Stone.txt,1066,,,,,,,,,,...,,2.0,,,7.0,,1.0,1,2.0,1.0
2 Chamber of Secrets.txt,1879,,,,,,,,,,...,,,,,2.0,,,2,,
3 Prisoner of Azkaban.txt,2222,,,,,,,,,,...,,1.0,,1.0,,,,9,3.0,
4 Goblet of Fire.txt,3680,,1.0,1.0,1.0,1.0,,,1.0,,...,1.0,,,,,1.0,4.0,9,12.0,
5 Order of the Phoenix.txt,4967,1.0,,,,,1.0,1.0,,2.0,...,,,1.0,,,,2.0,23,7.0,
6 Half-Blood Prince.txt,3323,,,,,,1.0,1.0,,,...,,,,,,,,7,2.0,1.0
7 Deathly Hallows.txt,3604,,,,,,,,,1.0,...,,,,,,,1.0,6,5.0,


Let's shrink our new DTM down to include just those words we're interested in for `'hagrid'`:

In [910]:
hagrid_shrunk.columns

Index(['grawp', 'shaggy', 'fact', 'suddenly', 'find', 'given', 'sure', 'case',
       'followed', 'rita',
       ...
       'dumbledore', 'like', 'ter', 'yeh', 'back', 'ron', 'hermione', 'hagrid',
       'harry', 'said'],
      dtype='object', length=275)

In [911]:
hp_dtm[hagrid_shrunk.columns]

Unnamed: 0_level_0,grawp,shaggy,fact,suddenly,find,given,sure,case,followed,rita,...,dumbledore,like,ter,yeh,back,ron,hermione,hagrid,harry,said
filepath,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 Sorcerers Stone.txt,,2.0,12,38,24,11,22,4,10,,...,68,118,50,58,142,160,52,182,648,420
2 Chamber of Secrets.txt,,2.0,20,50,54,22,51,14,32,,...,138,187,22,11,287,659,286,135,1542,1216
3 Prisoner of Azkaban.txt,,3.0,27,85,46,19,69,38,31,,...,139,240,49,39,368,695,615,201,1814,1511
4 Goblet of Fire.txt,,3.0,42,115,113,52,96,22,42,85.0,...,518,451,55,50,600,975,806,311,2945,2673
5 Order of the Phoenix.txt,35.0,5.0,69,50,133,61,184,52,87,30.0,...,536,527,124,75,793,1169,1196,370,3643,3999
6 Half-Blood Prince.txt,9.0,3.0,47,62,129,52,130,44,45,1.0,...,863,352,32,38,423,780,646,172,2543,2465
7 Deathly Hallows.txt,10.0,,39,33,136,48,170,37,40,21.0,...,460,448,11,15,542,1031,1079,132,2760,1978


Now we're going to sum all of these up:

In [912]:
hp_dtm[hagrid_shrunk.columns].sum()

grawp            54.0
shaggy           18.0
fact            256.0
suddenly        433.0
find            635.0
given           265.0
sure            722.0
case            211.0
followed        287.0
rita            137.0
supposed        336.0
looks           204.0
caught          333.0
expelled         73.0
someone         328.0
lupin           734.0
sound           312.0
point           289.0
stepped         153.0
heavily         109.0
eaters          342.0
heads           199.0
every           614.0
hard            459.0
big             175.0
deep            236.0
mouth           447.0
sounding         87.0
sight           340.0
small           477.0
               ...   
right          1493.0
hand           1160.0
come           1103.0
still          1697.0
get            1519.0
head           1308.0
around         2225.0
door           1297.0
told           1053.0
would          2253.0
professor      1808.0
go             1305.0
asked          1090.0
could          2760.0
know      

And then we're going to divide each of Hagrid's collocates by their overall frequency in the corpus:

In [913]:
hagrid_shrunk.sum() / hp_dtm[hagrid_shrunk.columns].sum()

grawp         0.185185
shaggy        0.555556
fact          0.039062
suddenly      0.023095
find          0.015748
given         0.037736
sure          0.013850
case          0.047393
followed      0.034843
rita          0.072993
supposed      0.029762
looks         0.049020
caught        0.030030
expelled      0.136986
someone       0.030488
lupin         0.013624
sound         0.032051
point         0.034602
stepped       0.065359
heavily       0.091743
eaters        0.029240
heads         0.050251
every         0.016287
hard          0.021786
big           0.062857
deep          0.046610
mouth         0.024609
sounding      0.126437
sight         0.032353
small         0.023061
                ...   
right         0.029471
hand          0.039655
come          0.042611
still         0.028874
get           0.032916
head          0.038226
around        0.024719
door          0.042406
told          0.056030
would         0.027519
professor     0.034845
go            0.051341
asked      

Let's put that result in a variable:

In [914]:
hagrid_colls_scaled = hagrid_shrunk.sum() / hp_dtm[hagrid_shrunk.columns].sum()

In [915]:
hagrid_colls_scaled.sort_values(ascending = False)

shaggy        0.555556
gamekeeper    0.545455
gotta         0.500000
steak         0.444444
yeh           0.412587
bin           0.373333
fer           0.370787
yer           0.340659
ter           0.338192
skrewts       0.325000
fang          0.271739
cabin         0.264151
maxime        0.244898
massive       0.240000
creatures     0.217949
growled       0.211111
beard         0.195402
ended         0.194444
dragons       0.193548
madame        0.192308
buckbeak      0.188119
grawp         0.185185
charlie       0.162500
beaming       0.161905
hagrid        0.153693
visit         0.146789
giant         0.143791
expelled      0.136986
forest        0.130435
sounding      0.126437
                ...   
hard          0.021786
place         0.021739
trying        0.021676
face          0.021615
death         0.021248
umbridge      0.020873
magic         0.020867
even          0.020790
put           0.020701
house         0.020436
seemed        0.020390
nothing       0.020031
black      

That's more like it! This gives us words that are *very* closely associated with Hagrid. In some cases, more than half of their occurrences happen within 10 words of Hagrid.

# Putting all this in a function
We don't want to repeat those steps for every word. We want results! Let's wrap all of this analysis in a function:

In [916]:
# in case NLTK isn't cooperating on your system:
stopwords = ['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

In [852]:
# this function depends upon a few of our old friends absolute_paths and tokenizer
# txt_dir points to a directory where your text files are located, and stored in .txt format

def corp_collocates(word, txt_dir, horizon = 10, percentile = 0.9, drop_stopwords = True):
    # 1. generate a list of files
    filepaths = absolute_paths(txt_dir)
    
    # 2. make a list of dictionaries containing our data
    output = []
    
    for filepath in filepaths:
        collocates = get_collocates(filepath, word, horizon)
        output.append(collocates)
    
    # 3. make a dataframe of our results
    dtm = pd.DataFrame(output)
    dtm = dtm.set_index(['filepath', 'target_word']).sort_index()
    
    # 4. optionally drop stopwords
    keep = []
    if drop_stopwords is True:
        for x in dtm.columns:
            if x not in stopwords:    
                keep.append(x)
    
        dtm = dtm[keep]        
        
    # 5. sum dtm and cut to percentile
    sums = dtm.sum()
    pct_index = round(len(sums) * percentile)
    top_words = sums.sort_values()[pct_index:].index # index returns the list of words
    
    # 6. scale results
    dtm = dtm[top_words]
    raw_values = make_dtm(txt_dir)[top_words]
    scaled_results = dtm.sum() / raw_values.sum()
    
    return scaled_results.sort_values(ascending = False)

In [None]:
corp_collocates('your_word', '/Users/erik/Downloads/corp/race/texts')

In [854]:
# example function
corp_collocates('gringotts', hp_dir)

impostors    0.750000
rob          0.625000
vault        0.305085
bank         0.206897
goblins      0.142857
break        0.069444
london       0.058824
wizarding    0.048193
gringotts    0.048193
fer          0.044944
goblin       0.042553
griphook     0.032000
bill         0.023622
gold         0.020000
white        0.009615
work         0.008499
bit          0.007678
hogwarts     0.006985
need         0.006803
place        0.006689
set          0.006565
ever         0.006394
inside       0.005865
anything     0.005747
say          0.005741
never        0.005666
would        0.004882
take         0.004373
first        0.004175
felt         0.003448
something    0.003445
moment       0.003401
hagrid       0.003327
left         0.003286
tell         0.003250
like         0.003013
think        0.002971
dark         0.002907
toward       0.002885
asked        0.002752
saw          0.002488
know         0.002430
time         0.002375
see          0.002225
still        0.001768
ron       

# Why does this all matter?

Collocates help us see that the words that we associate with specific concepts may or may not appear with those concepts as often as we would expect in our texts.

As in the Underwood et al. essay from earlier this week, they did not predict that `grin` and `smile` would separate male- and female-coded behaviors in the mid-20th century. Rather, it emerged from the data.