# 1. Reading your corpus
It's important to know what's **in** your corpus before you begin working with it. One of the best ways to do that is to look through the files manually and confirm that they are what we expect them to be.

### a. Look over at least 5 text files in your group's corpus. Scroll around and read a few lines of each.

How do they look? Do they have any obvious typos or errors? Are there unusual characters or clearly misspelled words? Are they long enough? Are there formatting problems that need to be cleaned up? Anything else you notice?

*Write your response here*

# 2. Rig up your corpus
Now that you've read through your corpus, it's time to set it up for analysis.

*The notebooks for 10-28 and 10-30 contain everything you need to get through this problem very quickly.*

### a. Make a document-term matrix of your corpus with raw frequencies. Be sure to set your filename (or another unique value) as your index.

In [None]:
dtm_raw = # your code here

### b. Make a document-term matrix of your corpus with frequencies scaled by the total number of words in each text. Be sure to set your filename (or another unique value) as your index.

In [None]:
dtm_scaled = # your code here

### c. Explain the difference between document-term matrices with raw and scaled frequencies. What is the value of scaling? When might you want to use raw frequencies?

*Your response here*

# 3. Practicing with Pandas
Now that we have loaded your corpus as a data frame, we're going to practice manipulating it with Pandas. If you're lucky, you'll start to see some interesting results for your final project!

Again, check the notebooks for 10-28 and 10-30 for examples of each of the following commands:

### a. Pick a word you're interested in from your DTM's columns. Print out its scaled frequencies in each of your texts.

### b. Now, sort those values such that the highest values are at the top.

### c. Pick a second word you want to compare to your first word. Print a dataframe containing *only* those two words.

### d. What are the most frequent words in your corpus? Sum all of your columns, and sort them.

### e. How many words in your raw DTM only appear 1 time? Sum your columns, and filter for rows `==1`.

# 4. Adding metadata to your scaled DTM

Now we're going to practice adding metadata to your data frame. This will allow you to easily sort, filter, and group your results.

We talked about this in class on Monday, but here's a refresher on how to `merge` data frames:

In [36]:
# let's say we're working with data from a cafe:

import pandas as pd

coffee = {'food':'coffee','size':'8oz','price':'$3.00'}
banana ={'food':'banana','size':'4oz','price':'$0.50'}
donut = {'food':'donut', 'size':'6oz', 'price':'$1.00'}
food_df = pd.DataFrame([coffee,banana,donut])

In [37]:
food_df

Unnamed: 0,food,price,size
0,coffee,$3.00,8oz
1,banana,$0.50,4oz
2,donut,$1.00,6oz


What if we want to combine that data with some other data about when and how those foods sell?

In [40]:
coffee_sales = {'food':'coffee', 'peak sales':'9am', 'total sold':181}
banana_sales = {'food':'banana', 'peak sales':'1pm', 'total sold':36}
donut_sales = {'food':'donut', 'peak sales':'10am', 'total sold':96}
sales_df = pd.DataFrame([coffee_sales, banana_sales, donut_sales])

In [41]:
sales_df

Unnamed: 0,food,peak sales,total sold
0,coffee,9am,181
1,banana,1pm,36
2,donut,10am,96


You can combine these data frames using `pd.merge`. In order for it to work, you have to pass **a column that both data frames share in common** to the `on` argument ("on" as in "merge the dataframes *on* this column").

In this case, both data frames have the column `food` in common. So:

In [17]:
pd.merge(food_df, sales_df, on='food')

Unnamed: 0,food,price,size,peak sales,total sold
0,coffee,$3.00,8oz,9am,181
1,banana,$0.50,4oz,1pm,36
2,donut,$1.00,6oz,10am,96


### a. Use `pd.merge` to add the `PUBL_DATE` column from your metadata file to your DTM.
1. Import your metadata using `pd.read_csv`
2. Slice your metadata so that you only have the `PUBL_DATE` column and the column you want to match on.
3. Merge your DTM with your metadata slice.

### b. Sort your DTM by year using the `PUBL_DATE` column.

# 5. Getting results with `corp_collocates`
The functions for `corp_collocates` are below. Be sure to run them all in order for the functions to work.

We're going use them to start looking for results in your corpus.

In [42]:
# this function depends upon a few of our old friends like absolute_paths and tokenizer
# txt_dir points to a directory where your text files are located, and stored in .txt format

def corp_collocates(word, txt_dir, horizon = 10, percentile = 0.9, drop_stopwords = True):
    # 1. generate a list of files
    filepaths = absolute_paths(txt_dir)
    
    # 2. make a list of dictionaries containing our data
    output = []
    
    for filepath in filepaths:
        collocates = get_collocates(filepath, word, horizon)
        output.append(collocates)
    
    # 3. make a dataframe of our results
    dtm = pd.DataFrame(output)
    dtm = dtm.set_index(['filepath', 'target_word']).sort_index()
    
    # 4. optionally drop stopwords
    keep = []
    if drop_stopwords is True:
        for x in dtm.columns:
            if x not in stopwords:    
                keep.append(x)
    
        dtm = dtm[keep]        
        
    # 5. sum dtm and cut to percentile
    sums = dtm.sum()
    pct_index = round(len(sums) * percentile)
    top_words = sums.sort_values()[pct_index:].index # index returns the list of words
    
    # 6. scale results
    dtm = dtm[top_words]
    raw_values = make_dtm(txt_dir)[top_words]
    scaled_results = dtm.sum() / raw_values.sum()
    
    return scaled_results.sort_values(ascending = False)

In [43]:
import pandas as pd

def make_dtm(directory, scaled = False):
    files = absolute_paths(directory)
    
    result = [] # empty list where I will append the dictionaries of word counts
    
    for file in files: # looping over the results
        text = open(file).read() # read in text file
        tokens = tokenize(text) # make tokens list
        d = count_words(tokens) # use count_words to create a dictionary
        
        if scaled is True:
            total_words = sum(list(d.values()))
            for key,value in d.items():
                d[key] = d[key] / total_words
        
        # os.path.split() returns the base path and the filename as a pair:
        d['filepath'] = os.path.split(file)[-1] # include the _ before filename in case the text contains "filename"
        result.append(d) # append the unscaled result
    
    return pd.DataFrame(result).set_index('filepath').sort_index()

In [44]:
def get_collocates(filepath, target_word, horizon = 10):
    text = open(filepath).read() # get text
    tokens = tokenize(text) # get tokens
    
    indexes = []

    for i, token in enumerate(tokens):
        if token == target_word:
            indexes.append(i) # get indexes
    
    collocates = []

    for index in indexes:
        colls = tokens[index-10:index+10]
        del colls[round(len(colls)/2)] # don't count target term
        collocates.extend(colls) # we use extend rather than append because we are adding additional elements *from* a list
        
    d = {}
    # we want to make sure we get data about where our values are coming from. this tells us the file:
    d['filepath'] = os.path.split(filepath)[-1] 
    d['target_word'] = target_word # this tells us our target
    
    for coll in collocates:
        if coll not in d:
            d[coll] = 1 # count up collocates
        else:
            d[coll] += 1
    
    return d

In [45]:
import string
import re

def tokenize(text, keep_punct = False):
    if keep_punct is True:
        for punct in string.punctuation:
            text = text.replace(punct, ' ' + punct + ' ')
    else:
        for punct in string.punctuation:
            text = text.replace(punct, ' ')
    
    # this replaces *any* amount of whitespace with a single space using regular expressions
    text = re.sub('\s+', ' ', text)
    
    result = []
    
    for x in text.lower().split(' '):
        if x.isalpha():
            result.append(x)
    
    return result

In [46]:
import os

def absolute_paths(directory, txt_only = True):
    files = os.listdir(directory)
    absolute_paths = []
    
    for file in files:
        path = os.path.join(directory, file)
        absolute_paths.append(path)
    
    if txt_only is True:
        txts = []
        for x in absolute_paths:
            if str('.txt') in str(x):
                txts.append(x)
        return txts
    
    else:        
        return absolute_paths

In [47]:
def count_words(word_list):
    d = {}
    
    for word in word_list:
        if word not in d:
            d[word] = 1
        else:
            d[word] += 1
    
    return d

In [48]:
stopwords = ['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

In [51]:
corp_collocates('hagrid', '/Users/e/code/literarytextmining/corpora/harry_potter/texts')

shaggy        0.555556
gamekeeper    0.545455
gotta         0.500000
steak         0.444444
yeh           0.412587
bin           0.373333
fer           0.370787
gruffly       0.370370
committee     0.347826
fence         0.346154
yer           0.340659
ter           0.338192
umbrella      0.333333
skrewts       0.325000
thestrals     0.310345
fang          0.271739
cabin         0.264151
aragog        0.256410
maxime        0.244898
massive       0.240000
sadly         0.219512
creatures     0.217949
growled       0.211111
wiping        0.204082
beard         0.195402
ended         0.194444
dragons       0.193548
madame        0.192308
buckbeak      0.188119
grawp         0.185185
                ...   
nothing       0.020031
black         0.020022
day           0.019651
hall          0.019576
bit           0.019194
robes         0.019149
something     0.018949
though        0.018847
next          0.018786
year          0.018762
felt          0.018391
going         0.018182
moment     

### a. Find at least 3 collocations of interest for your project, and write a sentence about the potential significance of each. (Be sure to print the collocates you discuss above so I can see them!)

*your response here*

### b. Then, write a sentence reflecting on the value of collocation as a method of analysis. What does it tell you? What more would you want to know about your results?

*your response here*