# Analyzing volumes for word frequencies
This notebook will demonstrate some of basic functionality of the Hathi Trust FeatureReader object. We will look at a few examples of easily replicable text analysis techniques — namely word frequency and visualization.

In [None]:
from htrc_features import FeatureReader
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Part 1 — Word frequency in novels
The following cells load a collection of nine novels from the 18th-20th centuries, chosen from an HTRC collection. Also loaded are a collection of math textbooks from the 17th-19th centuries, but the latter will be used in a later part. The collection of novels will be used as a departure point for our text analysis.

In [None]:
!rm -rf local-folder
download_output = !htid2rsync --f novels-word-use.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
suffix = '.json.bz2'
file_paths = ['local-folder/' + path for path in download_output if path.endswith(suffix)]

In [None]:
fr_novels = FeatureReader(file_paths)
for vol in fr_novels:
    print(vol.title)

## Selecting volumes
The following cell is useful in choosing a volume to manipulate. Set `title_word` to any word that is contained in the title of the fr-volume you would like to work with (the string comparison is case-insensitive since some titles are lower-case). The volume will then be stored as 'vol', and can be reassigned to any variable name you would like! As an example, `title_word` is currently set to "grapes", meaning "The Grapes of Wrath" by John Steinbeck is the current volume saved under the variable name vol. You can change this cell at any time to work with a different volume.

In [None]:
title_word = 'grapes'

for vol in fr_novels:
    if title_word.lower() in vol.title.lower():
        print('Current volume:', vol.title)
        break

## Sampling tokens from a book
The following cell will display the most common tokens (words or punctuation marks) in a given volume, alongside the number of times they appear. It will also calculate their relative frequencies (found by dividing the number of appearances over the total number of words in the book) and display the results in a `DataFrame`. We'll do this for the volume we found above, the cell may take a few seonds to run because we're looping through every word in the volume!

In [None]:
tokens = vol.tokenlist(pos=False, case=False, pages=False).sort_values('count', ascending=False)

freqs = []
for count in tokens['count']:
    freqs.append(count/sum(tokens['count']))
    
tokens['rel_frequency'] = freqs
tokens

### Graphing word frequencies
The following cell outputs a bar plot of the most common tokens from the volume and their frequencies.

In [None]:
%matplotlib inline
# Build a list of frequencies and a list of tokens.
freqs_1, tokens_1 = [], []
for i in range(15):  # top 8 words
    freqs_1.append(freqs[i])
    tokens_1.append(tokens.index.get_level_values('lowercase')[i])

# Create a range for the x-axis
x_ticks = np.arange(len(tokens_1))

# Plot!
plt.bar(x_ticks, freqs_1)
plt.xticks(x_ticks, tokens_1)
plt.ylabel('Frequency', fontsize=14)
plt.xlabel('Token', fontsize=14)
plt.title('Common token frequencies in "' + vol.title[:14] + '..."', fontsize=14)

As you can see, the most common tokens in "The Grapes of Wrath" are mostly punctuation and basic words that don't provide context. Let's see if we can narrow our search to gain some more relevant insight. We can get a list of stopwords from the `nltk` library. Punctuation is in the `string` library:

In [None]:
from nltk.corpus import stopwords
from string import punctuation

print(stopwords.words('english'))

print()

print(punctuation)

Now that we have a list of words to ignore in our search, we can make a few tweaks to our plotting cell.

In [None]:
freqs_filtered, tokens_filtered, i = [], [], 0
while len(tokens_filtered) < 15:
    if tokens.index.get_level_values('lowercase')[i] not in stopwords.words('english') + list(punctuation):
        freqs_filtered.append(freqs[i])
        tokens_filtered.append(tokens.index.get_level_values('lowercase')[i])
    i += 1

# Create a range for the x-axis
x_ticks = np.arange(len(freqs_filtered))

# Plot!
plt.bar(x_ticks, freqs_filtered)
plt.xticks(x_ticks, tokens_filtered)
plt.ylabel('Frequency', fontsize=14)
plt.xlabel('Token', fontsize=14)
plt.title('Common token frequencies in "' + vol.title[:14] + '..."', fontsize=14)

That's better. No more punctuation and lower frequencies on the y-axis mean that narrowing down our search choices was effective. This is also helpful if we're trying to find distinctive words in a text, because we removed the words that most texts share.

## Sampling tokens from all books
Now we can see how relative word frequencies compare across all the books in our sample. To do this, we'll need a few useful functions.
The first finds the most common noun in a volume, with adjustable parameters for minimum length.
The second calculates the relative frequency of a token across the entirety of a volume, saving us the time of doing the calculation like in the above cell.
Finally, we'll have a visualization function to create a bar plot of relative frequencies for all volumes in our sample, so that we can easily track how word frequencies differ across titles.

In [None]:
# A function to return the most common noun of length at least word_length in the volume.
# NOTE: word_length defaults to 2.
# e.g. most_common_noun(fr_novels.first) returns 'time'.

def most_common_noun(vol, word_length=2):   
    # Build a table of common nouns
    tokens_1 = vol.tokenlist(pages=False, case=False)
    nouns_only = tokens_1.loc[(slice(None), slice(None), ['NN']),]
    top_nouns = nouns_only.sort_values('count', ascending=False)

    token_index = top_nouns.index.get_level_values('lowercase')
    
    # Choose the first token at least as long as word_length with non-alphabetical characters
    for i in range(max(token_index.shape)):
        if (len(token_index[i]) >= word_length):
            if("'", "!", ",", "?" not in token_index[i]):
                return token_index[i]
    print('There is no noun of this length')
    return None

In [None]:
most_common_noun(vol, 15)

In [None]:
# Return the usage frequency of a given word in a given volume. 
# NOTE: frequency() returns a dictionary entry of the form {'word': frequency}.
# e.g. frequency(fr_novels.first(), 'blue') returns {'blue': 0.00012}

def frequency(vol, word):
    t1 = vol.tokenlist(pages=False, pos=False, case=False)
    token_index = t1[t1.index.get_level_values("lowercase") == word]
    
    if len(token_index['count'])==0:
        return {word: 0}
    
    count = token_index['count'][0]
    freq = count/sum(t1['count'])
    
    return {word: float('%.5f' % freq)}

In [None]:
frequency(vol, 'blue')

In [None]:
# Returns a plot of the usage frequencies of the given word across all volumes in the given FeatureReader collection.
# NOTE: frequencies are given as percentages rather than true ratios.
def frequency_bar_plot(word, fr):
    freqs, titles = [], []
    for vol in fr:
        title = vol.title
        short_title = title[:6] + (title[6:] and '..')
        freqs.append(100*frequency(vol, word)[word])
        titles.append(short_title)
        
    # Format and plot the data
    x_ticks = np.arange(len(titles))
    plt.bar(x_ticks, freqs)
    plt.xticks(x_ticks, titles, fontsize=10)
    plt.ylabel('Frequency (%)', fontsize=12)
    plt.title('Frequency of "' + word + '"', fontsize=14)

In [None]:
frequency_bar_plot('blue', fr_novels)

Your turn! See if you can output a bar plot of the most common noun of length at least 5 in "To Kill a Mockingbird". REMEMBER, you may have to set vol to a different value than it already has.

In [None]:
# Use the provided frequency functions to plot the most common 5-letter noun in "To Kill a Mockinbird".
# Your solution should be just one line of code.





## Part 2— Non-fiction volumes
Now we'll load a collection of 33 math textbooks from the 18th and 19th centuries. These volumes focus on number theory and arithmetic, and were written during the lives of Leonhard Euler and Joseph-Louis Lagrange – two of the most prolific researchers of number theory in all of history. As a result, we can expect the frequency of certain words and topics to shift over time to reflect the state of contemporary research. Let's load them and see.

In [None]:
download_output = !htid2rsync --f math-collection.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
file_paths = ['local-folder/' + path for path in download_output if path.endswith(suffix)]
fr_math = FeatureReader(file_paths)

In [None]:
fr_math = FeatureReader(file_paths)
for vol in fr_math:
    print(vol.title)

### Another frequency function
The next cell contains a frequency_by_year function that takes as inputs a query word and a FeatureReader object. The function calculates relative frequencies of the query word across all volumes in the FR, then outputs them in a `DataFrame` sorted by the volume year. It then plots the frequencies and allows us to easily see trends in word usage across a time period.

In [None]:
# Returns a DF of relative frequencies, volume years, and page counts, along with a scatter plot.
# NOTE: frequencies are given in percentages rather than true ratios.
def frequency_by_year(query_word, fr):
    volumes = pd.DataFrame()
    years, page_counts, query_freqs = [], [], []

    for vol in fr:
        years.append(int(vol.year))
        page_counts.append(int(vol.page_count))
        query_freqs.append(100*frequency(vol, query_word)[query_word])
    
    volumes['year'], volumes['pages'], volumes['freq'] = years, page_counts, query_freqs
    volumes = volumes.sort_values('year')
    
    # Set plot dimensions and labels
    scatter_plot = volumes.plot.scatter('year', 'freq', color='black', s=50, fontsize=12)
    plt.ylim(0-np.mean(query_freqs), max(query_freqs)+np.mean(query_freqs))
    plt.ylabel('Frequency (%)', fontsize=12)
    plt.xlabel('Year', fontsize=12)
    plt.title('Frequency of "' + query_word + '"', fontsize=14)
    
    return volumes.head(10)

### Checking for shifts over time
In 1744, Euler began a huge volume of work on identifying quadratic forms and progressions of primes. It follows from reason, then, that the mentions of these topics in number theory textbooks should see a discernible jump following the 1740's. The following cells call frequency_by_year on several relevant words.

In [None]:
frequency_by_year('quadratic', fr_math)

In [None]:
frequency_by_year('prime', fr_math)

In [None]:
frequency_by_year('factor', fr_math)

# All done!
That's all for this notebook, but it doesn't mean you can't apply what you've learned. Can you think of any words you'd like to track over time? Feel free to use the following empty cells however you'd like. An interesting challenge would be to see if you can incorporate the frequency functions from Part 1 into the scatter function from Part 2. Have fun!