# Getting started with HTRC Extracted Features

This tutorial will get you up-and-running with the HTRC Extracted Features dataset. Learn more about the data: https://wiki.htrc.illinois.edu/x/GoA5Ag

The code and instructions used in this notebook combine elements from a Programming Historian lesson called"Text Mining in Python through the HTRC Feature Reader" (https://programminghistorian.org/en/lessons/text-mining-with-extracted-features) and the Berkeley Data Science Module, "Library-HTRC" (https://github.com/ds-modules/Library-HTRC).


## Set-up and reading in files

To get started, we need to import the Python modules we'll use throughout this notebook.

In [None]:
from htrc_features import FeatureReader
import matplotlib
import matplotlib.pyplot as plt
import numpy 
import pandas 

Extracted Featuers files are originally formatted in JSON notation and compressed; you'll notice the file format is '.json.bz2'. The FeatureReader library is able to work with the files in that format with needing to decompress the files.

Within the library, there is a **FeatureReader object** that is used for loading the dataset files and making sense of them. It returns a **Volume object** for each file. A Volume is a representation of a single item in HathiTrust, for example a book or other textual work. From the Volume, you can access features about the work. To drill down to the features derived from individual pages, use the **Page object**.

We'll need to get the FeatureReader ready to use by pointing it to the file paths for the sample Extracted Features files we are using in this notebook. The files are in directory called 'data', in which they are further divided into two directories: '1970' and '1930'. 
 
With fr = FeatureReader(paths) below, the FeatureReader is initialized, meaning it is ready to use. An initialized FeatureReader is holding references to the file paths that we gave it, and will load them into Volume objects when asked.

In [None]:
#create a list of file names from the 1970s directory, create a list of paths to those files, and load
#the data in the Feature Reader

filenames1970 = ['mdp.49015002203033.json.bz2', 'mdp.49015002203405.json.bz2', 'mdp.49015002203140.json.bz2', 
'mdp.49015002221761.json.bz2', 'mdp.49015002203157.json.bz2', 'mdp.49015002221779.json.bz2', 'mdp.49015002203215.json.bz2', 
'mdp.49015002221787.json.bz2', 'mdp.49015002203223.json.bz2', 'mdp.49015002221811.json.bz2', 'mdp.49015002203231.json.bz2', 
'mdp.49015002221829.json.bz2', 'mdp.49015002203249.json.bz2', 'mdp.49015002221837.json.bz2', 'mdp.49015002203272.json.bz2', 
'mdp.49015002221845.json.bz2']
file_paths_1970 = ['data/1970/' + file for file in filenames1970]
fr1970 = FeatureReader(file_paths_1970)

Let's see what titles have been loaded as Volumes. Because these are serials, they have the same basic title.

In [None]:
for vol in fr1970.volumes():
    print(vol.title)

##### Bonus
Remove the comments from the following code box and re-run the following examples using the 1930s dataset. Do you find anything interesting?

In [None]:
#filenames1930 = ['mdp.49015002221860.json.bz2', 'miua.4925052,1928,001.json.bz2', 'mdp.49015002221878.json.bz2',
#'miua.4925383,1934,001.json.bz2', 'mdp.49015002221886.json.bz2']
#file_paths_1930 = ['data/1970/' + file for file in filenames]
#fr1930 = FeatureReader(file_paths_1930)

## File and page structure

We can call just one Volume at a time in order to examine its contents. In this example, we are taking the first file.

In [None]:
vol = fr1970.first()
vol

We can also call the URL for the volume and CTRL-click it to find the corresponding item in the HathiTrust Digital Library (HTDL). 

These volumes are in the public domain, so we will find that they are available for "Full View" in the HTDL. If they were still under copyright, we would be taken to a "Limited View" page. The Extracted Features dataset includes a snapshot of 15.7 million volumes from the HTDL and is agnostic to rights status, as the files represent data about the volumes. 

In [None]:
print(vol.handle_url)

Let's see what other metadata elements are available to you for each item in its corresponding Extracted Features file. Put your cursor between the period and the end parenthesis, and press tab. You can choose from the dropdown list. Then run the cell.

In [None]:
#Put your cursor between the period . and the end parenthesis ) and press tab. You can choose from the dropdown list.
print(vol.author)

It's time to access the first features of vol, which is a table of total words for every single page. These can be accessed simply by calling vol.tokens_per_page().

In [None]:
tokens = vol.tokens_per_page()
# Show just the first few rows, so we can look at what it looks like
tokens.head()

This is a straightforward table of information, similar to what you would see in Excel or Google Spreadsheets. Listed in the table are page numbers and the count of words on each page. 

With only two dimensions, it is trivial to plot the number of words per page. The table structure holding the data has a plot method for data graphics. Without extra arguments, tokens.plot() will assume that you want a line chart with the page on the x-axis and word count on the y-axis.

In [None]:
%matplotlib inline
tokens.plot()

How did we get here? When we ran vol.tokens_per_page(), it returned a Pandas DataFrame. This means that after setting tokens, we're no longer working with HTRC-specific code, just book data held in a common and very robust table-like construct from Pandas. tokens.head() used a DataFrame method to look at the first few rows of the dataset, and tokens.plot() uses a method from Pandas to visualize data.

Another DataFrame accessible to us is vol.tokenlist(), which can be called to return section-, part-of-speech-, and word-specific details:

Now let's look at some words deeper into the book: from 1000th to 1100th row, skipping by 15 [1000:1100:15]

In [None]:
tl = vol.tokenlist()
tl[1000:1100:15]

The Pandas DataFrame type returned by the HTRC Feature Reader is very malleable. To work with the tokenlist that you retrieved earlier, three skills are particularily valuable: selecting subsets by a condition, slicing by named row index, and grouping and aggregating. 

We'll practice slicing by named row index first.

For example, we can add a word between the quotation marks to retrieve only pages where that word occurs. We are using the power of the DataFrame index to retrieve only the rows that match our critia.

In [None]:
#add a word between the quotes
tl_all = vol.tokenlist(section='all')
chapter_pages = tl_all.loc[(slice(None), slice(None), ""),]
chapter_pages

Next we will subset the data by selection. 

To subset individual rows of a DataFrame, provide a series of True/False values to the DataFrame, formatted in square brackets. When True, the DataFrame returns that row; when False, the row is excluded from what is returned. 

t1_simple returns a token list without part-of-speech or page information. Then, we set matches to find only those rows that are "True" for the selection criteria of occuring more than 100 times. Finally, we sample 5 rows in our subset. Try adjusting the numbers in the code below to change the number of rows sampled or the counts on which you match.

In [None]:
#adjust the matching count number or sample set
tl_simple = vol.tokenlist(pos=False, pages=False)
matches = tl_simple['count'] > 100
tl_simple[matches].sample(5)

# Relative frequencies

Still looking at one volume, let's start to explore the relative frequencies of tokens within the volume. 

The following cell will display the most common tokens (words or punctuation marks) in a given volume, alongside the number of times they appear. It will also calculate their relative frequencies (found by dividing the number of appearances over the total number of words in the book) and display the results in a DataFrame. The cell may take a few seconds to run because we're looping through every word in the volume!

In [None]:
tokens = vol.tokenlist(pos=False, case=False, pages=False).sort_values('count', ascending=False)

freqs = []
for count in tokens['count']:
    freqs.append(count/sum(tokens['count']))
    
tokens['rel_frequency'] = freqs
tokens

Now, let's plot the most common tokens from the volume and their frequencies. The following cell outputs a bar plot using the matplotlib library.

In [None]:
%matplotlib inline
# Build a list of frequencies and a list of tokens.
freqs_1, tokens_1 = [], []
for i in range(15):  # top 8 words
    freqs_1.append(freqs[i])
    tokens_1.append(tokens.index.get_level_values('lowercase')[i])

# Create a range for the x-axis
x_ticks = numpy.arange(len(tokens_1))

# Plot!
plt.bar(x_ticks, freqs_1)
plt.xticks(x_ticks, tokens_1, rotation=90)
plt.ylabel('Frequency', fontsize=14)
plt.xlabel('Token', fontsize=14)
plt.title('Common token frequencies in "' + vol.title[:14] + '..."', fontsize=14)

As you can see, the most common tokens are mostly punctuation and basic words that don't provide context. Let's see if we can narrow our search to gain some more relevant insight. 

We can get a list of stopwords from the nltk library. Punctuation is in the string library. Let's import nltk and make the stopwords and punctuation accessible to us. 

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
from string import punctuation

print(stopwords.words('english'))

print()

print(punctuation)


Now that we have a list of words to ignore in our DataFrame, we can make a few tweaks to our plotting cell to remove the punctuation and display only those words not in our stopword list.

In [None]:
freqs_filtered, tokens_filtered, i = [], [], 0
while len(tokens_filtered) < 10:
    if tokens.index.get_level_values('lowercase')[i] not in stopwords.words('english') + list(punctuation):
        freqs_filtered.append(freqs[i])
        tokens_filtered.append(tokens.index.get_level_values('lowercase')[i])
    i += 1

# Create a range for the x-axis
x_ticks = numpy.arange(len(freqs_filtered))

# Plot!
plt.bar(x_ticks, freqs_filtered)
plt.xticks(x_ticks, tokens_filtered, rotation=90)
plt.ylabel('Frequency', fontsize=14)
plt.xlabel('Token', fontsize=14)
plt.title('Common token frequencies in "' + vol.title[:14] + '..."', fontsize=14)

# Tokens from all volumes in our set

Now let's see how word frequencies compare across all the books in our samples. 

First we'll set-up a few functions. The first finds the most common noun in a volume, with adjustable parameters for minimum length. The second calculates the relative frequency of a token across the entirety of a volume, saving us the time of doing the calculation like in the above cell. Finally, we'll have a visualization function to create a bar plot of relative frequencies for all volumes in our sample, so that we can easily track how word frequencies differ across titles.


### 1

Let's see what the most common nouns in this work are by word length. To try, add a number to the second code box below.

NOTE: word_length defaults to 2. e.g. most_common_noun(fr_novels.first) returns 'time'.

In [None]:
#establishing the function used in the next box
def most_common_noun(vol, word_length=2):   
    # Build a table of common nouns
    tokens_1 = vol.tokenlist(pages=False, case=False)
    nouns_only = tokens_1.loc[(slice(None), slice(None), ['NN']),]
    top_nouns = nouns_only.sort_values('count', ascending=False)

    token_index = top_nouns.index.get_level_values('lowercase')
    
    # Choose the first token at least as long as word_length with non-alphabetical characters
    for i in range(max(token_index.shape)):
        if (len(token_index[i]) >= word_length):
            if("'", "!", ",", "?" not in token_index[i]):
                return token_index[i]
    print('There is no noun of this length')
    return None

In [None]:
#add a number to the parenthesis between the comma , and end parenthesis )
most_common_noun(vol, )

### 2
Here, the function frequency() returns a plot of the usage frequencies of the given word across all volumes in the given FeatureReader collection.

NOTE: frequency() returns a dictionary entry of the form {'word': frequency}. e.g. frequency(fr_novels.first(), 'blue') returns {'blue': 0.00012}

Try adding a word in the single quotes in the last line below.

In [None]:
#establishing the function used in the next box
def frequency(vol, word):
    t1 = vol.tokenlist(pages=False, pos=False, case=False)
    token_index = t1[t1.index.get_level_values("lowercase") == word]
    
    if len(token_index['count'])==0:
        return {word: 0}
    
    count = token_index['count'][0]
    freq = count/sum(t1['count'])
    
    return {word: float('%.5f' % freq)}



In [None]:
#add a word in the quotes below
frequency(vol, '')

### Putting them together

The code below returns a plot of the usage frequencies of the given word across all volumes in the given FeatureReader collection.

Try adding different words to see their relative frequency in our sample.

NOTE: frequencies are given as percentages rather than true ratios.

In [None]:
#establishing the function used in the next box
def frequency_bar_plot(word, fr):
    freqs, titles = [], []
    for vol in fr:
        title = vol.title
        short_title = title[:6] + (title[6:] and '..')
        freqs.append(100*frequency(vol, word)[word])
        titles.append(short_title)
        
    # Format and plot the data
    x_ticks = numpy.arange(len(titles))
    plt.bar(x_ticks, freqs)
    plt.xticks(x_ticks, titles, fontsize=10, rotation=45)
    plt.ylabel('Frequency (%)', fontsize=12)
    plt.title('Frequency of "' + word + '"', fontsize=14)

In [None]:
#add a word to between the quotes
frequency_bar_plot('', fr_novels)

Okay, that's interesting, but since all our titles are the same, it's hard to make sense of the results. Let's try plotting relative frequency over time.

The code below returns a DataFrame of relative frequencies, volume years, and page counts, along with a scatter plot.

NOTE: frequencies are given in percentages rather than true ratios.

Try adding a word in the single quotes in the last line below.

In [None]:
#establishing the function used in the next box
def frequency_by_year(query_word, fr):
    volumes = pandas.DataFrame()
    years, page_counts, query_freqs = [], [], []

    for vol in fr:
        years.append(int(vol.year))
        page_counts.append(int(vol.page_count))
        query_freqs.append(100*frequency(vol, query_word)[query_word])
    
    volumes['year'], volumes['pages'], volumes['freq'] = years, page_counts, query_freqs
    volumes = volumes.sort_values('year')
    
    # Set plot dimensions and labels
    scatter_plot = volumes.plot.scatter('year', 'freq', color='black', s=50, fontsize=12)
    plt.ylim(0-numpy.mean(query_freqs), max(query_freqs)+numpy.mean(query_freqs))
    plt.ylabel('Frequency (%)', fontsize=12)
    plt.xlabel('Year', fontsize=12)
    plt.title('Frequency of "' + query_word + '"', fontsize=14)
    
    return volumes.head(10)

In [None]:
#add a word to between the quotes
frequency_by_year('', fr)

### Making use of the structured file

One particularly useful thing about the Extracted Features dataset is that the tokens in the extracted features files are part-of-speech tagged to differentiate homynyms like rose, which can be a name, a noun, and a verb.

For each page, the data is also divided into a header, body, and footer section so that you can systematically remove headers or footers from your data if you choose.

We already saw the possibiity of drilling down to the part-of-speech tag earlier when we found the most frequently-occuring noun in a volume. Below, we will look for one part of speech in just the body of our volumes.

What do you find? Try editing the code to retrieve tokens of another part of speech. Here are the codes in the Penn Treebank: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
idx = pandas.IndexSlice
vol = next(fr.volumes())
tl = vol.tokenlist(pages=False)
tl.index = tl.index.droplevel(0)
adjectives = tl.loc[idx[:,('JJ')],]
adj_dfs = [adjectives for vol in fr.volumes()]
all_adj = pandas.concat(adj_dfs).groupby(level='token').sum().sort_values('count', ascending=False)[:50]

#prints the Pandas dataframe of all adjectives.
print(all_adj)