## Single File Exploration and Analysis

We'll start by looking at a single file using the HTRC FeatureReader. This will give us an opportunity to explore the feature reader API on a smaller dataset before moving on to analysis of larger collections.

Note: For this workshop, we're using a collection of documents from UCSF Health Sciences. This workshop is adapted from tutorial material and documentation from HathiTrust and Programming Historian. If you'd like to follow along with this workbook but don't have the UCSF dataset, you can (with some modifications) use sample datasets from HathiTrust. Links are below.

Programming Historian Tutorial:
    https://programminghistorian.org/en/lessons/text-mining-with-extracted-features
    
Sample Data:
    https://analytics.hathitrust.org/datasets

        
HathiTrust FeatureReader documentation and examples
    https://github.com/htrc/htrc-feature-reader
    
The Feature Reader provides an extensive set of NLP tools for text analysis, more than we can cover in this workshop. The goal here is to introduce you to this API and do enough programming with the FeatureReader that you feel familiar enough to continue reading, coding, and applying it to your research and projects

### Import modules and set python and notebook parameters

First, some setup...

In [None]:
!pip install htrc-feature-reader
!pip install pyLDAvis
!pip install pandas==1.5.3

In [None]:
from htrc_features import FeatureReader, Volume

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas
pandas.set_option('display.max_rows', 50)

### Read in a single, multi-page volume

We'll use the Volume interface to read in a single document. 

In [None]:
vol = Volume('data/uc1/30268/uc1.32106020265887.json.bz2')

In [None]:
print(vol)

Jupyter Notebook will provide some formatting and create links to this document. 

In [None]:
vol

You can also access individual attributes using Python

In [None]:
print(vol.handle_url)
print(vol.id, vol.page_count, vol.year, vol.language, vol.handle_url)

To get a list of all attributes available on a Volume

vol.parser.meta.keys()

### Exercise

Take a look at and familiarize yourself with the article. Think about how text could be extracted from it. What information, in addition to text, would you want to preserve for your research? What information loss might you experience if you rely purely on text extraction? 

### The Volume interface

In this next section, we'll take a short tour of the Volume interface. The API is more extensive than this, so you're encouraged to keep exploring after this workshop!

### Tokens

The concept of a token is key to using the Volume API. 

"A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing."

https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

### Token Counts

We can use tokens_per_page() to count and visualize the number of tokens per page

vol.tokens_per_page()

In [None]:
# %matplotlib inline
tokens = vol.tokens_per_page()
tokens.plot()

### Unique Tokens

To get unique tokens for a document

In [None]:
unique_tokens = vol.tokens()

# convert to a list to display only the first 10
list(unique_tokens)[:10]

### Token List

The Extracted Features dataset also provides token counts with much more granularity (page, section, token, part of speect, and count for each token)

Parts of speech use Penn tree banking:
    
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
vol.tokenlist()

In [None]:
tl = vol.tokenlist()

In [None]:
tl

In [None]:
# use a groupby to count the number of each part of speech
tl.groupby(level=["pos"]).sum()[:10]

### Page and Parameters

You can access tokens for a specific page, and specify certain parameters (ignore case, select for specific part of speech or section of the document)

In [None]:
vol.tokenlist(page_select=9,case=False)

### Exercise

Try out a few other methods available on a Volume. You can go to the API or try out the list below...

In [None]:
vol.line_counts()
#vol.sentence_counts()
#vol.empty_line_counts()
#vol.begin_line_chars()
#vol.end_line_chars()

In [None]:
vol.line_counts().plot()

### Exercise: 

Take a look at some other records and get a sense of how text is extracted
* how complete is it?
* what do you gain from relying exclusively on word count? what do you lose?
* is there clutter? Are non-alphanumeric characters useful to you?
* how could it help to know the position or part of speech of a token? 
* how could varying transcription thoroughness and accuracy influence your research?

http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume.tokenlist

In [None]:
pandas.set_option('display.max_rows', None)
vol.tokenlist(drop_section=False,case=False, pos=False)[1000:1010]

In [None]:
tl = vol.tokenlist()

In [None]:
### Select counts of the word ‘academic’ for all pages and all page sections (first 10 results)
tl.loc[(slice(None), slice(None), "academic"),][:10]

### Exercise:

Try to find the word “nursing” in this record, and compare where that shows up to the token-per-page pattern previously plotted.

In [None]:
tl_nursing = vol.tokenlist()
nursing_pages = tl_nursing.loc[(slice(None), slice(None), "nursing"),]
nursing_pages

### Counting and sorting

For the next exercise, we'll limit our analysis to one page. 

In [None]:
# displaying first 10
vol.tokenlist(page_select=8,case=False).sort_values('count', ascending=False)[:10]

### Filtering based on token count

Use the "boolean mask" technique

In [None]:
tl_page = vol.tokenlist(page_select=8,case=False)
pandas.set_option('display.max_rows', None)
tl_page[tl_page['count'] > 20]

In [None]:
# alternatively, you can order them and print the first n results
tl_page.sort_values('count', ascending=False)[:10]

### Something to Consider

The most common words, by count, are "stop words", common words that might not be useful in an analysis.
In the next section, where we consider a collection of documents, we'll review a technique to remove stop words prior to analysis. 

### Working with granular, row-by-row data

At some point, you may want to write your own code to work on a granular level with the contents of a record, rather than relying on dataframe operations. 

In [None]:
# break to print only the first pate

for page in vol.pages():
    print('page:', page)
    
    page_df = page.tokenlist()

    for i, r in page_df.iterrows():
        print('i:', i)
        print('r:', r)
        print()
    break
