## Multiple File Exploration and Analysis

In the previous workbook, SingleFile, we used the FeatureReader to explore a single document. In the next two workbooks, "Multifile-Prep" and "Multifile-Analysis", we'll use the FeatureReader and Gensim to analyze a collection of documents and build a topic model.

This workbook will prepare data for analysis, and Multifile-Analysis will load the data created in this worbook to build a topic model using Gensim. 

For this workshop, we'll continue to use a UCSF health sciences related dataset. However you should be able to swap in any of the sample datasets prepared by HathiTrust:

https://analytics.hathitrust.org/datasets

In [None]:
import pandas as pd
from htrc_features import FeatureReader, Volume
import glob
from tqdm.notebook import trange, tqdm
import warnings

### Read Data

We'll use glob to recusively find all .bz2 files available in the directory structure, then append each file to a list of paths. 

In [None]:
paths = []
import glob
for file in glob.glob('data/uc1/**/*.bz2', recursive=True):
#for file in glob.glob('data/SOM-30Vol/uc1/**/*.bz2', recursive=True):
    paths.append(file)

In [None]:
len(paths)

In [None]:
print(paths[36])

### Prepare a dataset for a topic modeling exercise

To illustrate topic modeling, we'll prepare a dataset consisting of records containing "nursing" or "denistry" in the title. Keep in mind, you don't need to do this to use topic modeling to analyze a dataset. For unsupervised machine learning, you don't need to pre-define your categories or topics - the algorithm will identify clusters of documents around potential topics of interest for you. However, for an exercise, it can help to have a sense of our topics as it will help us see how the algorithm is identifying clusters that emerge from the collection. 

In [None]:
# this will take a few minutes to run.
# to make it run faster, you may want to lower the sample size

sample_size = 20
i = 0
fr = FeatureReader(paths)
nursing_count = 0
dentistry_count = 0
vols = []
vol_row_id = []
for vol in fr.volumes():
    title = vol.title.lower()
    if 'nursing' in title and nursing_count < sample_size:
        vols.append(vol)
        nursing_count += 1
    if 'dentistry' in title and dentistry_count < sample_size:
        vols.append(vol)
        dentistry_count += 1
    if dentistry_count >= 20 and nursing_count >= sample_size:
        break
    i += 1

In [None]:
len(vols)

### Volumes

We can print the title for each volume in our collection by iterating over each element in the vols collection we created in the previous step

In [None]:
for v in vols:
    print(v.title)

### Individual record access

We can access the metadata from the list for each record the same way we did with the single file case

In [None]:
vols[0]

In [None]:
# printing only the first 10 token lists
for i, p in enumerate(vols[0].pages()):
    print(i+1, p.tokenlist())
    print()
    if i > 10:
        break

### Exercise: Take a look at some other records and get a sense of how text is extracted

* how complete is it?
* what do you gain from relying exclusively on word count? what do you lose?
* is there clutter? Are non-alphanumeric characters useful to you?
* what do you lose if you don't know the position or part of speech of words? 
* how could varying transcription thoroughness and accuracy influence your research?

In [None]:
warnings.filterwarnings("ignore", category=FutureWarning)
vols[0].tokenlist(case=False)

### Cleaning Data

In the previous workbook, we discussed the potential "clutter" in our text. We may not want punctuation, non-alphanumeric characters, or stop words, and we may want to lemmatize or stem some of our terms

A full review of cleaning text data is beyond the scope of this workshop, though we will remove the stop words.

In [None]:
import nltk
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

### Create a bag of words

This next bit of code is a little complicated. We're using some of the techniques discussed earlier for going line-by-line through the pages of each volume to create a bag of words model and term frequency vector for each page. 

For now, let's walk through this code, take a look at the output, and and discuss how it works. You may also want to refer back to the of bag-of-words visuals.

In [None]:
# note - tdqm is used to provide a progress bar, since this bit of code can take a while to run.
# it's optional but can be useful to get a sense of how your code is progressing and how much longer it will
# take to run

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

vol_df = pd.DataFrame(columns=['htid', 'page_number', 'page_tokens'])

for vol in tqdm(vols, total=len(vols)):
    title = vol.title
    htid = vol.id
    for page in vol.pages():
        page_num = str(page).split(' ')[1]
        page_df = page.tokenlist(section='body', case=False, pos=False)
        
        tkn_list = []
        
        for i, r in page_df.iterrows():
            #print(i[0], i[1], i[2])
            word = i[2]
            count = r[0]
            #print(word, count)
            word = word.strip()
            
            if word not in en_stop and word.isalpha() and len(word) > 2:
                for _ in range(count):
                    tkn_list.append(word)
        
        if len(tkn_list) > 50 and 'nursing' in tkn_list and 'dentistry' in tkn_list:   
            app_df = pd.DataFrame({'title':title, 'htid': htid, 'page_number':  page_num, 'page_tokens': [tkn_list]})
            #print(app_df)
            vol_df = pd.concat([vol_df, app_df], ignore_index=True)


### Dataframe for Text Analysis

The code above creates a dataframe with each document id, page number, the tokens (including duplicates) for each page, and the title for each page

In [None]:
vol_df

### Pickle

We've created our dataframe. We'll save it to a directory in our workspace and load it for analysis in the next workbook.

You may be wondering why we aren't using the to_csv method to export our data to a csv. The reason we're doing it with pickle (which writes the model to disk but can't be read like a csv) is that our page_tokens column contains lists, rather than single values. This can cause issues when writing to/reading from csv files. There are ways around this, but it may be easier to pickle the data if you don't need to read it outside a pandas dataframe.

In [None]:
# pandas to_csv will convert lists into strings, which will be a hassle.
# better to pickle it. 
vol_df.to_pickle('processed_data/ucsf_medical.pkl')