# Experimenting with the HathiTrust Corpus & Machine Learning

This notebook is based on the [Classifying HTRC Genre notebook](https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingHtrcGenreWordFrequencies.ipynb) and uses data from the [HathiTrust Research Center](https://sharc.hathitrust.org/genre). A few caveats about the provenance and method of selection of the data:
- only English
- texts published 1700-1799, but may include reprints of earlier materials
- volumes held by large public or university libraries, primarily in the United States.
- the creators don't recommend the dataset as a source for literary research before 1750, since texts published pre-1800 is often in Special Collections and digitization is less predictable.
- A short version: the model's predictions about genre matched human descriptions 93.6% of the time, which is roughly as often as our six human readers agreed with each other (94.5%). Moreover, the datasets provided here have passed through additional (automatic and manual) filtering that allows us to guarantee better than 97% precision.

See the HathiTrust Research Center for more observations on normalization, OCR correction, metadata, and more.

All right, let's set up our classifier:

The first step is to (re)build our philosophical classifier. It's worth reiterating that the classifier is being trained on a relatively small corpus (so isn't likely as representative as it might be) and that the new HTRC genre corpus is literature-specific (so a different kind of beast from our training corpus). Is it still useful as a classifier? that's part of what we'd like to find out.
The classifier created below is essentially the same as before, though we'll use the LinearSVC algorithm because it provides a way of not just classifying (philosophical or non-philosophical) but also of expressing a value for how philosophical or not the text is.

In [2]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

# define the training corpus to use 
trav_data_dir = "C:/Users/broomgrass/iPython/Dream/data/texts"
trav_corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(trav_data_dir+"/modcorpus", ".*\.txt")
filtered_fileids = [fileid for fileid in trav_corpus.fileids()]

# create TF-IDF (actually relative frequencies) vectorizer
stopword_vectorizer = TfidfVectorizer(use_idf=False, stop_words=nltk.corpus.stopwords.words("english"), max_features=10000)
X_train = stopword_vectorizer.fit_transform([trav_corpus.raw(fileid) for fileid in filtered_fileids])
trav_categories = ["Other" if "Other" in fileid else "Travel" for fileid in filtered_fileids]

# create a classifier
trav_clf = LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3)
trav_clf.fit(X_train, trav_categories)



LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0)

#### Analyzing the HTRC Genre Corpus
The HTRC Genre corpus is organized by genre (if you rsync the directory rather than just download the files from the web the files are organized into subfolders by genre). For each genre there's a metadata file with all the volumes for that genre and then a set of compressed archives (.tar.gz) organized by time slice. Our strategy here will be as follows:
- for each genre folder (fiction, drama, poetery)
- read the metadata file into a table
- for each compressed archive in the genre folder (*.tar.gz)
    - for each tab-separated values file in the archive
create a pseudo text
        - for each word-count pair: add the word the specified number of times to our pseuedo text
        - produce a classifier decision (a value how philosophical the text is
add the prediction value to the corresponding row in the metadata table

In [3]:
from os import walk
import pandas as pd
from os.path import join
import glob

def get_genre_metadata_and_predictions(genre_dir, clf, vectorizer):
    metadatas = {}
    for (dirpath, dirnames, filenames) in walk(htrc_genre_dir):
        for genre in dirnames:
            genre_path = join(htrc_genre_dir, genre) 
            metadata = pd.read_csv(join(genre_path, genre+"_metadata.csv"), index_col=0)
            metadata['prediction'] = [float(0)] * len(metadata)
            for tgz in glob.glob(join(genre_path,"*.tar.gz")):
                print("Analyzing "+tgz)
                tar = tarfile.open(tgz, "r:gz")
                for tarinfo in tar:
                    if tarinfo.isreg() and tarinfo.name.endswith("tsv"):

                        # read in the TSV file and expand the text (it would probably be quicker to
                        # create a vectorizer that can use the feature counts directly, but oh well
                        text = ""
                        tsv = tar.extractfile(tarinfo)
                        for line in tsv.readlines():
                            word, count = line.decode("utf-8").strip().split("\t")
                            if any(c for c in word if c.isalpha()):
                                text += (word + " ") * int(count)

                        # predict the class
                        X_test = vectorizer.transform([text])
                        metadata['prediction'][tarinfo.name[0:-4]] = clf.decision_function(X_test)[0]

                tar.close()
            metadatas[genre] = metadata.sort('prediction', ascending=False)
        break
    return metadatas

Now we should be ready to use our classifier on the HTRC Genre corpus. This returns a dictionary object with keys for each genre (fiction, drama, poetry) and values that are pandas dataframes with all the existing metadata for each volume, plus the philosophical prediction that we've added.

In [4]:
htrc_genre_dir = "C:/Users/broomgrass/iPython/HathiTrust/data"
trav_metadatas = get_genre_metadata_and_predictions(htrc_genre_dir, trav_clf, stopword_vectorizer)

We can have a quick peek to see how many volumes are contained in each genre:


In [5]:
total = 0
for genre, metadata in trav_metadatas.items():
    total += len(metadata.index)
    print(genre+": "+"{:,}".format(len(metadata.index)))
print("total: "+"{:,}".format(total))

total: 0


Well, that is *not* what I wanted at all. After looking at everything, I'm not sure exactly where the problem is. My file structure looks like this: <img src="screenshot.png">

I suppose a first step is to ask Stefan what the file structure looked like, though this seems to match what was described?

A second step is to go through the code more thoroughly, though I have already pored through it and can't see exactly where it went wrong. I changed all the variables to match my own (ie., trav vs philo in the [original notebook](https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingHtrcGenreWordFrequencies.ipynb)) but maybe I missed something? >>> After spending another chunk of time going over it, I don't think I missed a variable. 

Is it a problem with how my files are being brought in?

In [6]:
htrc_genre_dir.fileids()

AttributeError: 'str' object has no attribute 'fileids'

In [9]:
testfiles = "C:/Users/broomgrass/iPython/Dream/data/texts/1600-1700corpus"
testfiles.fileids()

AttributeError: 'str' object has no attribute 'fileids'