## Classification

This notebook demonstrates some of the classification tasks that can be accomplished using data retrieved from the HTRC API.

We will be using the Scikit Learn library to tackle these classification problems. Install the library using pip:
```
   pip install sklearn
```

For these examples, we will be gathering data using the advance search from the Hathi Trust library. To create your training set, determine a search query that will become your labels. Once you search, add the search results to a collection named after your label.

Once you have completed adding to your collection, go to "My Collections" under ____. There you will find a button called "Download Metadata".

Download the JSON file associated with your collection and place it in the local directory that you are working in.

In [5]:
import json
import os

jsonFiles = [file for file in os.listdir('.') if file.find('json') != -1]

txts = []
for file in jsonFiles:
    with open(file) as f:
        data = json.load(f)
        
    texts = data['gathers']
    ids = [text['htitem_id'] for text in texts]
    
    filename = data['title'] + '.txt'
    txts.append(filename)
    
    #write each id into txt file
    with open(filename, 'w') as f:
        for textid in ids:
            f.write(textid + '\n')

print("JSON files created")

Once this step is complete, you can follow the instructions in the Setup notebook to load the data as needed.

## Task 1: Genre Classification

In this example, we'll be classifying texts into 3 different genres: Poetry, History, and Science Fiction. JSON files containing the metadata for 500 texts in each genre have been included.

In [25]:
import os

history_output = !htid2rsync --f history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ history/
poetry_output = !htid2rsync --f poetry.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ poetry/
scifi_output = !htid2rsync --f scifi.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ scifi/
outputs = list([history_output, poetry_output, scifi_output])
subjects = ['history', 'poetry', 'scifi']

paths = {}
suffix = '.json.bz2'
for subject, output in zip(subjects, outputs):
    folder = subject
    filePaths = [path for path in output if path.endswith(suffix)]
    paths[subject] = [os.path.join(folder, path) for path in filePaths]
    fn = subject + '_paths.txt'
    with open(fn, 'w') as f:
        for path in paths[subject]:
            p = str(path) + '\n'
            f.write(p)

#write paths into a .txt file

print("Path extraction complete")

['[sandbox] Welcome to the HathiTrust Research Center rsync server.', '', 'receiving file list ... ', 'rsync: link_stat "uva/pairtree_root/x0/30/52/63/54/x030526354/uva.x030526354.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/30/20/49/71/x030204971/uva.x030204971.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "osu/pairtree_root/32/43/50/75/58/68/83/32435075586883/osu.32435075586883.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "osu/pairtree_root/32/43/50/75/58/67/35/32435075586735/osu.32435075586735.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "osu/pairtree_root/32/43/50/24/29/71/45/32435024297145/osu.32435024297145.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "osu/pairtree_root/32/43/50/24/29/71/37/32435024297137/osu.32435024297137.json.bz2" (in features) failed: No such file or direct

Save volumes for each class from their respective paths.

In [17]:
from htrc_features import FeatureReader

paths = {}
subjects = ['history', 'poetry', 'scifi']
for subject in subjects:
    with open(subject + '_paths.txt', 'r') as f:
#         print(f.readlines())
        paths[subject] = [line[:len(line)-1] for line in f.readlines()]
        
history = FeatureReader(paths['history'])

poetry = FeatureReader(paths['poetry'])

scifi = FeatureReader(paths['scifi'])

print("Finished reading paths")


Finished reading paths


Recreating string versions of each text.

In [46]:
import re
import os

def isPure(s):
    return not any(char.isdigit() for char in s) and re.match('^[\w-]+$', s) is not None

def tableToString(ts, cs):
    s = ""
    for token, count in zip(ts, cs):
        if isPure(token):
            for i in cs:
                s += token + " "
    return s

def getTokenString(vol):
    volText = ""
    for page in vol:
        if page.token_count() > 0:
            counts = [t for t in page.tokenlist()['count']]
            tokens = page.tokens()
            volString = tableToString(tokens, counts)
            volText += volString
    return volText    

def writeVols(volumes, dirname, numTexts):
    if not os.path.exists(dirname):
        os.mkdir(dirname)
    i = 0
    for vol in volumes:
        if i == numTexts:
            break
        i += 1
        print(i, vol.title)
        volText = getTokenString(vol)
        fn = dirname + '/' + str(i) + '.txt'
        with open(fn, 'w') as f:
            f.write(volText)
    

In [45]:
scifiDir = 'scifi_texts'
histDir = 'history_texts'



writeTokenString(scifi.volumes(), scifiDir, 10)
writeTokenString(history.volumes(), histDir, 10)

print('Finished processing texts')

1 Modern masterpieces of science fiction, edited by Sam Moskowitz.


KeyboardInterrupt: 

In [5]:
import nltk
import numpy as np
import os
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils import shuffle

X = []
y = []
for file in os.listdir('scifi_texts'):
    with open('scifi_texts/' + file) as f:
        X.append(str(f.readlines()))
        y.append(0)

for file in os.listdir('hist_texts'):
    with open('hist_texts/' + file) as f:
        X.append(str(f.readlines()))
        y.append(1)

np.random.seed(1)

X, y = shuffle(X, y, random_state=0)
print("Generated training data")

Generated training data


In [6]:
type(X)

list

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation

text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(random_state=0))
                     ])
scores = cross_validation.cross_val_score(text_clf, X, y)
print(scores, np.mean(scores))

[ 1.          0.83333333  1.        ] 0.944444444444


In [None]:
#save output from CountVectorizer
#do a test prediction

print(np.mean(scores))

## Task 2: Author Gender Classification

Our next task will be to classify text based on the author's gender. We can find this under the 'htrc_gender' attribute found in each volume's metadata.

We will create the training set using the existing volumes we have already seen in the previous example by searching the metadata fields for gender.

In [None]:
X2 = []
y2 = []

if not os.path.exists('female_texts'):
    os.mkdir('female_texts')

if not os.path.exists('male_texts'):
    os.mkdir('male_texts')

subjects = [history, poetry, scifi]
male = 0
female = 0
for subject in subjects:
    for vol in subject.volumes():
        if male == 10 and female == 10:
            break
        try:
            if vol.metadata['htrc_gender'][0] == 'male':
                if male < 10:
                    X.append(getTokenString(vol))
                    y.append(0)
                    male += 1
            else:
                if female < 10:
                    X.append(getTokenString(vol))
                    y.append(1)
                    female += 1
        except:
            pass
    if male == 10 and female == 10:
        break


0 0
0 0
0 0
1 0
2 0
2 0
3 0
4 0
5 0
5 0
6 0
6 0
7 0
7 1
8 1
9 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 1
10 2
10 2
10 2
10 2
10 2
10 2
10 2
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 3
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
10 4
1

In [None]:
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(random_state=0))
                     ])
scores = cross_validation.cross_val_score(text_clf, X2, y2)
print(scores, np.mean(scores))