## Classification

This notebook demonstrates some of the classification tasks that can be accomplished using data retrieved from the HTRC API.

We will be using the Scikit Learn library to tackle these classification problems. Install the library using pip:
```
   pip install sklearn
```

For these examples, we will be gathering data using the advance search from the Hathi Trust library. To create your training set, determine a search query that will become your labels. Once you search, add the search results to a collection named after your label.

Once you have completed adding to your collection, go to "My Collections" under ____. There you will find a button called "Download Metadata".

Download the JSON file associated with your collection and place it in the local directory that you are working in.

In [4]:
import json
import os

jsonFiles = [file for file in os.listdir('.') if file.find('json') != -1]

txts = []
for file in jsonFiles:
    with open(file) as f:
        data = json.load(f)
        
    texts = data['gathers']
    ids = [text['htitem_id'] for text in texts]
    
    filename = data['title'] + '.txt'
    txts.append(filename)
    
    #write each id into txt file
    with open(filename, 'w') as f:
        for textid in ids:
            f.write(textid + '\n')

print("JSON files created")

JSON files created


Once this step is complete, you can follow the instructions in the Setup notebook to load the data as needed.

## Task 1: Genre Classification

In this example, we'll be classifying texts into 3 different genres: Poetry, History, and Science Fiction. JSON files containing the metadata for 500 texts in each genre have been included.

In [5]:
import os

history_output = !htid2rsync --f history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ history/
poetry_output = !htid2rsync --f poetry.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ poetry/
scifi_output = !htid2rsync --f scifi.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ scifi/
outputs = list([history_output, poetry_output, scifi_output])
subjects = ['history', 'poetry', 'scifi']

paths = {}
suffix = '.json.bz2'
for subject, output in zip(subjects, outputs):
    folder = subject
    filePaths = [path for path in output if path.endswith(suffix)]
    paths[subject] = [os.path.join(folder, path) for path in filePaths]
    fn = subject + '_paths.txt'
    with open(fn, 'w') as f:
        for path in paths[subject]:
            p = str(path) + '\n'
            f.write(p)

#write paths into a .txt file

print("Path extraction complete")

Path extraction complete


Save volumes for each class from their respective paths.

In [3]:
from htrc_features import FeatureReader

paths = {}
subjects = ['history', 'poetry', 'scifi']
for subject in subjects:
    with open(subject + '_paths.txt', 'r') as f:
#         print(f.readlines())
        paths[subject] = [line[:len(line)-1] for line in f.readlines()]
        
history = FeatureReader(paths['history'])

poetry = FeatureReader(paths['poetry'])

scifi = FeatureReader(paths['scifi'])

print("Finished reading paths")


Finished reading paths


To create our bag of words matrix, we need to keep a global dictionary of all words seen in each of our texts. We initialize "wordDict", which tracks all the words seen and records its index in the bag of words matrix. 

In [18]:
import numpy as np

wordDict = {}
i = 0 
volumes = []

print("Generating global dictionary")
volCount = 0
for vol in scifi.volumes():
    volumes.append(vol)
    tok_list = vol.tokenlist(pages=False)
    tokens = tok_list.index.get_level_values('token')
    if volCount == 200:
        break
    volCount += 1 
    
    for token in tokens:
        if token not in wordDict.keys():
            wordDict[token] = i
            i += 1

for vol in poetry.volumes():
    volumes.append(vol)
    tok_list = vol.tokenlist(pages=False)
    tokens = tok_list.index.get_level_values('token')
    if volCount == 400:
        break
    volCount += 1 
    
    for token in tokens:
        if token not in wordDict.keys():
            wordDict[token] = i
            i += 1
 

Generating global dictionary


Once we construct the global dictionary, we can fill the bag of words matrix with the word counts for each volume. Once we have this, we will use it to format the training data for our model.

In [19]:
print("Generating bag of words matrix")
dtm = np.zeros((volCount, len(wordDict.keys())))

for i, vol in enumerate(volumes):
    tok_list = vol.tokenlist(pages=False)
    counts = list(tok_list['count'])
    tokens = tok_list.index.get_level_values('token')
    
    for token, count in zip(tokens, counts):
        try:
            index = wordDict[token]
            dtm[i, index] = count
        except:
            pass
        
X = dtm
y = np.zeros((400))
y[200:400] = 1

print("Finished")

Generating bag of words matrix
Finished


We can then use the TfidfTransformer to format the bag of words matrix, so that we can fit it to our LinearSVC model. Let's see how our model does.

In [20]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation

tfidf = TfidfTransformer()
out = tfidf.fit_transform(X, y)

model = LinearSVC()

score = cross_validation.cross_val_score(model, X, y, cv=10)
print(np.mean(score))


0.9


## Task 2: Author Gender Classification

Our next task will be to classify text based on the author's gender. We can find this under the 'htrc_gender' attribute found in each volume's metadata.

We will create the training set using the existing volumes we have already seen in the previous example by searching the metadata fields for gender.

In [7]:
import sys

vols = []
X_2 = []
y_2 = []

wordDictGender = {}
subjects = [history, poetry, scifi]
index = 0
male = 0
female = 0
for subject in subjects:
    for vol in subject.volumes():
        if male == 10 and female == 10:
            break
        try:
            if str(vol.metadata['htrc_gender'][0]) == 'male':
                if male < 10:
                    vols.append(vol)
                    tok_list = vol.tokenlist(pages=False)
                    tokens = tok_list.index.get_level_values('token')
                    for token in tokens:
                        if token not in wordDictGender.keys():
                            wordDictGender[token] = index
                            index += 1
                            
                    y_2.append(0)
                    male += 1
                    
            elif str(vol.metadata['htrc_gender'][0]) == 'female':
                if female < 10:
                    vols.append(vol)
                    tok_list = vol.tokenlist(pages=False)
                    tokens = tok_list.index.get_level_values('token')
                    
                    for token in tokens:
                        if token not in wordDictGender.keys():
                            wordDictGender[token] = index
                            index += 1
                    
                    y_2.append(1)
                    female += 1
        except:
            print("Unexpected error:", sys.exc_info()[0])

        print(male, female)
    if male == 10 and female == 10:
        break


Unexpected error: <class 'IndexError'>
0 0
Unexpected error: <class 'KeyError'>
0 0
1 0
2 0
Unexpected error: <class 'KeyError'>
2 0
3 0
4 0
5 0
Unexpected error: <class 'KeyError'>
5 0
6 0
Unexpected error: <class 'KeyError'>
6 0
7 0
7 1
8 1
9 1
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'KeyError'>
10 1
10 1
10 1
10 1
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'KeyError'>
10 1
10 1
10 1
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'KeyError'>
10 1
10 1
Unexpected error: <class 'KeyError'>
10 1
10 1
Unexpected error: <class 'KeyError'>
10 1
10 1
10 1
Unexpected error: <class 'KeyError'>
10 1
Unexpected error: <class 'IndexError'>
10 1
Unexpected error: <class 'IndexError'>
10 1
Unexpected error: <class 'IndexError'>
10 1
Unexpected error: <class 'IndexError'>
10 1
Unexpected er

In [13]:
import numpy as np

print("Generating bag of words matrix")
volCount = 20
dtm_gender = np.zeros((volCount, len(wordDictGender.keys())))

for i, vol in enumerate(vols):
    tok_list = vol.tokenlist(pages=False)
    counts = list(tok_list['count'])
    tokens = tok_list.index.get_level_values('token')
    
    for token, count in zip(tokens, counts):
        try:
            index = wordDictGender[token]
            dtm_gender[i, index] = count
        except:
            pass
        
X_2 = dtm_gender

print("Finished")

Generating bag of words matrix
Finished


In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation

tfidf = TfidfTransformer()
out = tfidf.fit_transform(X_2, y_2)

model = LinearSVC()

score = cross_validation.cross_val_score(model, X_2, y_2, cv=10)
print(np.mean(score))



0.85
