## Classification

This notebook demonstrates some of the classification tasks that can be accomplished using data retrieved from the HTRC API.

We will be using the `scikit-learn` library to tackle these classification problems.

For these examples, we will be gathering data using the advance search from the Hathi Trust library. To create your training set, determine a search query that will become your labels. Once you search, add the search results to a collection named after your label.

Once you have completed adding to your collection, go to "My Collections" under ____. There you will find a button called "Download Metadata".

You'll download the JSON file associated with your collection and place it in the local directory that you are working in. We've already downloaded the example for you.

```python
import json
import os

jsonFiles = [file for file in os.listdir('.') if file.find('json') != -1]

txts = []
for file in jsonFiles:
    with open(file) as f:
        data = json.load(f)
        
    texts = data['gathers']
    ids = [text['htitem_id'] for text in texts]
    
    filename = data['title'] + '.txt'
    txts.append(filename)
    
    # write each id into txt file
    with open(filename, 'w') as f:
        for textid in ids:
            f.write(textid + '\n')

print("JSON files created")
```

Once this step is complete, you can follow the instructions in the Setup notebook to load the data as needed.

## Task 1: Genre Classification

In this example, we'll be classifying texts into 3 different genres: Poetry, History, and Science Fiction. JSON files containing the metadata for 500 texts in each genre have been included. We've already downloaded the data for you, but if you use your own data you'd need to run the cell below:

```python
import os

history_output = !htid2rsync --f history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ history/
poetry_output = !htid2rsync --f poetry.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ poetry/
scifi_output = !htid2rsync --f scifi.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ scifi/
outputs = list([history_output, poetry_output, scifi_output])
subjects = ['history', 'poetry', 'scifi']

paths = {}
suffix = '.json.bz2'
for subject, output in zip(subjects, outputs):
    folder = subject
    filePaths = [path for path in output if path.endswith(suffix)]
    paths[subject] = [os.path.join(folder, path) for path in filePaths]
    fn = subject + '_paths.txt'
    with open(fn, 'w') as f:
        for path in paths[subject]:
            p = str(path) + '\n'
            f.write(p)

print("Path extraction complete")
```

As in the previous notebooks, we'll construct `FeatureReader` objects for each corpus. The line below reads in path files we created to the downloaded data:

In [5]:
from htrc_features import FeatureReader

paths = {}
subjects = ['history', 'poetry', 'scifi']
for subject in subjects:
    with open(subject + '_paths.txt', 'r') as f:
        paths[subject] = [line[:len(line)-1] for line in f.readlines()]
        
history = FeatureReader(paths['history'])

poetry = FeatureReader(paths['poetry'])

scifi = FeatureReader(paths['scifi'])

print("Finished reading paths")

Finished reading paths


To create our bag of words matrix, we need to keep a global dictionary of all words seen in each of our texts. We initialize "wordDict", which tracks all the words seen and records its index in the bag of words matrix. We also keep a list of volumes so that we can parse them later.

In [6]:
import numpy as np

wordDict = {}
i = 0 
volumes = []

print("Generating global dictionary")
volCount = 0
for vol in scifi.volumes():
    volumes.append(vol)
    tok_list = vol.tokenlist(pages=False)
    tokens = tok_list.index.get_level_values('token')
    if volCount == 200:  # first 200 from scifi volumes
        break
    volCount += 1 
    
    for token in tokens:
        if token not in wordDict.keys():
            wordDict[token] = i
            i += 1

for vol in poetry.volumes():
    volumes.append(vol)
    tok_list = vol.tokenlist(pages=False)
    tokens = tok_list.index.get_level_values('token')
    if volCount == 400:  # additional 200 from poetry volumes
        break
    volCount += 1 
    
    for token in tokens:
        if token not in wordDict.keys():
            wordDict[token] = i
            i += 1

print("Global dictionary generated")

Generating global dictionary
Global dictionary generated


## Challenge
How would you change the above code to have 500 training volumes per class?

---

Once we construct the global dictionary, we can fill the bag of words matrix with the word counts for each volume. Once we have this, we will use it to format the training data for our model.

In [7]:
print("Generating bag of words matrix")
dtm = np.zeros((volCount, len(wordDict.keys())))

for i, vol in enumerate(volumes):
    tok_list = vol.tokenlist(pages=False)
    counts = list(tok_list['count'])
    tokens = tok_list.index.get_level_values('token')
    
    for token, count in zip(tokens, counts):
        try:
            index = wordDict[token]
            dtm[i, index] = count
        except:
            pass
        
X = dtm
y = np.zeros((400))
y[200:400] = 1

print("Finished")

Generating bag of words matrix
Finished


We can then use the `TfidfTransformer` to format the bag of words matrix, so that we can fit it to our LinearSVC model. Let's see how our model does.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import cross_validation

tfidf = TfidfTransformer()
out = tfidf.fit_transform(X, y)

model = LinearSVC()

score = cross_validation.cross_val_score(model, X, y, cv=10)
print(np.mean(score))



0.8975


We can also get the most helpful features, or words, for each class. First we'll `fit` the model:

In [9]:
model.fit(X, y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [10]:
feats = np.argsort(model.coef_[0])[:50]
top_scifi = [(list(feats).index(wordDict[w]) + 1, w) for w in wordDict.keys() if wordDict[w] in feats]
sorted(top_scifi)

[(1, 'w'),
 (2, 'science'),
 (3, 'fiction'),
 (4, 'SF'),
 (5, '"'),
 (6, 'Science'),
 (7, 'Fiction'),
 (8, 'London'),
 (9, 'from'),
 (10, 'at'),
 (11, 'you'),
 (12, 'story'),
 (13, 'it'),
 (14, 'into'),
 (15, '?'),
 (16, 'are'),
 (17, ':'),
 (18, 'Earth'),
 (19, 'But'),
 (20, 'they'),
 (21, 'He'),
 (22, 'stories'),
 (23, '('),
 (24, 'space'),
 (25, 'aliens'),
 (26, 'an'),
 (27, 'has'),
 (28, 'he'),
 (29, 'I'),
 (30, 'me'),
 (31, 'ale'),
 (32, 'but'),
 (33, 'two'),
 (34, 'W'),
 (35, 'Director'),
 (36, 'have'),
 (37, 'lub'),
 (38, 'za'),
 (39, 'could'),
 (40, 'Wells'),
 (41, 'your'),
 (42, 'Roberts'),
 (43, 'ship'),
 (44, 'fantasy'),
 (45, 'do'),
 (46, 'only'),
 (47, 'time'),
 (48, 'planet'),
 (49, 'tak'),
 (50, 'new')]

In [11]:
feats = np.argsort(model.coef_[0])[-50:]
top_poetry = [(list(feats).index(wordDict[w]) + 1, w) for w in wordDict.keys() if wordDict[w] in feats]
sorted(top_poetry, key=lambda tup: tup[0])

[(1, 'deal'),
 (2, 'AND'),
 (3, 'whole'),
 (4, '10'),
 (5, 'Land'),
 (6, 'America'),
 (7, 'Master'),
 (8, 'Columbia'),
 (9, 'by'),
 (10, 'after'),
 (11, 'live'),
 (12, 'on'),
 (13, 'rule'),
 (14, 'On'),
 (15, 'White'),
 (16, 'H.'),
 (17, "'"),
 (18, 'which'),
 (19, 'or'),
 (20, 'country'),
 (21, "'s"),
 (22, 'that'),
 (23, 'this'),
 (24, 'American'),
 (25, 'A'),
 (26, 'not'),
 (27, 'her'),
 (28, 'came'),
 (29, 'poetry'),
 (30, 'For'),
 (31, 'By'),
 (32, 'love'),
 (33, 'Law'),
 (34, 'planted'),
 (35, 'To'),
 (36, '1'),
 (37, 'fall'),
 (38, 'let'),
 (39, 'History'),
 (40, ';'),
 (41, 'truth'),
 (42, 'free'),
 (43, 'The'),
 (44, "'ll"),
 (45, 'all'),
 (46, 'we'),
 (47, 'Our'),
 (48, 'land'),
 (49, 'our'),
 (50, 'We')]

## Task 2: Author Gender Classification

Our next task will be to classify text based on the author's gender. We can find this under the 'htrc_gender' attribute found in each volume's metadata.

We will create the training set using the existing volumes we have already seen in the previous example by searching the metadata fields for gender. We will then add the volumes with these attributes, as well as add the correct labels to our `y` vector. For this example, we will use 10 training samples for each class to show the example.

In [12]:
vols = []
y_2 = []

wordDictGender = {}
subjects = [history, poetry, scifi]
index = 0
male = 0
female = 0

for subject in subjects:
    for vol in subject.volumes():
        if male == 10 and female == 10:
            break
        try:
            if str(vol.metadata['htrc_gender'][0]) == 'male':
                if male < 10:
                    vols.append(vol)
                    tok_list = vol.tokenlist(pages=False)
                    tokens = tok_list.index.get_level_values('token')
                    for token in tokens:
                        if token not in wordDictGender.keys():
                            wordDictGender[token] = index
                            index += 1
                            
                    y_2.append(0)
                    male += 1
                    
            elif str(vol.metadata['htrc_gender'][0]) == 'female':
                if female < 10:
                    vols.append(vol)
                    tok_list = vol.tokenlist(pages=False)
                    tokens = tok_list.index.get_level_values('token')
                    
                    for token in tokens:
                        if token not in wordDictGender.keys():
                            wordDictGender[token] = index
                            index += 1
                    
                    y_2.append(1)
                    female += 1
        except:
            pass
    if male == 10 and female == 10:
        break

ERROR:root:Unexpected: there were 0 results for coo.31924029579814 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000010053274 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000042013833 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000046348185 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000046368365 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000085261075 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000092448210 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000111022061 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000115666806 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000115670873 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000116553128 instead of 1.
ERROR:root:Unexpected: there were 0 results for inu.30000117880595 instead of 1.
ERROR:root:Unexpected: there

We can then take these volumes and create our bag of words matrix as we did in the previous example. Finally, we can run our LinearSVC model once again to show how well our model does.

In [13]:
import numpy as np

print("Generating bag of words matrix")
volCount = 20
dtm_gender = np.zeros((volCount, len(wordDictGender.keys())))

for i, vol in enumerate(vols):
    tok_list = vol.tokenlist(pages=False)
    counts = list(tok_list['count'])
    tokens = tok_list.index.get_level_values('token')
    
    for token, count in zip(tokens, counts):
        try:
            index = wordDictGender[token]
            dtm_gender[i, index] = count
        except:
            pass
        
X_2 = dtm_gender

print("Finished")

Generating bag of words matrix
Finished


In [15]:
tfidf = TfidfTransformer()
out = tfidf.fit_transform(X_2, y_2)

model = LinearSVC()

score = cross_validation.cross_val_score(model, X_2, y_2, cv=10)
print(np.mean(score))

0.85


In [16]:
model.fit(X_2, y_2)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [17]:
feats = np.argsort(model.coef_[0])[:50]
top_male = [(list(feats).index(wordDict[w]) + 1, w) for w in wordDict.keys() if wordDict[w] in feats]
sorted(top_male)

[(1, 'foul'),
 (2, 'fascinated'),
 (3, 'There’s'),
 (4, 'colorless'),
 (5, '..'),
 (6, 'culmination'),
 (7, 'conception'),
 (8, 'cyclonic'),
 (9, 'inﬂators'),
 (10, 'greatest'),
 (11, 'Physically'),
 (12, 'imperceptibly'),
 (13, 'Trained'),
 (14, 'arbiters'),
 (15, 'guard'),
 (16, 'foreground'),
 (17, 'atmospheric'),
 (18, "'from"),
 (19, 'nights'),
 (20, 'darts'),
 (21, 'more’n'),
 (22, 'Slammer'),
 (23, 'dojust'),
 (24, 'blue-tinged'),
 (25, 'Kung'),
 (26, 'integrator’s'),
 (27, 'faithful'),
 (28, 'example'),
 (29, 'mounted'),
 (30, 'answers'),
 (31, 'deceased'),
 (32, 'astrogater’s'),
 (33, 'kind'),
 (34, 'difficulties'),
 (35, 'Kumo'),
 (36, "'it"),
 (37, 'goggles'),
 (38, 'learn'),
 (39, 'braking'),
 (40, 'articulate'),
 (41, 'War'),
 (42, 'juus'),
 (43, 'doled'),
 (44, 'febrile'),
 (45, 'protracted'),
 (46, 'grin'),
 (47, '1946'),
 (48, 'wavered'),
 (49, 'instantaneously'),
 (50, 'car—and')]

In [18]:
feats = np.argsort(model.coef_[0])[-50:]
top_female = [(list(feats).index(wordDict[w]) + 1, w) for w in wordDict.keys() if wordDict[w] in feats]
sorted(top_female, key=lambda tup: tup[0])

[(1, 'ache'),
 (2, 'close'),
 (3, 'laugh'),
 (4, 'inﬂexible'),
 (5, 'legends'),
 (6, 'clarity'),
 (7, 'grandfather—did'),
 (8, 'audience'),
 (9, 'Aika'),
 (10, 'armed'),
 (11, 'Zim'),
 (12, 'does'),
 (13, 'Shooting'),
 (14, 'decipherable'),
 (15, 'What'),
 (16, 'Gailah'),
 (17, 'boring'),
 (18, 'close-up'),
 (19, 'consultin’'),
 (20, 'city'),
 (21, 'coruscating'),
 (22, 'drowsy'),
 (23, 'cooking'),
 (24, 'Graphic'),
 (25, 'exhibition'),
 (26, 'Gambler'),
 (27, 'irrational'),
 (28, 'contractors'),
 (29, 'lighter'),
 (30, 'Unhurt'),
 (31, 'Sophocles'),
 (32, 'chemistry'),
 (33, 'REY'),
 (34, 'Title'),
 (35, 'leam'),
 (36, 'body'),
 (37, '213'),
 (38, 'Sturgeon'),
 (39, ')'),
 (40, '!/hen'),
 (41, 'five'),
 (42, 'it—if'),
 (43, 'war—of'),
 (44, 'lava'),
 (45, 'climax'),
 (46, '...._'),
 (47, '0-88355-l26-8'),
 (48, '“have'),
 (49, 'Physiology'),
 (50, 'Phil—ﬁfteen')]

Congratulations! You've finished the tutorial. You now have the tools to run your own classification tasks with the HTRC library. Try using different models or adding more volumes to increase your accuracy scores for your model.