## Classification

This notebook demonstrates some of the classification tasks that can be accomplished using data retrieved from the HTRC API.

We will be using the Scikit Learn library to tackle these classification problems. Install the library using pip:
```
   pip install sklearn
```

For these examples, we will be gathering data using the advance search from the Hathi Trust library. To create your training set, determine a search query that will become your labels. Once you search, add the search results to a collection named after your label.

Once you have completed adding to your collection, go to "My Collections" under ____. There you will find a button called "Download Metadata".

Download the JSON file associated with your collection and place it in the local directory that you are working in.

In [1]:
import json
import os

jsonFiles = [file for file in os.listdir('.') if file.find('json') != -1]

txts = []
for file in jsonFiles:
    with open(file) as f:
        data = json.load(f)
        
    texts = data['gathers']
    ids = [text['htitem_id'] for text in texts]
    
    filename = data['title'] + '.txt'
    txts.append(filename)
    
    #write each id into txt file
    with open(filename, 'w') as f:
        for textid in ids:
            f.write(textid + '\n')

Once this step is complete, you can follow the instructions in the Setup notebook to load the data as needed.

## Task 1: Genre Classification

In this example, we'll be classifying texts into 3 different genres: Poetry, History, and Science Fiction. JSON files containing the metadata for 500 texts in each genre have been included.

In [23]:
x = 1
y = 2
z = list([x, y])
print(z)

[1, 2]


In [1]:
import os

history_output = !htid2rsync --f history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ history/
poetry_output = !htid2rsync --f poetry.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ poetry/
scifi_output = !htid2rsync --f scifi.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ scifi/
outputs = list([history_output, poetry_output, scifi_output])
subjects = ['history', 'poetry', 'scifi']
print(scifi_output)
paths = {}
suffix = '.json.bz2'
for subject, output in zip(subjects, outputs):
    folder = subject
    filePaths = [path for path in output if path.endswith(suffix)]
    paths[subject] = [os.path.join(folder, path) for path in filePaths]
    
# folder = 'history'
# suffix = '.json.bz2'
# filePaths = [path for path in history_output if path.endswith(suffix)]
# history_paths = [os.path.join(folder, path) for path in filePaths]


# folder = 'poetry'
# suffix = '.json.bz2'
# filePaths = [path for path in poetry_output if path.endswith(suffix)]
# poetry_paths = [os.path.join(folder, path) for path in filePaths]


# folder = 'scifi'
# suffix = '.json.bz2'
# filePaths = [path for path in scifi_output if path.endswith(suffix)]
# scifi_paths = [os.path.join(folder, path) for path in filePaths]

print(paths['history'])

#write paths into a .txt file

print("Path extraction complete")

['[sandbox] Welcome to the HathiTrust Research Center rsync server.', '', 'receiving file list ... ', 'rsync: link_stat "uva/pairtree_root/x0/01/81/40/89/x001814089/uva.x001814089.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/01/27/09/53/x001270953/uva.x001270953.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/00/52/61/76/x000526176/uva.x000526176.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/01/28/37/50/x001283750/uva.x001283750.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/01/28/37/49/x001283749/uva.x001283749.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/30/22/15/65/x030221565/uva.x030221565.json.bz2" (in features) failed: No such file or directory (2)', 'rsync: link_stat "uva/pairtree_root/x0/01/28/

In [6]:
from htrc_features import FeatureReader

hist = FeatureReader(paths['history'])
vol = list(hist.volumes())[0]
vol.

[('schemaVersion', 'schema_version'),
 ('dateCreated', 'date_created'),
 ('title', 'title'),
 ('pubDate', 'pub_date'),
 ('language', 'language'),
 ('htBibUrl', 'ht_bib_url'),
 ('handleUrl', 'handle_url'),
 ('oclc', 'oclc'),
 ('imprint', 'imprint'),
 ('names', 'names'),
 ('classification', 'classification'),
 ('typeOfResource', 'type_of_resource'),
 ('issuance', 'issuance'),
 ('genre', 'genre'),
 ('bibliographicFormat', 'bibliographic_format'),
 ('pubPlace', 'pub_place'),
 ('governmentDocument', 'government_document'),
 ('sourceInstitution', 'source_institution'),
 ('enumerationChronology', 'enumeration_chronology'),
 ('hathitrustRecordNumber', 'hathitrust_record_number'),
 ('rightsAttributes', 'rights_attributes'),
 ('accessProfile', 'access_profile'),
 ('volumeIdentifier', 'volume_identifier'),
 ('sourceInstitutionRecordNumber', 'source_institution_record_number'),
 ('isbn', 'isbn'),
 ('issn', 'issn'),
 ('lccn', 'lccn'),
 ('lastUpdateDate', 'last_update_date')]

0
