# Classifier training using Tensorflow Estimators

Tensorflow's Estimator class takes a lot of the work out of building transferrable machine learning models; you can, for example, push a model into a javascript version that will live on the web, or swap out models of the very few types for which pre-built estimators exist.

In [2]:
import tensorflow as tf
import SRP
import pandas as pd
import numpy as np

The goal here is to learn how to classify books by their two-letter library of congress classification. For example:
BF is psychology. First we read in a CSV with ids and classifications for about 2,000,000 books.

Note that you could replace this csv with any other two-column dataset of htid and categorical labels.

In [3]:
metadata = pd.read_csv("/home/bschmidt/Dropbox/hathi_metadata/data_to_classify_on/lc_ic.csv.gz", names=["htid", "lc"])

Now I build a list of the possible category values and sort it alphabetically. Note that I've already pruned this down to a manageable number (221) by removing erroneous labels.

In [4]:
all_cats = list(set(list(metadata.lc)))
all_cats.sort()
print("There are {} categories".format(len(all_cats)))

There are 221 categories


A lookup dictionary stores the classification for each individual volume.

In [5]:
lookup = dict(zip(metadata.htid, metadata.lc))

I'm using a single-precision, 1280 dimensional feature set. If you want to use 640 dimensional features, that can be changed here.

In [6]:
bytes = 2
dims = 640

Now I need an iterator function to send data to tensorflow. Tensorflow estimators love functions; so this
is a function that returns a function. That's a little weird for python, but perfectly normal for R or Javascript or plenty of other languages.

Note that you'll need to change the file location in the body of this function.

In [9]:
def base_function(what = 'train', modulo = 10):
    def fun():
        # Your directory will be different than this.
        full_hathi = SRP.Vector_file("/home/bschmidt/vector_models/ht-640d-half-precision.bin".format(what), precision = bytes)
        for i, (id, row) in enumerate(full_hathi):
            if i % modulo == 0 and what != "test":
                continue
            if i % modulo == 1 and what != "validate":
                continue
            if i % modulo >= 2 and what != "train":
                continue
            if id in lookup:
                cat = lookup[id]
                # Normalize vectors to unit length.
                row = row/np.linalg.norm(row.astype('<f4'))
                yield (row, cat)
    return fun




Now we can create train, test, and validate functions. These are functions that return an iterator function that itself returns one entry at a time.

In [10]:
train = base_function('train')
test = base_function('test')
validate = base_function('validate')

Now onto building a tensorflow estimator. First we need to define a numeric feature column which tells the estimator class about the input data: a 1280 dimnesional numeric vector.

In [12]:
dtype = tf.float32

numeric_feature_column = \
    tf.feature_column.numeric_column(key="embedding",
                                     shape = [dims], 
                                     dtype= dtype)


Create a DNN classifier. I use 1500 hidden units here, which is a little less than idea.
This also takes the category list we trained earlier; and it uses a directory to store information for later training.

In [13]:
classifier = tf.estimator.DNNClassifier(feature_columns=[numeric_feature_column],
                                       hidden_units = [1500],
                                        n_classes = len(all_cats),
                                        label_vocabulary = all_cats,
                                        dropout = 0.5,
                                        model_dir = '/tmp/model2'
                                       )

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f19687245f8>, '_save_checkpoints_steps': None, '_num_worker_replicas': 1, '_evaluation_master': '', '_is_chief': True, '_service': None, '_tf_random_seed': None, '_task_id': 0, '_model_dir': '/tmp/model2', '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_task_type': 'worker', '_keep_checkpoint_max': 5, '_master': '', '_num_ps_replicas': 0, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None}


We want not one element at a time, but a tensorflow `Dataset` batched into (for example) 250 books at a time.

This is the ugliest part of the code. `Dictize` converts from the tuple format our generator returns to a keyed dictionary for the x values. Technical note: the 'yielder' and 'test_yielder' functions are different because the dataset needs to actually be created inside a functional scope that takes no arguments for the purposes of variable scoping in the estimator. 

In [16]:
BATCH_SIZE = 250

def dictize(x, y):
    return ({numeric_feature_column: x}, y)

def yielder():
    dataset = tf.data.Dataset.from_generator(
    train, (dtype, tf.string), 
                        (tf.TensorShape([dims]), tf.TensorShape([])))

    batches = dataset.map(dictize).repeat().batch(BATCH_SIZE)

    return batches

def test_yielder():
    dataset = tf.data.Dataset.from_generator(
        test, (dtype, tf.string), 
             (tf.TensorShape([dims]), tf.TensorShape([])))

    batches = dataset.map(dictize).repeat().batch(BATCH_SIZE)

    return batches



Now we're ready to train. We train on the training function, and request 2500 steps which is enough to get through the set once or twice.

On a normal laptop, this is a long process--I'd think about leaving it running overnight. But if you use tensorboard you can inspect some of the progress by typing `tensorboard --logdir '/tmp/model2'` into the command line and visiting `localhost:6006` in your browser.

Ideally we'd be evaluating performance on the validation data here while it runs, but I'm taking it easy.

I'll do 10,000 steps: with a batch size of 250 books at a time, that means 2.5 million books. That's about a single pass through the training data here.

In [17]:
classifier.train(input_fn = yielder, steps = 10000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/model2/model.ckpt.
INFO:tensorflow:loss = 1350.954, step = 1
INFO:tensorflow:global_step/sec: 15.1715
INFO:tensorflow:loss = 817.1722, step = 101 (6.591 sec)
INFO:tensorflow:global_step/sec: 14.75
INFO:tensorflow:loss = 858.6796, step = 201 (6.780 sec)
INFO:tensorflow:global_step/sec: 13.5541
INFO:tensorflow:loss = 524.1978, step = 301 (7.378 sec)
INFO:tensorflow:global_step/sec: 15.6822
INFO:tensorflow:loss = 880.43884, step = 401 (6.377 sec)
INFO:tensorflow:global_step/sec: 14.9495
INFO:tensorflow:loss = 644.6073, step = 501 (6.689 sec)
INFO:tensorflow:global_step/sec: 14.1539
INFO:tensorflow:loss = 517.4074, step = 601 (7.066 sec)
INFO:tensorflow:global_step/sec: 13.1359
INFO:tensorflow:loss = 457.95

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7f19686c0390>

## Evaluate the classifier performance on the test data.

In the paper, I got about 68% accuracy at this task. Here it's a little lower (62%) for a few reasons:

1. The dimensionality is half as big, and I'm only using 1500 neurons in the hidden layer.
2. The precision is only 2-bytes per integer, not 4. I don't think this makes much of a difference; but it might make some.
3. The initializations aren't quite as well thought out--in particular, I haven't taken any steps here to avoid 'dead' neurons in the network by initializing to positive values, etc. 
4. We're not choosing a stopping point based on the validation set. Based on my previous experience, I think a single pass is probably a little smaller than ideal: we might get better results with 20,000 or 30,000 steps.

In [18]:
classifier.evaluate(test_yielder, steps = 100)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-09-05-15:22:12
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-11150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2018-09-05-15:22:18
INFO:tensorflow:Saving dict for global step 11150: accuracy = 0.627, average_loss = 1.4050226, global_step = 11150, loss = 351.25568


{'accuracy': 0.627,
 'average_loss': 1.4050226,
 'global_step': 11150,
 'loss': 351.25568}

## Evaluate impressionistically on truly out-of-domain data.

Now that we have a model, we'll throw it against some demonstration texts. Here's a function to pull any article from wikipedia.

In [41]:
import requests
from lxml import html

def get_wikitext(article_title, language="en"):
    response = requests.get(
        'https://{}.wikipedia.org/w/api.php'.format(language),
        params={
            'action': 'parse',
            'page': article_title,
            'format': 'json',
        }
    ).json()
    raw_html = response['parse']['text']['*']
    document = html.document_fromstring(raw_html)
    first_p = document.xpath('//p')[0]
    body = "\n".join([p.text_content() for p in document.xpath('//p')])
    return (body)

print(get_wikitext("Elephant")[:300] + "...")





Elephants are large mammals of the family Elephantidae and the order Proboscidea. Three species are currently recognised: the African bush elephant (Loxodonta africana), the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). Elephants are scattered throughout sub-Sa...


Finally, I wrap some prediction code into a function to explore any wikipedia article. You can pull this apart to build a more complicated classifier.

In [44]:
import SRP

def predict(article, language='en'):
    text = get_wikitext(article, language=language)
    print(text[:300].strip() + "...")
    representation = SRP.SRP(640).stable_transform(text)
    representation = representation/np.linalg.norm(representation)

    predict_input_fn = tf.estimator.inputs.numpy_input_fn({numeric_feature_column: np.array([representation])},
                                                  y=None,
                                                  batch_size=1,
                                                  num_epochs=1,
                                                  shuffle=False)    


    p = classifier.predict(predict_input_fn)
    predictions = p.__next__()
    preds = list(zip(predictions['probabilities'], all_cats))
    preds.sort()
    preds.reverse()

    for i in range(10):
        print(" {:.02%} {}".format(*preds[i]))

Elephant is correctly classed as QL.

In [45]:
predict("elephant")

Elephants are large mammals of the family Elephantidae and the order Proboscidea. Three species are currently recognised: the African bush elephant (Loxodonta africana), the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). Elephants are scattered throughout sub-Sa...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 84.42% QL
 5.44% GN
 3.19% QH
 1.36% SF
 0.83% G
 0.83% GR
 0.40% RC
 0.36% Q
 0.25% GV
 0.24% BL


So is 'rhinoceros': but there's a little more uncertainty here.

In [46]:
predict("rhinoceros")

Ceratotherium
Dicerorhinus
Diceros
Rhinoceros
Extinct genera, see text

A rhinoceros (/raɪˈnɒsərəs/, from Greek  rhinokeros, meaning 'nose-horned', from  rhinos, meaning 'nose', and  kerato/keras, meaning 'horn'), commonly abbreviated to 'rhino', is one of any five extant species of odd-toed ungu...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 35.55% QL
 14.03% RC
 12.11% QH
 6.48% SF
 4.56% RA
 2.67% TX
 2.62% QP
 2.16% RM
 1.77% Q
 1.24% SB


The German-language article on Immanuel Kant is reasonably classed as 'B' (General Philosophy) rather than PT (German literature),
showing that we've learned German-specific rules as well as English ones. It's possible that BD or BH would be more correct, though. 

In [47]:
predict("Immanuel Kant", language='de')

Immanuel Kant (* 22. April 1724 in Königsberg, Preußen; † 12. Februar 1804 ebenda) war ein deutscher Philosoph der Aufklärung. Kant zählt zu den bedeutendsten Vertretern der abendländischen Philosophie. Sein Werk Kritik der reinen Vernunft kennzeichnet einen Wendepunkt in der Philosophiegeschichte u...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 69.11% B
 5.59% BD
 3.21% BH
 2.45% PT
 1.45% BF
 1.29% BL
 1.21% HM
 1.13% LB
 0.94% Q
 0.91% N


French gets Kant right too.

In [48]:
predict("Emmanuel Kant", language='fr')

modifier - modifier le code - modifier Wikidata
Emmanuel Kant (Immanuel en allemand, prononcé dans cette langue [ɪˈmaːnu̯eːl kant]), né le 22 avril 1724 à Königsberg, capitale de la Prusse-Orientale, et mort dans cette même ville le 12 février 1804, est un philosophe allemand, fondateur du criticism...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 35.84% B
 9.62% PN
 4.88% BJ
 4.74% Z
 3.59% N
 3.55% BD
 2.32% CB
 1.99% PQ
 1.70% AC
 1.69% LA


But William James, who is correct in English and French, gets dumped into German history as the model's priors about German-language text overwhelm any psychology or philosophy-specific content here.

In [52]:
predict("William James", language='de')

William James (* 11. Januar 1842 in New York; † 26. August 1910 in Chocorua, New Hampshire) war ein US-amerikanischer Psychologe und Philosoph. Von 1876 bis 1907 war er Professor für Psychologie und Philosophie an der Harvard University. James gilt sowohl als Begründer der Psychologie in den USA[1]...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 13.55% DD
 10.68% B
 10.58% PT
 6.69% Z
 6.40% PN
 2.78% PG
 2.70% CT
 2.40% HX
 1.86% D
 1.83% BP


In some cases, the model is *extremely* confident: 98% certainty that Johannes Brahms is an article about music.

I've found, experimentally, that these probabilities tend to be useful; but also to overstate the actual accuracy. Probably it should say something more like 93%.

In [56]:
predict("Johannes Brahms", language='en')

Johannes Brahms (German: [joˈhanəs ˈbʁaːms]; 7 May 1833 – 3 April 1897) was a German composer and pianist of the Romantic period. Born in Hamburg into a Lutheran family, Brahms spent much of his professional life in Vienna, Austria. His reputation and status as a composer are such that he is some...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 98.00% ML
 1.90% MT
 0.03% PN
 0.01% PR
 0.01% GV
 0.01% PT
 0.01% ND
 0.01% M
 0.00% CT
 0.00% NC


"Intersectionality" is not philosophy, but Subclass HQ. The family. Marriage. Women.

In [59]:
predict("Intersectionality", language="en")

Intersectionality is an analytic framework that attempts to identify how interlocking systems of power impact those who are most marginalized in society.[1] Intersectionality considers that various forms of social stratification, such as class, race, sexual orientation, age, disability and gender,...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 29.35% HQ
 18.72% HV
 9.86% HD
 9.16% KF
 6.72% HM
 4.47% K
 2.12% JC
 1.99% HN
 1.77% RA
 1.15% HF


The Boston Red Sox are sports and leisure. Gee, I'm having trouble finding an example that doesn't work!

In [60]:
predict("Boston Red Sox", language="en")

The Boston Red Sox are an American professional baseball team based in Boston, Massachusetts. The Red Sox compete in Major League Baseball (MLB) as a member club of the American League (AL) East division. The Red Sox have won eight World Series championships and have played in twelve. In addit...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 84.34% GV
 10.00% PN
 1.43% F
 1.30% E
 0.44% PE
 0.43% SF
 0.26% CT
 0.18% QC
 0.13% TL
 0.11% CS


Here we go. A lot of wikipedia is famously Pokémon, but that's not something you can find a lot of in libraries. So it punts--maybe it's 'general literature', maybe it's bibliography (Z--a common catchall category). GV would probably be the best bet, but only shows up in third place.

In [64]:
predict("Pokémon", language="en")

Pokémon (Japanese: ポケモン, Hepburn: Pokemon, Japanese: [pokemoɴ]; English: /ˈpoʊkɪˌmɒn, -ki-, -keɪ-/),[1][2][3] also known as Pocket Monsters (ポケットモンスター) in Japan, is a Japanese media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.[4] The...
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/model2/model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
 26.16% PN
 20.31% Z
 10.83% GV
 8.43% NC
 7.01% TR
 2.74% BF
 2.31% QL
 2.23% N
 1.95% Q
 1.69% DS
