# Reading in the Training and Evaluation Data

We will use the msha.csv data set to build and evaluate our autocoder. We begin by dividing it into 3 pieces, each of which will be used for a separate purpose. These pieces are training, validation, and test.

In [1]:
from __future__ import print_function
from pprint import pprint
import numpy as np
import pandas as pd

# read the csv file into a data frame, a special Python object for representing tabular data
df = pd.read_csv(r'msha.csv', parse_dates=['ACCIDENT_DT'])

# add a column to our data frame with the year the incident occurred
df['YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)

# split the data frame into separate test, validation, and training dataframes based on the year of the incident
df_training = df[df['YEAR'] <= 2011]
df_validation = df[df['YEAR'] == 2012]
df_test = df[df['YEAR'] == 2013]

print('''Data set sizes:
    test: %s
    validation: %s
    training: %s''' % (len(df_test), len(df_validation), len(df_training)))

Data set sizes:
    test: 8643
    validation: 9032
    training: 18681


# Converting inputs to vectors

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of the CountVectorizer object
vectorizer = CountVectorizer()

# Use the narratives in our training data to create the vocabulary that will
# be represented by our feature vectors. This is remembered by the vectorizer.
vectorizer.fit(df_training['NARRATIVE'])

print('Our vectorizer has defined an input vector with %s elements' % len(vectorizer.vocabulary_))
pprint(vectorizer.vocabulary_)

Our vectorizer has defined an input vector with 11915 elements
{u'00': 0,
 u'000': 1,
 u'00023': 2,
 u'001': 3,
 u'002': 4,
 u'003': 5,
 u'004': 6,
 u'005': 7,
 u'009': 8,
 u'00a': 9,
 u'00am': 10,
 u'00p': 11,
 u'00pm': 12,
 u'01': 13,
 u'011': 14,
 u'013': 15,
 u'02': 16,
 u'023': 17,
 u'02429': 18,
 u'025lb': 19,
 u'03': 20,
 u'032': 21,
 u'0333': 22,
 u'04': 23,
 u'05': 24,
 u'0503': 25,
 u'0543': 26,
 u'055': 27,
 u'06': 28,
 u'07': 29,
 u'08': 30,
 u'080': 31,
 u'0800': 32,
 u'089': 33,
 u'08ic01': 34,
 u'09': 35,
 u'0906374': 36,
 u'0perator': 37,
 u'10': 38,
 u'100': 39,
 u'1000': 40,
 u'1003': 41,
 u'100ft': 42,
 u'100lb': 43,
 u'100lbs': 44,
 u'101': 45,
 u'1010': 46,
 u'1014': 47,
 u'1018': 48,
 u'1019': 49,
 u'102': 50,
 u'103': 51,
 u'1040': 52,
 u'10483': 53,
 u'105': 54,
 u'106': 55,
 u'1060': 56,
 u'1064': 57,
 u'107': 58,
 u'1074': 59,
 u'108': 60,
 u'10a': 61,
 u'10aa': 62,
 u'10am': 63,
 u'10b': 64,
 u'10db': 65,
 u'10e': 66,
 u'10ft': 67,
 u'10in': 68,
 u'10lb': 69,

Now that we've defined this mapping of inputs to vector positions we can convert our training data to vector representation.

In [3]:
# Convert the training narratives into their matrix representation.
x_training = vectorizer.transform(df_training['NARRATIVE'])

print(x_training.shape)

(18681, 11915)


We can also examine the matrix directly but it's not very interesting. Most of the elements of our feature matrix are 0 for any given example.

In [5]:
n_features = x_training.shape[1]
dense_vector = x_training[0].todense()
print('The vector representing our first training narrative looks like this:', dense_vector)
print('Only %s of the %s elements in this vector are nonzero' % (np.count_nonzero(dense_vector), n_features))

The vector representing our first training narrative looks like this: [[0 0 0 ..., 0 0 0]]
Only 29 of the 11915 elements in this vector are nonzero


In [9]:
vector = vectorizer.transform(['zipties zone'])
print(vector.todense())

[[0 0 0 ..., 0 1 1]]


# Selecting and Fitting the Model

We have converted our training data to a feature matrix, we can now select our model and fit it to the data

In [10]:
from sklearn.linear_model import LogisticRegression

# y_training contains the codes associated with our training narratives
y_training = df_training['INJ_BODY_PART_CD']

# we create an instance of the LogisticRegression model and set regularization to 1.0
clf = LogisticRegression(C=10)

# we fit the model to our training data (ie. we calculate the weights)
clf.fit(x_training, y_training)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Like our other outputs, we can inspect the Logistic Regression model. The most important attribute of the model is the coefficients, which show how strongly our various features are related to our various codes. Like our feature values, these coefficients are also stored in a matrix. 

In [11]:
print(clf.coef_.shape)
print(clf.coef_[0])

(46L, 11915L)
[ -5.74154250e-01  -6.41327514e-05  -1.76895150e-06 ...,  -3.49983043e-01
  -3.75010912e-04  -2.87351092e-03]


If we want, we can look at which codes each of these rows correspond to, and the features that have the highest weights (make that class much more likely)

In [12]:
# print the codes recognized by the classifier
print(clf.classes_)

[100 110 120 121 122 130 140 141 142 143 144 150 160 170 200 300 310 311
 312 313 314 320 330 340 350 400 410 420 430 440 450 460 500 510 511 512
 513 514 520 530 540 550 600 700 800 900]


We can examine these weights more closely and see which inputs they correspond to

In [13]:
# select the index of the code whose weights we want to examine 
# index 0 corresponds to code 100 (head not elsewhere classified)
# index 3 corresponds to code 121 (ear, external)
code_index = 0
code = clf.classes_[code_index]

# Retrieve the weights for the specified code_index
code_weights = clf.coef_[code_index]

# find the feature_index with largest weight for this code
feature_index = code_weights.argmax()

# create a dictionary that maps from index, to word
feature_mapper = {v: k for k, v in vectorizer.vocabulary_.items()}

# map that index to the word it represents
word = feature_mapper[feature_index]
print('The word with the heighest weight for code %s is "%s"' % (code, word))
print('It has a weight of:', clf.coef_[code_index, feature_index])

The word with the heighest weight for code 100 is "forehead"
It has a weight of: 7.08494702566


# Evaluation

In [14]:
from sklearn.metrics import accuracy_score, f1_score

# Convert the validation narratives to a feature matrix
x_validation = vectorizer.transform(df_validation['NARRATIVE'])

# Generate predicted codes for our validation narratives
y_validation_pred = clf.predict(x_validation)

# Calculate how accurately these match the true codes
y_validation = df_validation['INJ_BODY_PART_CD']
accuracy = accuracy_score(y_validation, y_validation_pred)
macro_f1 = f1_score(y_validation, y_validation_pred, average='macro')
print('accuracy = %s' % (accuracy))
print('macro f1 score = %s' % (macro_f1))

accuracy = 0.732506643047
macro f1 score = 0.493494990609


  'recall', 'true', average, warn_for)


What if we had been foolish and evaluated our performance on the training data?

In [15]:
y_training_pred = clf.predict(x_training)
accuracy = accuracy_score(y_training, y_training_pred)
macro_f1 = f1_score(y_training, y_training_pred, average='macro')
print('accuracy = %s' % (accuracy))
print('macro f1 score = %s' % (macro_f1))

accuracy = 0.99250575451
macro f1 score = 0.993895312884


# Experimenting with Different Regularization

In [17]:
clf = LogisticRegression(C=1)
clf.fit(x_training, y_training)

y_training_pred = clf.predict(x_training)
training_accuracy = accuracy_score(y_training, y_training_pred)
print('accuracy on training data is: %s' % training_accuracy)

y_validation_pred = clf.predict(x_validation)
validation_accuracy = accuracy_score(y_validation, y_validation_pred)
print('accuracy on validation data is: %s' % validation_accuracy)

accuracy on training data is: 0.953749799261
accuracy on validation data is: 0.758082373782


# Experimenting with Inputs

### minimum document frequency
By default, CountVectorizer includes any feature (i.e. word) that occurs in our training data as a feature in our feature vector. We can use the min_df argument to specify that features should only be included in the feature vector if they occur in at least X different documents. Below, we specify that a feature should only be included if it occurs in at least 2 different documents. Previously this resulted in a feature vector that included 12,336 features. Now, we only get 7,163.

In [18]:
vectorizer1 = CountVectorizer(min_df=2)
x_training = vectorizer1.fit_transform(df_training['NARRATIVE'])
x_training.shape

(18681, 6797)

### n-grams
Using just the occurrence of individual words (unigrams) loses word order, which might be useful. We can counteract that by using multiple word sequences, such as bigrams (2 word sequences), trigrams (3 word sequences), and so on. In practice it is rarely useful to go beyond bigrams for text classification.

In [19]:
vectorizer2 = CountVectorizer(min_df=5, ngram_range=(1,2))
x_training = vectorizer2.fit(df_training['NARRATIVE'])
x_training.shape

(18681, 19914)

On a typical machine learning project you will likely spend a lot of time tweaking your model and retraining it, as we have done above. Eventually you will settle on the best version (as measured against your validation data). You will then measure it's performance on the test data to get a final measure of performance.

In [22]:
clf = LogisticRegression(C=1)
clf.fit(x_training, y_training)

x_training = vectorizer2.transform(df_training['NARRATIVE'])
y_training_pred = clf.predict(x_training)
training_accuracy = accuracy_score(y_training, y_training_pred)
print('accuracy on training data is: %s' % training_accuracy)

x_validation = vectorizer2.transform(df_validation['NARRATIVE'])
y_validation_pred = clf.predict(x_validation)
validation_accuracy = accuracy_score(y_validation, y_validation_pred)
print('accuracy on validation data is: %s' % validation_accuracy)

accuracy on training data is: 0.994272255233
accuracy on validation data is: 0.774025686448


# Saving the Model

Suppose we have settled on our final autocoder. This may have taken quite a bit of time to train and develop. Some models can take days, weeks, or even months to train. To save it for later, we need to save both the vectorizer used to encode our data and the model used to classify it. Below, we do this using the Python module `joblib`.

In [23]:
from sklearn.externals import joblib

# save the classifier object as LRclf.pkl
joblib.dump(clf, filename='LRclf.pkl')
# save the vectorizer object as vectorizer.pkl
joblib.dump(vectorizer2, filename='vectorizer.pkl')

['vectorizer.pkl']

Later, we can load these back as shown below. Note: we must first import any objects or functions used by our pickled files, otherwise they will fail to load.

In [24]:
from sklearn.externals import joblib
# import the objects used by our saved vectorizer and classifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

# load the classifier
clf = joblib.load(filename='LRclf.pkl')
# load the vectorizer
vectorizer = joblib.load(filename='vectorizer.pkl')

# Using the Autocoder

There are a number of ways we might use our newly created autocoder. The simplest option is to just automatically assign the codes. Suppose for example, that our test dataset is really just a set of uncoded narratives. We might code it as follows:

In [25]:
x_test = vectorizer.transform(df_test['NARRATIVE'])
y_test_pred = clf.predict(x_test)
print(x_test.shape)
print(y_test_pred.shape)
print(y_test_pred)

(8643, 19914)
(8643L,)
[520 200 420 ..., 700 100 700]


### assigning codes

In [34]:
df_test.loc[:, 'Autocode'] = y_test_pred
df_test.head(10)

Unnamed: 0,ACCIDENT_DT,FIPS_STATE_CD,INJ_BODY_PART,INJ_BODY_PART_CD,MINE_ID,NARRATIVE,YEAR,Autocode,Probability
3,2013-10-29,4,ANKLE,520,200024,Contractor employee working as a carpenter mis...,2013,520,0.956954
19,2013-06-11,56,NECK,200,4800152,"Employee hit his head on a low pipe, and injur...",2013,200,0.831049
20,2013-11-11,42,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,3605023,EE drove 970 loader all day and at the end of ...,2013,420,0.937884
23,2013-07-14,19,"UPPER EXTREMITIES, MULTIPLE",350,2100790,Hot sand got blew out the door onto employee c...,2013,313,0.662741
26,2013-03-01,32,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,2602535,"Employee was carrying a 2""X50' coil of rubber ...",2013,420,0.958011
28,2013-07-11,32,ANKLE,520,2601962,While walking to the bus during shift change s...,2013,520,0.897337
35,2013-11-14,19,FINGER(S)/THUMB,340,1302067,Employee was removing a inpullor from a pump i...,2013,340,0.981709
42,2013-01-29,21,"HEAD,NEC",100,1517232,Employee stated he was operating opposite doub...,2013,130,0.789098
46,2013-05-01,54,KNEE/PATELLA,512,4602380,Climbing steps on shovel preparing to change c...,2013,512,0.912291
48,2013-11-08,35,FINGER(S)/THUMB,340,2900097,"Employee was lowering a dozer belly pan, which...",2013,340,0.943637


### accessing the predicted probabilities

In [31]:
y_pred_prob = clf.predict_proba(x_test)
print('The shape of the pred_prob matrix is: %s' % str(y_pred_prob.shape))
print('The probabilities for the first example are:\n%s' % y_pred_prob[0])
max_index = y_pred_prob[0].argmax()
code = clf.classes_[max_index]
print('The highest probability is at index %s which corresponds to code %s' % (max_index, code))

The shape of the pred_prob matrix is: (8643L, 46L)
The probabilities for the first example are:
[  1.04751543e-04   1.33515376e-04   2.06188691e-04   1.22708134e-04
   6.03564443e-04   4.08308745e-04   2.99472360e-04   5.23276380e-04
   8.22821601e-04   2.89555065e-04   3.45402418e-04   1.28194544e-03
   2.65480304e-04   3.49757034e-04   3.08101268e-04   3.09169977e-04
   3.58544945e-04   6.18139374e-04   1.38824804e-04   2.78700863e-04
   6.69055296e-04   1.23557935e-04   1.78611331e-04   6.00893323e-04
   3.98288620e-04   2.17864388e-04   3.33599681e-04   3.14071663e-04
   8.41622831e-04   2.33159131e-04   2.40546618e-04   1.82379031e-04
   4.30119532e-04   2.14262467e-04   1.73973281e-04   4.52610734e-04
   2.13365736e-02   1.67431805e-04   9.56954138e-01   2.61576900e-04
   3.95698956e-04   4.81694390e-03   5.20229716e-04   1.34384098e-03
   2.67453454e-04   5.63268318e-04]
The highest probability is at index 38 which corresponds to code 520


### adding the predicted probability for each code

In [33]:
# get a sequence indicating the position with the highest probability for each row
top_positions = y_pred_prob.argmax(axis=1)
top_probabilities = y_pred_prob[np.arange(len(top_positions)), top_positions]
print(top_probabilities.shape)
df_test.loc[:, 'Probability'] = top_probabilities
df_test.head()

(8643L,)


Unnamed: 0,ACCIDENT_DT,FIPS_STATE_CD,INJ_BODY_PART,INJ_BODY_PART_CD,MINE_ID,NARRATIVE,YEAR,Autocode,Probability
3,2013-10-29,4,ANKLE,520,200024,Contractor employee working as a carpenter mis...,2013,520,0.956954
19,2013-06-11,56,NECK,200,4800152,"Employee hit his head on a low pipe, and injur...",2013,200,0.831049
20,2013-11-11,42,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,3605023,EE drove 970 loader all day and at the end of ...,2013,420,0.937884
23,2013-07-14,19,"UPPER EXTREMITIES, MULTIPLE",350,2100790,Hot sand got blew out the door onto employee c...,2013,313,0.662741
26,2013-03-01,32,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,2602535,"Employee was carrying a 2""X50' coil of rubber ...",2013,420,0.958011


### saving the results

In [35]:
df_test.to_excel('msha_autocoded.xlsx')

# Extras

Scikit-learn provides a wide variety of powerful tools that we didn't have time to cover in this class, including many different learning algorithms and utilities for feature and model selection. To learn more, see the official and very extensive [online documentation](http://scikit-learn.org/stable/index.html).