# Reading in the Training and Evaluation Data

We will use the msha.csv data set to build and evaluate our autocoder. We begin by dividing it into 3 pieces, each of which will be used for a separate purpose. These pieces are training, validation, and test.

In [28]:
from __future__ import print_function
import numpy as np
import pandas as pd

# read the csv file into a data frame, a special Python object for representing tabular data
df = pd.read_csv(r'msha.csv', parse_dates=['ACCIDENT_DT'])

# add a column to our data frame with the year the incident occurred
df['YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)

# split the data frame into separate test, validation, and training dataframes based on the year of the incident
df_test = df[df['YEAR'] == 2013]
df_validation = df[df['YEAR'] == 2012]
df_training = df[df['YEAR'] <= 2011]

print('''Data set sizes:
    test: %s
    validation: %s
    training: %s''' % (len(df_test), len(df_validation), len(df_training)))

Data set sizes:
    test: 8643
    validation: 9032
    training: 18681


# Feature Extraction
A critical step (perhaps the most important) in building our machine learning autocoder is determining the relevant inputs for our task and encoding these in a way that will be both computationally efficient, and easy for our algorithm to learn. 

For text classification tasks such as ours, a very simple approach that often works very well is known as the bag-of-words-representation.

The basic idea is to create a 1 to 1 mapping of words to positions in a vector (list of numbers) and then to represent each collection of words with a vector where the values of the vector indicate which words occurred and did not occur.

For example, for the sake of simplicity suppose there are only 5 possible words that might occur in narrative: "he", "fell", "hit", "rock", "car". We might map these to a vector of length 5 such that the first position of that vector always corresponds to the word "he", the second to "fell", the third to "hit", the fourth to "rock, and the fifth to "car". 

<img src="feature_map.png">



Suppose also that we use a value of 0 to indicate when a word does not occur in a narrative, and a value of 1 to indicate when it does. If we were to receive 





larmapping of words to positions in a vector (list of numbers) and then represent map words that might occur in a narrative to fixed positions in a vecotr (list of numbers), and then to indicate where those create a vector (i.e. list of numbers) where each position in that vector corresponds to a word that might occur, and the value at that position indicates convert each example (i.e. narrative) into a giant vector (list of numbers), where each position in the vector corresponds to one of the  words that occurs in our training data, and the value at that position indicates something about that word's occurence in the specific example. Often, we use a value of 1 to indicate that the word is present in the narrative, and a 0 to indicate it is not. The resulting vector is known as a feature vector because it represents the features of our data that we consider relevant for our task.

Below, we use scikit-learn to convert the injury narratives from our training data into a matrix (table of numbers), with each row of that matrix corresponding to the vector produced by our bag-of-words representation of an example. 

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of the CountVectorizer object
vectorizer = CountVectorizer()
# Use the narratives in our training data to create the vocabulary that will
# be represented by our feature vectors. This is remembered by the vectorizer.
vectorizer.fit(df_training['NARRATIVE'])
# Convert the training narratives into their matrix representation.
x_training = vectorizer.transform(df_training['NARRATIVE'])

Below, an examination of the feature matrix shows the number of training examples and features (i.e. words in our vocabulary)

In [37]:
n_rows = x_training.shape[0]
n_features = x_training.shape[1]
print('# of training examples:', n_rows)
print('# of features representing each example:', n_features)

# of training examples: 18681
# of features representing each example: 11915


We can also examine the matrix directly but it's not very interesting. Most of the elements of our feature matrix are 0 for any given example.

In [38]:
dense_vector = x_training[0].todense()
print('The vector representing our first training narrative looks like this:', dense_vector)
print('Only %s of the %s elements in this vector are nonzero' % (np.count_nonzero(dense_vector), n_features))

The vector representing our first training narrative looks like this: [[0 0 0 ..., 0 0 0]]
Only 29 of the 11915 elements in this vector are nonzero


# Selecting and Fitting the Model
Another important consideration in building a machine learning autocoder is selecting and fitting our model. The model defines the ways in which our features can relate to the codes that will be assigned, our fitting procedure decides how the details of the model will ultimately be learned.

Below, we use a regularized Logistic Regression model, which has been shown
to work well on many text classification tasks. The regularization hyperparameter C controls how closely we allow the model to fit the training data. If C is too high the model will tend to "overfit". If it is too low, the model may fail to learn important relationships between the features and the codes. 

There is only one way to find the optimal value for C: experimentation.

In [39]:
from sklearn.linear_model import LogisticRegression

# y_training contains the codes associated with our training narratives
y_training = df_training['INJ_BODY_PART_CD']
# we create an instance of the LogisticRegression model
clf = LogisticRegression(C=1.0)
# we fit the model to our training data (ie. we calculate the model parameters)
clf.fit(x_training, y_training)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

Like our other outputs, we can inspect the Logistic Regression model. The most important attribute of the model is the coefficients, which show how strongly our various features are related to our various codes. Like our feature values, these coefficients are also stored in a matrix. 

Below, we see that the coefficients matrix consists of 46 rows, one for each of the possible codes, and 12,441 columns, one for each of our features.

In [40]:
print(clf.coef_.shape)
print(clf.coef_[2])

(46L, 11915L)
[ -1.50909819e-03  -1.26303852e-04  -2.11726350e-06 ...,  -1.06079037e-02
  -1.52733803e-04  -3.74806034e-03]


# Evaluation

Already we have seen that there are many different ways to build our autocoder, the regularization parameter (C) alone has infinite values. In fact, there are many other things we can vary, some of which we will discuss later. To choose amongst these options we must try them and measure their quality in some way. This is the purpose of our validation data.

Below, we use our vectorizer to convert the narratives from our valadition data into into a feature matrix using the same encoding scheme used for our training data. We then use our trained LogisticRegression classifier to predict codes for these narratives, and compare these to the "true" codes already assigned to the narratives by MSHA.

In [41]:
from sklearn.metrics import accuracy_score

# Convert the validation narratives to a feature matrix
x_validation = vectorizer.transform(df_validation['NARRATIVE'])
# Generate predicted codes for our validation narratives
y_validation_pred = clf.predict(x_validation)
# Calculate how accurately these match the true codes
y_validation = df_validation['INJ_BODY_PART_CD']
accuracy = accuracy_score(y_validation, y_validation_pred)
print('accuracy = %s' % (accuracy))

accuracy = 0.758082373782


Our validation data says we have 75% accuracy, not bad considering we've done almost no work. It's perfectly possible that human accuracy is near 75% on this task, although that's a study for another time. 

What if we had been foolish and evaluated our model only on the training data?

In [42]:
y_training_pred = clf.predict(x_training)
accuracy = accuracy_score(y_training, y_training_pred)
print('accuracy = %s' % (accuracy))

accuracy = 0.953749799261


Yikes! Evaluating on our training data would have told us we were coding at 95% accuracy when really we only get 75% accuracy on the validation data. A large gap between accuracy on the training data and validation data is an indicator that overfitting is occurring, and 20% is a very large gap. This suggests we might be able to improve our model by reducing overfitting.

There are a number of ways to do this: we might reduce the number of features in our model, we might gather more training data, or we might use a different model that is more resistant to overfitting. In this case we will reduce the regularization hyperparameter C, which tends to also reduce overfitting. Below, we train a new model with C set at 0.1 and re-evaluate it's performance on our training and validation data.

In [43]:
clf = LogisticRegression(C=0.1)
clf.fit(x_training, y_training)

y_training_pred = clf.predict(x_training)
training_accuracy = accuracy_score(y_training, y_training_pred)
print('accuracy on training data is: %s' % training_accuracy)

y_validation_pred = clf.predict(x_validation)
validation_accuracy = accuracy_score(y_validation, y_validation_pred)
print('accuracy on validation data is: %s' % validation_accuracy)

accuracy on training data is: 0.843156147958
accuracy on validation data is: 0.763175376439


Decreasing the hyperparameter C has reduced our overfitting quite a bit. The accuracy on the training data has fallen substantially (but who cares, we don't need to predict data we already know the codes for). More importantly, our accuracy on the validation data seems to have improved slightly. This is likely to be a better model than what we were using previously.

There are many other ways we might adjust our model. For example, we might change the way we convert our narratives into feature vectors. This is controlled by the vectorizer we use, in this case CountVectorizer. The [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVr.html) online shows there are many ways this might be modified, we illustrate a few below.

### minimum document frequency
By default, CountVectorizer includes any feature (i.e. word) that occurs in our training data as a feature in our feature vector. We can use the min_df argument to specify that features should only be included in the feature vector if they occur in at least X different documents. Below, we specify that a feature should only be included if it occurs in at least 2 different documents. Previously this resulted in a feature vector that included 12,336 features. Now, we only get 7,163.

In [72]:
vectorizer1 = CountVectorizer(min_df=2)
x_training = vectorizer1.fit_transform(df_training['NARRATIVE'])
x_training.shape

(21035, 7163)

### n-grams
Another CountVectorizer default is to only include individual words as features. Sometimes it's also useful to include short sequences of words, 2 word sequences (called bigrams) and ocassionaly, 3 word sequences called trigrams. Below, we tell the vectorizer to create features for all individual words and word pairs that occur in at least 2 examples in our training data. This has the effect of dramatically increasing the number of features in our feature vector to 56,853! Be careful with this, you can very easily end up with a feature matrix larger than your computer can handle.

In [73]:
vectorizer2 = CountVectorizer(min_df=2, ngram_range=(1,2))
x_training = vectorizer2.fit_transform(df_training['NARRATIVE'])
print(x_training.shape)

(21035, 56853)


On a typical machine learning project you will likely spend a lot of time tweaking your model and retraining it, as we have done above. Eventually you will settle on the best version (as measured against your validation data). You will then measure it's performance on the test data to get a final measure of performance.

# Saving the Model

Suppose we have settled on our final autocoder. This may have taken quite a bit of time to train and develop. Some models can take days, weeks, or even months to train. To save it for later, we need to save both the vectorizer used to encode our data and the model used to classify it. Below, we do this using the Python module `joblib`.

In [74]:
from sklearn.externals import joblib

# save the classifier object as LRclf.pkl
joblib.dump(clf, filename='LRclf.pkl')
# save the vectorizer object as vectorizer.pkl
joblib.dump(vectorizer, filename='vectorizer.pkl')

['vectorizer.pkl']

Later, we can load these back as shown below. Note: we must first import any objects or functions used by our pickled files, otherwise they will fail to load.

In [75]:
from sklearn.externals import joblib
# import the objects used by our saved vectorizer and classifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

# load the classifier
clf = joblib.load(filename='LRclf.pkl')
# load the vectorizer
vectorizer = joblib.load(filename='vectorizer.pkl')

# Using the Autocoder

There are a number of ways we might use our newly created autocoder. The simplest option is to just automatically assign the codes. Suppose for example, that our test dataset is really just a set of uncoded narratives. We might code it as follows:

In [76]:
x_test = vectorizer.transform(df_test['NARRATIVE'])
y_test_pred = clf.predict(x_test)

df_test['Autocode'] = y_test_pred
df_test.head()

Unnamed: 0,ACCIDENT_DT,FIPS_STATE_CD,INJ_BODY_PART,INJ_BODY_PART_CD,MINE_ID,NARRATIVE,Autocode
35783,2013-03-11 00:00:00,1,KNEE/PATELLA,512,101401,Employee was walking by a supply car when he s...,512
40528,2010-09-11 00:00:00,32,MULTIPLE PARTS (MORE THAN ONE MAJOR),700,2601089,Employee in bus when bus hit a berm. Diagnose...,100
29973,2014-07-21 00:00:00,54,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,4609086,Employee stated that he completed performing m...,420
38708,2011-02-01 00:00:00,21,HAND (NOT WRIST OR FINGERS),330,1519447,Cleaning back board off at #2 tailpiece with a...,330
9324,2013-01-09 00:00:00,48,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,4100991,Miner had finished cleaning up a wet concrete ...,420


Another option is to be more selective. Our LogisticRegression model has the nice property that it can tell us not only which codes it thinks are best, but it can also give us an estimate of how likely those codes are to be correct. We might use this information to only assign codes that exceed a certain likelihood of being correct. Alternatively, we could use this information to "suggest" likely codes to a human coder.

To get classification probabilities we use the predict_proba() method of our Logistic Regression classifier. For each of the examples in the matrix fed to predict_proba() we get a sequence showing how confident the model is that each of the 46 possible codes should be assigned to it. 

Below, we run predict_proba() on our 10,000 example test dataset. The result is a 10,000 row matrix (one for each example), with 46 columns (one for each code). 

In [77]:
y_pred_prob = clf.predict_proba(x_test)
print('The shape of the pred_prob matrix is: %s' % str(y_pred_prob.shape))
print('The probabilities for the first example are:\n%s' % y_pred_prob[0])

The shape of the pred_prob matrix is: (10000L, 46L)
The probabilities for the first example are:
[  3.41020511e-03   3.01536102e-03   1.34404050e-03   2.73822059e-03
   3.34934214e-03   3.72835828e-03   2.71803865e-03   1.34908277e-03
   2.82138263e-03   8.00753984e-04   1.65381330e-03   7.07661375e-04
   1.21763663e-03   1.38525872e-03   5.37209415e-03   1.09459687e-03
   4.24352843e-03   3.76646598e-03   3.92432816e-03   4.05624446e-03
   1.43570253e-03   5.17993260e-03   5.10698189e-03   5.36387885e-03
   2.79492518e-03   1.31328913e-03   2.39314012e-03   2.79896495e-03
   6.85283046e-03   4.13963828e-03   6.14814400e-03   2.17366664e-03
   1.50971133e-03   6.02092928e-03   9.47741545e-03   7.96389434e-01
   1.27997256e-02   4.67647647e-03   3.75432874e-02   5.10830117e-03
   2.82194370e-03   3.30385409e-03   7.88862252e-03   1.19017681e-02
   1.02380146e-03   1.13722073e-03]


The position in the vector indicates the code the probability corresponds to. To recover just the probability and position of the most likely code, we can do the following:

In [78]:
# get a sequence indicating the position with the highest probability for each row
top_positions = y_pred_prob.argmax(axis=1)

codes = []
probabilities = []
# loop through each example and retrieve the best code and it's probability
# if the probability exceeds .75, add it as an autocode, otherwise leave it blank
for n, pos in enumerate(top_positions):
    # grab the code that corresponds to the position with the highest probability
    code = clf.classes_[pos]
    # grab the probability at that position
    prob = y_pred_prob[n][pos]
    # add the probability to our list of probabilities
    probabilities.append(prob)
    # if prob exceeds our cutoff add the code to our list of codes
    if prob > .75:
        codes.append(code)
    # if prob does not exceed the cutoff, add a blank to our list of codes
    else:
        codes.append('')

# Add the autocodes to our DataFrame
df_test['Autocode'] = codes
# Add the probabilities to our DataFrame
df_test['Probability'] = probabilities
# Show selected columns of our updated DataFrame
df_test[['INJ_BODY_PART', 'INJ_BODY_PART_CD', 'Autocode', 'Probability']].head()

Unnamed: 0,INJ_BODY_PART,INJ_BODY_PART_CD,Autocode,Probability
35783,KNEE/PATELLA,512,512.0,0.796389
40528,MULTIPLE PARTS (MORE THAN ONE MAJOR),700,,0.489486
29973,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,,0.72308
38708,HAND (NOT WRIST OR FINGERS),330,330.0,0.847322
9324,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),420,,0.132973


# Extras

Scikit-learn provides a wide variety of powerful tools that we didn't have time to cover in this class, including many different learning algorithms and utilities for feature and model selection. To learn more, see the official and very extensive [online documentation](http://scikit-learn.org/stable/index.html).