[View in Colaboratory](https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/FastAI.ipynb)

# Download the prerequisites

1.   pytorch (a library for building neural networks)
2.   fastai (a library on top of pytorch)
3.   spacy english language model
4.   msha.xlsx data
5.   xlrd (a library to read excel files)

In [1]:
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
!pip install fastai
!python -m spacy download en
!wget --no-clobber 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'
!pip install xlrd

Looking in links: https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en

    You can now load the model via spacy.load('en')

File ‘msha.xlsx’ already there; not retrieving.



In [2]:
import pandas as pd

df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)
df['ACCIDENT_YEAR'].value_counts()
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

training rows: 18681
validation rows: 9032


In [0]:
from sklearn.preprocessing import LabelEncoder

labeler = LabelEncoder().fit(df['INJ_BODY_PART'])
df_train['LABEL'] = labeler.transform(df_train['INJ_BODY_PART'])
df_valid['LABEL'] = labeler.transform(df_valid['INJ_BODY_PART'])

FastAI is very picky (in a pretty ridiculous way) about the data formats it accepts. We save the training and validation data in this way to meet FastAI requirements for using csv files. Specifically, FastAI requires:

1.   The first column must contain the label
2.   The second column must contain the text
3.   The CSV must not have a header (column names at the top)


In [4]:
df_train[['LABEL', 'NARRATIVE']].to_csv('train.csv', header=False, index=False)
df_valid[['LABEL', 'NARRATIVE']].to_csv('valid.csv', header=False, index=False)
n_labels = len(labeler.classes_)
print(n_labels)

46


In [0]:
from fastai.text.data import TextDataset
from fastai.text.data import text_data_from_csv
from fastai.text.data import lm_data

Create a "DataBunch" from our CSV files. A "DataBunch" contains tokenized text that has been mapped to numbers, each representing a word in the text.

In [6]:
train_ds = TextDataset.from_csv('.', 
                                name='train',
                                classes=list(labeler.classes_))

Tokenizing train.
Numericalizing train.


We are going to use a pretrained language model to build a state of the art text classifier. This involves the following steps:

1.   Load the weights for a pre-trained language model, i.e. a model trained on a huge collection of text. We don't want to do this ourselves because it takes a huge amount of time.
2.   Finetune the language model to some of the language data in our dataset. This can include all the data currently available since this is an unsupervised process.
3.   Cut off the language model output layer and put a classifier layer on top.
4.   Finetune the new model to our classification task




# Load the pretrained model weights

We create a directory called 'models', and download them there.

In [7]:
import os

if not os.path.exists('models'):
  os.mkdir('models')

FileExistsError: ignored

Download the pretrained models

In [0]:
from fastai.core import download_url

download_url('http://files.fast.ai/models/wt103_v1/lstm_wt103.pth', 'models/lstm_wt103.pth')
download_url('http://files.fast.ai/models/wt103_v1/itos_wt103.pkl', 'models/itos_wt103.pkl')

# Finetune the language model on some of our own text

We start by preparing a "language model" DataBunch of some of our data. This is just data that's in a format useful for language modelling.

In [9]:
data_lm = text_data_from_csv(path='.', train='train', data_func=lm_data)

Tokenizing valid.
Numericalizing valid.


Load the model and finetune it on our data for 1 epoch.

In [10]:
from fastai.text.learner import RNNLearner

learn = RNNLearner.language_model(data_lm, 
                                  pretrained_fnames=['lstm_wt103', 'itos_wt103'], 
                                  drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

epoch  train loss  valid loss  accuracy
0      3.964359    3.674327    0.326862


In [11]:
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3)

epoch  train loss  valid loss  accuracy
0      3.591313    3.419873    0.354575
1      3.379102    3.256631    0.373550
2      3.220740    3.171059    0.382985
3      3.125974    3.129997    0.387092
4      3.095357    3.123070    0.388513


In [0]:
learn.save_encoder('ft_enc')

# Fit the model to our classification task

In [0]:
#!wget --no-clobber https://www.dropbox.com/s/45i662vci7ja6vv/ft_enc.pth?dl=0
#!mv ft_enc.pth?dl=0 models/ft_enc.pth

In [14]:
from fastai.text.data import classifier_data

data_clas = text_data_from_csv('.', 
                               data_func=classifier_data, 
                               vocab=data_lm.train_ds.vocab, 
                               n_labels=n_labels, 
                               classes=labeler.classes_)
max(df_valid['LABEL'])

45

In [15]:
from fastai.text.learner import RNNLearner

learn = RNNLearner.classifier(data_clas, drop_mult=0.5)
learn.load_encoder('ft_enc')
learn.fit_one_cycle(1, 1e-2)

epoch  train loss  valid loss  accuracy
0      2.223177    1.900726    0.468888


In [16]:
learn.model

SequentialRNN(
  (0): MultiBatchRNNCore(
    (encoder): Embedding(5560, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(5560, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=46, bias=Tr

In [17]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch  train loss  valid loss  accuracy
0      1.649309    1.266621    0.642936


In [19]:
learn.unfreeze()
learn.fit_one_cycle(1, slice(2e-3/100, 2e-3))

epoch  train loss  valid loss  accuracy
0      1.137662    0.900267    0.753211


In [20]:
learn.unfreeze()
learn.fit_one_cycle(5, slice(2e-3/100, 2e-3))

epoch  train loss  valid loss  accuracy
0      1.099562    0.853967    0.770926
1      0.990908    0.801697    0.783437
2      0.859269    0.762832    0.797608
3      0.865479    0.763538    0.797941
4      0.787889    0.757117    0.799491


In [0]:
learn.fit_one_cycle(5, slice(2e-3/100, 2e-3))

epoch  train loss  valid loss  accuracy
0      0.810765    0.747445    0.800819
1      0.847537    0.755049    0.802480
2      0.785435    0.746593    0.800266
