<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Finetune2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer Learning with Finetune 
Finetune is a library that creates a scikit-learn style `fit(), predict()` interface to a variety of state-of-the-art pretrained language models, making them much easier to use.

# Resources:
* [Finetune Quick Start Guide](https://finetune.indico.io/)
* [Finetune Source Code](https://github.com/IndicoDataSolutions/finetune)
* [GPT1 Paper](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) (Default model in Finetune)
* [GPT2 Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

# Download Packages and Data

In [2]:
!pip install -U finetune
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'

Requirement already up-to-date: finetune in /usr/local/lib/python3.6/dist-packages (0.6.7)
--2019-06-16 18:30:45--  https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx
Resolving github.com (github.com)... 52.74.223.119
Connecting to github.com (github.com)|52.74.223.119|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx [following]
--2019-06-16 18:30:46--  https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4183086 (4.0M) [application/octet-stream]
Saving to: ‘msha.xlsx.3’


2019-06-16 18:30:46 (152 MB/s) - ‘msha.xlsx.3’ saved [4183086/4183086]



In [3]:
import pandas as pd

df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)
df['ACCIDENT_YEAR'].value_counts()
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])][:3200].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012][:1000].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

training rows: 3200
validation rows: 1000


# Train Classifier
By default finetune uses the GPT1 pretrained model. Parameters include the following:
* `batch_size`: The number of training examples used to calculate each gradient update. Bigger batches train the model faster but take up more memory on the GPU. If it's too big you will get out-of-memory errors.
* `max_length` - The maximum number of words that will be considered in each training example. You want this to be just big enough for your data. Longer lengths require more processing time and more GPU memory but also allow the model to read all of the words in longer narratives.
* `n_epochs` - The number of complete passes through the training set. Too much and the model risks overfitting. Too little and it risks underfitting.
* `val_size` - The number of examples from the training set that will be used to periodically validate the model during training.

Other parameters are left at their defaults. See [configuration options](https://finetune.indico.io/#finetune-model-configuration-options) for other options.

In [4]:
from finetune import Classifier

model = Classifier(batch_size=32, 
                   max_length=90, 
                   n_epochs=4, 
                   val_size=0)
model.fit(df_train['NARRATIVE'], df_train['INJ_BODY_PART'])

I0616 18:32:21.420229 139771120363392 base.py:104] Saving tensorboard output to /tmp/Finetunexxxoohxu
I0616 18:32:21.469501 139771120363392 config.py:78]  Visible GPUs: {0: Tesla T4}
Epoch 4/4: 100%|██████████| 3200/3200 [01:03<00:00, 22.55it/s]


In [5]:
# re-use the existing tensorflow graph
with model.cached_predict():
  # generate predictions
  df_valid['PREDICTED_PART'] = model.predict(df_valid['NARRATIVE'].values)
# look at a sample
df_valid[['NARRATIVE', 'INJ_BODY_PART', 'PREDICTED_PART']].sample(5)

  "Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
  max_length
Inference: 100%|██████████| 1000/1000 [00:19<00:00, 51.77it/s]


Unnamed: 0,NARRATIVE,INJ_BODY_PART,PREDICTED_PART
2486,Coal rolled out from rib striking employee on ...,LOWER LEG/TIBIA/FIBULA,LOWER LEG/TIBIA/FIBULA
3452,Employee was using a knife to cut a rope and s...,FINGER(S)/THUMB,FINGER(S)/THUMB
3457,Walking on belt line said too much weight on r...,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),MULTIPLE PARTS (MORE THAN ONE MAJOR)
3106,Employee was welding and a foreign body enter ...,EYE(S) OPTIC NERVE/VISON,EYE(S) OPTIC NERVE/VISON
1984,This accident is still under investigation and...,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),BACK (MUSCLES/SPINE/S-CORD/TAILBONE)


In [7]:
# calculate the predicted probabilities
with model.cached_predict():
  df_valid['PROB_DICT'] = model.predict_proba(df_valid['NARRATIVE'].values)
  
with pd.option_context('display.max_colwidth', 500):
  df_valid[['NARRATIVE', 'PREDICTED_PART', 'PROB_DICT']].head(2)

  "Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
  max_length
Inference: 100%|██████████| 1000/1000 [00:07<00:00, 126.04it/s]


In [8]:
# function that takes a row of our dataframe and returns the predicted probability
def get_probability(row):
    predicted_part = row['PREDICTED_PART']
    probability_dict = row['PROB_DICT']
    return probability_dict[predicted_part]

# apply get_probability to each row in our dataframe and store the result
df_valid['PREDICTED_PROB'] = df_valid.apply(func=get_probability, axis=1)
# take a peak at what we get
df_valid[['NARRATIVE', 'INJ_BODY_PART', 'PREDICTED_PART', 'PREDICTED_PROB']].sample(2).head()

Unnamed: 0,NARRATIVE,INJ_BODY_PART,PREDICTED_PART,PREDICTED_PROB
421,Hearing Loss,EAR(S) INTERNAL & HEARING,EAR(S) INTERNAL & HEARING,4e-06
3249,Employee was performing routine maintenance on...,BODY SYSTEMS,BODY SYSTEMS,3.3e-05


In [9]:
from sklearn.metrics import accuracy_score, f1_score

mf1 = f1_score(df_valid['INJ_BODY_PART'], df_valid['PREDICTED_PART'], average='macro')
acc = accuracy_score(df_valid['INJ_BODY_PART'], df_valid['PREDICTED_PART'])
print('macro-f1:', mf1)
print('accuracy:', acc)

macro-f1: 0.6082996613119874
accuracy: 0.805


  'precision', 'predicted', average, warn_for)


# Semi-Supervised Learning

One of the benefits of language model pretraining is that it also allows us to pretrain models on unlabeled data from our town. This often improves performance even further. We illustrate this below, assuming that we only have access to 100 labeled examples, and 18,681 unlabeled examples.

In [37]:
from sklearn.model_selection import train_test_split

df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])].copy()
# grab 2 examples of each INJ_BODY_PART
df_small_train = df_train.sample(100)
# number of stratified examples:
print(f'labeled examples in small train: {len(df_small_train)}')           
print(f'total "unlabeled" example in big train: {len(df_train)}')

labeled examples in small train: 100
total "unlabeled" example in big train: 18681


In [38]:
# max_length is the maximum number of words we will use from each narrative
model = Classifier(batch_size=32, 
                   max_length=90, 
                   n_epochs=5, 
                   val_size=0)
# finetune the language model to our narratives (note no labels are used)
model.fit(df_small_train['NARRATIVE'], df_small_train['INJ_BODY_PART'])

I0616 20:18:23.854972 139771120363392 base.py:104] Saving tensorboard output to /tmp/Finetunew0zalxzs
Epoch 5/5: 100%|██████████| 100/100 [00:01<00:00, 51.50it/s]


In [39]:
# re-use the existing tensorflow graph
with model.cached_predict():
  # generate predictions
  preds = model.predict(df_valid['NARRATIVE'].values)
acc = accuracy_score(y_true=df_valid['INJ_BODY_PART'], y_pred=preds)
mf1 = f1_score(y_true=df_valid['INJ_BODY_PART'], y_pred=preds, average='macro')
print(f'accuracy={acc}')
print(f'macro-f1={mf1}')

  "Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
  max_length
Inference: 100%|██████████| 1000/1000 [00:20<00:00, 49.10it/s]

accuracy=0.179
macro-f1=0.014154086241981731



  'precision', 'predicted', average, warn_for)


In [42]:
# max_length is the maximum number of words we will use from each narrative
model = Classifier(batch_size=32, 
                   max_length=90, 
                   n_epochs=1, 
                   val_size=0,
                   dataset_size=len(df_train['NARRATIVE']))
# finetune the language model to our narratives (no labels are used, just narratives)
pretrain_generator = lambda: iter(df_train['NARRATIVE'])
model.fit(pretrain_generator)

NameError: ignored

In [0]:
model.config.dataset_size = len(df_strata_train['NARRATIVE'])
model.config.n_epochs=20
model.fit(df_small_train['NARRATIVE'], df_small_train['INJ_BODY_PART'])

In [65]:
# re-use the existing tensorflow graph
with model.cached_predict():
  # generate predictions
  preds = model.predict(df_valid['NARRATIVE'].values)
acc = accuracy_score(y_true=df_valid['INJ_BODY_PART'], y_pred=preds)
print(f'accuracy={acc}')

  "Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
  max_length
Inference: 100%|██████████| 1000/1000 [00:22<00:00, 44.81it/s]


accuracy=0.076


In [53]:
model.predict(df_strata_train['NARRATIVE'])

Inference:   0%|          | 0/223 [00:00<?, ?it/s]


StopIteration: ignored

In [24]:
model.config.lr_schedule

'warmup_linear'

In [25]:
model.config.lr_warmup

0.002

In [26]:
model.lr

AttributeError: ignored