<a href="https://colab.research.google.com/github/ameasure/colab_tutorials/blob/master/Finetune2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transfer Learning with Finetune 
Finetune is a library that creates a scikit-learn style `fit(), predict()` interface to a variety of state-of-the-art pretrained language models, making them much easier to use.

# Resources:
* [Finetune Quick Start Guide](https://finetune.indico.io/)
* [Finetune Source Code](https://github.com/IndicoDataSolutions/finetune)
* [GPT1 Paper](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) (Default model in Finetune)
* [GPT2 Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

In [1]:
!pip install -U finetune
!wget 'https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx'

Requirement already up-to-date: finetune in /usr/local/lib/python3.6/dist-packages (0.6.7)
--2019-06-16 15:59:33--  https://github.com/ameasure/autocoding-class/raw/master/msha.xlsx
Resolving github.com (github.com)... 13.250.177.223
Connecting to github.com (github.com)|13.250.177.223|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx [following]
--2019-06-16 15:59:33--  https://raw.githubusercontent.com/ameasure/autocoding-class/master/msha.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4183086 (4.0M) [application/octet-stream]
Saving to: ‘msha.xlsx.2’


2019-06-16 15:59:34 (124 MB/s) - ‘msha.xlsx.2’ saved [4183086/4183086]



In [2]:
import pandas as pd

df = pd.read_excel('msha.xlsx')
df['ACCIDENT_YEAR'] = df['ACCIDENT_DT'].apply(lambda x: x.year)
df['ACCIDENT_YEAR'].value_counts()
df_train = df[df['ACCIDENT_YEAR'].isin([2010, 2011])][:3200].copy()
df_valid = df[df['ACCIDENT_YEAR'] == 2012][:1000].copy()
print('training rows:', len(df_train))
print('validation rows:', len(df_valid))

training rows: 3200
validation rows: 1000


In [20]:
import math

batch_size = 32
val_size = 320
train_size = len(df_train) - val_size
# validate once every epoch
val_interval = math.floor(train_size / batch_size)
print(f'val_interval={val_interval}')

val_interval=90


In [21]:
from finetune import Classifier

# max_length is the maximum number of words we will use from each narrative
model = Classifier(batch_size=batch_size, 
                   max_length=90, 
                   n_epochs=4, 
                   val_size=val_size,
                   val_interval=val_interval,
                   eval_acc=True)
model.fit(df_train['NARRATIVE'], df_train['INJ_BODY_PART'])

I0616 16:32:51.127326 140626639210368 base.py:104] Saving tensorboard output to /tmp/Finetunem8dmeg_p
I0616 16:33:10.129099 140626639210368 model.py:221] Adding evaluation metrics, Accuracy
Epoch 2/4:   3%|▎         | 96/2880 [00:01<00:53, 51.60it/s]
Validation:   0%|          | 0/320 [00:00<?, ?it/s][A
Validation:  15%|█▌        | 49/320 [00:00<00:00, 484.47it/s][A
Validation:  29%|██▉       | 93/320 [00:00<00:00, 469.41it/s][A
Validation:  30%|███       | 96/320 [00:00<00:02, 78.58it/s] [A
Validation:  40%|████      | 128/320 [00:00<00:02, 92.57it/s][A
Validation:  50%|█████     | 160/320 [00:00<00:01, 105.61it/s][A
Validation:  60%|██████    | 192/320 [00:00<00:01, 117.55it/s][A
Validation:  70%|███████   | 224/320 [00:01<00:00, 124.67it/s][A
Validation:  80%|████████  | 256/320 [00:01<00:00, 129.54it/s][A
Validation:  90%|█████████ | 288/320 [00:01<00:00, 137.39it/s][A
Validation: 100%|██████████| 320/320 [00:01<00:00, 142.00it/s][A
Epoch 3/4:   3%|▎         | 96/2880 [0

In [22]:
# re-use the existing tensorflow graph
with model.cached_predict():
  # generate predictions
  df_valid['PREDICTED_PART'] = model.predict(df_valid['NARRATIVE'].values)
# look at a sample
df_valid[['NARRATIVE', 'INJ_BODY_PART', 'PREDICTED_PART']].sample(5)

  "Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
  max_length
Inference: 100%|██████████| 1000/1000 [00:18<00:00, 54.27it/s]


Unnamed: 0,NARRATIVE,INJ_BODY_PART,PREDICTED_PART
2486,Coal rolled out from rib striking employee on left lower leg causing a contusion. ****DID NOT START LOSING TIME UNTIL 3/28/12.****,LOWER LEG/TIBIA/FIBULA,LOWER LEG/TIBIA/FIBULA
3452,Employee was using a knife to cut a rope and states that the knife slipped and cut his lt. little finger.,FINGER(S)/THUMB,FINGER(S)/THUMB
3457,"Walking on belt line said too much weight on right leg, slipped and fell on back. Had pain in back and right leg was trying to step up on walk board which was wet and muddy.",BACK (MUSCLES/SPINE/S-CORD/TAILBONE),MULTIPLE PARTS (MORE THAN ONE MAJOR)
3106,Employee was welding and a foreign body enter his left eye.,EYE(S) OPTIC NERVE/VISON,EYE(S) OPTIC NERVE/VISON
1984,This accident is still under investigation and my be revised at a later date. Employee had parked rock truck to help crusher crew. When he stepped on the second to last step he slipped and fell to the ground into his back. He did not wish for medical attention at that time. On 1/20/12 he decided to see a doctor.,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),BACK (MUSCLES/SPINE/S-CORD/TAILBONE)


In [23]:
pd.options.display.max_colwidth=500
with model.cached_predict():
  df_valid['PROB_DICT'] = model.predict_proba(df_valid['NARRATIVE'].values)
df_valid[['NARRATIVE', 'PREDICTED_PART', 'PROB_DICT']].head(2)

  "Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
  max_length
Inference: 100%|██████████| 1000/1000 [00:07<00:00, 129.05it/s]


Unnamed: 0,NARRATIVE,PREDICTED_PART,PROB_DICT
2,"Employee, parked s/c on grade at 16-Block #3 Entry Spad #3868. S/c slid approx. 3' pinning oper. between s/c & rib, employee had set park brake and got off machine to move roof bolter cable.",CHEST (RIBS/BREAST BONE/CHEST ORGNS),"{'ABDOMEN/INTERNAL ORGANS': 1.6925336e-05, 'ANKLE': 1.4378819e-05, 'ARM, MULTIPLE PARTS': 1.3527091e-05, 'ARM,NEC': 3.2564294e-05, 'BACK (MUSCLES/SPINE/S-CORD/TAILBONE)': 8.9107025e-06, 'BODY SYSTEMS': 2.5714748e-05, 'BRAIN': 1.5362051e-05, 'CHEST (RIBS/BREAST BONE/CHEST ORGNS)': 2.2794888e-05, 'EAR(S) EXTERNAL': 4.5170873e-05, 'EAR(S) INTERNAL & HEARING': 4.3377437e-05, 'ELBOW': 2.8746328e-05, 'EYE(S) OPTIC NERVE/VISON': 3.5216468e-05, 'FACE, MULTIPLE PARTS': 5.3805532e-05, 'FACE,NEC': 3.61..."
5,Possible heart attack.,BODY SYSTEMS,"{'ABDOMEN/INTERNAL ORGANS': 0.00059487275, 'ANKLE': 0.00014386217, 'ARM, MULTIPLE PARTS': 0.0005284241, 'ARM,NEC': 0.0002205993, 'BACK (MUSCLES/SPINE/S-CORD/TAILBONE)': 0.0003225865, 'BODY SYSTEMS': 0.002054714, 'BRAIN': 0.102767654, 'CHEST (RIBS/BREAST BONE/CHEST ORGNS)': 0.00039653, 'EAR(S) EXTERNAL': 0.00032873746, 'EAR(S) INTERNAL & HEARING': 0.0005502523, 'ELBOW': 0.00045330886, 'EYE(S) OPTIC NERVE/VISON': 0.0011364677, 'FACE, MULTIPLE PARTS': 0.004714678, 'FACE,NEC': 0.0026187112, 'FIN..."


In [24]:
# function that takes a row of our dataframe and returns the predicted probability
def get_probability(row):
    predicted_part = row['PREDICTED_PART']
    probability_dict = row['PROB_DICT']
    return probability_dict[predicted_part]

# apply get_probability to each row in our dataframe and store the result
df_valid['PREDICTED_PROB'] = df_valid.apply(func=get_probability, axis=1)
# take a peak at what we get
df_valid[['NARRATIVE', 'INJ_BODY_PART', 'PREDICTED_PART', 'PREDICTED_PROB']].sample(5).head()

Unnamed: 0,NARRATIVE,INJ_BODY_PART,PREDICTED_PART,PREDICTED_PROB
421,Hearing Loss,EAR(S) INTERNAL & HEARING,EAR(S) INTERNAL & HEARING,8e-06
3249,Employee was performing routine maintenance on the propel motor on shovel 4 when he experienced electric shock.,BODY SYSTEMS,BODY SYSTEMS,0.000145
463,A PIECE OF WATER LINE FELL TWO FEET AND STRUCK EMPLOYEE ON THE SHOULDER CAUSING BRUISING.,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),0.001458
2703,Operator was changing drill steel bit when the chain wrench broke and a piece of the bar hit the back/side of the operators left knee. No initial treatment was sought. Area around the impact began to swell after a couple of days and that's when the operator sought medical attention,KNEE/PATELLA,MULTIPLE PARTS (MORE THAN ONE MAJOR),0.000106
3402,Employee slipped and fell into a sump hole injuring shoulder and back. Company was notified this was a medical injury on 12/19/2012.,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),MULTIPLE PARTS (MORE THAN ONE MAJOR),0.106359


In [25]:
from sklearn.metrics import accuracy_score, f1_score

mf1 = f1_score(df_valid['INJ_BODY_PART'], df_valid['PREDICTED_PART'], average='macro')
acc = accuracy_score(df_valid['INJ_BODY_PART'], df_valid['PREDICTED_PART'])
print('macro-f1:', mf1)
print('accuracy:', acc)

macro-f1: 0.5759598454236836
accuracy: 0.815


  'precision', 'predicted', average, warn_for)


In [26]:
model.generate_text(seed_text='He was cleaning the')

'_start_he was cleaning the mess spills spills spills spills spills spill spill spill spill spill spill spill spill spill spill spill pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour pour '