In [None]:
! pip install smaberta

Collecting smaberta
  Downloading smaberta-0.0.2-py3-none-any.whl (12 kB)
Collecting tensorboardX
  Downloading tensorboardX-2.5-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 4.5 MB/s 
Collecting transformers==2.6.0
  Downloading transformers-2.6.0-py3-none-any.whl (540 kB)
[K     |████████████████████████████████| 540 kB 33.7 MB/s 
Collecting simpletransformers==0.22.1
  Downloading simpletransformers-0.22.1-py3-none-any.whl (144 kB)
[K     |████████████████████████████████| 144 kB 45.1 MB/s 
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 40.2 MB/s 
[?25hCollecting tokenizers==0.5.2
  Downloading tokenizers-0.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 24.3 MB/s 
Collecting boto3
  Downloading boto3-1.22.9-

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import random
import torch
import pickle
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)

import sys
from smaberta import TransformerModel

In [None]:
!ls

gdrive	sample_data


In [None]:
sys.path.append('/content/gdrive/MyDrive/')

### Loading Data

Load train data stored in CSV format using Pandas. Pretty much any format is acceptable, just some form of text and accompanying labels. Modify according to your task. For the purpose of this tutorial, we are using a sample from New York Times Front Page Dataset (Boydstun, 2014).

In [None]:
train_df = pd.read_csv("./data/data_train3.csv")

Loading test data

In [None]:
test_df = pd.read_csv("./data/data_test.csv")

Just to get an idea of what this dataset looks like

Paired data consisting of freeform text accompanied by their supervised labels towards the particular task. Here the text is headlines of news stories and the label categorizes them into the subjects. We have a total of 25 possible labels here, each represented by a separate number.

In [None]:
print(len(train_df.label.values))

13642


In [None]:
train_df.head()

Unnamed: 0,text,label
0,"On day 8 after alloBMT, the patient suddenly m...",0
1,Similar phenomenon associated with gemcitabine...,0
2,No renal or cardiac toxicity was observed.,0
3,The authors report two cases of catechol-O-met...,0
4,This paper describes a possible side effect pr...,0


In [None]:
print(train_df.text[:10].tolist(), train_df.label[:10].tolist())

['On day 8 after alloBMT, the patient suddenly manifested high-grade fever, transfusion-resistant severe anemia, and thrombocytopenia.', 'Similar phenomenon associated with gemcitabine, the only FDA-approved drug for pancreatic cancer, is rarely reported.', 'No renal or cardiac toxicity was observed.', 'The authors report two cases of catechol-O-methyltransferase (COMT) inhibitor-induced asymptomatic hepatic dysfunction in women with Parkinson disease.', 'This paper describes a possible side effect previously unreported--papilledema not associated with peripheral neuropathy.', 'CONCLUSIONS: The piloerection observed after the replacement of fluvoxamine with milnacipran in this patient appears to have been due to an increase in the alpha(1)-adrenoceptor occupancy by endogenous norepinephrine induced by milnacipran.', '2-CdA typically causes a long-lasting state of immunodeficiency and the profound influence of this drug on the immune system has raised questions concerning the emergence 

### Learning Parameters
These are training arguments that you would use to train the classifier. For the purposes of the tutorial we set some sample values. Presumably in a different case you would perform a grid search or random search CV

In [None]:
lr = 1e-3
epochs = 5
print("Learning Rate ", lr)
print("Train Epochs ", epochs)

Learning Rate  0.001
Train Epochs  5


### Initialise model
1. First argument is indicative to use the Roberta architecture (alternatives - Bert, XLNet... as provided by Huggingface). Used to specify the right tokenizer and classification head as well 
2. Second argument provides intialisation point as provided by Huggingface [here](https://huggingface.co/transformers/pretrained_models.html). Examples - roberta-base, roberta-large, gpt2-large...
3. The tokenizer accepts the freeform text input and tansforms it into a sequence of tokens suitable for input to the transformer. The transformer architecture processes these before passing it on to the classifier head which transforms this representation into the label space.  
4. Number of labels is specified below to initialise the classification head appropriately. As per the classification task you would change this.
5. You can see the training args set above were used in the model initiation below.. 
6. Pass in training arguments as initialised, especially note the output directory where the model is to be saved and also training logs will be output. The overwrite output directory parameter is a safeguard in case you're rerunning the experiment. Similarly if you're rerunning the same experiment with different parameters, you might not want to reprocess the input every time - the first time it's done, it is cached so you might be able to just reuse the same. fp16 refers to floating point precision which you set according to the GPUs available to you, it shouldn't affect the classification result just the performance.

In [None]:
model = TransformerModel('roberta', 'roberta-base', num_labels=2, reprocess_input_data=True, num_train_epochs=epochs, learning_rate=lr, 
                  output_dir='./saved_model/', overwrite_output_dir=True, fp16=False)

### Run training

In [None]:
model.train(train_df['text'], test_df['label'])

Starting Epoch:  0
Starting Epoch:  1
Starting Epoch:  2
Starting Epoch:  3
Starting Epoch:  4
Training of roberta model complete. Saved to ./saved_model/.


To see more in depth logs, set flag show_running_loss=True on the function call of train_model

### Inference from model

At training time the model is saved to the output directory that was passed in at initialization. We can either continue retaining the same model object, or load from the directory it was previously saved at. In this example we show the loading to illustrate how you would do the same. This is helpful when you want to train and save a classifier and use the same sporadically. For example in an online setting where you have some labelled training data you would train and save a model, and then load and use it to classify tweets as your collection pipeline progresses.

In [None]:
model = TransformerModel('roberta', 'roberta-base',  num_labels=2, location="./saved_model/")

### Evaluate on test set

At inference time we have access to the model outputs which we can use to make predictions as shown below. Similarly you could perform any emprical analysis on the output before/after saving the same. Typically you would save the results for replication purposes. You can use the model outputs as you would on a normal Pytorch model, here we just show label predictions and accuracy. In this tutorial we only used a fraction of the available data, hence why the actual accuracy is not great. For full results that we conducted on the experiments, check out our paper.

In [None]:
result, model_outputs, wrong_predictions = model.evaluate(test_df['text'], test_df['label'])
preds = np.argmax(model_outputs, axis = 1)


{'mcc': 0.0, 'tp': 0, 'tn': 2468, 'fp': 0, 'fn': 1059}


In [None]:
len(test_df), len(preds)

(3527, 3527)

In [None]:
correct = 0
labels = test_df['label'].tolist()
print(len(labels))
print(labels)
print(preds)
counter = 0
for i in range(len(labels)):
    if preds[i] != 0:
        counter+=1
    if preds[i] == labels[i]:
        correct+=1

print(counter)
accuracy = correct/len(labels)
print("Accuracy: ", accuracy)

3527
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0

In [None]:
pickle.dump(model_outputs, open("../model_outputs.pkl", "wb"))

### Run inference 

This is the use case when you only have a new set of documents and no labels. For example if we just want to make predictions on a set of new text documents without loading a pandas datafram i.e. if you just have a list of texts, it can be predicted as shown below. Note that here you have the predictions and model outputs.

In [None]:
texts = test_df['text'].tolist()

In [None]:
preds, model_outputs = model.predict(texts)

KeyboardInterrupt: ignored

In [None]:
correct = 0
for i in range(len(labels)):
    if preds[i] == labels[i]:
        correct+=1

accuracy = correct/len(labels)
print("Accuracy: ", accuracy)

Accuracy:  0.7058823529411765


### References

Boydstun, Amber E. (2014). New York Times Front Page Dataset. www.comparativeagendas.net. Accessed April 26, 2019.



