### Text Classification Training with Simple Transformers
This notebook shows a simple text classification workflow with the simple transformers library. This example also employs 5-Fold cross validation. We use a multi-label classification model despite the BBC dataset being multi-class, in order to extract usable activation values for purposes of visualization in other notebooks. One cannot directly modify the output layer activation function on simple transformers models therefore the workaround here is, rather than resorting to vanilla pytorch, to simply change the model type. 

In [23]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from simpletransformers.classification import MultiLabelClassificationModel
import dill

In [24]:
df = pd.read_csv('res/bbc.csv')

In [27]:
# reduce size of texts
X = df.text.apply(lambda x: x[:450]).tolist()
# encode
y = df.label.apply(lambda x: [1 if l == x else 0 for l in df.label.unique()]).to_numpy()

In [28]:
# organize data
data = pd.DataFrame(list(zip(X, y)))

In [29]:
# cv
results = []
kfold = KFold(shuffle=True, random_state=42)
for train_index, test_index in kfold.split(X, y):
    train_df = data.iloc[train_index]
    test_df = data.iloc[test_index]
    model = MultiLabelClassificationModel(
        'roberta', 
        'roberta-base', 
        num_labels=len(df.label.unique()), 
        use_cuda=True, 
        args={
            'fp16': False, 
            'reprocess_input_data': True, 
            'overwrite_output_dir': True, 
            'num_train_epochs': 5
        }
    )
    model.train_model(train_df)
    result, _, _ = model.eval_model(test_df)
    print(f"LRAP: {result['LRAP']}")
    results.append(result['LRAP'])

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'clas

  0%|          | 0/1600 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/400 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

LRAP: 0.99625


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'clas

  0%|          | 0/1600 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/400 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

LRAP: 0.99125


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'clas

  0%|          | 0/1600 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/400 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

LRAP: 0.9858333333333335


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'clas

  0%|          | 0/1600 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/400 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

LRAP: 0.996875


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'clas

  0%|          | 0/1600 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/200 [00:00<?, ?it/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/400 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

LRAP: 0.9933333333333333


In [30]:
results

[0.99625, 0.99125, 0.9858333333333335, 0.996875, 0.9933333333333333]

In [43]:
# train on subsample of dataset to make histogram analysis more interesting
model = MultiLabelClassificationModel(
    'roberta', 
    'roberta-base', 
    num_labels=len(df.label.unique()), 
    use_cuda=True, 
    args={
        'fp16': False, 
        'reprocess_input_data': True, 
        'overwrite_output_dir': True, 
        'num_train_epochs': 5
    }
)
model.train_model(data.groupby(data[1].apply(lambda x: x.index(1))).sample(15))

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'clas

  0%|          | 0/75 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/10 [00:00<?, ?it/s]

(50, 0.42597315162420274)

In [44]:
with open('res/bbc.dill', 'wb') as f:
    dill.dump(model, f)