## fine tune bert model for custom dataset

### 1. install libraries

In [22]:
! pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### 2. load/define data set

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/dummycsdata.csv')

In [3]:
# str labels to int
df['intlabel'] = df['label'].rank(method='dense', ascending=False).astype(int) - 1

In [4]:
# prepare mapping from int labels back to str
labelmapping = {}
for key in df.intlabel.unique():
    value = df.loc[df['intlabel'] == key,'label'].unique()[0]
    labelmapping[key] = value
print(labelmapping)

{3: 'HELP', 4: 'CONFIRMATION_YES', 5: 'CONFIRMATION_NO', 1: 'NEXT', 0: 'RESTART', 2: 'IRRELEVANT'}


In [5]:
n_labels = len(labelmapping.values())
print(n_labels)

6


In [6]:
print(df)

                                 text             label  intlabel
0                               pomoc              HELP         3
1                     potřebuju pomoc              HELP         3
2                            pomoz mi              HELP         3
3                     zobraz nápovědu              HELP         3
4                            nápověda              HELP         3
5                        nevím co dál              HELP         3
6                    co mám dělat dál              HELP         3
7                            poraď mi              HELP         3
8                                 ano  CONFIRMATION_YES         4
9                                  jo  CONFIRMATION_YES         4
10                                 ok  CONFIRMATION_YES         4
11                              jasně  CONFIRMATION_YES         4
12                                 ne   CONFIRMATION_NO         5
13                        ani náhodou   CONFIRMATION_NO         5
14        

In [7]:
texts = df.text.tolist()
labels = df.intlabel.tolist()

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
trntxt, tsttxt, trnlbl, tstlbl = train_test_split(texts, labels, test_size=0.1)

In [10]:
print(tsttxt, tstlbl, [labelmapping[key] for key in tstlbl])

['jasně', 'rozmyslel jsem se', 'potřebuju pomoc'] [4, 5, 3] ['CONFIRMATION_YES', 'CONFIRMATION_NO', 'HELP']


### 3. preprocess text

In [11]:
import tensorflow as tf
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
# load the same tokenizer a model was trained with
tokenizer = AutoTokenizer.from_pretrained("Seznam/small-e-czech")

In [13]:
trnencodings = tokenizer(trntxt, truncation=True, padding=True)
tstencodings = tokenizer(tsttxt, truncation=True, padding=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [17]:
trndata = [{'label': label, 'input_ids': inid, 'attention_mask': atmask} for label, inid, atmask in zip(trnlbl, trnencodings['input_ids'], trnencodings['attention_mask'])]
tstdata = [{'label': label, 'input_ids': inid, 'attention_mask': atmask} for label, inid, atmask in zip(tstlbl, tstencodings['input_ids'], tstencodings['attention_mask'])]

### 4. load pretrained model

In [18]:
from transformers import AutoModelForSequenceClassification

In [19]:
model = AutoModelForSequenceClassification.from_pretrained("Seznam/small-e-czech", num_labels=n_labels)

Some weights of the model checkpoint at Seznam/small-e-czech were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at Seznam/small-e-czech and are newly initialized: ['classifier.dense.bias', 'classifier.de

### 5. Fit model on a custom dataset

In [20]:
from transformers import TrainingArguments, Trainer

In [99]:
training_args = TrainingArguments(
    output_dir="./tunedbert",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=20,
    weight_decay=0.01,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=trndata,
    eval_dataset=tstdata,
    tokenizer=tokenizer,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [100]:
trainer.train()

***** Running training *****
  Num examples = 27
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 40


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=40, training_loss=1.760857391357422, metrics={'train_runtime': 9.7797, 'train_samples_per_second': 55.216, 'train_steps_per_second': 4.09, 'total_flos': 279286723440.0, 'train_loss': 1.760857391357422, 'epoch': 20.0})

### 6. Test trained model

In [101]:
import numpy as np
from transformers import pipeline

In [111]:
rawpredictions = trainer.predict(tstdata)
pred_intlabels = np.argmax(rawpredictions.predictions, axis=1)
pred_labels = [labelmapping[lbl] for lbl in pred_intlabels]
gt_intlabels = np.array([entry['label'] for entry in tstdata])
gt_labels = [labelmapping[lbl] for lbl in gt_intlabels]

***** Running Prediction *****
  Num examples = 3
  Batch size = 16


In [114]:
correct = np.sum(pred_intlabels == gt_intlabels)
accuracy = correct / gt_intlabels.shape[0]
print("ACCURACY", accuracy)

ACCURACY 0.6666666666666666


In [115]:
print("DATA:", tsttxt)
print("PREDICTED:", pred_labels)
print("GROUND TRUTH:", gt_labels)

DATA: ['jasně', 'rozmyslel jsem se', 'potřebuju pomoc']
PREDICTED: ['CONFIRMATION_YES', 'HELP', 'HELP']
GROUND TRUTH: ['CONFIRMATION_YES', 'CONFIRMATION_NO', 'HELP']
