## fine tune bert model for custom dataset

### 1. install libraries

In [22]:
! pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### 2. load/define data set

In [1]:
import pandas as pd

In [44]:
df = pd.read_csv('data/dummydata.csv')

In [65]:
df['intlabel'] = df['label'].rank(method='dense', ascending=False).astype(int) - 1

In [66]:
texts = df.text.tolist()
labels = df.intlabel.tolist()

In [67]:
from sklearn.model_selection import train_test_split

In [68]:
trntxt, tsttxt, trnlbl, tstlbl = train_test_split(texts, labels, test_size=0.2)

In [69]:
print(tsttxt, tstlbl)

["I don't want it", 'stop it', 'cancel', 'well, ok', 'nah'] [1, 1, 1, 0, 1]


In [70]:
data = {}
data['train'] = [{'text': txt, 'lable': lbl} for txt, lbl in zip(trntxt, trnlbl)]
data['test'] = [{'text': txt, 'lable': lbl} for txt, lbl in zip(tsttxt, tstlbl)]

### 3. preprocess text

In [71]:
import tensorflow as tf
from transformers import DistilBertTokenizerFast

In [72]:
# load the same tokenizer a model was trained with
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

In [73]:
trnencodings = tokenizer(trntxt, truncation=True, padding=True)
tstencodings = tokenizer(tsttxt, truncation=True, padding=True)

In [74]:
trn_dataset = tf.data.Dataset.from_tensor_slices((
    dict(trnencodings),
    trnlbl
))
tst_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tstencodings),
    tstlbl
))

### 4. load pretrained model

In [75]:
from transformers import TFDistilBertForSequenceClassification

In [76]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use i

In [77]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(trn_dataset.shuffle(100).batch(16),
          epochs=3,
          batch_size=16,
          validation_data=tst_dataset.shuffle(100).batch(16))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f59704b3350>