Skip to content

Fine tuning Natural Language Processing models.

geekjr edited this page Apr 17, 2021 · 1 revision

Here, we will fine tune BERT for text classification. We will do this on the IMDb reviews dataset. First, download the dataset and extract it:

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

Now, create a new file, bert_tc_ft.py. Inside of it, add the following import:

from quickai import TextFineTuning

Now, lets call TextFineTuning:

TextFineTuning("./aclImdb", "./IMDB_BERT_FT", "classification", ["pos", "neg"],
               epochs=1)

The first parameter is the path to the training data. The data should be in the same format as the IMDb data. The second parameter is the name to use when saving the model. The third is the type of model. In this case, its classification. The fourth is the classes for the classification. Above, we have specified epochs as 1. This is a good idea if you are just playing around, but if you want to train a production model, you should change this to a higher number(somewhere between 3-5).

Go ahead and run that file. The training will take quite some time, even on a GPU. Once it is done. you will see a folder called IMDB_BERT_FT next to your Python file. That is the model. This model can be loaded as follows:

from quickai import classification_ft
results = classification_ft("./IMDB_BERT_FT", ["pos", "neg"])
print(results)

The classification_ft function takes 2 parameters, the model path and the classes. Note that the classes must be in the same order as training.

In this example, we only did text classification, however, the same basic principle applies for Q&A and token classification training. The differences are that for Q&A the data should be in squad format, and for token classification the data should be in W-NUT format.