# Finetuning for Text Classification

In [1]:
import backprop

### Data prep

When you're finetuning a model, you're going to need to provide it with some data.

The benefit of finetuning is that the amount of data you need is substantially less than you'd need for training a new model from scratch.

Here, we'll be using the IMDb movie review dataset, and training a model to classify sentiment.

This dataset has movie reviews, along with an associated label indicating whether or not that review was positive or negative in tone. For this proof-of-concept approach, we'll get a working model trained on just 1000 examples.

In [2]:
from datasets import load_dataset

dataset = load_dataset("imdb")

Reusing dataset imdb (/home/kristo/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


In [3]:
dataset["train"][0]

{'label': 1,
 'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'}

One thing to note about this dataset's training set is that it's ordered: the first half is entirely **positive** reviews (`label` is `1`), and the latter half entirely **negative** (`label` is `0`) reviews.

Since we're just running a demo here, we're not going to use all 25000 examples in the set. We'll take 500 from that front half, and 500 from the back.

In [4]:
# For readability, we'll map the label numbers to strings.
# This isn't necessary, but it looks nicer on output.
def label_to_sentiment(label):
    if label == 0:
        return "negative"
    else:
        return "positive"

In [5]:
inputs = []
outputs = []
for i in range(500):
    inputs.append(dataset["train"][i]["text"])
    outputs.append(label_to_sentiment(dataset["train"][i]["label"]))
    inputs.append(dataset["train"][-i]["text"])
    outputs.append(label_to_sentiment(dataset["train"][-i]["label"]))

### Finetuning

Now that we've got our data prepared, we're ready to train.

All we need to do is choose our task, supply it with the model we're training, and call `.finetune()` with our inputs and outputs.

Let's see what models are available for this purpose.

In [6]:
backprop.TextClassification.list_models(display=True)

Name                     bart-large-mnli
Alias                    english
Description              Facebook's large version of BART, finetuned on the Multi-Genre Natural Language Inference dataset.
Supported tasks          ['text-classification']
Finetunable tasks        []
----------
Name                     xlmr-large-xnli
Alias                    multilingual
Description              XLM-RoBERTa is a multilingual variant of Facebook's RoBERTa model. This has been finetuned on the XNLI dataset, resulting in classification system that is effective on 100 different languages.
Supported tasks          ['text-classification']
Finetunable tasks        []
----------
Name                     xlnet-base-cased
Description              XLNet
Supported tasks          ['text-classification']
Finetunable tasks        ['text-classification']
----------


In [7]:
tc = backprop.TextClassification("xlnet-base-cased")
tc.finetune({"texts": inputs, "labels": outputs}, epochs=3)

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

Finding the optimal batch size...


Batch size 2 succeeded, trying batch size 4
Batch size 4 succeeded, trying batch size 8
Batch size 8 succeeded, trying batch size 16
Batch size 16 succeeded, trying batch size 32
Batch size 32 succeeded, trying batch size 64
Batch size 64 failed, trying batch size 32
Finished batch size finder, will continue with full run using batch size 32
Restored states from the checkpoint file at /home/kristo/Documents/backprop/examples/scale_batch_size_temp_model.ckpt
GPU available: True, used: True
TPU available: None, using: 0 TPU cores

  | Name  | Type         | Params
---------------------------------------
0 | model | HFSeqTCModel | 117 M 
---------------------------------------
117 M     Trainable params
0         Non-trainable params
117 M     Total params
469.242   Total estimated model params size (MB)


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


Training finished! Save your model for later with backprop.save or upload it with backprop.upload


Great, that didn't take too long.

Let's write some fake review and see how it does.

In [8]:
neg_review = """Backprop: The Movie had its highlights, but ultimately fell flat. 
            While AI is a fascinating concept, watching people sit and train models
            in real-time just doesn't make for an exciting viewing experience. 
            This is made especially egregious by the four-and-a-half hour runtime."""

pos_review = """Backprop: The Movie was an absolute pleasure from start to finish.
             The writing was witty, the concept was engaging, and the music was beautiful.
             There's just something about a well-trained model that ties a film together."""

In [13]:
tc(neg_review)

{'positive': 0.00520756421610713, 'negative': 0.9947924017906189}

In [14]:
tc(pos_review)

{'positive': 0.9908009171485901, 'negative': 0.009199077263474464}

Perfect -- after just a bit of training, our model can correctly classify the reviews of *Backprop: The Movie* (coming soon).

It's worth noting here that this is a binary classifier: our dataset had only two labels. Every review it was trained on is distinctly positive or negative, no nuance.

Let's try a review with some grey area.

In [11]:
mix_review = """Backprop: The Movie was certainly an experimental film.
                The slow sections were an absolute drag. This was made worse 
                by just how long this film is.
                However, the idea was clever, and it was pretty well-written, even
                if some of the actors weren't great."""

In [12]:
tc(mix_review)

{'positive': 0.8351138830184937, 'negative': 0.16488608717918396}

Aha!

Now our results are substantially less polarized.

Even though our dataset didn't have a 'neutral' label, we can still take the lack of confidence in calling the review `negative` or `positive` as an implied neutral label. 