# Text Classification (Sentiment Analysis) with Transformers

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Sentiment Analysis

When it comes to text data, sentiment analysis is one of the most widely performed analysis on it. Sentiment Analysis has been through tremendous improvements from the days of classic methods to recent times where in the state of the art models utilize deep learning to improve the performance. 

# Fine-tuning a model on a text classification task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of Sentiment Analysis

![](https://i.imgur.com/Pq7f3Fd.png)

## Install Relevant Libraries



In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 4.2 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 96.5 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.6 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 93.8 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 65.7 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.12.1 transformers-4.19.4


You will be leveraging 🤗 Transformers and 🤗 Datasets as well as other dependencies

## Get Dataset

In [4]:
import pandas as pd

dataset = pd.read_csv(r'https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2011%20-%20Sentiment%20Analysis%20-%20Unsupervised%20Learning/movie_reviews.csv.bz2', compression='bz2')
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
dataset['sentiment'] = [1 if sentiment == 'positive' else 0 for sentiment in dataset['sentiment']]

In [7]:
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [8]:
train_df = dataset.iloc[:35000]
test_df = dataset.iloc[35000:]

In [9]:
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)

In [11]:
!ls -l --block-size=MB

total 66MB
drwxr-xr-x 1 root root  1MB Jun  1 13:50 sample_data
-rw-r--r-- 1 root root 20MB Jun 15 22:28 test.csv
-rw-r--r-- 1 root root 47MB Jun 15 22:28 train.csv


## Load Dataset

Here we load the IMDB Sentiment dataset from our retrieved dataset files above

In [1]:
from datasets import load_dataset, load_metric

data_files = {"train": "train.csv", "test": "test.csv"}
imdb_data = load_dataset("csv", data_files=data_files)

Using custom data configuration default-f9419df487d4ddc2
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-f9419df487d4ddc2/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)


  0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
imdb_data.keys()

dict_keys(['train', 'test'])

In [3]:
imdb_data['train'][:2]

{'review': ["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is d

Given infrastructure constraints we subset our dataset and limit training on only 10000 records.

In [None]:
cnn_data['train'] = cnn_data['train'].shuffle(seed=42).select(range(10000))
cnn_data['validation'] = cnn_data['validation'].shuffle(seed=42).select(range(2000))
cnn_data['test'] = cnn_data['test'].shuffle(seed=42).select(range(2000))

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-7d0683495a29c6ed.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-2670a4089b7d24d3.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de/cache-078f5e86fe5e9f48.arrow


The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [4]:
imdb_data

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 35000
    })
    test: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 15000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [5]:
imdb_data["train"][0]

{'review': "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is du

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(imdb_data["train"])

Unnamed: 0,review,sentiment
0,"This is one of my favorites along with the Mariette Hartley and Robert Lansing ""Sandy"" and the Agnes Moorhead-and-the-tiny-spacemen episodes.<br /><br />It is an important take, from mid-1961, on the long Cold War that the U.S. was then embroiled in. The beaten-down city-scene, the near-starving characters' sparse dialog, their threadbare uniforms, and the minimal action ""says"" it all: the absurdity of an on-going conflict that threatens to destroy human life, modern civilization, and all that is sweet and redeeming about it.<br /><br />It is a ""fable"" because it was made in a time in which, had events turned out differently, such as the second Berlin Crisis (Spring 1961) and the subsequent Cuban Missile Crisis (Oct. 1962), it would have actually been a reasonable representation of one of the U.S.'s major cities, ruined and replete with a few miserable survivors. I also see it as a ""fable"" because it is not only a cautionary tale, but because it is the most redemptive of all our popular myths: it is a love story, set in an impossible situation, and involving two highly mismatched lovers.",1
1,"Now I love Bela Lugosi,don't get me wrong,he is one of the most interesting people to ever make a movie but he certainly did his share of clunkers.This is just another one of those.<br /><br />Lugosi plays Dr.Lorenz,a doctor who has had his medical license pulled for unexplained reasons.He is however doing experiments to keep his wife young and beautiful.It's revealed that she is 70-80 years old yet Lugosi looks to be in his mid 50's so why he is married to this old woman is never really explained.<br /><br />Anyway these treatments or experiments involved giving brides who are at the altar being married some sort of sweet smelling substance whereby they pass out but are thought to be dead.Then Lugosi and some of his assistants steal the body on its way to the morgue and take it back to his lab where it's kept in some sort of suspended animation or catatonic state.Then the stolen brides have a needle rammed somewhere in their bodies,maybe the neck,and then the needle is rammed into the body of Lugosi's wife to bring her back to youth and beauty.We never really see where Lugosi sticks the needle or what it is that he draws out of the brides but it somehow restores his wife .Apparently old age makes you scream with pain because Lugosi's wife does a lot of screaming until she gets back to her younger state.Helping Lugosi in his lab is the only good thing about this movie....a weird old hag and her two deformed sons....one son is a big lumpy looking slow acting fellow who likes to fondle the snoozing brides and the other son is a mean little dwarf....little person, to be politically correct in today's world.At night these three just sort of pile up and sleep in Lugosi's dreary downstairs lab.Who these 3 are and how they came to be Lugosi's scared assistants is,like a lot of stuff in this film, never explained.<br /><br />So anyway a female reporter is given the assignment by her gruff editor to find out where all the stolen brides are going to.She quickly figures out that the one common thing among all the stolen brides is a rare orchid that is found on them.So she asks around and is told that there is a world renowned orchid expert living nearby who just happens to be the one who developed this particular orchid.This expert turns out to be creepy Dr.Lorenz.She quickly tracks him down and upsets his little house of horrors.I'm not sure where the police were during all this but they came in to mop up after the reporter had done all the dirty work.<br /><br />It seems that Lugosi's movies always had some sort of unnecessary silly plot line that just made the whole thing stink to high heavens.I mean a world famous orchid expert kidnaps brides by sending them a doped up orchid he himself is known to have developed? D'OH!<br /><br />And then later it's revealed that the young ladies don't even have to be brides for the procedure to work so why would Lugosi keep kidnapping brides from heavily guarded churches for his experiments and create all the attention and newspaper headlines? Why not just grab a prostitute off the street like a normal weirdo pervert would do? This clunker reminded me a lot of another Lugosi stinker,""The Devil Bat""....same silly plot lines and bad acting and same silly 'reporter gets bad guy' deal.<br /><br />But Lugosi is always good--he is creepy and sinister enough to keep you interested at least enough to keep watching him.The woman playing the reporter was just a terrible actor....she had no emotion whatsoever,she just delivered her lines like a machine gun ,spewing them out as quickly as she could.Everyone else pretty much blew too,when it came to being good actors.<br /><br />But this thing is watchable ,if only for Bela Lugosi fans.Lugosi was always so intense even when the picture was a dog.He must have known he was doing terrible pictures but maybe he also knew that if he gave it everything he had a little of that intensity might shine through past all the bad plots and bad acting which surrounded him.<br /><br />And he was right----we horror fans will always have a love for Bela Lugosi.He gave it his all every time he was in front of the camera.We do give two f**ks for you,Bela.",0
2,"It's unbelievable but the fourth is better than the second and the third. After the third that was awful, it's incredible how they could have an unexpected sequel with new ideas. Chuck is the same nasty doll of the previous movies. Interesting the final that lets know that a fifth can be done....",1
3,"No, this hilariously horrible 70's made-for-TV horror clinker isn't about a deadly demonically possessed dessert cake. Still, this exceptionally awful, yet undeniably amusing and thus enjoyable cathode ray refuse reaches a breathtaking apex of absolute, unremitting silliness and atrociousness that's quite tasty in a so-execrable-it's-downright-awesome sort of way. Richard Crenna, looking haggard and possibly inebriated, and Yvette Mimieux, who acts as if she never got over the brutal rape she endured in ""Jackson County Jail,"" sluggishly portray a disgustingly nice and respectable suburbanite couple whose quaint, dull, sleepy small town existence gets ripped asunder when the cute German Shepard they take in as the family pet turns out to be some ancient lethal evil spirit. Pretty soon Mimieux and her two repellently cutesy kids Kim Richards and Ike Eisenmann (the psychic alien moppets from the Disney ""Witch Mountain"" pictures) are worshiping a crude crayon drawing of the nasty, ugly canine entity in the den. Boy, now doesn't that sound really scary and disturbing? Well, scary and disturbing this laughably ludicrous claptrap sure ain't, but it sure is funny, thanks to Curtis (""Night Tide"") Harrington's hopelessly weak direction, cartoonish (not so) special effects, an almost painfully risible'n'ridiculous plot, and a game cast that struggles valiantly with the absurd story (besides the leads, both Martine Beswicke and R.G. Armstrong briefly pop up as members of a Satanic cult and Victor Jory has a nice cameo as a helpful Native American shaman). Favorite scene: the malicious Mephestophelion mutt puts the whammy on Crenna, practically forcing him to stick his hand into a wildly spinning lawnmower blade. While stuck-up snobby fright film fans may hold their noses at the perfectly putrid stench of this admittedly smelly schlock, devout TV trash lovers should deem this endearingly abominable offal the boob tube equivalent to Alpo.",1
4,"Directed and written by the famous/infamous Edward D. Wood Jr, using a pseudonym(Daniel Davis)playing the lead role of Glen/Glenda. This is an almost radical documentary about transvestism; Wood himself being a transvestite with a fetish for angora sweaters. It seems miles of stock footage and an incoherent Bela Lugosi is used to stretch this odd and awkward film to 67 minutes. Police inspector(Lyle Talbot)seeks enlightenment from a psychiatrist, Dr. Alton(Timothy Farrell), to better understand the emotional and disposition of transvestites.<br /><br />Also in the cast: Delores Fuller, ""Tommy"" Hanes, Captain DeZita and Wood's sister Evelyn. Of note: Farrell also acts as narrator. And Fuller later helped write songs included in the Elvis Presley movies BLUE HAWAII, KISSIN' COUSINS & KID GALAHAD.",0


We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the function `load_metric`.  

In [8]:
metric = load_metric('glue', 'mrpc')
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

For classification most common metrics include accuracy and f1-score.


In [9]:
predictions = [1,0,1,1,0]
references = [1,1,0,1,0]
scores = metric.compute(
    predictions=predictions, references=references
)
scores

{'accuracy': 0.6, 'f1': 0.6666666666666666}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head.

Here we picked the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) checkpoint.

![](https://i.imgur.com/GmFRcP3.png)

BERT can be used for a variety of tasks and we will fine-tune it for classification (sentiment).

In [10]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [11]:
tokenizer("Hello, this is a sentence!")

{'input_ids': [101, 7592, 1010, 2023, 2003, 1037, 6251, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. 

This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. 

The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [12]:
def preprocess_function(examples):
    model_inputs = tokenizer(examples['review'], truncation=True)
    model_inputs["label"] = examples["sentiment"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [13]:
preprocess_function(imdb_data["train"][:2])

{'input_ids': [[101, 2028, 1997, 1996, 2060, 15814, 2038, 3855, 2008, 2044, 3666, 2074, 1015, 11472, 2792, 2017, 1005, 2222, 2022, 13322, 1012, 2027, 2024, 2157, 1010, 2004, 2023, 2003, 3599, 2054, 3047, 2007, 2033, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2034, 2518, 2008, 4930, 2033, 2055, 11472, 2001, 2049, 24083, 1998, 4895, 10258, 2378, 8450, 5019, 1997, 4808, 1010, 2029, 2275, 1999, 2157, 2013, 1996, 2773, 2175, 1012, 3404, 2033, 1010, 2023, 2003, 2025, 1037, 2265, 2005, 1996, 8143, 18627, 2030, 5199, 3593, 1012, 2023, 2265, 8005, 2053, 17957, 2007, 12362, 2000, 5850, 1010, 3348, 2030, 4808, 1012, 2049, 2003, 13076, 1010, 1999, 1996, 4438, 2224, 1997, 1996, 2773, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2009, 2003, 2170, 11472, 2004, 2008, 2003, 1996, 8367, 2445, 2000, 1996, 17411, 4555, 3036, 2110, 7279, 4221, 12380, 2854, 1012, 2009, 7679, 3701, 2006, 14110, 2103, 1010, 2019, 6388, 2930, 1997, 1996, 3827, 2073, 2035, 1996, 4442, 2031, 3221, 21430

To apply this function on all the sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. 

This will apply the function on all the elements of all the splits in `dataset`, so our training, and testing data will be preprocessed in one single command.

In [14]:
tokenized_datasets = imdb_data.map(preprocess_function, batched=True)



  0%|          | 0/35 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

In [16]:
tokenized_datasets = tokenized_datasets.remove_columns('review')
tokenized_datasets = tokenized_datasets.remove_columns('sentiment')

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. 

The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). 

For instance, it will properly detect if you change the task in the first cell and rerun the notebook.

## Fine-tuning the Transformer Model 

Now that our data is ready, we can download the pretrained model and fine-tune it. 

Since our task is about sentence classification, we use the `AutoModelForSequenceClassification` class. 

Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. 

The only thing we have to specify is the number of labels for our problem which should be 2

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). 

This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [18]:
batch_size = 16
metric_name = "f1"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-classification",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

Here,

- we set the evaluation to be done at the end of each epoch
- tweak the learning rate
- use the `batch_size` defined at the top of the cell 
- customize the weight decay

Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

We use DataCollatorWithPadding to create a batch of examples. It will also dynamically pad your text to the length of the longest element in its batch, so they are a uniform length. 

While it is possible to pad your text in the tokenizer function by setting `padding=True`, dynamic padding is more efficient.

In [19]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import numpy as np
import nltk

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. 

We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits.

In [20]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [21]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [22]:
trainer.train()

***** Running training *****
  Num examples = 35000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 6564


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1966,0.16732,0.9428,0.942754
2,0.1199,0.222054,0.9446,0.945333
3,0.0626,0.245842,0.947333,0.947606


***** Running Evaluation *****
  Num examples = 15000
  Batch size = 16
Saving model checkpoint to bert-base-uncased-finetuned-classification/checkpoint-2188
Configuration saved in bert-base-uncased-finetuned-classification/checkpoint-2188/config.json
Model weights saved in bert-base-uncased-finetuned-classification/checkpoint-2188/pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-classification/checkpoint-2188/tokenizer_config.json
Special tokens file saved in bert-base-uncased-finetuned-classification/checkpoint-2188/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 15000
  Batch size = 16
Saving model checkpoint to bert-base-uncased-finetuned-classification/checkpoint-4376
Configuration saved in bert-base-uncased-finetuned-classification/checkpoint-4376/config.json
Model weights saved in bert-base-uncased-finetuned-classification/checkpoint-4376/pytorch_model.bin
tokenizer config file saved in bert-base-uncased-finetuned-classificatio

TrainOutput(global_step=6564, training_loss=0.13996629937257365, metrics={'train_runtime': 6178.6777, 'train_samples_per_second': 16.994, 'train_steps_per_second': 1.062, 'total_flos': 2.73284819870928e+16, 'train_loss': 0.13996629937257365, 'epoch': 3.0})

# Using your fine-tuned model for Classification

Once you’ve fine-tuned the model you can use it with a pipeline object, for inference as follows:

In [23]:
from transformers import pipeline

In [38]:
model.to('cuda')

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [39]:
clf = pipeline(task='text-classification', model=model, tokenizer=tokenizer, device=0)

In [40]:
document = "The movie was not good at all"

In [41]:
clf(document)

[{'label': 'LABEL_0', 'score': 0.9990333318710327}]

In [42]:
document = "The movie was amazing"

In [43]:
clf(document)

[{'label': 'LABEL_1', 'score': 0.995755672454834}]

## Fine-tuned Transformer performance on Test Data

We can feed our test set (which the model has not seen) to our pipeline to get a feel for the quality of the model predictions. 

In [44]:
imdb_data['test'][:2]

{'review': ["Just don't bother. I thought I would see a movie with great supspense and action.<br /><br />But it grows boring and terribly predictable after the interesting start. In the middle of the film you have a little social drama and all tension is lost because it slows down the speed. Towards the end the it gets better but not really great. I think the director took this movie just too serious. In such a kind of a movie even if u don't care about the plot at least you want some nice action. I nearly dozed off in the middle/main part of it. Rating 3/10.<br /><br />derboiler.",
  "Be careful with this one. Once you get yer mitts on it, it'll change the way you look at kung-fu flicks. You will be yearning a plot from all of the kung-fu films now, you will be wanting character depth and development, you will be craving mystery and unpredictability, you will demand dynamic camera work and incredible backdrops. Sadly, you won't find all of these aspects together in one kung-fu movie,

In [46]:
%%time

predictions = clf(imdb_data['test']['review'], batch_size=100, truncation=True)
predictions = [pred['label'] for pred in predictions]

predictions = [0 if item == 'LABEL_0' else 1 for item in predictions]
labels = imdb_data['test']['sentiment']

CPU times: user 4min 13s, sys: 482 ms, total: 4min 14s
Wall time: 4min 10s


In [47]:
from sklearn.metrics import confusion_matrix, classification_report


print(classification_report(labels, predictions))
pd.DataFrame(confusion_matrix(labels, predictions))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      7490
           1       0.94      0.95      0.95      7510

    accuracy                           0.95     15000
   macro avg       0.95      0.95      0.95     15000
weighted avg       0.95      0.95      0.95     15000



Unnamed: 0,0,1
0,7066,424
1,366,7144


## Pre-trained Transformer Model performance on Test Data

In [48]:
default_clf = pipeline(task='text-classification', device=0)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp7eac5y2a


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
creating metadata file for /root/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout":

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/8d04c767d9d4c14d929ce7ad8e067b80c74dbdb212ef4c3fb743db4ee109fae0.9d268a35da669ead745c44d369dc9948b408da5010c6bac414414a7e33d5748c
creating metadata file for /root/.cache/huggingface/transformers/8d04c767d9d4c14d929ce7ad8e067b80c74dbdb212ef4c3fb743db4ee109fae0.9d268a35da669ead745c44d369dc9948b408da5010c6bac414414a7e33d5748c
loading weights file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8d04c767d9d4c14d929ce7ad8e067b80c74dbdb212ef4c3fb743db4ee109fae0.9d268a35da669ead745c44d369dc9948b408da5010c6bac414414a7e33d5748c
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-bas

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/d44ec0488a5f13d92b3934cb68cc5849bd74ce63ede2eea2bf3c675e1e57297c.627f9558061e7bc67ed0f516b2f7efc1351772cc8553101f08748d44aada8b11
creating metadata file for /root/.cache/huggingface/transformers/d44ec0488a5f13d92b3934cb68cc5849bd74ce63ede2eea2bf3c675e1e57297c.627f9558061e7bc67ed0f516b2f7efc1351772cc8553101f08748d44aada8b11
loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/83261b0c74c462e53d6367de0646b1fca07d0f15f1be045156b9cf8c71279cc9.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
creating metadata file for /root/.cache/huggingface/transformers/83261b0c74c462e53d6367de0646b1fca07d0f15f1be045156b9cf8c71279cc9.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/83261b0c74c462e53d6367de0646b1fca07d0f15f1be045156b9cf8c71279cc9.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/added_

In [49]:
document = "The movie was amazing"
default_clf(document)

[{'label': 'POSITIVE', 'score': 0.9998806715011597}]

In [50]:
%%time

predictions = default_clf(imdb_data['test']['review'], batch_size=100, truncation=True)
predictions = [pred['label'] for pred in predictions]

predictions = [0 if item == 'NEGATIVE' else 1 for item in predictions]
labels = imdb_data['test']['sentiment']

CPU times: user 2min 15s, sys: 443 ms, total: 2min 15s
Wall time: 2min 13s


In [51]:
print(classification_report(labels, predictions))
pd.DataFrame(confusion_matrix(labels, predictions))

              precision    recall  f1-score   support

           0       0.86      0.92      0.89      7490
           1       0.92      0.86      0.89      7510

    accuracy                           0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000



Unnamed: 0,0,1
0,6923,567
1,1081,6429
