# 2. Fine-tuning Sentiment Analysis
## Setup

In [1]:
!pip install --q transformers torch datasets accelerate

In [2]:
from datasets import load_dataset
import numpy as np

**Note:** This needs some later clean-up and adding, as the instructor imports things as we are needing them.

## Explore data in GLUE benchmark for SST-2 task

In [3]:
# https://huggingface.co/datasets/amazon_polarity
# Larger dataset that takes a long time to process, we may want to try it later
# dataset = load_dataset("amazon_polarity")
raw_datasets = load_dataset("glue", "sst2")

Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'sst2' at /home/jupyter/.cache/huggingface/datasets/glue/sst2/1.0.0/fd8e86499fa5c264fcaad392a8f49ddf58bf4037 (last modified on Mon Dec 25 20:15:54 2023).


In [4]:
raw_datasets = load_dataset("glue", "sst2")

Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'sst2' at /home/jupyter/.cache/huggingface/datasets/glue/sst2/1.0.0/fd8e86499fa5c264fcaad392a8f49ddf58bf4037 (last modified on Mon Dec 25 20:15:54 2023).


We have 3 dataset objects, which are Python dictionaries.

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [6]:
raw_datasets['train'] # Index dataset using its key

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})

In [7]:
dir(raw_datasets['train']) # Let's see what attributes and methods the object has

['_TF_DATASET_REFS',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getitems__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_local_temp_path',
 '_check_index_is_initialized',
 '_data',
 '_estimate_nbytes',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_generate_tables_from_cache_file',
 '_generate_tables_from_shards',
 '_get_cache_file_path',
 '_get_output_signature',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_push_parquet_shards_to_hub',
 '_save_to_disk_single',
 '_select_contiguous',
 '_select_wi

In [8]:
type(raw_datasets['train'])

datasets.arrow_dataset.Dataset

In [9]:
raw_datasets['train'].data

MemoryMappedTable
sentence: string
label: int64
idx: int32
----
sentence: [["hide new secretions from the parental units ","contains no wit , only labored gags ","that loves its characters and communicates something rather beautiful about human nature ","remains utterly satisfied to remain the same throughout ","on the worst revenge-of-the-nerds clich√©s the filmmakers could dredge up ",...,"you wish you were at home watching that movie instead of in the theater watching this one ","'s no point in extracting the bare bones of byatt 's plot for purposes of bland hollywood romance ","underdeveloped ","the jokes are flat ","a heartening tale of small victories "],["suspense , intriguing characters and bizarre bank robberies , ","a gritty police thriller with all the dysfunctional family dynamics one could wish for ","with a wonderful ensemble cast of characters that bring the routine day to day struggles of the working class to life ","nonetheless appreciates the art and reveals a music s

In [10]:
raw_datasets['train'][0]

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [11]:
raw_datasets['train'][50000:50003]

{'sentence': ['glow ',
  'a classical dramatic animated feature ',
  'best espionage picture '],
 'label': [1, 1, 1],
 'idx': [50000, 50001, 50002]}

Note that the result is a dictionary of lists:

In [12]:
type(raw_datasets['train'][50000:50003])

dict

Let's check the features attribute now, which shows even the name of the classes in the labels feature:

In [13]:
raw_datasets['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

## Tokenization

In [14]:
from transformers import AutoTokenizer

In [15]:
checkpoint = "distilbert-base-uncased" # we chose this model checkpoint as it trains faster, but "bert-base-uncased" could've been a good option too
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

In [16]:
# Test our tokenizer on a subsample of our sentences
tokenized_sentences = tokenizer(raw_datasets['train'][0:3]['sentence'])
from pprint import pprint
pprint(tokenized_sentences)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
               [101,
                3397,
                2053,
                15966,
                1010,
                2069,
                4450,
                2098,
                18201,
                2015,
                102],
               [101,
                2008,
                7459,
                2049,
                3494,
                1998,
                10639,
                2015,
                2242,
                2738,
                3376,
                2055,
                2529,
                3267,
                102]]}


In [17]:
# Wrap out tokenizer with a new function to set truncation to True
def tokenize_fn(batch):
    return tokenizer(batch['sentence'], truncation=True)

In [18]:
# Map the wrapper function to all our samples
tokenized_datasets = raw_datasets.map(tokenize_fn, batched=True)

In [19]:
from transformers import TrainingArguments

In [20]:
# Define training arguments object, which starts overfitting with just a few epochs
training_args = TrainingArguments(
    'my_trainer',
    evaluation_strategy='epoch', # Required if evaluating on a provided dataset
    save_strategy='epoch',
    num_train_epochs=1,
)

In [21]:
from transformers import AutoModelForSequenceClassification 

In [22]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Note** the expected warning, as we do intend to train the last layers of our model (therefore update the weights).

In [23]:
type(model)

transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification

In [24]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [25]:
!pip install torchinfo

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [26]:
from torchinfo import summary
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
‚îú‚îÄDistilBertModel: 1-1                                  --
‚îÇ    ‚îî‚îÄEmbeddings: 2-1                                  --
‚îÇ    ‚îÇ    ‚îî‚îÄEmbedding: 3-1                              23,440,896
‚îÇ    ‚îÇ    ‚îî‚îÄEmbedding: 3-2                              393,216
‚îÇ    ‚îÇ    ‚îî‚îÄLayerNorm: 3-3                              1,536
‚îÇ    ‚îÇ    ‚îî‚îÄDropout: 3-4                                --
‚îÇ    ‚îî‚îÄTransformer: 2-2                                 --
‚îÇ    ‚îÇ    ‚îî‚îÄModuleList: 3-5                             42,527,232
‚îú‚îÄLinear: 1-2                                           590,592
‚îú‚îÄLinear: 1-3                                           1,538
‚îú‚îÄDropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

**Note:** Pytorch doesn't show by default the input and output layers size.

In [27]:
# Let's do a sanity check to see if we end up training all the weights of the NN, and not just the ones
# on the newly trained layers. For doing this comparison, let's save the model params before fine-tuning.
params_before = []
for name, p in model.named_parameters():
    params_before.append(p.detach().cpu().numpy())

In [28]:
from transformers import Trainer
from datasets import load_metric

metric = load_metric("glue", "sst2")

  metric = load_metric("glue", "sst2")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [29]:
metric.compute(predictions=[1, 0, 1], references=[1, 0, 0])

{'accuracy': 0.6666666666666666}

In [30]:
def compute_metrics(logits_and_labels):
    # metric = load_metric("glue", "sst2")
    logits, labels = logits_and_labels
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [31]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [32]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.198,0.345907,0.902523


TrainOutput(global_step=8419, training_loss=0.2634747613140432, metrics={'train_runtime': 398.8511, 'train_samples_per_second': 168.857, 'train_steps_per_second': 21.108, 'total_flos': 517212489917652.0, 'train_loss': 0.2634747613140432, 'epoch': 1.0})

In [33]:
trainer.save_model('sa_saved_model')

In [34]:
!ls

'1 - Named Entity Recognition (NER).ipynb'
'1 - Pipeline Mask Language Modeling.ipynb'
'1 - Question Answering.ipynb'
'1 - Sentiment analysis.ipynb'
'1 - Text Generation.ipynb'
'1 - Text summarization.ipynb'
'1 - Translation.ipynb'
'1 - Zero-shot classification.ipynb'
'2 - Fine-tuning Sentiment Analysis.ipynb'
'2 - Fine-tuning sentiment custom dataset.ipynb'
'2 - Models & Tokenizers.ipynb'
'2 - NER.ipynb'
'2 - POS Tagger.ipynb'
'2 - QA - Advanced.ipynb'
 AirlineTweets.csv
'Fine-Tuning RTE.ipynb'
 README.md
 bbc_text_cls.csv
 data.json
 data_sentiment_finetuning.csv
 my_trainer
 ner_test.pkl
 ner_train.pkl
 robert_frost.txt
 sa_saved_model
 spa-eng
 spa-eng.zip


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [43]:
!ls sa_saved_model

config.json	   special_tokens_map.json  tokenizer_config.json  vocab.txt
model.safetensors  tokenizer.json	    training_args.bin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [44]:
from transformers import pipeline

In [None]:
new_sa_model = pipeline('text-classification', model='sa_saved_model', device=0)

Let's now try our new Sentiment Analysis model:

In [46]:
new_sa_model('This movie is great!')

[{'label': 'LABEL_1', 'score': 0.9996929168701172}]

In [50]:
new_sa_model('I kind of liked this movie, but I did not love the main character actress')

[{'label': 'LABEL_0', 'score': 0.9623615145683289}]

In [50]:
new_sa_model('I kind of liked this movie, but I did not love the main character actress')

[{'label': 'LABEL_0', 'score': 0.9623615145683289}]