In [1]:
import numpy as np
import transformers
import tensorflow as tf
import pandas as pd

from datasets import load_dataset, concatenate_datasets
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments, pipeline



## Import Dataset

In [2]:
dataset = load_dataset('rotten_tomatoes')

Downloading builder script:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading and preparing dataset rotten_tomatoes_movie_review/default (download: 476.34 KiB, generated: 1.28 MiB, post-processed: Unknown size, total: 1.75 MiB) to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46...


Downloading data:   0%|          | 0.00/488k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset rotten_tomatoes_movie_review downloaded and prepared to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [4]:
unique_labels = list(set(dataset['train']['label']))
print(unique_labels)

[0, 1]


## Metric Evaluation

In [5]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Tokenizer

In [6]:
def tokenize_function(examples):
    tokenized_batch = tokenizer(examples["text"], padding="max_length", truncation=True)
    tokenized_batch["labels"] = examples["label"]
    return tokenized_batch

In [7]:
tokenizer = DistilBertTokenizer.from_pretrained('typeform/distilbert-base-uncased-mnli')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/258 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [8]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

## Zero-Shot using DistilBERT trained on MNLI

In [9]:
classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli")

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [10]:
def zero_shot_classification(dataset, classifier, candidate_labels):
    texts = dataset['text']
    true_labels = dataset['label']
    preds = []

    for text in texts:
        outputs = classifier(text, candidate_labels)
        pred_label = candidate_labels[outputs['labels'].index(outputs['scores'].index(max(outputs['scores'])))]
        preds.append(candidate_labels.index(pred_label))

    return preds, true_labels

In [11]:
preds, true_labels = zero_shot_classification(dataset['test'], classifier, unique_labels)

In [12]:
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, preds, average='binary')
accuracy = accuracy_score(true_labels, preds)

print(f"Accuracy: {accuracy}")
print(f"F1 score: {f1}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")

Accuracy: 0.6031894934333959
F1 score: 0.6997870830376153
Precision: 0.5627853881278538
Recall: 0.924953095684803


## Few-Shot

In [13]:
few_shot_model = DistilBertForSequenceClassification.from_pretrained('typeform/distilbert-base-uncased-mnli', num_labels=2, ignore_mismatched_sizes=True)

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at typeform/distilbert-base-uncased-mnli and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
def stratified_sample(dataset, k=50):
    pos_samples = dataset.filter(lambda example: example['label'] == 1).select(range(k // 2))
    neg_samples = dataset.filter(lambda example: example['label'] == 0).select(range(k // 2))
    
    combined_samples = concatenate_datasets([pos_samples, neg_samples])
    return combined_samples.shuffle(seed=42)

In [15]:
shuffled_dataset = dataset.shuffle(seed=42)

In [16]:
few_shot_train_dataset = stratified_sample(shuffled_dataset['train'], k=50)
few_shot_eval_dataset = stratified_sample(shuffled_dataset['validation'], k=50)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [17]:
tokenized_few_shot_train_dataset = few_shot_train_dataset.map(tokenize_function, batched=True)
tokenized_few_shot_eval_dataset = few_shot_eval_dataset.map(tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [18]:
few_shot_training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    num_train_epochs=10,             # Total number of training epochs
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=10,                # Log every X updates steps
    evaluation_strategy="epoch",     # Evaluate each `logging_steps` steps
    save_strategy="epoch",            # Save checkpoint at the end of each epoch
    report_to="none"
)

In [19]:
few_shot_trainer = Trainer(
    model=few_shot_model,
    args=few_shot_training_args,
    train_dataset=tokenized_few_shot_train_dataset,
    eval_dataset=tokenized_few_shot_eval_dataset,
    compute_metrics=compute_metrics,
)

In [20]:
few_shot_trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.685623,0.58,0.704225,0.543478,1.0
2,No log,0.684344,0.58,0.704225,0.543478,1.0
3,0.706500,0.682324,0.58,0.704225,0.543478,1.0
4,0.706500,0.680062,0.58,0.704225,0.543478,1.0
5,0.835900,0.677363,0.6,0.714286,0.555556,1.0
6,0.835900,0.673933,0.6,0.714286,0.555556,1.0
7,0.835900,0.66968,0.62,0.724638,0.568182,1.0
8,0.684000,0.665781,0.66,0.746269,0.595238,1.0
9,0.684000,0.661668,0.68,0.757576,0.609756,1.0
10,0.644000,0.656433,0.72,0.78125,0.641026,1.0




TrainOutput(global_step=40, training_loss=0.7176114559173584, metrics={'train_runtime': 39.7897, 'train_samples_per_second': 12.566, 'train_steps_per_second': 1.005, 'total_flos': 66233699328000.0, 'train_loss': 0.7176114559173584, 'epoch': 10.0})

In [21]:
test_results = few_shot_trainer.evaluate(tokenized_datasets['test'])
print(test_results)



{'eval_loss': 0.7083517909049988, 'eval_accuracy': 0.5281425891181989, 'eval_f1': 0.6271312083024463, 'eval_precision': 0.5183823529411765, 'eval_recall': 0.7936210131332082, 'eval_runtime': 9.8374, 'eval_samples_per_second': 108.362, 'eval_steps_per_second': 3.456, 'epoch': 10.0}


### Inferencing

In [27]:
sample_test_dataset = tokenized_datasets["test"].select(range(10))

In [28]:
predictions = few_shot_trainer.predict(sample_test_dataset)

pred_labels = np.argmax(predictions.predictions, axis=1)

for i, (text, pred_label) in enumerate(zip(sample_test_dataset["text"], pred_labels)):
    print(f"Example {i+1}:")
    print(f"Review: {text}")
    print(f"Predicted sentiment: {'positive' if pred_label else 'negative'}\n")



Example 1:
Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
Predicted sentiment: negative

Example 2:
Review: consistently clever and suspenseful .
Predicted sentiment: positive

Example 3:
Review: it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
Predicted sentiment: positive

Example 4:
Review: the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
Predicted sentiment: positive

Example 5:
Review: red dragon " never cuts corners .
Predicted sentiment: positive

Example 6:
Review: fresnadillo has something serious to say about the ways in which extravagant chance can distort our perspective and throw us off the path of good sense .
Predicted sentiment: positive

Example 7:
Review: throws in enough clever and unexpected twists to make

## Fine-Tuning

In [22]:
fine_tune_model = DistilBertForSequenceClassification.from_pretrained('typeform/distilbert-base-uncased-mnli', num_labels=2, ignore_mismatched_sizes=True)

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at typeform/distilbert-base-uncased-mnli and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
fine_tune_training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    num_train_epochs=3,             # Total number of training epochs
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=10,                # Log every X updates steps
    evaluation_strategy="epoch",     # Evaluate each `logging_steps` steps
    save_strategy="epoch",            # Save checkpoint at the end of each epoch
    report_to="none"
)

In [24]:
fine_tune_trainer = Trainer(
    model=fine_tune_model,
    args=fine_tune_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics
)

In [25]:
fine_tune_trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4113,0.417905,0.82364,0.805383,0.898383,0.729831
2,0.3504,0.467812,0.84334,0.840497,0.856031,0.825516
3,0.0832,0.722272,0.840525,0.842301,0.833028,0.851782




TrainOutput(global_step=1602, training_loss=0.29641858706211255, metrics={'train_runtime': 743.5128, 'train_samples_per_second': 34.418, 'train_steps_per_second': 2.155, 'total_flos': 3389840731607040.0, 'train_loss': 0.29641858706211255, 'epoch': 3.0})

In [26]:
test_results = fine_tune_trainer.evaluate(tokenized_datasets['test'])
print(test_results)



{'eval_loss': 0.7777389883995056, 'eval_accuracy': 0.8208255159474672, 'eval_f1': 0.8192999053926207, 'eval_precision': 0.8263358778625954, 'eval_recall': 0.8123827392120075, 'eval_runtime': 10.2815, 'eval_samples_per_second': 103.682, 'eval_steps_per_second': 3.307, 'epoch': 3.0}


### Inferencing

In [31]:
predictions = fine_tune_trainer.predict(sample_test_dataset)

pred_labels = np.argmax(predictions.predictions, axis=1)

for i, (text, pred_label) in enumerate(zip(sample_test_dataset["text"], pred_labels)):
    print(f"Example {i+1}:")
    print(f"Review: {text}")
    print(f"Predicted sentiment: {'positive' if pred_label else 'negative'}\n")



Example 1:
Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
Predicted sentiment: negative

Example 2:
Review: consistently clever and suspenseful .
Predicted sentiment: positive

Example 3:
Review: it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
Predicted sentiment: negative

Example 4:
Review: the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
Predicted sentiment: positive

Example 5:
Review: red dragon " never cuts corners .
Predicted sentiment: negative

Example 6:
Review: fresnadillo has something serious to say about the ways in which extravagant chance can distort our perspective and throw us off the path of good sense .
Predicted sentiment: positive

Example 7:
Review: throws in enough clever and unexpected twists to make

## Summary

### Zero-Shot using DistilBERT MNLI
Using DistilBERT trained on the MultiNLI dataset, the model achieved an accuracy of approximately 60.32%, with an F1 score of 69.98%, precision of 56.28%, and a high recall of 92.49%. This indicates that without any fine-tuning specific to the sentiment analysis task, the model could predict the correct sentiment more than half the time, often preferring to classify texts as positive to ensure high recall.

### Few-Shot
For few-shot learning, the model's performance increased gradually over 10 epochs. It started with an accuracy of 58% and an F1 score of 70.42% in the first epoch and achieved its best results in the 10th epoch with an accuracy of 72% and an F1 score of 78.12%. Precision started at 54.35% and improved to 64.10% by the 10th epoch, while recall remained consistently at 100%, suggesting that the model was biased towards positive predictions.

### Fine-Tune
Upon full fine-tuning, the model showed substantial improvement:
- In the first epoch, it reached an accuracy of 82.36% and an F1 score of 80.54%. Precision was very high at 89.84%, with recall at 72.98%.
- The second epoch saw further improvements, with accuracy climbing to 84.33% and F1 to 84.05%. Precision was 85.60%, with recall at 82.55%.
- The third epoch maintained a similar accuracy of 84.05% and improved the F1 score slightly to 84.23%. Precision decreased to 83.30%, but recall increased to 85.18%, indicating a more balanced approach between positive and negative predictions.