# Text Classification using Representation Model

In this notebook, we intend to use a representation model which is fine-tuned to perform text classification itself.
We will learn text classification using a general purpose embeddings model in a separate notebook.

In [6]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [7]:
dataset["train"][10]

{'text': 'this is a film well worth seeing , talking and singing heads and all .',
 'label': 1}

## Load the sentiment analysis model

In [10]:
from transformers import pipeline

model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe = pipeline(
    model=model_name,
    tokenizer=model_name,
    device="cuda:0",
    return_all_scores=True
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
from transformers.pipelines.pt_utils import KeyDataset
from tqdm import tqdm
import numpy as np

y_pred = []

for out in tqdm(pipe(KeyDataset(dataset["test"], "text")),total=len(dataset["test"])):
    y_pred.append(np.argmax([out[0]["score"],out[2]["score"]]))

100%|██████████| 1066/1066 [00:10<00:00, 103.52it/s]


In [14]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    performance = classification_report(y_true, y_pred, target_names=["Negative Review","Positive Review"])
    print(performance)

In [15]:
evaluate_performance(dataset["test"]["label"],y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



### Notes
1. We observe that the model performed decently well even though it was not trained on the domain data (movie reviews in this case)
2. To further increase the model performance, we have two approaches,
    * Option 1 : Use a different model which is trained on domain data.
    * Option 2 : Use a different representation model, namely embedding model.

#### Option 1

In [16]:
model_name = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

pipe_ft = pipeline(
    model=model_name,
    tokenizer=model_name,
    device="cuda:0",
    return_all_scores=True
)

model.safetensors:  23%|##3       | 62.9M/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [17]:
y_pred_ft = []

for out in tqdm(pipe_ft(KeyDataset(dataset["test"], "text")),total=len(dataset["test"])):
    y_pred_ft.append(np.argmax([out[0]["score"],out[1]["score"]]))


100%|██████████| 1066/1066 [00:06<00:00, 177.04it/s]


In [18]:
evaluate_performance(dataset["test"]["label"],y_pred_ft)

                 precision    recall  f1-score   support

Negative Review       0.89      0.90      0.90       533
Positive Review       0.90      0.89      0.90       533

       accuracy                           0.90      1066
      macro avg       0.90      0.90      0.90      1066
   weighted avg       0.90      0.90      0.90      1066



As observed, the performance of the model is higher compared to the previous model.