<a href="https://colab.research.google.com/github/aarushijohly/NLP/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Movie review sentiment analysis using **"twitter-roberta-base-sentiment-latest"** on **"rotten tomatoes"** dataset.

In [None]:
!pip install transformers accelerate sentence_transformers datasets openai



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

In [None]:
from datasets import load_dataset

data=load_dataset("rotten_tomatoes")
print(data)

In [None]:
model_path="cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe=pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [None]:
y_pred=[]
for output in tqdm(pipe(KeyDataset(data["test"],"text")), total=len(data["test"])):
    negative_score=output[0]["score"]
    positive_score=output[2]["score"]
    assignment=np.argmax([negative_score,positive_score])
    y_pred.append(assignment)

Disabling tokenizer parallelism, we're using DataLoader multithreading already
100%|██████████| 1066/1066 [00:12<00:00, 82.57it/s]


In [None]:
from sklearn.metrics import classification_report

def evaluate_performane(y_true, y_pred):
    """Create and print classification report"""
    performance=classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [None]:
evaluate_performane(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



**Leveraging Embeddings**

In [None]:
from sentence_transformers import SentenceTransformer

model=SentenceTransformer("sentence-transformer/all-mpnet-base-v2")

train_embeddings=model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings=model.encode(data["test"]["text"], show_progress_bar=True)

In [None]:
from sklearn.linear_model import LogisticRegression

clf=LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

y_pred=clf.predict(test_embedding)
evaluate_performance(data["test"]["label"], y_pred)

**Zero shot learning**: To assign labels to documents, we apply cosine similarity to the document label pairs.

Cosine similarity checks how similar a given document is to the description of the candidate label.

Try with a very negative/positive movie review.

In [None]:
label_embeddings=model.encode(["A negative review", "A positive review"])
cosine similarity (review, "A negative review")
cosine similarity (review, "A positive review")

**With Generative Models**

In [None]:
from transformers import pipeline

pipe=pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

In [None]:
prompt="Is the following sentence positive or negative?"
data=data.map(lambda example:{"t5": prompt+example['text']})
data