<a href="https://colab.research.google.com/github/aarushijohly/NLP/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Aim: Leveraging pretrained language models for classifying text.

1. Using representive pretrained task specific model
2. Situation when no model is pretrained for a specific task: leveraging embedding model to generate feature which are them fed to a classic machine learning classifier
3. Situation when there is no labeled data
4. Using generative model

Movie review sentiment analysis using **"twitter-roberta-base-sentiment-latest"** on **"rotten tomatoes"** dataset.

This model is finetuned on tweets for sentiment analysis. Although this was not trained specifically for movie reviews, we explore how this model generalizes.

Embedding used: sentence-transformers/all-mpnet-base-v2

In [None]:
!pip install transformers accelerate sentence_transformers datasets openai

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

In [None]:
from datasets import load_dataset

data=load_dataset("rotten_tomatoes")
print(data)

In [None]:
model_path="cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe=pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

In [None]:
y_pred=[]
for output in tqdm(pipe(KeyDataset(data["test"],"text")), total=len(data["test"])):
    negative_score=output[0]["score"]
    positive_score=output[2]["score"]
    assignment=np.argmax([negative_score,positive_score])
    y_pred.append(assignment)

In [None]:
from sklearn.metrics import classification_report

def evaluate_performane(y_true, y_pred):
    """Create and print classification report"""
    performance=classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [None]:
evaluate_performane(data["test"]["label"], y_pred)

**Leveraging Embeddings**

In [None]:
from sentence_transformers import SentenceTransformer

model=SentenceTransformer("sentence-transformer/all-mpnet-base-v2")

train_embeddings=model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings=model.encode(data["test"]["text"], show_progress_bar=True)

In [None]:
from sklearn.linear_model import LogisticRegression

clf=LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

y_pred=clf.predict(test_embedding)
evaluate_performance(data["test"]["label"], y_pred)

**Classification in case of data without labels:**

We shall use **Zero-shot classification with embeddings** to predict labels of input text even though it was not trained on them. We describe our labels based on what they should represent.

In [None]:
label_embeddings = model.encode(["A negative review", "A positive review"])

from sklearn.metrics.pairwise import cosine_similarity

sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

**Classification With Generative Models**

1. Using text-to-text transfer transformer: Using Flan-T5 model for classification

In [None]:
from transformers import pipeline

pipe=pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

In [None]:
prompt="Is the following sentence positive or negative?"
data=data.map(lambda example:{"t5": prompt+example['text']})
data

In [None]:
y_pred=[]
for output in tqdm(pipe(KeyDataset(data["test"],"t5")), total=len(data["test"])):
    text=output[0]["generated_text"]
    y_pred.append(0 if text=="negative" else 1)

In [None]:
evaluate_performane(data["test"]["label"], y_pred)

**ChatGPT for Classification**