<a href="https://colab.research.google.com/github/Zuhair0000/fine_tuning_course/blob/main/fine_tuning_BERT_for_multi_class_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Data

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/refs/heads/master/twitter_multi_class_sentiment.csv')

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
import matplotlib.pyplot as plt

In [None]:
label_counts = df['label_name'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of classes")

In [None]:
df['words per tweet'] = df['text'].str.split().apply(len)
df.boxplot('words per tweet', by='label_name')

In [None]:
df.head()

# 1. AutoTokenizer (The Translator)

Computers cannot read English. They only read numbers.

* The Job: This tool takes your sentence ("I love AI") and chops it into pieces (tokens). It then looks up those pieces in a massive dictionary and replaces them with ID numbers (e.g., [101, 234, 567...]).

* Why "Auto"? Every model (BERT, RoBERTa, GPT) has its own unique dictionary. You can't use a GPT dictionary for a BERT model. AutoTokenizer looks at your model name and automatically pulls the correct dictionary.

In [None]:
from transformers import AutoTokenizer

model_ckpt = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

text = 'I love machine learning! Tokenizatoin is awesome'

encode_text = tokenizer(text)
print((encode_text))

In [None]:
tokenizer.vocab_size

# Data Loader and Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.3, stratify=df['label_name'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label_name'])

train.shape, test.shape, validation.shape

# **Converting to Hugging Face Format**
Pandas is great for humans, but Hugging Face models prefer their own Dataset object format because it's faster for training.


1. Dataset.from_pandas: Converts the Pandas DataFrame into a Hugging Face Dataset.

2. preserve_index=False: Pandas creates a numbered index (0, 1, 2...) on the left. We don't need this extra column cluttering our AI data, so we throw it away.

3. DatasetDict: A container that holds all three splits (train, test, validation) in one variable, making it easy to access them later (e.g., dataset['train']).



In [None]:
from datasets import Dataset, DatasetDict

dataset = DatasetDict(
    {'train': Dataset.from_pandas(train, preserve_index=False),
     'test': Dataset.from_pandas(test, preserve_index=False),
     'validation': Dataset.from_pandas(validation, preserve_index=False)
     }
)
dataset

In [None]:
dataset['train'][0]

# Tokenization

In [None]:
def tokenize(batch):
  temp = tokenizer(batch['text'], padding=True, truncation=True)
  return temp

print(tokenize(dataset['train'][:2]))

In [None]:
emotion_encoded = dataset.map(tokenize, batched=True, batch_size=None)

In [None]:
emotion_encoded

In [None]:
label2id = {x['label_name']:x['label'] for x in dataset['train']}
id2label = {v:k for k, v in label2id.items()}

label2id, id2label

# 2. AutoModel (The Brain)

* The Job: This loads the raw, pre-trained BERT brain. This brain has read all of Wikipedia. It understands grammar, context, and synonyms.

* The Catch: It only understands language. It doesn't know you want to do "Sentiment Analysis." If you feed it a sentence, it just spits out a mathematical representation of that sentence. It doesn't give you a label like "Happy" or "Sad."

In [None]:
from transformers import AutoModel
import torch

In [None]:
model = AutoModel.from_pretrained(model_ckpt)

In [None]:
model

# 3. AutoConfig (The Blueprint)
* **The Job**: This is a settings file. It tells the code: "How many labels do we have?" (In your case, you have 6 emotions). It also remembers which number corresponds to which emotion (e.g., 0 = Sadness, 1 = Joy).

* **Why we need it**: We need to inject this map (label2id) into the model so it knows that it is choosing between 6 specific options, not 2 or 100.

# 4. AutoModelForSequenceClassification (The Specialist)
* **The Job**: This is the most important one for you. It takes the AutoModel (the raw brain) and glues a strictly defined "Head" on top of it.

* **The Head**: This is a final layer of math that takes the brain's deep thoughts and forces them into one of your 6 categories.

# **Difference**:

* **AutoModel**: Outputs a complex vector of numbers (the "hidden state").

* **AutoModelForSequenceClassification**: Outputs scores for your labels (e.g., Joy: 95%, Sadness: 2%, etc.).

In [None]:
from transformers import AutoModelForSequenceClassification, AutoConfig

num_labels = len(label2id)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

In [None]:
device

In [None]:
from transformers import TrainingArguments

batch_size = 64
training_dir = 'bert_base_train_dir'
training_args = TrainingArguments(output_dir = training_dir,
                                  # overwrite_output_dir = True,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size = batch_size,
                                  per_device_eval_batch_size = batch_size,
                                  weight_decay = 0.01,
                                  eval_strategy = 'epoch',
                                  disable_tqdm = False
                                  )



In [None]:
!pip install evaluate

In [None]:
import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')

def compute_metrics_evaluate(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return accuracy.compute(predictions=predictions, reference=labels)



In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)

  f1 = f1_score(labels, preds, average='weighted')
  acc = accuracy_score(labels, preds)

  return{
      'accuracy': acc,
      "f1": f1
  }

# Build Model and Trainer

# **The Trainer (The Manager)**
In the old days (3+ years ago), you had to write a for loop that manually fed data to the model, calculated the error, updated the weights, and repeated. It was 50 lines of complex math code.

* ### **TrainingArguments**
This is just a configuration list. You are telling the manager:

num_train_epochs=2: "Read the entire textbook (dataset) 2 times."

batch_size=64: "Study 64 flashcards at a time before taking a break to update your brain."

output_dir: "Save your progress here."

* ### **Trainer**
This is the magic wrapper. You give it:

1. The Model (The student)

2. The Args (The schedule)

3. The Data (The textbooks)

4. compute_metrics (The exam)

The Trainer handles all the looping, the GPU management, and the progress bars for you.



In [None]:
from transformers import Trainer, DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(model=model,
                  args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=emotion_encoded['train'],
                  eval_dataset=emotion_encoded['validation'],
                  data_collator=data_collator
                  )

In [None]:
trainer.train()

# Model Evaluation

In [None]:
preds_output = trainer.predict(emotion_encoded['test'])
preds_output.metrics

In [None]:
y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = emotion_encoded['test'][:]['label']

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

In [None]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

In [None]:
cm = confusion_matrix(y_true, y_pred)

plt.figure()
sns.heatmap(cm, annot=True, xticklabels=label2id.keys(), yticklabels=label2id.keys(), fmt='d', cbar=False)

In [None]:
text = 'I am super sad today'

def get_predition(text):
  input_encoded = tokenizer(text, return_tensors='pt').to(device)

  with torch.no_grad():
    outputs = model(**input_encoded)

  logits = outputs.logits
  pred = torch.argmax(logits, dim=1).item()
  return id2label[pred]

In [None]:
get_predition("I love you")

In [None]:
trainer.save_model('bert-base-uncased-sentiment-model')

In [None]:
from transformers import pipeline

classifier = pipeline('text-classification', mdoel='bert-base-uncased-sentiment-model')

classifier(text)

In [None]:
classifier('i love you')