<a href="https://colab.research.google.com/github/bilal-najar/git_practice/blob/master/Assignment_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 5: Implementing and Evaluating a Large Language Model (LLM) for Text Classification - Bilal Najar

In [None]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import BertTokenizer, BertForSequenceClassification, AdamW

# Load and preprocess the dataset
file_path = '/content/complaints_processed.csv'
complaints_df = pd.read_csv(file_path)

# Remove unnecessary index column if it exists
complaints_df = complaints_df.drop(columns=['Unnamed: 0'], errors='ignore')

# Split dataset into training and test sets
train_df, test_df = train_test_split(complaints_df, test_size=0.2, random_state=27, stratify=complaints_df['product'])

# Map product labels to numerical values
label_to_id = {label: idx for idx, label in enumerate(train_df['product'].unique())}
train_labels = train_df['product'].map(label_to_id).values
test_labels = test_df['product'].map(label_to_id).values

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to tokenize the narratives for BERT input
def tokenize_data(texts, labels, max_length=32):
    encoding = tokenizer.batch_encode_plus(
        texts.tolist(),
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors='pt'
    )
    inputs = encoding['input_ids']
    attention_masks = encoding['attention_mask']
    labels = torch.tensor(labels)
    return inputs, attention_masks, labels

# Tokenize training and test datasets
train_inputs, train_masks, train_labels = tokenize_data(train_df['narrative'].astype(str), train_labels)
test_inputs, test_masks, test_labels = tokenize_data(test_df['narrative'].astype(str), test_labels)



# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_to_id))

# Move model to GPU on Colab
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Prepare optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Create DataLoader for batch processing
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=64)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=64)

# Training function
def train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch in dataloader:
        batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)

        model.zero_grad()
        outputs = model(batch_inputs, attention_mask=batch_masks, labels=batch_labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(dataloader)
    return avg_loss

# Evaluation function
def evaluate(model, dataloader, device):
    model.eval()
    preds, true_labels = [], []

    with torch.no_grad():
        for batch in dataloader:
            batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)
            outputs = model(batch_inputs, attention_mask=batch_masks)
            logits = outputs.logits
            preds.append(logits.argmax(dim=1).cpu().numpy())
            true_labels.append(batch_labels.cpu().numpy())

    preds = [item for sublist in preds for item in sublist]
    true_labels = [item for sublist in true_labels for item in sublist]
    return preds, true_labels

# Training loop
epochs = 1
for epoch in range(epochs):
    avg_train_loss = train(model, train_dataloader, optimizer, device)
    print(f'Epoch {epoch+1}/{epochs} - Loss: {avg_train_loss:.4f}')

# Evaluate the model
preds, true_labels = evaluate(model, test_dataloader, device)

# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, preds, average='weighted')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/1 - Loss: 0.4901
Accuracy: 0.8605
Precision: 0.8642
Recall: 0.8605
F1-score: 0.8611


In [None]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import BertTokenizer, BertForSequenceClassification, AdamW

# Load and preprocess the dataset
file_path = '/content/complaints_processed.csv'
complaints_df = pd.read_csv(file_path)

# Remove unnecessary index column if it exists
complaints_df = complaints_df.drop(columns=['Unnamed: 0'], errors='ignore')

# Split dataset into training and test sets
train_df, test_df = train_test_split(complaints_df, test_size=0.2, random_state=27, stratify=complaints_df['product'])

# Map product labels to numerical values
label_to_id = {label: idx for idx, label in enumerate(train_df['product'].unique())}
train_labels = train_df['product'].map(label_to_id).values
test_labels = test_df['product'].map(label_to_id).values

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to tokenize the narratives for BERT input
def tokenize_data(texts, labels, max_length=128):
    encoding = tokenizer.batch_encode_plus(
        texts.tolist(),
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors='pt'
    )
    inputs = encoding['input_ids']
    attention_masks = encoding['attention_mask']
    labels = torch.tensor(labels)
    return inputs, attention_masks, labels

# Tokenize training and test datasets
train_inputs, train_masks, train_labels = tokenize_data(train_df['narrative'].astype(str), train_labels)
test_inputs, test_masks, test_labels = tokenize_data(test_df['narrative'].astype(str), test_labels)



# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_to_id))

# Move model to GPU on Colab
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Prepare optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Create DataLoader for batch processing
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=64)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=64)

# Training function
def train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch in dataloader:
        batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)

        model.zero_grad()
        outputs = model(batch_inputs, attention_mask=batch_masks, labels=batch_labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(dataloader)
    return avg_loss

# Evaluation function
def evaluate(model, dataloader, device):
    model.eval()
    preds, true_labels = [], []

    with torch.no_grad():
        for batch in dataloader:
            batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)
            outputs = model(batch_inputs, attention_mask=batch_masks)
            logits = outputs.logits
            preds.append(logits.argmax(dim=1).cpu().numpy())
            true_labels.append(batch_labels.cpu().numpy())

    preds = [item for sublist in preds for item in sublist]
    true_labels = [item for sublist in true_labels for item in sublist]
    return preds, true_labels

# Training loop
epochs = 3
for epoch in range(epochs):
    avg_train_loss = train(model, train_dataloader, optimizer, device)
    print(f'Epoch {epoch+1}/{epochs} - Loss: {avg_train_loss:.4f}')

# Evaluate the model
preds, true_labels = evaluate(model, test_dataloader, device)

# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, preds, average='weighted')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3 - Loss: 0.4296
Epoch 2/3 - Loss: 0.3012
Epoch 3/3 - Loss: 0.2408
Accuracy: 0.8953
Precision: 0.8954
Recall: 0.8953
F1-score: 0.8951


In [None]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import BertTokenizer, BertForSequenceClassification, AdamW

# Load and preprocess the dataset
file_path = '/content/complaints_processed.csv'
complaints_df = pd.read_csv(file_path)

# Remove unnecessary index column if it exists
complaints_df = complaints_df.drop(columns=['Unnamed: 0'], errors='ignore')

# Split dataset into training and test sets
train_df, test_df = train_test_split(complaints_df, test_size=0.2, random_state=27, stratify=complaints_df['product'])

# Map product labels to numerical values
label_to_id = {label: idx for idx, label in enumerate(train_df['product'].unique())}
train_labels = train_df['product'].map(label_to_id).values
test_labels = test_df['product'].map(label_to_id).values

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to tokenize the narratives for BERT input
def tokenize_data(texts, labels, max_length=128):
    encoding = tokenizer.batch_encode_plus(
        texts.tolist(),
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors='pt'
    )
    inputs = encoding['input_ids']
    attention_masks = encoding['attention_mask']
    labels = torch.tensor(labels)
    return inputs, attention_masks, labels

# Tokenize training and test datasets
train_inputs, train_masks, train_labels = tokenize_data(train_df['narrative'].astype(str), train_labels)
test_inputs, test_masks, test_labels = tokenize_data(test_df['narrative'].astype(str), test_labels)



# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_to_id))

# Move model to GPU on Colab
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Prepare optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Create DataLoader for batch processing
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=64)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=64)

# Training function
def train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0
    for batch in dataloader:
        batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)

        model.zero_grad()
        outputs = model(batch_inputs, attention_mask=batch_masks, labels=batch_labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(dataloader)
    return avg_loss

# Evaluation function
def evaluate(model, dataloader, device):
    model.eval()
    preds, true_labels = [], []

    with torch.no_grad():
        for batch in dataloader:
            batch_inputs, batch_masks, batch_labels = tuple(t.to(device) for t in batch)
            outputs = model(batch_inputs, attention_mask=batch_masks)
            logits = outputs.logits
            preds.append(logits.argmax(dim=1).cpu().numpy())
            true_labels.append(batch_labels.cpu().numpy())

    preds = [item for sublist in preds for item in sublist]
    true_labels = [item for sublist in true_labels for item in sublist]
    return preds, true_labels

# Training loop
epochs = 5
for epoch in range(epochs):
    avg_train_loss = train(model, train_dataloader, optimizer, device)
    print(f'Epoch {epoch+1}/{epochs} - Loss: {avg_train_loss:.4f}')

# Evaluate the model
preds, true_labels = evaluate(model, test_dataloader, device)

# Calculate evaluation metrics
accuracy = accuracy_score(true_labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, preds, average='weighted')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5 - Loss: 0.4279
Epoch 2/5 - Loss: 0.3018
Epoch 3/5 - Loss: 0.2408
Epoch 4/5 - Loss: 0.1853
Epoch 5/5 - Loss: 0.1406
Accuracy: 0.8957
Precision: 0.8975
Recall: 0.8957
F1-score: 0.8963


# Analysis

I fine-tuned a pre-trained BERT model using the “Consumer Complaints” dataset in three experiments. The first run, using minimal parameters (max sequence length of 32 and 1 epoch), achieved an accuracy of 0.8605 with a precision of 0.8642 and an F1-score of 0.8611. In the second run, increasing the sequence length to 128 and training for 3 epochs improved performance to an accuracy of 0.8953 and an F1-score of 0.8951, with the loss decreasing from 0.4296 to 0.2408. A third run, using the same parameters but extending to 5 epochs, yielded slight improvements, with an accuracy of 0.8957 and an F1-score of 0.8963, while the loss further decreased to 0.1406. These results show that while increasing sequence length and epochs leads to better performance, the marginal gains after 3 epochs suggest diminishing returns. Running these experiments on the A100 GPU in Google Colab was essential for efficient processing given the model’s computational demands.