# Lab

##### Objective : The main purpose behind this lab is to get familiar with NLP language models using Pytorch library.

## Part 3 : BERT

### Step 1: Establish the Model

In [1]:
!pip install transformers torch



In [69]:
import pandas as pd
import gzip
import json
import numpy as np
import torch
import re

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from transformers import Trainer, TrainingArguments, default_data_collator
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from nltk.translate.bleu_score import sentence_bleu
from bert_score import score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 2: Prepare the Data

DataSet : https://nijianmo.github.io/amazon/index.html

In [58]:
# Function to parse the JSON file
def parse(path):
    with gzip.open(path, 'r') as f:
        for line in f:
            yield json.loads(line)

# Function to get the DataFrame
def get_df(path):
    data = []
    for review in parse(path):
        data.append(review)
    return pd.DataFrame(data)

# Load the entire Amazon Fashion dataset
df = get_df('AMAZON_FASHION_5.json.gz')

# Keep only necessary columns and drop rows with missing values
df = df[['reviewText', 'overall']].dropna()

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,reviewText,overall
0,Great product and price!,5.0
1,Great product and price!,5.0
2,Great product and price!,5.0
3,Great product and price!,5.0
4,Great product and price!,5.0


In [59]:
df = df.drop_duplicates()
df

Unnamed: 0,reviewText,overall
0,Great product and price!,5.0
5,Waaay too small. Will use for futur children!,3.0
6,Stays vibrant after many washes,5.0
8,My son really likes the pink. Ones which I was...,5.0
9,Waaay too small. Will use for future child.,3.0
...,...,...
2380,"I wear these everyday to work, the gym, etc.",5.0
3115,Very comfortable and fits perfectly,5.0
3116,Super.,5.0
3117,"Largely my fault for not reading carefully, bu...",4.0


In [60]:
df['overall'].value_counts()

5.0    293
4.0     66
3.0     49
1.0     17
2.0     15
Name: overall, dtype: int64

In [61]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters, punctuation, and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove stopwords
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

df['reviewText'] = df['reviewText'].apply(preprocess_text)
df

Unnamed: 0,reviewText,overall
0,great product price,5.0
5,waaay small use futur children,3.0
6,stays vibrant many washes,5.0
8,son really likes pink ones nervous,5.0
9,waaay small use future child,3.0
...,...,...
2380,wear everyday work gym etc,5.0
3115,comfortable fits perfectly,5.0
3116,super,5.0
3117,largely fault reading carefully high synthetic...,4.0


In [62]:
# Split the data into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Display the size of training and testing sets
print(f"Training set size: {train_df.shape[0]}")
print(f"Testing set size: {test_df.shape[0]}")

Training set size: 352
Testing set size: 88


### Step 3: Fine-tune and Train the Model

#### Tokenization and DataLoader

In [65]:
import os
os.environ['HF_DISABLE_TFDS_SYMLINKS'] = "1"

# Tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

# Dataset class for BERT
class ReviewDataset(Dataset):
    def __init__(self, reviews, labels, tokenizer, max_length):
        self.reviews = reviews
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, item):
        review = str(self.reviews[item])
        label = self.labels[item] - 1
        encoding = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'review_text': review,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

train_dataset = ReviewDataset(
    reviews=train_df.reviewText.to_numpy(),
    labels=train_df.overall.to_numpy(),
    tokenizer=tokenizer,
    max_length=160
)

test_dataset = ReviewDataset(
    reviews=test_df.reviewText.to_numpy(),
    labels=test_df.overall.to_numpy(),
    tokenizer=tokenizer,
    max_length=160
)

train_data_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_data_loader = DataLoader(test_dataset, batch_size=16)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Model Training

In [66]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=1000,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=1000,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=110, training_loss=1.342120326649059, metrics={'train_runtime': 2483.2417, 'train_samples_per_second': 0.709, 'train_steps_per_second': 0.044, 'total_flos': 144714978355200.0, 'train_loss': 1.342120326649059, 'epoch': 5.0})

### Step 4: Evaluate the Model

In [70]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1) + 1  # Adding 1 to adjust labels
    labels = p.label_ids + 1  # Adding 1 to adjust labels
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')
    precision = precision_score(labels, preds, average='weighted')
    recall = recall_score(labels, preds, average='weighted')
    cm = confusion_matrix(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall, "confusion_matrix": cm}

# Predict and compute metrics
predictions, labels, _ = trainer.predict(test_dataset)
preds = np.argmax(predictions, axis=1)

# Calculate metrics
acc = accuracy_score(labels, preds)
f1 = f1_score(labels, preds, average='weighted')

# BLEU requires different input formats
bleu_score = sentence_bleu([labels.tolist()], preds.tolist())

# BERTScore requires different input formats
P, R, F1 = score([" ".join(str(p) for p in preds)], [" ".join(str(l) for l in labels)], lang='en')
bertscore_results = {"precision": P.mean(), "recall": R.mean(), "f1": F1.mean()}

print(f'Accuracy: {acc}')
print(f'F1 Score: {f1}')
print(f'BLEU Score: {bleu_score}')
print(f'BERTScore: {bertscore_results}')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.6477272727272727
F1 Score: 0.5092476489028214
BLEU Score: 0.37511372399251636
BERTScore: {'precision': tensor(0.9208), 'recall': tensor(0.8859), 'f1': tensor(0.9030)}


#### Step 5: Conclusion

Fine-tuning a pre-trained BERT model for text classification tasks leverages the extensive contextual understanding that BERT has developed through its pre-training on a large corpus of text. This approach can significantly improve performance, particularly when compared to traditional machine learning models that rely on less sophisticated feature extraction methods. However, careful attention must be paid to the imbalanced nature of the dataset, and techniques such as class weighting or oversampling may be necessary to ensure the model performs well across all classes. Additionally, the choice of hyperparameters and the size of the dataset for fine-tuning are critical factors that influence the model's effectiveness.

This method provides a powerful tool for handling complex text classification tasks, even with challenging datasets, by utilizing the state-of-the-art capabilities of BERT.