# Goal and Overview
- This project involves sentiment analysis on Amazon product reviews using deep learning techniques. The objective is to classify product reviews into two categories: positive and negative sentiment. We will utilize the BERT (Bidirectional Encoder Representations from Transformers) model, a powerful pre-trained transformer model, to perform the sentiment classification.

- last update date: Jan 3, 2025

# Table of Contents
1. [Importing Libraries and Preparing the Environment](amazon_sentiment.ipynb#1-importing-libraries-and-preparing-the-environment)
2. [Downloading and Decompressing the Dataset](amazon_sentiment.ipynb#2-downloading-and-decompressing-the-dataset)
3. [Previewing the Data](amazon_sentiment.ipynb#3-previewing-the-data)
4. [Data Preprocessing](amazon_sentiment.ipynb#4-data-preprocessing)
5. [Exploring Dataset Labels](amazon_sentiment.ipynb#5-exploring-dataset-labels)
6. [Configuring the BERT Model](amazon_sentiment.ipynb#6-configuring-the-bert-model)
7. [Fine-Tuning BERT for Sentiment Analysis](amazon_sentiment.ipynb#7-fine-tuning-bert-for-sentiment-analysis)
8. [Evaluation on Test Data](amazon_sentiment.ipynb#8-evaluation-on-test-data)
9. [Future Direction](amazon_sentiment.ipynb#9-future-direction)
    

## 1. Importing Libraries and Preparing the Environment
In this step, we import the necessary Python libraries to build our sentiment classification model. These libraries include PyTorch, transformers, and others required for data processing and model training.

In [1]:
import re
import os

import numpy as np
import torch
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
from torch.utils.data import RandomSampler
from tqdm import tqdm, trange
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import BertConfig, BertForSequenceClassification, BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm


## 2. Downloading and Decompressing the Dataset
The dataset for this project was downloaded from Kaggle. It contains product reviews labeled with sentiment scores (positive or negative). We use the Kaggle API to download the dataset and then decompress it.

The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews) and decompressed using following script:

```
import kagglehub
import shutil
# Download the dataset
path = kagglehub.dataset_download("bittlingmayer/amazonreviews")
shutil.move(path, '.')

import bz2

def decompress_bz2(file_path, output_path):
    with bz2.open(file_path, 'rt', encoding='utf-8') as file:
        with open(output_path, 'w', encoding='utf-8') as out_file:
            out_file.write(file.read())

# Decompress the files
decompress_bz2('7/test.ft.txt.bz2', 'test.txt')
decompress_bz2('7/train.ft.txt.bz2', 'train.txt')
```

## 3. Previewing the Data
Here, we preview the first few lines of the train and test files to get a sense of the data format. Each line in the dataset contains a sentiment label and a product review.

In [2]:
# Preview the first few lines of the train file
with open('train.txt', 'r') as train_file:
    for _ in range(5):  # Print the first 5 lines
        print(train_file.readline())

# Preview the first few lines of the test file
with open('test.txt', 'r') as test_file:
    for _ in range(5):  # Print the first 5 lines
        print(test_file.readline())

__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

__label__2 The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.

__label__2 Amazing!: This soundtrack is

## 4. Data Preprocessing
Before feeding the data into the model, we need to preprocess the text (e.g., tokenization and cleaning) so that the BERT model can properly interpret it. This step will involve converting text into token IDs and padding them to the correct length for input to the BERT model.

In [3]:
# Function to pad sequences
def rpad(array, n):
    current_len = len(array)
    if current_len > n:
        return array[:n]
    extra = n - current_len
    return array + ([0] * extra)

# Parse line and extract label and text
def parse_line_with_label(line):
    line = line.strip().lower()
    line = line.replace("&nbsp;", " ")
    line = re.sub(r'<br(\s\/)?>', ' ', line)
    line = re.sub(r' +', ' ', line)  # Merge multiple spaces into one

    # Extract label and text
    match = re.match(r'__label__(\d+)\s(.+)', line)
    if match:
        label = int(match.group(1))  # Extract label (e.g., 2)
        text = match.group(2)       # Extract text after the label
        return text, label
    return None, None

# Read dataset and parse each line
def read_labeled_data(filename):
    data = []
    labels = []
    with open(filename, 'r', encoding="utf-8") as file:
        for line in file:
            text, label = parse_line_with_label(line)
            if text and label is not None:
                label = label - 1
                data.append(text)
                labels.append(label)
    return data, labels

# Tokenizer and embedding conversion
def convert_to_embedding(tokenizer, sentences_with_labels):
    for sentence, label in sentences_with_labels:
        tokens = tokenizer.tokenize(sentence)
        tokens = tokens[:250]
        bert_sent = rpad(tokenizer.convert_tokens_to_ids(["CLS"] + tokens + ["SEP"]), n=256)
        yield torch.tensor(bert_sent), torch.tensor(label, dtype=torch.int64)

# Prepare the dataloader
def prepare_dataloader(tokenizer, sampler=RandomSampler, train=False):
    filename = 'sample_train.txt' if train else 'sample_test.txt'

    data, labels = read_labeled_data(filename)
    sentences_with_labels = zip(data, labels)

    dataset = list(convert_to_embedding(tokenizer, sentences_with_labels))

    sampler_func = sampler(dataset) if sampler is not None else None
    dataloader = DataLoader(dataset, sampler=sampler_func, batch_size=32)  # Set your batch size here

    return dataloader

## 5. Exploring Dataset Labels
Before training, we examine the distribution of labels in the dataset. This helps ensure the dataset is balanced and provides insights into the classification task.

In [4]:
# Read the train and test data
train_data, train_labels = read_labeled_data("train.txt")
test_data, test_labels = read_labeled_data("test.txt")

# Count the occurrences of each label
from collections import Counter

train_label_counts = Counter(train_labels)
test_label_counts = Counter(test_labels)

print("Train label counts:", train_label_counts)
print("Test label counts:", test_label_counts)

Train label counts: Counter({1: 1800000, 0: 1800000})
Test label counts: Counter({1: 200000, 0: 200000})


To make the project more computationally feasible, we sampled the original dataset. This approach ensures that the model can be trained efficiently while still achieving reasonable performance. The sampled dataset contains:
- Train Dataset: 1800 samples
- Test Dataset: 200 samples 

The sampling was done randomly to maintain the original label distribution.

In [5]:
import random

def create_sample_file(input_file, output_file, sample_size=100):
    """
    Create a sample file from the input data.

    Args:
        input_file (str): Path to the original dataset file.
        output_file (str): Path to the output sample file.
        sample_size (int): Number of lines to sample.
    """
    with open(input_file, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()

    sample = random.sample(lines, min(sample_size, len(lines)))

    with open(output_file, 'w', encoding='utf-8') as outfile:
        outfile.writelines(sample)

# Generate sample training and testing files
create_sample_file('train.txt', 'sample_train.txt', sample_size=1800)
create_sample_file('test.txt', 'sample_test.txt', sample_size=200)

After sampling, it's important to confirm that the label distribution remains similar to the original dataset to ensure balanced training. Below are the label counts for the sampled data:

In [6]:
# Read the train and test data
train_data, train_labels = read_labeled_data("sample_train.txt")
test_data, test_labels = read_labeled_data("sample_test.txt")

# Count the occurrences of each label
from collections import Counter

train_label_counts = Counter(train_labels)
test_label_counts = Counter(test_labels)

print("Train label counts:", train_label_counts)
print("Test label counts:", test_label_counts)

Train label counts: Counter({1: 934, 0: 866})
Test label counts: Counter({1: 101, 0: 99})


## 6. Configuring the BERT Model
We initialize the hyperparameters and configurations for fine-tuning the BERT model. This includes batch size, learning rates, warm-up steps, and other critical parameters.

In [7]:
PAD_TOKEN_LABEL_ID = CrossEntropyLoss().ignore_index
BATCH_SIZE = 16
LEARNING_RATE_MODEL = 1e-5
LEARNING_RATE_CLASSIFIER = 1e-3
WARMUP_STEPS = 0
GRADIENT_ACCUMULATION_STEPS = 1
MAX_GRAD_NORM = 1.0
SEED = 42
NO_CUDA = True

## 7. Fine-Tuning BERT for Sentiment Analysis
In this section, we define the process for training the BERT model on the sentiment analysis task. The model will be fine-tuned using the preprocessed Amazon reviews dataset. The training loop includes loss calculation, backpropagation, and gradient clipping to ensure stable training.

In [8]:
class Transformers:
    model = None

    def __init__(self, tokenizer):
        self.pad_token_label_id = PAD_TOKEN_LABEL_ID
        self.device = torch.device("cuda" if torch.cuda.is_available() and not NO_CUDA else "cpu")
        self.tokenizer = tokenizer

    def predict(self, sentence):
        if self.model is None or self.tokenizer is None:
            self.load()

        embeddings = list(convert_to_embedding([(sentence, -1)]))
        preds = self._predict_tags_batched(embeddings)
        return preds

    def evaluate(self, dataloader):
        from sklearn.metrics import classification_report
        y_pred = self._predict_tags_batched(dataloader)
        # y_true = np.append(np.zeros(50), np.ones(50))
        y_true = []
        for _, labels in dataloader:
            y_true.extend(labels.cpu().numpy())

        score = classification_report(y_true, y_pred)
        return score

    def _predict_tags_batched(self, dataloader):
        preds = []
        self.model.eval()
        for batch in tqdm(dataloader, desc="Computing NER tags"):
            batch = tuple(t.to(self.device) for t in batch)

            with torch.no_grad():
                outputs = self.model(batch[0])
                _, is_neg = torch.max(outputs[0], 1)
                preds.extend(is_neg.cpu().detach().numpy())

        return preds

    def train(self, dataloader, model, epochs):
        assert self.model is None  # make sure we are not training after load() command
        model.to(self.device)
        self.model = model

        t_total = len(dataloader) // GRADIENT_ACCUMULATION_STEPS * epochs
        # Number of iteractions

        # Prepare optimizer and schedule (linear warmup and decay)
        optimizer_grouped_parameters = [
            {"params": model.bert.parameters(), "lr": LEARNING_RATE_MODEL},
            {"params": model.classifier.parameters(), "lr": LEARNING_RATE_CLASSIFIER}
        ]
        optimizer = AdamW(optimizer_grouped_parameters)
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=t_total)

        # Train!
        print("***** Running training *****")
        print("Training on %d examples" % len(dataloader))
        print("Num Epochs = %d" % epochs)
        print("Total optimization steps = %d" % t_total)
        
        global_step = 0
        tr_loss, logging_loss = 0.0, 0.0
        model.zero_grad()
        train_iterator = trange(epochs, desc="Epoch")
        self._set_seed()
        for _ in train_iterator:
            epoch_iterator = tqdm(dataloader, desc="Iteration")
            for step, batch in enumerate(epoch_iterator):
                model.train()
                batch = tuple(t.to(self.device) for t in batch)
                outputs = model(batch[0], labels=batch[1])
                loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)

                if GRADIENT_ACCUMULATION_STEPS > 1:
                    loss = loss / GRADIENT_ACCUMULATION_STEPS

                loss.backward()

                tr_loss += loss.item()
                if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
                    scheduler.step()  # Update learning rate schedule
                    optimizer.step()
                    model.zero_grad()
                    global_step += 1

        self.model = model

        return global_step, tr_loss / global_step

    def _set_seed(self):
        torch.manual_seed(SEED)
        if self.device == 'gpu':
            torch.cuda.manual_seed_all(SEED)

    def load(self, model_dir='weights/'):
        self.tokenizer = BertTokenizer.from_pretrained(model_dir)
        self.model = BertForSequenceClassification.from_pretrained(model_dir)
        self.model.to(self.device)

def train(epochs=20, output_dir="weights/"):
    num_labels = 2  # negative and positive reviews
    config = BertConfig.from_pretrained('bert-base-uncased', num_labels=num_labels)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', config=config)

    dataloader = prepare_dataloader(tokenizer, train=True)
    predictor = Transformers(tokenizer)
    predictor.train(dataloader, model, epochs)

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

def evaluate(model_dir="weights/"):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

    dataloader = prepare_dataloader(tokenizer, train=False, sampler=None)
    predictor = Transformers(tokenizer)
    predictor.load(model_dir=model_dir)
    out = predictor.evaluate(dataloader)
    return out

- Train with 3 epochs

In [9]:
path = './'
os.makedirs(path, exist_ok=True)
train(epochs=3, output_dir=path)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


***** Running training *****
Training on 57 examples
Num Epochs = 3
Total optimization steps = 171


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Iteration: 100%|██████████| 57/57 [52:25<00:00, 55.19s/it]
Iteration: 100%|██████████| 57/57 [50:37<00:00, 53.28s/it]
Iteration: 100%|██████████| 57/57 [47:24<00:00, 49.91s/it]
Epoch: 100%|██████████| 3/3 [2:30:27<00:00, 3009.29s/it]


## 8. Evaluation on Test Data
After training, we evaluate the model's performance on the test dataset to measure its effectiveness. Metrics such as accuracy and F1-score are used to assess the model.

In [10]:
out = evaluate(model_dir=path)

Computing NER tags: 100%|██████████| 7/7 [01:37<00:00, 13.97s/it]


In [11]:
print(out)

              precision    recall  f1-score   support

           0       0.83      0.80      0.81        99
           1       0.81      0.84      0.83       101

    accuracy                           0.82       200
   macro avg       0.82      0.82      0.82       200
weighted avg       0.82      0.82      0.82       200



## 9. Future Direction

### 9.1 Model Optimization

- Experiment with Alternative Transformer Architectures: Explore models like RoBERTa, XLNet, or DistilBERT to compare performance, efficiency, and suitability for the sentiment classification task.

- Hyperparameter Tuning: Optimize parameters like batch size, learning rate, number of epochs, and more using automated tools such as:
    - Optuna: A hyperparameter optimization library for efficient searches.
    - Ray Tune: A scalable framework for distributed hyperparameter tuning.

- Gradient Accumulation: Simulate larger batch sizes on memory-limited devices (e.g., CPUs) by accumulating gradients over multiple steps before performing an update (GRADIENT_ACCUMULATION_STEPS > 1).

- Learning Rate Optimization
    - Adjust learning rates dynamically during training for better convergence using learning rate schedules:
        - StepLR: Decreases the learning rate by a fixed factor every few epochs.
        - ExponentialLR: Exponentially reduces the learning rate.
        - ReduceLROnPlateau: Decreases the learning rate when the monitored metric stops improving.
    - Learning Rate Warmup: Gradually increase the learning rate during initial training steps to stabilize optimization and prevent large parameter updates early on. Experiment with non-zero WARMUP_STEPS (e.g., 10% of the total steps).

- Mixed Precision Training: Use mixed precision (combining 16-bit and 32-bit floating point operations) to speed up training and reduce memory consumption.

### 9.2 Multi-Class Sentiment Analysis

Extend the project to handle multi-class sentiment analysis (e.g., "positive," "neutral," "negative") rather than binary classification.

### 9.3 Scalability

- Implement distributed training using tools like PyTorch's DistributedDataParallel or Hugging Face Accelerate to handle larger datasets efficiently.
- Use libraries like Dask or Apache Spark for parallelized data preprocessing.