<a href="https://colab.research.google.com/github/csbanon/bert-product-rating-predictor/blob/master/Star_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/99/84/7bc03215279f603125d844bf81c3fb3f2d50fe8e511546eb4897e4be2067/transformers-4.0.0-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 8.4MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 35.5MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 38.5MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893257 sha256=0adc5b8

In [None]:
import glob
import numpy as np
import os
import pandas as pd
import sys
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, BertModel, BertForSequenceClassification

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Get Preprocessed Review Data


In [None]:
github_url = 'https://raw.githubusercontent.com/csbanon/bert-product-rating-predictor/master/data/reviews_comments_stars.csv'
df = pd.read_csv(github_url)
df = df[['comment', 'stars']]
df

Unnamed: 0,comment,stars
0,I could sit here and write all about the specs...,5
1,A very reasonably priced laptop for basic comp...,4
2,"This is the best laptop deal you can get, full...",5
3,A few months after the purchase....It is still...,5
4,BUYER BE AWARE: This computer has Microsoft 10...,1
...,...,...
195760,I have not tried this camera without the SD ca...,5
195761,"Hello, I bought this item months ago and I tho...",1
195762,This is an incredible camera for the money!! ...,5
195763,Great cameras. Purchased some for my mother af...,5


## Define the neural model for fine tuning
Given a review as an input sequence, we want to predict its star rating. This is a multi-class sequence classification task.

For out model, we will use BertForSequenceClassification and set the num_labels argument to the number of unique values for Amazon star ratings.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = len(df['stars'].unique()), # number of unique labels for our multi-class classification problem
    output_attentions = False,
    output_hidden_states = False,
)
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## Define the Reviews Dataset
Each item in the dataset will return a dictionary consisting of:


*   input_ids: the input token ids
*   attn_mask: the attention mask of the input sequence
*   label: the target star rating of the input review

In [None]:
class ReviewsDataset(Dataset):
    def __init__(self, df, max_length=512):
        self.df = df
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        self.max_length = max_length 
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        # input=review, label=stars
        review = self.df.loc[idx, 'comment']
        # labels are 0-indexed
        label = int(self.df.loc[idx, 'stars']) - 1
        
        encoded = self.tokenizer(
            review,                      # review to encode
            add_special_tokens=True,
            max_length=self.max_length,  # Truncate all segments to max_length
            padding='max_length',        # pad all reviews with the [PAD] token to the max_length
            return_attention_mask=True,  # Construct attention masks.
            truncation=True
        )
        
        input_ids = encoded['input_ids']
        attn_mask = encoded['attention_mask']
        
        return {
            'input_ids': torch.tensor(input_ids),
            'attn_mask': torch.tensor(attn_mask), 
            'label': torch.tensor(label)
        }

In [None]:
train_dataset, test_dataset = train_test_split(df, test_size=0.2, random_state=1)

In [None]:
test_dataset[:100].to_csv('/content/drive/My Drive/BERT project/gold.csv')

Define some constants that are important later on.

In [None]:
MAX_LEN = 256
TEST_SIZE = 0.2
VAL_SIZE = 0.125
TRAIN_BATCH_SIZE = 32
TEST_BATCH_SIZE = 32

CHECKPOINT_FILE = 'checkpoint.dat'
CHECKPOINT_FOLDER = 'Checkpoint'
EPOCHS = 4
LEARNING_RATE = 2e-05
PROJECT_FOLDER = '/content/drive/My Drive/BERT project/'
MODEL_FOLDER = 'Model_V3'
SAVE_EVERY = 100
NUM_WORKERS = 4

## Create Datasets / DataLoaders
Create the train and test datasets and dataloaders for the neural network.

In [None]:
train_dataset, test_dataset = train_test_split(df, test_size=TEST_SIZE, random_state=1)
train_dataset, val_dataset = train_test_split(train_dataset, test_size=VAL_SIZE, random_state=1)

train_dataset = train_dataset.reset_index(drop=True)
val_dataset = val_dataset.reset_index(drop=True)
test_dataset = test_dataset.reset_index(drop=True)

train_set = ReviewsDataset(train_dataset, MAX_LEN)
val_set = ReviewsDataset(val_dataset, MAX_LEN)
test_set = ReviewsDataset(test_dataset, MAX_LEN)

print("# of samples in train set: {}".format(len(train_set)))
print("# of samples in val set: {}".format(len(val_set)))
print("# of samples in test set: {}".format(len(test_set)))

# of samples in train set: 137035
# of samples in val set: 19577
# of samples in test set: 39153


In [None]:
train_params = {
                'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': NUM_WORKERS
                }
val_params = train_params

test_params = {
                'batch_size': TEST_BATCH_SIZE,
                'shuffle': False,
                'num_workers': NUM_WORKERS
              }

train_loader = DataLoader(train_set, **train_params)
val_loader = DataLoader(val_set, **val_params)
test_loader = DataLoader(test_set, **test_params)

## Define the neural model for fine tuning
Given a review as an input sequence, we want to predict its star rating. This is a multi-class sequence classification task.

For out model, we will use BertForSequenceClassification and set the num_labels argument to the number of unique values for Amazon star ratings.

## Fine Tuning the Model on Train Dataset

In [None]:
# For weighted Cross Entropy Loss
# Penalize errors higher if they come from a class with lower frequency
star_groups = df.groupby('stars')
star_distribution = []
for i in range(len(df['stars'].unique())):
    star_distribution.append(len(star_groups.groups[i+1])/len(df))

star_distribution = torch.tensor(star_distribution, dtype=torch.float32)

# V3
weights = 1.0 / star_distribution
weights = weights / weights.sum()

# V4
# weights = 1.0 - star_distribution

print('{:<20}: {}'.format('Star distribution', star_distribution.tolist()))
print('{:<20}: {}'.format('Weights', weights.tolist()))

Star distribution   : [0.09381656348705292, 0.04584067687392235, 0.05710418149828911, 0.11441779881715775, 0.6888207793235779]
Weights             : [0.17712825536727905, 0.3625069260597229, 0.29100432991981506, 0.1452358216047287, 0.02412465587258339]


In [None]:
# Define the optimizer
loss_function = torch.nn.CrossEntropyLoss(weight=weights.to(device), reduction='mean')
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [None]:
# Define the accuracy function
def calculate_accuracy(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

In [None]:
# For validation
def validate(model, data_loader):
    model.eval()
    n_correct = 0 
    nb_test_steps = 0
    nb_test_examples = 0
    test_loss = 0
    y_pred = []
    y_true = []

    with torch.no_grad():
        for _, data in enumerate(data_loader, 0):
            input_ids = data['input_ids'].to(device)
            mask = data['attn_mask'].to(device)
            labels = data['label'].to(device)

            outputs = model(input_ids, mask)
            loss = loss_function(outputs[0], labels)
            test_loss += loss.item()

            # gets labels with highest probabilities and their corresponding indices
            big_val, big_idx = torch.max(outputs[0].data, dim=1)
            n_correct += calculate_accuracy(big_idx, labels)

            preds = (big_idx + 1).cpu().tolist()
            gold = (labels + 1).cpu().tolist()
            y_pred.extend(preds)
            y_true.extend(gold)

            nb_test_steps += 1
            nb_test_examples += labels.size(0)
            
    epoch_loss = test_loss/nb_test_steps
    epoch_accu = (n_correct*100)/nb_test_examples
    print(f"Validation Loss: {epoch_loss}")
    print(f"Validation Accuracy: {epoch_accu}\n")
    
    return y_true, y_pred, epoch_accu

In [None]:
# Training loop
def train(epoch):
    # number of batches run by model
    nb_tr_steps = 0
    # number of training examples run by model
    nb_tr_examples = 0
    # number of examples classified correctly by model
    n_correct = 0
    tr_loss = 0
    model.train()

    for batch, data in enumerate(train_loader):
        input_ids = data['input_ids'].to(device)
        mask = data['attn_mask'].to(device)
        labels = data['label'].to(device)

        outputs = model(input_ids, mask)
        loss = loss_function(outputs[0], labels)
        tr_loss += loss.item()

        # gets labels with highest probabilities and their corresponding indices
        big_val, big_idx = torch.max(outputs[0].data, dim=1)
        n_correct += calculate_accuracy(big_idx, labels)

        nb_tr_steps += 1
        nb_tr_examples+=labels.size(0)
        
        if batch % SAVE_EVERY == 0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print("Batch {} of epoch {} complete.".format(batch, epoch+1))
            print(f"Training Loss: {loss_step}   Training Accuracy: {accu_step}")

            if not os.path.exists(CHECKPOINT_FOLDER):
              os.makedirs(CHECKPOINT_FOLDER)

            # Since a single epoch could take well over hours, we regularly save the model even during evaluation of training accuracy.
            torch.save(model.state_dict(), os.path.join(PROJECT_FOLDER, CHECKPOINT_FOLDER, CHECKPOINT_FILE))
            print("Saving checkpoint at", os.path.join(PROJECT_FOLDER, CHECKPOINT_FOLDER, CHECKPOINT_FILE))

        optimizer.zero_grad()
        loss.backward()
        # When using GPU
        optimizer.step()

    print('\n*****\n')
    print(f'The Total Accuracy for Epoch {epoch+1}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss: {epoch_loss}")
    print(f"Training Accuracy: {epoch_accu}\n")

    # Evaluate model after training it on this epoch
    validate(model, val_loader)

    torch.save(model.state_dict(), os.path.join(PROJECT_FOLDER, CHECKPOINT_FOLDER, CHECKPOINT_FILE))
    model.save_pretrained(os.path.join(PROJECT_FOLDER, MODEL_FOLDER, str(epoch+1)))
    print("Saving checkpoint at ", os.path.join(PROJECT_FOLDER, CHECKPOINT_FOLDER, CHECKPOINT_FILE))
    print("Saving model at ", os.path.join(PROJECT_FOLDER, MODEL_FOLDER, str(epoch+1)), '\n\n================================================\n')

    return

In [None]:
# # Training without weighted loss
# for epoch in range(EPOCHS):
#     train(epoch)

Batch 0 of epoch 1 complete.
Training Loss: 1.754623532295227   Training Accuracy: 9.375
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 100 of epoch 1 complete.
Training Loss: 0.9618283035141406   Training Accuracy: 68.28589108910892
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 200 of epoch 1 complete.
Training Loss: 0.8432419965812816   Training Accuracy: 71.51741293532338
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 300 of epoch 1 complete.
Training Loss: 0.7857411857261214   Training Accuracy: 73.14161129568106
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 400 of epoch 1 complete.
Training Loss: 0.7539933294875366   Training Accuracy: 73.87001246882792
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 500 of epoch 1 complete.
Training Loss: 0.7263684099662804   Training

In [None]:
# Training with weighted loss
for epoch in range(EPOCHS):
    train(epoch)

Batch 0 of epoch 1 complete.
Training Loss: 1.7537753582000732   Training Accuracy: 0.0
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 100 of epoch 1 complete.
Training Loss: 1.5132590898192755   Training Accuracy: 45.88490099009901
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 200 of epoch 1 complete.
Training Loss: 1.3825425769559188   Training Accuracy: 54.57089552238806
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 300 of epoch 1 complete.
Training Loss: 1.3075979821309696   Training Accuracy: 58.409468438538205
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 400 of epoch 1 complete.
Training Loss: 1.2695915419561905   Training Accuracy: 59.77244389027432
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 500 of epoch 1 complete.
Training Loss: 1.2476821829696854   Training

In [None]:
# Training with weighted loss (1-dist)
for epoch in range(EPOCHS):
    train(epoch)

Batch 0 of epoch 1 complete.
Training Loss: 1.486339807510376   Training Accuracy: 46.875
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 100 of epoch 1 complete.
Training Loss: 1.2299827673647663   Training Accuracy: 69.46163366336634
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 200 of epoch 1 complete.
Training Loss: 1.1006911972268898   Training Accuracy: 71.93718905472637
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 300 of epoch 1 complete.
Training Loss: 1.0376355314373573   Training Accuracy: 73.04817275747509
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 400 of epoch 1 complete.
Training Loss: 1.0143284614098043   Training Accuracy: 73.31670822942644
Saving checkpoint at /content/drive/My Drive/BERT project/Checkpoint/checkpoint.dat
Batch 500 of epoch 1 complete.
Training Loss: 0.9967816003901278   Trainin

In [None]:
# Evaluation on test set
for epoch in range(1, EPOCHS+1):
  model = BertForSequenceClassification.from_pretrained(os.path.join(PROJECT_FOLDER, MODEL_FOLDER, str(epoch))).cuda()
  print(f'Running validation on model trained on {epoch} epochs')

  validate(model, test_loader)

Running validation on model trained on 1 epochs
Validation Loss: 0.5471496991326217
Validation Accuracy: 79.90703138967639

Running validation on model trained on 2 epochs
Validation Loss: 0.5360850396328697
Validation Accuracy: 80.1445610808878

Running validation on model trained on 3 epochs
Validation Loss: 0.574592362602357
Validation Accuracy: 80.08837126146145

Running validation on model trained on 4 epochs
Validation Loss: 0.6106200944327937
Validation Accuracy: 79.32214645110209



In [None]:
model = BertForSequenceClassification.from_pretrained(os.path.join(PROJECT_FOLDER, MODEL_FOLDER, '2')).cuda()
print(f'Running validation on model trained on 2 epochs')

y_true, y_pred, epoch_acc = validate(model, test_loader)

Running validation on model trained on 2 epochs
Validation Loss: 0.9980481928275302
Validation Accuracy: 70.70977958266289

