## 1. Import Statements

---



In [1]:
%%capture
!pip install transformers

In [2]:
import torch
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, BertModel, BertForSequenceClassification

In [3]:
# Set up the GPU.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## 2. Load the Data

---


The original code in this section is located in `bert-training.ipynb`. It is included here to make the `get_star_predictions()` function to work.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
github_url = 'https://raw.githubusercontent.com/csbanon/bert-product-rating-predictor/master/data/reviews_comments_stars.csv'
df = pd.read_csv(github_url)
df = df[['comment', 'stars']]
df

Unnamed: 0,comment,stars
0,I could sit here and write all about the specs...,5
1,A very reasonably priced laptop for basic comp...,4
2,"This is the best laptop deal you can get, full...",5
3,A few months after the purchase....It is still...,5
4,BUYER BE AWARE: This computer has Microsoft 10...,1
...,...,...
195760,I have not tried this camera without the SD ca...,5
195761,"Hello, I bought this item months ago and I tho...",1
195762,This is an incredible camera for the money!! ...,5
195763,Great cameras. Purchased some for my mother af...,5


In [6]:
train_dataset, test_dataset = train_test_split(df, test_size=0.2, random_state=1)
test_dataset = test_dataset.reset_index(drop=True)

## 3. Define the BERT Model

---



The original code in this section is located in `bert-training.ipynb`. It is included here to make the `get_star_predictions()` function to work. The output is suppressed to make the notebook easier to read.

In [7]:
%%capture
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = len(df['stars'].unique()), # Number of unique labels for our multi-class classification problem.
    output_attentions = False,
    output_hidden_states = False,
)
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## 4. Load the Trained Model

---

Here, we load the `trained_model.bin` file, which contains the trained weights.

In [8]:
# Load the trained model.
model.load_state_dict(torch.load('drive/My Drive/CAP 5610/trained_model.bin'))
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## 5. Define the Reviews Dataset

---



The original code in this section is located in `star_prediction.ipyn`. It is included here to make the `get_star_predictions()` function to work.

In [9]:
class ReviewsDataset(Dataset):
    def __init__(self, df, max_length=512):
        self.df = df
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        self.max_length = max_length 
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        # input=review, label=stars
        review = self.df.loc[idx, 'comment']
        # labels are 0-indexed
        label = int(self.df.loc[idx, 'stars']) - 1
        
        encoded = self.tokenizer(
            review,                      # Review to encode.
            add_special_tokens=True,
            max_length=self.max_length,  # Truncate all segments to max_length.
            padding='max_length',        # Pad all reviews with the [PAD] token to the max_length.
            return_attention_mask=True,  # Construct attention masks.
            truncation=True
        )
        
        input_ids = encoded['input_ids']
        attn_mask = encoded['attention_mask']
        
        return {
            'input_ids': torch.tensor(input_ids),
            'attn_mask': torch.tensor(attn_mask), 
            'label': torch.tensor(label)
        }

## 6. Predict the Star Ratings

---

The following code takes a DataFrame containing reviews and returns their predicted star ratings with an accuracy score.

In [10]:
def calculate_accuracy(predictions, targets):
  """
  Calculate the accuracy of the predictions.

  :predictions: the predicted star ratings.
  :targets: the ground-truth labels.
  """

  num_correct = (predictions == targets).sum().item()
  
  return num_correct

In [11]:
def get_star_predictions(df, model):
  """
  Uses the given pretrained model to predict star ratings based on
  reviews. Returns an array with the original reviews and their star
  predictions, as well as the accuracy of the predictions.

  :df: DataFrame containing reviews and their labels.
  :model: loaded pretrained BERT model.
  """

  # Define the prediction parameters.
  MAX_LEN = 256
  TEST_BATCH_SIZE = 16
  NUM_WORKERS = 4

  test_params = {'batch_size': TEST_BATCH_SIZE,
              'shuffle': False,
              'num_workers': NUM_WORKERS}

  # Define the dataset and dataloader.
  dataset = ReviewsDataset(df, MAX_LEN)
  data_loader = DataLoader(dataset, **test_params)

  num_batches = 0
  num_examples = 0
  num_correct = 0
  total_examples = len(df)
  predictions = np.zeros([total_examples, 2], dtype=object)

  for batch, data in enumerate(data_loader):
    
    # Get the tokenization values.
    input_ids = data['input_ids'].to(device)
    mask = data['attn_mask'].to(device)
    labels = data['label'].to(device)

    # Make predictions with the trained model
    outputs = model(input_ids, mask)
    
    # Get the star ratings.
    big_val, big_idx = torch.max(outputs[0].data, dim=1)
    star_predictions = (big_idx + 1).cpu().numpy()
    reviews = df['comment'].values[num_examples:num_examples + labels.size(0)]
    batch = np.vstack((reviews, star_predictions)).T

    # Update the output.
    predictions[num_examples:num_examples + labels.size(0)] = batch

    num_correct += calculate_accuracy(big_idx, labels)

    num_batches += 1
    num_examples += labels.size(0)

    if num_batches % 10 == 0:
      print("Batch #{}: {}/{}".format(num_batches, num_examples, total_examples))

  accuracy = (num_correct * 100) / num_examples

  print("Finished predictions! Accuracy:", accuracy)

  return predictions, accuracy

In [12]:
# Get the star predictions.
predictions, accuracy = get_star_predictions(df, model)

Batch #10: 160/195765
Batch #20: 320/195765
Batch #30: 480/195765
Batch #40: 640/195765
Batch #50: 800/195765
Batch #60: 960/195765
Batch #70: 1120/195765
Batch #80: 1280/195765
Batch #90: 1440/195765
Batch #100: 1600/195765
Batch #110: 1760/195765
Batch #120: 1920/195765
Batch #130: 2080/195765
Batch #140: 2240/195765
Batch #150: 2400/195765
Batch #160: 2560/195765
Batch #170: 2720/195765
Batch #180: 2880/195765
Batch #190: 3040/195765
Batch #200: 3200/195765
Batch #210: 3360/195765
Batch #220: 3520/195765
Batch #230: 3680/195765
Batch #240: 3840/195765
Batch #250: 4000/195765
Batch #260: 4160/195765
Batch #270: 4320/195765
Batch #280: 4480/195765
Batch #290: 4640/195765
Batch #300: 4800/195765
Batch #310: 4960/195765
Batch #320: 5120/195765
Batch #330: 5280/195765
Batch #340: 5440/195765
Batch #350: 5600/195765
Batch #360: 5760/195765
Batch #370: 5920/195765
Batch #380: 6080/195765
Batch #390: 6240/195765
Batch #400: 6400/195765
Batch #410: 6560/195765
Batch #420: 6720/195765
Batch #

In [13]:
# Show the predictions.
print(predictions)

[["I could sit here and write all about the specs on this computer, but they are already in the description, and If you are like me... you don't really understand it anyways.So I am going to tell you what I LOVE about this computer and what I use it for. I am a full time college student as well as a single mother who stays busy. I have previously used a HP All In one computer that I bought brand new a year ago and I hate that thing... It is so slow!!! When I first opened this item, I was just hoping that it would be a little faster! What I got instead was an amazing computer that is faster than I could have ever imagined. Now I don't use this thing for much more than amazon reviews, school work, and papers. But this is exactly what I needed."
  5]
 ['A very reasonably priced laptop for basic computing needs. The specs that stick out to me for describing this as "basic needs" is 4GB of RAM, and 128GB M.2 SSD. Both are at the bare minimum in today\'s needs. Cell phones now come with thos

## 7. Save the Results in a CSV File

---



In [14]:
# Save the predictions as a dataframe.
pred_df = pd.DataFrame(data=predictions, columns=['review', 'prediction'])

In [15]:
# Save the results to a CSV file.
pred_df.to_csv('predictions.csv')