## 1. Import Statements

---



In [1]:
%%capture
!pip install transformers

In [2]:
import torch
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, BertModel, BertForSequenceClassification

In [3]:
# Set up the GPU.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## 2. Load the Data

---


The original code in this section is located in `star_prediction.ipyn`. It is included here to make the `get_star_predictions()` function to work.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
github_url = 'https://raw.githubusercontent.com/csbanon/bert-product-rating-predictor/master/data/reviews_comments_stars.csv'
df = pd.read_csv(github_url)
df = df[['comment', 'stars']]
df

Unnamed: 0,comment,stars
0,I could sit here and write all about the specs...,5
1,A very reasonably priced laptop for basic comp...,4
2,"This is the best laptop deal you can get, full...",5
3,A few months after the purchase....It is still...,5
4,BUYER BE AWARE: This computer has Microsoft 10...,1
...,...,...
195760,I have not tried this camera without the SD ca...,5
195761,"Hello, I bought this item months ago and I tho...",1
195762,This is an incredible camera for the money!! ...,5
195763,Great cameras. Purchased some for my mother af...,5


In [6]:
train_dataset, test_dataset = train_test_split(df, test_size=0.2, random_state=1)
test_dataset = test_dataset.reset_index(drop=True)

## 3. Define the BERT Model

---



The original code in this section is located in `star_prediction.ipyn`. It is included here to make the `get_star_predictions()` function to work. The output is suppressed to make the notebook easier to read.

In [7]:
%%capture
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = len(df['stars'].unique()), # number of unique labels for our multi-class classification problem
    output_attentions = False,
    output_hidden_states = False,
)
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## 4. Load the Trained Model

---

Here, we load the `pytorch_model.bin` file, which contains the trained weights.

In [8]:
# Load the trained model.
model.load_state_dict(torch.load('drive/My Drive/pytorch_model.bin'))
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## 5. Define the Reviews Dataset

---



The original code in this section is located in `star_prediction.ipyn`. It is included here to make the `get_star_predictions()` function to work.

In [9]:
class ReviewsDataset(Dataset):
    def __init__(self, df, max_length=512):
        self.df = df
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        self.max_length = max_length 
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        # input=review, label=stars
        review = self.df.loc[idx, 'comment']
        # labels are 0-indexed
        label = int(self.df.loc[idx, 'stars']) - 1
        
        encoded = self.tokenizer(
            review,                      # review to encode
            add_special_tokens=True,
            max_length=self.max_length,  # Truncate all segments to max_length
            padding='max_length',        # pad all reviews with the [PAD] token to the max_length
            return_attention_mask=True,  # Construct attention masks.
            truncation=True
        )
        
        input_ids = encoded['input_ids']
        attn_mask = encoded['attention_mask']
        
        return {
            'input_ids': torch.tensor(input_ids),
            'attn_mask': torch.tensor(attn_mask), 
            'label': torch.tensor(label)
        }

## 6. Predict the Star Ratings

---

The following code takes a DataFrame containing reviews and returns their predicted star ratings with an accuracy score.

In [10]:
def calculate_accuracy(predictions, targets):
  """
  Calculate the accuracy of the predictions.

  :predictions: the predicted star ratings.
  :targets: the ground-truth labels.
  """

  num_correct = (predictions == targets).sum().item()
  
  return num_correct

In [11]:
def get_star_predictions(df, model):
  """
  Uses the given pretrained model to predict star ratings based on
  reviews. Returns an array with the original reviews and their star
  predictions, as well as the accuracy of the predictions.

  :df: DataFrame containing reviews and their labels.
  :model: loaded pretrained BERT model.
  """

  # Define the prediction parameters.
  MAX_LEN = 256
  TEST_BATCH_SIZE = 16
  NUM_WORKERS = 4

  test_params = {'batch_size': TEST_BATCH_SIZE,
              'shuffle': True,
              'num_workers': NUM_WORKERS}

  # Define the dataset and dataloader.
  dataset = ReviewsDataset(df, MAX_LEN)
  data_loader = DataLoader(dataset, **test_params)

  num_batches = 0
  num_examples = 0
  num_correct = 0
  total_examples = len(df)
  predictions = np.zeros([total_examples, 2], dtype=object)

  for batch, data in enumerate(data_loader):
    
    # Get the tokenization values.
    input_ids = data['input_ids'].to(device)
    mask = data['attn_mask'].to(device)
    labels = data['label'].to(device)

    # Make predictions with the trained model
    outputs = model(input_ids, mask)
    
    # Get the star ratings.
    big_val, big_idx = torch.max(outputs[0].data, dim=1)
    star_predictions = (big_idx + 1).cpu().numpy()
    reviews = df['comment'].values[num_examples:num_examples + labels.size(0)]
    batch = np.vstack((reviews, star_predictions)).T

    # Update the output.
    predictions[num_examples:num_examples + labels.size(0)] = batch

    num_correct += calculate_accuracy(big_idx, labels)

    num_batches += 1
    num_examples += labels.size(0)

    print("Batch #{}: {}/{}".format(num_batches, num_examples, total_examples))

  accuracy = (num_correct * 100) / num_examples

  print("Finished predictions! Accuracy:", accuracy)

  return predictions, accuracy

In [12]:
# Get the star predictions.
predictions, accuracy = get_star_predictions(test_dataset, model)

Batch #1: 16/39153
Batch #2: 32/39153
Batch #3: 48/39153
Batch #4: 64/39153
Batch #5: 80/39153
Batch #6: 96/39153
Batch #7: 112/39153
Batch #8: 128/39153
Batch #9: 144/39153
Batch #10: 160/39153
Batch #11: 176/39153
Batch #12: 192/39153
Batch #13: 208/39153
Batch #14: 224/39153
Batch #15: 240/39153
Batch #16: 256/39153
Batch #17: 272/39153
Batch #18: 288/39153
Batch #19: 304/39153
Batch #20: 320/39153
Batch #21: 336/39153
Batch #22: 352/39153
Batch #23: 368/39153
Batch #24: 384/39153
Batch #25: 400/39153
Batch #26: 416/39153
Batch #27: 432/39153
Batch #28: 448/39153
Batch #29: 464/39153
Batch #30: 480/39153
Batch #31: 496/39153
Batch #32: 512/39153
Batch #33: 528/39153
Batch #34: 544/39153
Batch #35: 560/39153
Batch #36: 576/39153
Batch #37: 592/39153
Batch #38: 608/39153
Batch #39: 624/39153
Batch #40: 640/39153
Batch #41: 656/39153
Batch #42: 672/39153
Batch #43: 688/39153
Batch #44: 704/39153
Batch #45: 720/39153
Batch #46: 736/39153
Batch #47: 752/39153
Batch #48: 768/39153
Batch #

In [13]:
# Show the predictions.
print(predictions)

[["Simple to place in not a fan of the pull arm style of fan but it does the job when upgrading CPU's. You also gotta update your bios before installation of new CPU"
  5]
 ['Its nice.  A small little dock for two controllers, plugged it in and with no controllers on it both lights came on green.  Not discouraged I plugged in both controllers hoping to see them turn red.  Didnt happen.  Powered on one controller in the dock, battery indicator was not blinking indicating charing.  Unplugged usb from base bad plugged into controller its self low and behold....it charges....try the dock again and nothing.  Why would a company continue to make something THEY KNOW DOES NOT WORK?  Extremely disappointed I wasted my young nephews (9) xmas gift to me on this.  He was very proud to give me the 10 dollar amazon gift card he worked himself doing chores for his neighbors.'
  5]
 ['I really like these, purchased one for each garage door and works great on my Android cell phone.  Have now had these 