# Text Classification: Bert-Based Uncased with PyTorch

##### This repository includes a Jupyter Notebook that incorporates a sentiment analysis model using the Bert-Based Uncased architecture, best described in the academic paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin, Jacob, et al. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv, Cornell University, 11 Oct. 2018, https://arxiv.org/abs/1810.04805. Accessed 30 June 2024. 

<img src='https://nlp.gluon.ai/_images/bert-sentence-pair.png' width='800'>

## About

##### In this project, I will be using Hugging Face's Bert-Based Uncased encoder architecture to train a sentiment analysis model on an amazon product review dataset. This model will allow many different functions of a amazon's business such as marketing or product to analyze customer sentiment of the products they sell with high precision and speed to allow stakeholders to focus on the decision making process.

## Import Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
import pyarrow as pa
from datasets import Dataset
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification
import torch
from transformers import AutoTokenizer
import numpy as np
import evaluate
import opendatasets as od

In [2]:
od.download("https://www.kaggle.com/datasets/mahmudulhaqueshawon/amazon-product-reviews")

Skipping, found downloaded files in ".\amazon-product-reviews" (use force=True to force download)


In [3]:
df = pd.read_csv("amazon-product-reviews/amazon.csv")
df.head()

Unnamed: 0,Text,label
0,This is the best apps acording to a bunch of ...,1
1,This is a pretty good version of the game for ...,1
2,this is a really . there are a bunch of levels...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


## Process the data

In [5]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [6]:
def process_data(row):

    text = row['Text']
    text = str(text)
    text = ' '.join(text.split())

    encodings = tokenizer(text, padding="max_length", truncation=True, max_length=128)

    label = 0
    if row['label'] == 1:
        label += 1

    encodings['label'] = label
    encodings['Text'] = text

    return encodings

In [7]:
print(process_data({
    'Text': 'this is a sample review of a movie.',
    'label': 1
}))

{'input_ids': [101, 2023, 2003, 1037, 7099, 3319, 1997, 1037, 3185, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [8]:
processed_data = []

for i in range(len(df[:1000])):
    processed_data.append(process_data(df.iloc[i]))

## Generate the dataset

In [10]:
new_df = pd.DataFrame(processed_data)

train_df, valid_df = train_test_split(
    new_df,
    test_size=0.2,
    random_state=2022
)

In [11]:
train_hg = Dataset(pa.Table.from_pandas(train_df))
valid_hg = Dataset(pa.Table.from_pandas(valid_df))

In [12]:
accuracy = evaluate.load("accuracy")

In [13]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Create a model

In [15]:
model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
training_args = TrainingArguments(output_dir="./result",
                                  eval_strategy="epoch",
                                 num_train_epochs = 4,
                                 learning_rate = 2e-5,
                                 weight_decay=0.01,
                                 save_strategy='epoch',
                                 load_best_model_at_end=True,
                                 logging_strategy='epoch')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_hg,
    eval_dataset=valid_hg,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

## Train and Evaluate the model

In [31]:
trainer.train()

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3431,0.142146,0.9
2,0.1546,0.062358,0.985
3,0.0525,0.07518,0.985
4,0.0375,0.07536,0.985


TrainOutput(global_step=400, training_loss=0.1469484955072403, metrics={'train_runtime': 274.8339, 'train_samples_per_second': 11.643, 'train_steps_per_second': 1.455, 'total_flos': 210488844288000.0, 'train_loss': 0.1469484955072403, 'epoch': 4.0})

In [32]:
trainer.evaluate()

{'eval_loss': 0.062357522547245026,
 'eval_accuracy': 0.985,
 'eval_runtime': 5.1033,
 'eval_samples_per_second': 39.19,
 'eval_steps_per_second': 4.899,
 'epoch': 4.0}

## Save the model

In [35]:
model.save_pretrained('./model/')

## Load the model

In [37]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

new_model = AutoModelForSequenceClassification.from_pretrained('./model/').to(device)

In [39]:
new_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

## Get predictions

In [41]:
def get_prediction(text):
    encoding = new_tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
    encoding = {k: v.to(trainer.model.device) for k, v in encoding.items()}

    outputs = new_model(**encoding)

    logits = outputs.logits
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(logits.squeeze().cpu())
    probs = probs.detach().numpy()
    label = np.argmax(probs, axis=-1)
    
    if label == 1:
        return {
            'label': 1,
            'probability': probs[1]
        }
    else:
        return {
            'label': 0,
            'probability': probs[0]
        }

# Counter to limit the number of entries processed
counter = 0
max_entries = 10

# Iterate over each row in the 'Text' column and print the result
for index, row in valid_df.iterrows():
    if counter >= max_entries:
        break
    input_string = row['Text']
    actual_label = row['label']
    result = get_prediction(input_string)
    print(f"Input: {input_string}")
    print(f"Actual Label: {actual_label}")
    print(f"Prediction: {result['label']} with probability {result['probability']:.4f}")
    print()
    counter += 1

Input: me and the kids love this and play it everywhere waiting for our food while i'm shopping its a must have to keep everyone busy
Actual Label: 1
Prediction: 1 with probability 0.9556

Input: By boys promised I would like Angry Birds, and thy are right .It's A very fun game It's hard to put down once you get started.and easy to lose track of time when your playing.best game ever
Actual Label: 1
Prediction: 1 with probability 0.9508

Input: I downloaded this right after Christmas and have a hard time putting down, it's so addictive! My daughter has all of the Angry birds games on both the pc and my laptop, but she's not allowed to play on my Kindle. Why does this game draw me in so much? Th
Actual Label: 1
Prediction: 1 with probability 0.9498

Input: Who doesn't like angry birds? It's free and after you finish each episode, the next challenge is to do them all perfectly. It's harder than you might think.
Actual Label: 1
Prediction: 1 with probability 0.9505

Input: I think that my 