# Project Aim

Use DistilBERT to capture the sentiment of users for a service/ product based on their review description.

**Core Problem:** From this project we can see that understanding product sentiment based on the ratings alone ( out of 5 ) becomes somewhat of a challenge. This is because different users that have rated 3 stars to the product, may not share the same sentiment for the product.

This difference in sentiment occurs because, when most of us take a look at 3 stars, we think of a 'Decent' product with a few minor complaints/ issues. But as it turns out, some portion of users rate 3 stars even though their description clearly shows a 'Bad' product that has a major complaints/ issues which should not be categorized as a 'Decent' product rated at 3 stars.

**Solution:** Train and Fine-Tune our NLP model on the review descriptions left by users, and use the ratings ( out of 5 ) as 3 different target classes:
- 1-2 star == Negative Class
- 3 star == Neutral Class
- 4-5 star == Positive Class

Although the ground truth may be distorted because of the Neutral class not being consistent, we can still train our model to get a fair undestanding of the Neutral class which corresponds to a 'Decent' product with minor complaints/ issues. This is possible because most Neutral rated samples are of the 'Decent' type, so our model can still learn to make that distinction along with the help of learning what 'Negative' and 'Positive' classes imply.

# Connecting Drive and Libraries


In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
!pip install langdetect

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import DistilBertTokenizer, DistilBertModel
from transformers import AutoTokenizer
from tqdm.auto import tqdm
#from langdetect import detect
from sklearn.preprocessing import LabelEncoder
from torch.optim.lr_scheduler import StepLR


# Importing dataset

In [None]:
data_path = "/content/drive/MyDrive/Datasets/amazon_reviews/amazon_reviews.tsv"

In [None]:
# extracting data into df

df = pd.read_csv(data_path, sep='\t', on_bad_lines='skip')
print(df.head())

## DATA PRE PROCESSING


In [None]:
# extracting relevant columns
df_relevant = df[['star_rating', 'review_body']]

# removing nan values
df_noNAN = df_relevant.dropna(subset=['star_rating','review_body'])

# removing duplicates
df_no_duplicates = df_noNAN.drop_duplicates(subset=['review_body'], keep='first')

# converting reviews rating column to integer
df_no_duplicates['star_rating'] = df_no_duplicates['star_rating'].astype('int32')

# limiting our size of reviews for computational efficieny and have similar data structure.
df_no_duplicates['review_length'] = df_no_duplicates['review_body'].apply(len)

threshold_length = 100

df_filtered = df_no_duplicates[df_no_duplicates['review_length'] <= threshold_length]

# Converting ratings into positive, neutral and negative...
def categorize_rating(rating):
    if rating in [4, 5]:
        return 'positive'
    elif rating == 3:
        return 'neutral'
    else:
        return 'negative'

df_filtered['reviews.category'] = df_filtered['star_rating'].map(categorize_rating)

# Using undersampling to deal with imbalanced data
sample_size = 1000

# Sampling from each rating category
positive_samples = df_filtered[df_filtered['reviews.category'] == 'positive'].sample(n=sample_size, random_state=42)
neutral_samples = df_filtered[df_filtered['reviews.category'] == 'neutral'].sample(n=sample_size, random_state=42)
negative_samples = df_filtered[df_filtered['reviews.category'] == 'negative'].sample(n=sample_size, random_state=42)

df_sampled = pd.concat([positive_samples, neutral_samples, negative_samples], ignore_index=True)

In [None]:
# Keeping only english language reviews

tqdm.pandas()

def detect_language(text):
    try:
        return detect(text)
    except:
        return None


df_sampled['language'] = df_sampled['review_body'].progress_apply(detect_language)


df_english = df_sampled[df_sampled['language'] == 'en']

In [None]:
## saving/ retrieving the csv at our checkpoint

# df_english.to_csv('df_english.csv')

df_english = pd.read_csv('/kaggle/input/amazon-9k/df_english_9000.csv')
df_english.drop('Unnamed: 0', axis = 1, inplace = True)

In [None]:
# Creating our labels

label_encoder = LabelEncoder()
df_english['label'] = label_encoder.fit_transform(df_english['reviews.category'])

## Output: {0: 'negative', 1: 'neutral', 2: 'positive'}

labels = df_english['label'].tolist()


In [None]:
#Splitting Data into train and test

X = df_english['review_body']
y = df_english['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Tokenizing Input Data


In [None]:
# Tokenizing our inputs

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_data(texts):
    return tokenizer.batch_encode_plus(
        texts.tolist(),
        padding=True,
        truncation=True,
        max_length=256,
        return_attention_mask=True,
        return_tensors="pt"
    )

train_encodings = tokenize_data(X_train)
test_encodings = tokenize_data(X_test)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Loading data into PyTorch Dataset

In [None]:
#transforming our tokenized data and labels into dataset objects for PyTorch

class SentimentData(Dataset):
    def __init__(self, encodings, labels, sentences):
        self.encodings = encodings
        self.labels = labels
        self.sentences = sentences

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        item['sentence'] = self.sentences[idx]
        return item


In [None]:
# Converting labels to list
y_traini = y_train.tolist()
y_testi = y_test.tolist()

# Coverting Actual sentences into list

X_traini = X_train.tolist()
X_testi = X_test.tolist()

# Create the datasets for training and testing
train_dataset = SentimentData(train_encodings, y_traini, X_traini)
test_dataset = SentimentData(test_encodings, y_testi, X_testi)

In [None]:
# Creating DataLoader for training and testing

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

## Initializing our DistilBert Model ( base - uncased )

In [None]:
# Initializing our model

distilbert = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [None]:
# Creating our model...

class Bert_Flow(nn.Module):

    def __init__(self, bert_model):

        super(Bert_Flow, self).__init__()

        self.bert = bert_model

        self.drop1 = nn.Dropout(0.1)
        self.linear1 = nn.Linear(768,512)
        self.relu1 =  nn.ReLU()
        self.drop2 = nn.Dropout(0.1)
        self.linear2 = nn.Linear(512,3)


    def forward(self, input_ids, attention_mask):

        outputs = self.bert(input_ids, attention_mask)
        last_hidden_state = outputs.last_hidden_state
        cls_hs = last_hidden_state[:, 0, :]

        x = self.drop1(cls_hs)
        x = self.linear1(x)
        x = self.relu1(x)
        x = self.drop2(x)
        x = self.linear2(x)

        return x

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# pass the pre-trained BERT to our custom architecture
model = Bert_Flow(distilbert)

# Pushing model into GPU if exists
model = model.to(device)

# Instantializing Adam optimizer

optimizer = optim.Adam(model.parameters(), lr=1e-6)

# Instantializing Crossetropy loss

CEloss = nn.CrossEntropyLoss()

## Creating Train, Helper, and Test Functions

In [None]:
# Creating our Training function

def train():
    model.train()
    total_loss, total_accuracy = 0, 0
    total_preds = []
    all_labels = []

    for step, batch in enumerate(train_loader):
        if step % 50 == 0 and not step == 0:
            print(f'Batch {step} of {len(train_loader)}.')


        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        preds = model(input_ids, attention_mask)

        loss = CEloss(preds, labels)

        total_loss += loss.item()

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()

        preds = preds.detach().cpu().numpy()

        total_preds.append(preds)

        all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(train_loader)
    total_preds = np.concatenate(total_preds, axis=0)

    # Calculate accuracy

    predicted_labels = np.argmax(total_preds, axis=1)
    all_labels = np.array(all_labels)
    accuracy = np.mean(predicted_labels == all_labels)

    return avg_loss, accuracy


In [None]:
# This is our helper function that helps create our final results DataFrame

def analyze_neutral_predictions():

    # Initializing an empty list to store the data for the DataFrame
    data = []

    # Iterating through the data
    for i in range(len(final_preds)):
        sentence = sentences_list[i]
        predicted_class = final_preds[i]
        true_class = final_labels[i]

        # Maping true class label to its textual representation
        true_class_label = ['negative', 'neutral', 'positive'][true_class]


        # Determining the class label for the DataFrame
        if predicted_class == 1:
            class_label = 'neutral'
        elif predicted_class == 0:
            class_label = 'negative'
        elif predicted_class == 2:
            class_label = 'positive'

        # Appending the sentence and its class label to the data
        data.append((sentence, class_label, true_class_label))

    # Creating a pandas DataFrame with the collected data
    df = pd.DataFrame(data, columns=['Sentences', 'Predicted Class', 'True Class'])

    return df

In [None]:
# Creating our Evaluation function

def evaluate():
    print("\nEvaluating...")

    model.eval()

    total_loss = 0
    total_preds = []
    all_labels = []
    sentences_list = []

    for step, batch in enumerate(test_loader):
        if step % 50 == 0 and not step == 0:

            print(f'  Batch {step} of {len(test_loader)}')


        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        sentences = batch['sentence']

        all_labels.extend(labels.cpu().numpy())
        sentences_list.extend(sentences)


        with torch.no_grad():

            preds = model(input_ids, attention_mask)

            loss = CEloss(preds, labels)
            total_loss += loss.item()


            preds = preds.detach().cpu().numpy()
            total_preds.append(preds)


    avg_loss = total_loss / len(test_loader)


    all_labels = np.array(all_labels)                                    # These are the Actual Labels corresponding to each prediction
    total_preds = np.concatenate(total_preds, axis=0)                    # These are the Logits predicted
    predicted_labels = np.argmax(total_preds, axis=1)                    # Converting the logits into labels
    probabilities = torch.softmax(torch.tensor(total_preds), dim = 1)    # These are predicted probabilities
    accuracy = np.mean(predicted_labels == all_labels)





    return avg_loss, accuracy, predicted_labels, total_preds ,all_labels, sentences_list, probabilities



## Running our model...

In [None]:
# running the model...

# Setting initial loss to infinite
best_valid_loss = float('inf')

train_losses = []
valid_losses = []

final_preds = 0
final_labels = 0

epochs = 10

for epoch in tqdm(range(epochs)):

    print(f'\nEpoch {epoch + 1} / {epochs}')

    train_loss, train_accuracy = train()

    valid_loss, valid_accuracy, pred_labels, total_preds ,all_labels, sentences_list, probabilities = evaluate()

    # Saving the best model based on validation loss
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')

    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    if epoch == epochs - 1:

        # final_preds are the predicted LABELS NmPy array
        # final_labels are the Actual LABLES NmPy array
        # final_logits are the predicted logits
        final_preds, final_labels, final_logits = pred_labels, all_labels, total_preds
        data_results = analyze_neutral_predictions()


    # Printing losses and accuracies

    print(f'\nTraining Loss: {train_loss:.3f}, Training Accuracy: {train_accuracy * 100:.2f}%')
    print(f'Validation Loss: {valid_loss:.3f}, Validation Accuracy: {valid_accuracy * 100:.2f}%')


  0%|          | 0/10 [00:00<?, ?it/s]


Epoch 1 / 10
Batch 50 of 426.
Batch 100 of 426.
Batch 150 of 426.
Batch 200 of 426.
Batch 250 of 426.
Batch 300 of 426.
Batch 350 of 426.
Batch 400 of 426.

Evaluating...
  Batch 50 of 107
  Batch 100 of 107

Training Loss: 1.054, Training Accuracy: 48.75%
Validation Loss: 0.941, Validation Accuracy: 67.04%

Epoch 2 / 10
Batch 50 of 426.
Batch 100 of 426.
Batch 150 of 426.
Batch 200 of 426.
Batch 250 of 426.
Batch 300 of 426.
Batch 350 of 426.
Batch 400 of 426.

Evaluating...
  Batch 50 of 107
  Batch 100 of 107

Training Loss: 0.845, Training Accuracy: 66.64%
Validation Loss: 0.752, Validation Accuracy: 69.98%

Epoch 3 / 10
Batch 50 of 426.
Batch 100 of 426.
Batch 150 of 426.
Batch 200 of 426.
Batch 250 of 426.
Batch 300 of 426.
Batch 350 of 426.
Batch 400 of 426.

Evaluating...
  Batch 50 of 107
  Batch 100 of 107

Training Loss: 0.733, Training Accuracy: 70.43%
Validation Loss: 0.691, Validation Accuracy: 70.86%

Epoch 4 / 10
Batch 50 of 426.
Batch 100 of 426.
Batch 150 of 426.
Bat

In [None]:
# Saving out dataframe/ Retreiving our dataframe

#data_results.to_csv('neutral.csv')

#data_results = pd.read_csv('/kaggle/input/data-results/data_results.csv')
#data_results.drop('Unnamed: 0', axis = 1, inplace = True)

# Evaluating our data ( 9000 samples )

### Basic evaluation metrics...

In [None]:
# Precision, Recall, and F1 Scores for each class.

print(classification_report(final_labels, final_preds, target_names=['Negative', 'Neutral', 'Positive']))

              precision    recall  f1-score   support

    Negative       0.73      0.77      0.75       541
     Neutral       0.65      0.57      0.61       571
    Positive       0.78      0.84      0.80       590

    accuracy                           0.73      1702
   macro avg       0.72      0.73      0.72      1702
weighted avg       0.72      0.73      0.72      1702



### Critical Observation...

- Here we observe that our model is doing well for both Positive and Negative samples, although the performance is not very good for Neutral samples

This may be because for 'Positive' (rated 4/5 out of 5), and 'Negative' (rated 1/2 out of 5) samples, the sentiment of the person is very clear, but the 'Neutral' samples ( rated 3 out of 5 ) are not all rated with the same sentiment in mind.

This means that Neutral samples may not capture the true sentiment of the people in that category. Let us see some examples of 'Neutral' samples that were rated by people on Amazon...

Set1 (Neutral rated samples)-
- 'Story concluded to quickly and was anti climatic, but a good read.'                                             # MODEL DECIDED ITS NEUTRAL  ( minor complaint )

- 'Not bad, good solid tour of Italy. Ending could have been more exciting....'                                    # MODEL DECIDED ITS NEUTRAL  ( minor complaint )

- 'I liked the book but I believe it could use more action and details! BOOM SAUCE!  ;) ;) ;) :) :) :)'            # MODEL DECIDED ITS NEUTRAL  ( minor complaint )

- 'Good book, loved it, just dragged on a tiny bit'                                                                # MODEL DECIDED ITS POSITIVE  

- 'As good as her other book, The Husband's Secret. A good read!'                                                  # MODEL DECIDED ITS POSITIVE
  

Set2 (Neutral rated sample )-
- 'Keeps shutting down. Not please with this app.                                                                 # MODEL DECIDED ITS NEGATIVE  ( major complaint )'

- 'Takes too long for energy to recharge'                                                                          # MODEL DECIDED ITS NEGATIVE  ( major complaint )

- 'Not what I really wanted didn't fit my ear! But cost time and Money 2 send back!'                               # MODEL DECIDED ITS NEGATIVE  ( major complaint )

- 'I can not really rate the movie.  The very poor streaming of it made it unbearable to watch.'                   # MODEL DECIDED ITS NEGATIVE  ( major complaint )




It looks as if Set1 is more like 'Decent', where it has a small hiccup that the reviewer does not mind. Set2 is more like 'Bad', where there was a greater level of critique.

But both users have rated it neutral, which goes against the inuitive concept of 'Neutral Rating' which we think of as 'Decent' with few minor complains/ issues.


# Conclusion

It turns out that our model does infact perform quite well to distinguish Set1 as Positive mostly ( and Neutral sometimes ) and Set 2 as Negative mostly ( and Neutral sometimes ).

So we can conclude that if our definition of Neutral is 'Decent' with a few minor complaints/ issues, and Negative relates to a more major complaint/ issue, then we can see that our model does quite well.

Since the ground truth is quite distorted with the Neutral class, it seems as though the evaluation scores are not the best, when in fact, our model has much higher accuracy for each class, when the review description is considered. .