# Goodreads Books Review Rating Prediction
(https://www.kaggle.com/competitions/goodreads-books-reviews-290312/)
---

> Reviews are a good way to judge the quality of any product, whether it's books, clothes, technology, or anything else. When you want to buy something online these days, the first thing that comes to mind is the reviews from past buyers and the overall rating the product has received.

> Reader feedback, whether positive or negative, five stars or one star, will encourage the product owner to make improvements.

> Reader connection and engagement will be encouraged by book reviews, whether they be left on Amazon, Goodreads, or social media. Readers must determine whether or not other readers are enjoying the book.




Here are the columns of the dataset


*   book_id - Id of Book

*   review_id - Id of review

*   rating - rating from 0 to 5

*   review_text - review text

*   date_added - date added

*   date_updated - date updated

*   read_at - read at


*  started_at - started at


*   n_votes - no. of votes


*   n_comments - no. of comments



**Model (BERT and Fine Tuning)**

- BERT is not a finished model. It's designed to be fine-tuned to perform specific tasks like our sentiment analysis.

- Fine-tuning takes the already pre-trained model and makes it perform a similar task, called a downstream task. Such downstream tasks need no architectural modification to the BERT model.

- Google did the pre-training of BERT. It costs around $7,000 and ran for five days, using 16 TPUs. We can take advantage of that by using this pre-trained model.

- Pre-trained models help to achieve better results in a shorter time with fewer costs.

- We use DistilBERT, a smaller and faster version of BERT, by facilitating the Hugging Face Transformers package, a python library providing pre-trained NLP models.

## Author
Firda Puspita Devi


## Install Huggingface Transformers

In [None]:
!pip install transformers



## Dependencies

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/Bootcamp/Day33 - Checkpoint 1')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os, sys
sys.path.append('../')
os.chdir('../')

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from wordcloud import WordCloud
from collections import Counter
import requests
import re
import torch
from torch import optim
import torch.nn.functional as F
from tqdm import tqdm
from sklearn.model_selection import train_test_split


from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
from nltk.tokenize import TweetTokenizer

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from scipy.special import softmax

from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification

import tensorflow as tf
import json
from io import StringIO

In [None]:
pip install wordcloud matplotlib pandas

In [None]:
!pip install requests

## Configuration

First, we'll need to enable GPUs for the notebook:

Navigate to Edit→Notebook Settings
select GPU from the Hardware Accelerator drop-down

In [None]:
num_gpus_available = len(tf.config.experimental.list_physical_devices('GPU'))
print("Num GPUs Available: ", num_gpus_available)
assert num_gpus_available > 0

## Import data from Kaggle


In [None]:
!pip install kaggle
!mkdir -p ~/.kaggle
!cp "/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/kaggle.json" ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle competitions download -c goodreads-books-reviews-290312

In [None]:
!unzip goodreads-books-reviews-290312.zip -d "/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1"

## Exploratory Data Analysis (EDA)

### Statistics summary

#### Limit to get only 10k records of train data or 1% of each rating

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/goodreads_train.csv")

total_records_needed = 10000

# Calculate the total number of records for each rating
rating_counts = data['rating'].value_counts()

# Calculate 1% of each rank's total number
rating_sample_sizes = (rating_counts * 0.01).astype(int)

# Here we scale down the sample sizes proportionally if the sum exceeds 10.000
if rating_sample_sizes.sum() > total_records_needed:
    scaling_factor = total_records_needed / rating_sample_sizes.sum()
    rating_sample_sizes = (rating_sample_sizes * scaling_factor).astype(int)


# Sampling data according to calculated sample sizes
samples = []
for rating, size in rating_sample_sizes.items():
    rating_samples = data[data['rating'] == rating].sample(n=size, random_state=1)
    samples.append(rating_samples)

# Concatenate all samples into a new DataFrame
data = pd.concat(samples)


print(rating_sample_sizes)

In [None]:
# Validate the number of records and proportions
print("Total records sampled:", data.shape[0])
print(data['rating'].value_counts(normalize=True))  # Check the percentage distribution

In [None]:
data.info()

In [None]:
data_sum_stats = data.describe(include='all').T.drop('count', axis=1)
data_sum_stats

In [None]:
data.columns

In [None]:
col_outlier = ['rating','n_votes','n_comments']

for col in col_outlier:
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.histplot(data[col], bins=30)
    plt.title('Histogram')

    # plot Q-Q
    plt.subplot(1, 3, 2)
    stats.probplot(data[col], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # box plot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=data[col])
    plt.title('Boxplot')

    plt.show()

In [None]:
# Select only numeric columns for correlation calculation
numeric_train = data.select_dtypes(include=[np.number])

# Calculate the correlation matrix on just the numeric data
corrMatrix = numeric_train.corr()

# Create a heatmap
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(corrMatrix, annot=True)
plt.show()

### Visualizations

In [None]:
# setting id as index column
data.set_index("user_id", inplace = True)
data.set_index("book_id", inplace = True)
data.set_index("review_id", inplace = True)

#### Rating and Number of Votes Distribution

In [None]:
# Calculate the average number of votes and comments per rating
avg_data = data.groupby('rating')[['n_votes', 'n_comments']].mean().reset_index()

In [None]:
# Bar plot for train data
plt.figure(figsize=(10, 6))
sns.barplot(x='rating', y='n_votes', data=avg_data)
plt.title('Average Number of Votes per Rating (Train Data)')
plt.xlabel('Rating')
plt.ylabel('Average Number of Votes')
plt.show()

In [None]:
# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='rating', y='n_comments', data=avg_data)
plt.title('Average Number of Comments per Rating')
plt.xlabel('Rating')
plt.ylabel('Average Number of Comments')
plt.show()

#### Number of Votes and Number of Comments Distributions

In [None]:
# Check if there are any duplicate indices
print(data.index.duplicated().sum())

# If there are duplicates, reset the index
if data.index.duplicated().any():
    data.reset_index(drop=True, inplace=True)

In [None]:
print("Max n_votes:", data['n_votes'].max())
print("Max n_comments:", data['n_comments'].max())

In [None]:
votes_bins = [0, 50, 100, 500, 1000, data['n_votes'].max() + 1]
comments_bins = [0, 10, 50, 100, 500, data['n_comments'].max() + 1]

print("Votes bins:", votes_bins)
print("Comments bins:", comments_bins)


In [None]:
# Check differences between bins to ensure they are all positive
print("Votes bins differences:", np.diff(votes_bins))
print("Comments bins differences:", np.diff(comments_bins))

In [None]:
# Manually set the last bin to be larger than any possible value in the dataset for both votes and comments
votes_bins = [0, 50, 100, 500, 1000]
comments_bins = [0, 10, 50, 100, 500]

# Extend the last bin beyond the highest predefined value if necessary
votes_bins.append(votes_bins[-1] + 500)
comments_bins.append(comments_bins[-1] + 500)

In [None]:
# Apply binning again with corrected bins
data['votes_bin'] = pd.cut(data['n_votes'], bins=votes_bins, right=False)
data['comments_bin'] = pd.cut(data['n_comments'], bins=comments_bins, right=False)

In [None]:
# Check new differences between bins to ensure all are positive
print("New Votes bins differences:", np.diff(votes_bins))
print("New Comments bins differences:", np.diff(comments_bins))

In [None]:
avg_comments_per_vote_bin = data.groupby('votes_bin')['n_comments'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(x='votes_bin', y='n_comments', data=avg_comments_per_vote_bin)
plt.title('Average Number of Comments per Vote Bins')
plt.xlabel('Number of Votes (binned)')
plt.ylabel('Average Number of Comments')
plt.xticks(rotation=45)
plt.show()

#### WordCloud

In [None]:
text = ' '.join(review for review in data['review_text'])

# Generate a word cloud image
wordcloud = WordCloud(width = 800, height = 400, background_color ='white').generate(text)

# Display the word cloud image:
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

#### Top N Words

In [None]:
# Tokenize the text
words = text.split()

# Get frequencies of each word
word_counts = Counter(words)

# Determine the number of top words to display, e.g., top 10
top_n = 10
top_words = word_counts.most_common(top_n)
top_words_df = pd.DataFrame(top_words, columns=['Word', 'Frequency'])

# Plot
plt.figure(figsize=(10, 5))
plt.bar(top_words_df['Word'], top_words_df['Frequency'], color ='blue')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.title('Top N Words in the Dataset')
plt.xticks(rotation=45)
plt.show()

## Sentiment Analysis

### **Label anotator sentiment**

To insert labels such as 'positive', 'neutral', or 'negative' for the DataFrame, we would typically go through a sentiment analysis process. This can be done either:


**1. Manually by reading through each text and assigning a label based on the sentiment conveyed**

For manual labeling, we could create a new column in the DataFrame and insert the labels directly.


**2. Automatically using a sentiment analysis tool or model**

For automatic labeling, we'd typically use a pre-trained sentiment analysis model from a library like NLTK, TextBlob, or through a service like Google Cloud Natural Language API.

In [None]:
def assign_sentiment_from_rating(rating):
    # Assign sentiment based on rating
    if rating in [4, 5]:
        return 'positive'
    elif rating in [0, 1, 2]:
        return 'negative'
    elif rating == 3:
        return 'neutral'
    else:
        return 'undefined'

In [None]:
data['sentiment'] = data['rating'].apply(assign_sentiment_from_rating)

In [None]:
print(data[['rating', 'sentiment']].tail())

### Preprocess text

In [None]:
data.columns

In [None]:
# Specify columns to delete
columns_to_delete = ['date_added', 'date_updated', 'read_at', 'started_at',
                     'n_votes', 'n_comments', 'votes_bin', 'comments_bin',
                     'rating']

data = data.drop(columns=columns_to_delete)

data.head(10)

In [None]:
###
# common functions
###
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def metrics_to_string(metric_dict):
    string_list = []
    for key, value in metric_dict.items():
        string_list.append('{}:{:.2f}'.format(key, value))
    return ' '.join(string_list)

In [None]:
# Set random seed
set_seed(20052024)

In [None]:
def cleaning_text(text):
    # remove url
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    text = url_pattern.sub(r'', text)

    # remove hashtags
    # only removing the hash # sign from the word
    text = re.sub(r'#', '', text)

    # remove mention handle user (@)
    text = re.sub(r'@[\w]*', ' ', text)

    # remove emojis
    emoji_pattern = re.compile(
        '['
        '\U0001F600-\U0001F64F'  # emoticons
        '\U0001F300-\U0001F5FF'  # symbols & pictographs
        '\U0001F680-\U0001F6FF'  # transport & map symbols
        '\U0001F700-\U0001F77F'  # alchemical symbols
        '\U0001F780-\U0001F7FF'  # Geometric Shapes Extended
        '\U0001F800-\U0001F8FF'  # Supplemental Arrows-C
        '\U0001F900-\U0001F9FF'  # Supplemental Symbols and Pictographs
        '\U0001FA00-\U0001FA6F'  # Chess Symbols
        '\U0001FA70-\U0001FAFF'  # Symbols and Pictographs Extended-A
        '\U00002702-\U000027B0'  # Dingbats
        '\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)

    # remove punctuation
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    for x in text.lower():
        if x in punctuations:
            text = text.replace(x, " ")

    # remove extra whitespace
    text = ' '.join(text.split())

    # lowercase
    text = text.lower()
    return text

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

# CONSTRUCT STOPWORDS
alir3z4_stopword = "https://github.com/Alir3z4/stop-words/blob/master/english.txt"
iso_stopword = "https://github.com/stopwords-iso/stopwords-en/blob/master/stopwords-en.txt"
bbalet_stopword = "https://github.com/bbalet/stopwords/blob/master/_stopwords.txt"
igorbrigadir_stopword = "https://github.com/igorbrigadir/stopwords/blob/master/en/terrier.txt"
naimdjon_stopword = "https://github.com/naimdjon/stopwords/blob/master/stopwords.txt"
saurabbhsp_stopword = "https://github.com/saurabbhsp/stopwords/blob/master/English.txt"
sanjaalcorps_stopword = "https://github.com/sanjaalcorps/EnglishStopWords/blob/master/stop_words_eng.csv"
cihanhelin_stopword = "https://github.com/cihanhelin/NLTK-s-list-of-english-stopwords/blob/main/NLTK-s-list-of-english-stopwords"
machouz1_stopword = "https://github.com/machouz/stopwords/blob/master/stopwords/stop-words_english_1_en.txt"
machouz2_stopword = "https://github.com/machouz/stopwords/blob/master/stopwords/stop-words_english_2_en.txt"
machouz3_stopword = "https://github.com/machouz/stopwords/blob/master/stopwords/stop-words_english_3_en.txt"
machouz4_stopword = "https://github.com/machouz/stopwords/blob/master/stopwords/stop-words_english_4_google_en.txt"
machouz5_stopword = "https://github.com/machouz/stopwords/blob/master/stopwords/stop-words_english_5_en.txt"
machouz6_stopword = "https://github.com/machouz/stopwords/blob/master/stopwords/stop-words_english_6_en.txt"
nltk_stopword = stopwords.words('english')

# create path url for each stopword
path_stopwords = [alir3z4_stopword, iso_stopword, bbalet_stopword, igorbrigadir_stopword,
                  naimdjon_stopword, saurabbhsp_stopword, sanjaalcorps_stopword, cihanhelin_stopword,
                  machouz1_stopword, machouz2_stopword, machouz3_stopword, machouz4_stopword,
                  machouz5_stopword, machouz6_stopword]

# combine stopwords
stopwords_l = nltk_stopword
for path in path_stopwords:
    response = requests.get(path)
    stopwords_l += response.text.split('\n')

custom_st = '''
A fun, fast paced science fiction thriller. I read it in 2 nights and couldn't put it down.
The book is about the quantum theory of many worlds which states that all decisions we make
throughout our lives basically create branches,
and that each possible path through the decision tree can be thought of as a parallel world.
And in this book, someone invents a way to switch between these worlds.
This was nicely alluded to/foreshadowed in this quote: \n
"I think about all the choices we've made that created this moment.
Us sitting here together at this beautiful table.
Then I think of all the possible events that could have stopped this moment from ever happening,
and it all feels, I don't know..." "What?" "So fragile."
Now he becomes thoughtful for a moment. He says finally,
"It's terrifying when you consider that every thought we have, every choice we could possibly make,
branches into a new world." \n (view spoiler)
[This book can't be discussed without spoilers. It is a book about choice and regret.
Ever regret not chasing the girl of your dreams so you can focus on your career?
Well Jason2 made that choice and then did regret it. Clearly the author is trying to tell us to optimize for happiness -
to be that second rate physics teacher at a community college if it means you can have a happy life.
I'm being snarky because while there is certainly something to that, you also have to have meaning in your life that comes from within.
I thought the book was a little shallow on this dimension. In fact, all the characters were fairly shallow.
Daniela was the perfect wife. Ryan the perfect antithesis of Jason. Amanda the perfect loyal traveling companion, etc.
This, plus the fact that the book was weak on the science are what led me to take a few stars off -
but I'd still read it again if I could go back in time - was a very fun and engaging read. \n
If you want to really minimize regret, you have to live your life to avoid it in the first place.
Regret can't be hacked, which is kind of the point of the book.
My favorite book about regret is Remains of the Day. I do really like the visualization of the decision tree though - that is a powerful concept. \n
"Every moment, every breath, contains a choice. But life is imperfect. We make the wrong choices. So we end up living in a state of perpetual regret,
and is there anything worse? I built something that could actually eradicate regret. Let you find worlds where you made the right choice."
Daniela says, "Life doesn't work that way. You live with your choices and learn. You don't cheat the system.
'''

# create dictionary with unique stopword
st_words = set(stopwords_l)
custom_stopword = set(custom_st.split())

# result stopwords
stop_words = st_words | custom_stopword
print(f'Stopwords: {list(stop_words)[:5]}')
# remove stopwords
from nltk import word_tokenize, sent_tokenize

def remove_stopword(text, stop_words=stop_words):
    word_tokens = word_tokenize(text)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    return ' '.join(filtered_sentence)

In [None]:
# pipeline preprocess
def preprocess(text):
    # cleaning text and lowercase
    output = cleaning_text(text)

    # remove stopwords
    output = remove_stopword(output)

    return output

In [None]:
# implement preprocessing

# Copy the datasets
preprocessed_data = data.copy()

# Apply the preprocessing function
preprocessed_data['review_text'] = data['review_text'].map(preprocess)

print(preprocessed_data['review_text'].head())

In [None]:
# Define file paths
csv_file_path = '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/preprocessed_data.csv'

# Save the train dataset
df = pd.DataFrame(preprocessed_data)
df.to_csv(csv_file_path, sep=';', index=False, header=True)
print(f'Preprocessed data has been saved to {csv_file_path}')

In [None]:
# load processed dataset into pandas
preprocessed_data = pd.read_csv('/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/preprocessed_data.csv', sep=';')
preprocessed_data.tail(10)

In [None]:
preprocessed_data[preprocessed_data['sentiment'] == "neutral"]

In [None]:
texts = preprocessed_data['review_text'].tolist()
labels = preprocessed_data['sentiment'].tolist()

### Modeling

##### Load distilbert-base-multilingual-cased-sentiments-student

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-uncased")

# # Check if GPU is available and move the model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)


In [None]:
model

In [None]:
count_param(model)

##### Preparing the Input and Making Predictions

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
file_path = '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/preprocessed_data.csv'
df = pd.read_csv(file_path, sep=',')

# Split the dataset into train, test, and validation sets
train_df, test_valid_df = train_test_split(df, test_size=0.2, random_state=42)
test_df, valid_df = train_test_split(test_valid_df, test_size=0.5, random_state=42)

# Save the splits as TSV files
train_df.to_csv('/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/train.tsv', sep='\t', index=False)
test_df.to_csv('/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/test.tsv', sep='\t', index=False)
valid_df.to_csv('/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/valid.tsv', sep='\t', index=False)

In [None]:
train_dataset_path = '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/train.tsv'
valid_dataset_path = '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/valid.tsv'
test_dataset_path = '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/train.tsv'

##### Set up label mappings

In [None]:
LABEL2INDEX = {'negative': 0, 'neutral': 1, 'positive': 2}
INDEX2LABEL = {0: 'negative', 1: 'neutral', 2: 'positive'}

In [None]:
# Load the data from TSV files into pandas DataFrames
train_df = pd.read_csv(train_dataset_path, delimiter=';')
valid_df = pd.read_csv(valid_dataset_path, delimiter=';')
test_df = pd.read_csv(test_dataset_path, delimiter=';')

In [None]:
# Convert textual labels to numeric labels using LABEL2INDEX
train_df['label'] = train_df['sentiment'].map(LABEL2INDEX).astype(int)
valid_df['label'] = valid_df['sentiment'].map(LABEL2INDEX).astype(int)
test_df['label'] = test_df['sentiment'].map(LABEL2INDEX).astype(int)

In [None]:
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch
from transformers import AutoTokenizer, DistilBertForSequenceClassification, AdamW
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Ensure text is a string and handle NaN values
        if not isinstance(text, str):
            text = str(text)
        if pd.isna(label):
            label = 0  # Default label

        # Tokenize the text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

train_dataset = SentimentDataset(train_df['review_text'], train_df['label'].map(LABEL2INDEX), tokenizer, max_len=512)
valid_dataset = SentimentDataset(valid_df['review_text'], valid_df['label'].map(LABEL2INDEX), tokenizer, max_len=512)
test_dataset = SentimentDataset(test_df['review_text'], test_df['label'].map(LABEL2INDEX), tokenizer, max_len=512)

# Dataloaders
train_loader = DataLoader(train_dataset, batch_size=6, shuffle=True, num_workers=4)  # Reduced batch size
valid_loader = DataLoader(valid_dataset, batch_size=6, shuffle=False, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=6, shuffle=False, num_workers=4)

In [None]:
w2i, i2w = LABEL2INDEX, INDEX2LABEL
print(w2i)
print(i2w)

##### Test model on sample sentences

In [None]:
# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

model.eval()

# Example text
text = '1 star highlander historical romance.'

# Tokenize the text
subwords = tokenizer.encode(text, return_tensors='pt').to(model.device)

# Get the logits from the model
output = model(subwords)
logits = output.logits

# Print the shape of logits for debugging
print(f"Shape of logits: {logits.shape}")

# Get the top label and its confidence
topk = torch.topk(logits, k=1, dim=-1)
label = topk[1].squeeze(dim=-1).item()
confidence = F.softmax(logits, dim=-1).squeeze()[label] * 100

# Print the result
print(f'Text: {text} | Label : {i2w[label]} ({confidence:.3f}%)')

In [None]:
text = 'started interesting thriller woman run mysterious pass story took strange twists turns ends talky talky reveal.'

# Tokenize the text
subwords = tokenizer.encode(text, return_tensors='pt').to(model.device)

# Get the logits from the model
output = model(subwords)
logits = output.logits

# Print the shape of logits for debugging
print(f"Shape of logits: {logits.shape}")

# Get the top label and its confidence
topk = torch.topk(logits, k=1, dim=-1)
label = topk[1].squeeze(dim=-1).item()
confidence = F.softmax(logits, dim=-1).squeeze()[label] * 100

# Print the result
print(f'Text: {text} | Label : {i2w[label]} ({confidence:.3f}%)')


### Fine tuning & evaluation

In [None]:
import shutil

def save_ckp(state, is_best, checkpoint_path, best_model_path):
    f_path = checkpoint_path
    # save checkpoint data to the path given, checkpoint_path
    torch.save(state, f_path)
    # if it is a best model, min validation loss
    if is_best:
        best_fpath = best_model_path
        # copy that checkpoint file to best path given, best_model_path
        shutil.copyfile(f_path, best_fpath)

In [None]:
!nvidia-smi

In [None]:
import torch
torch.cuda.is_available()

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
optimizer = optim.Adam(model.parameters(), lr=3e-6)
model = model.cuda()

#### Training distilbert-base-multilingual-cased-sentiments-student for Sentiment Analysis

In [None]:
train_df.dropna(subset=['label'], inplace=True)

In [None]:
# Check for NaN values in the DataFrame
print("NaN in texts:", train_df['review_text'].isna().sum())
print("NaN in labels:", train_df['label'].isna().sum())

# Drop rows where any of the necessary columns are NaN
train_df.dropna(subset=['review_text', 'label'], inplace=True)


##### Load the Model and Optimizer

In [None]:
from transformers import DistilBertForSequenceClassification, AdamW
import torch
from tqdm import tqdm

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
model.to('cuda')

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Function to get learning rate
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

##### Define forward_sequence_classification Function

In [None]:
# Define the forward function
def forward_sequence_classification(model, batch_data, i2w, device):
    input_ids = batch_data['input_ids'].to(device)
    attention_mask = batch_data['attention_mask'].to(device)
    labels = batch_data['labels'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    logits = outputs.logits

    batch_hyp = torch.argmax(logits, dim=1).cpu().numpy()
    batch_label = labels.cpu().numpy()

    return loss, batch_hyp, batch_label

##### Define metrics calculation functions

metrics calculation functions

In [None]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, DistilBertForSequenceClassification, AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import pandas as pd

# Define the metrics calculation functions
def document_sentiment_metrics_fn(predictions, labels):
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


metrics to string functions

In [None]:
def metrics_to_string(metrics):
    return "Acc: {:.4f}, Prec: {:.4f}, Rec: {:.4f}, F1: {:.4f}".format(
        metrics['accuracy'],
        metrics['precision'],
        metrics['recall'],
        metrics['f1']
    )


##### Training and evaluation loop

In [None]:
# Training loop
n_epochs = 10
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)

    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        # Forward model
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data, i2w=i2w, device='cuda')

        # Update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss += tr_loss

        # Calculate metrics
        list_hyp += list(batch_hyp)
        list_label += list(batch_label)

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format(
            (epoch+1), total_train_loss/(i+1), get_lr(optimizer)
        ))

    # Calculate train metric
    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format(
        (epoch+1), total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)
    ))

    # Evaluate on validation
    model.eval()
    torch.set_grad_enabled(False)

    total_loss, total_correct, total_labels = 0, 0, 0
    list_hyp, list_label = [], []

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data, i2w=i2w, device='cuda')

        # Calculate total loss
        valid_loss = loss.item()
        total_loss += valid_loss

        # Calculate evaluation metrics
        list_hyp += list(batch_hyp)
        list_label += list(batch_label)
        metrics = document_sentiment_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))

    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1), total_loss/(i+1), metrics_to_string(metrics)))


In [None]:
# Evaluate on test
model.eval()
torch.set_grad_enabled(False)

list_hyp = []

pbar = tqdm(test_loader, leave=True, total=len(test_loader))
for i, batch_data in enumerate(pbar):
    _, batch_hyp, _ = forward_sequence_classification(model, batch_data, i2w=LABEL2INDEX, device='cuda')
    list_hyp += list(batch_hyp)

# Save predictions to a file
df = pd.DataFrame({'label': list_hyp}).reset_index()
df.to_csv('pred.txt', index=False)

print(df)

In [None]:
torch.save(model, '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/distilbert_base_multilingual_model.pth')

##### Test fine-tuned model on sample sentences

In [None]:
# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

model.eval()

# Example text
text = '1 star highlander historical romance.'

# Tokenize the text
subwords = tokenizer.encode(text, return_tensors='pt').to(model.device)

# Get the logits from the model
output = model(subwords)
logits = output.logits

# Print the shape of logits for debugging
print(f"Shape of logits: {logits.shape}")

# Get the top label and its confidence
topk = torch.topk(logits, k=1, dim=-1)
label = topk[1].squeeze(dim=-1).item()
confidence = F.softmax(logits, dim=-1).squeeze()[label] * 100

# Print the result
print(f'Text: {text} | Label : {i2w[label]} ({confidence:.3f}%)')

In [None]:
text = 'started interesting thriller woman run mysterious pass story took strange twists turns ends talky talky reveal.'

# Tokenize the text
subwords = tokenizer.encode(text, return_tensors='pt').to(model.device)

# Get the logits from the model
output = model(subwords)
logits = output.logits

# Print the shape of logits for debugging
print(f"Shape of logits: {logits.shape}")

# Get the top label and its confidence
topk = torch.topk(logits, k=1, dim=-1)
label = topk[1].squeeze(dim=-1).item()
confidence = F.softmax(logits, dim=-1).squeeze()[label] * 100

# Print the result
print(f'Text: {text} | Label : {i2w[label]} ({confidence:.3f}%)')


In [None]:
text = 'An interesting retelling of Snow White.'

# Tokenize the text
subwords = tokenizer.encode(text, return_tensors='pt').to(model.device)

# Get the logits from the model
output = model(subwords)
logits = output.logits

# Print the shape of logits for debugging
print(f"Shape of logits: {logits.shape}")

# Get the top label and its confidence
topk = torch.topk(logits, k=1, dim=-1)
label = topk[1].squeeze(dim=-1).item()
confidence = F.softmax(logits, dim=-1).squeeze()[label] * 100

# Print the result
print(f'Text: {text} | Label : {i2w[label]} ({confidence:.3f}%)')


#### Hyperparameter Optimization

In [None]:
pip install optuna

In [None]:
# Define the objective function for Optuna
def objective(trial):
    # Define hyperparameters to be tuned
    lr = trial.suggest_loguniform('lr', 1e-5, 5e-5)
    batch_size = trial.suggest_categorical('batch_size', [4, 6, 8])

    # Initialize the model
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
    model.to('cuda')

    # Initialize the optimizer
    optimizer = AdamW(model.parameters(), lr=lr)

    # Dataloaders with the new batch size
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
    valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

    # Training loop
    n_epochs = 3  # Use a smaller number of epochs for hyperparameter tuning
    for epoch in range(n_epochs):
        model.train()
        torch.set_grad_enabled(True)

        total_train_loss = 0
        list_hyp, list_label = [], []

        train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
        for i, batch_data in enumerate(train_pbar):
            # Forward model
            loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data, device='cuda')

            # Update model
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            tr_loss = loss.item()
            total_train_loss += tr_loss

            # Calculate metrics
            list_hyp += list(batch_hyp)
            list_label += list(batch_label)

            train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format(
                (epoch+1), total_train_loss/(i+1), lr
            ))

        # Evaluate on validation
        model.eval()
        torch.set_grad_enabled(False)

        total_loss, total_correct, total_labels = 0, 0, 0
        list_hyp, list_label = [], []

        pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
        for i, batch_data in enumerate(pbar):
            loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data, device='cuda')

            # Calculate total loss
            valid_loss = loss.item()
            total_loss += valid_loss

            # Calculate evaluation metrics
            list_hyp += list(batch_hyp)
            list_label += list(batch_label)
            metrics = document_sentiment_metrics_fn(list_hyp, list_label)

            pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))

    # Return the main metric to optimize
    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    return metrics['f1']

# Set up the Optuna study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

# Print best hyperparameters
print("Best hyperparameters: ", study.best_params)

# Save the study
study.trials_dataframe().to_csv('/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/optuna_study.csv')

# Save the best model
best_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
best_model.to('cuda')
best_optimizer = AdamW(best_model.parameters(), lr=study.best_params['lr'])
torch.save(best_model, '/content/drive/MyDrive/Bootcamp/Day 33 - Checkpoint 1/best_distilbert_model.pth')
