# **RoBERTa base model for movie spoiler text classification**
<br>

## **References:-**

[1] RoBERTa base model https://huggingface.co/roberta-base <br>

[2] RoBERTa: A Robustly Optimized BERT Pretraining Approachhttps://arxiv.org/abs/1907.11692

## **Dataset**

[IMDB Spoiler Dataset](https://www.kaggle.com/datasets/rmisra/imdb-spoiler-dataset).

The description provided on Kaggle:

*This dataset is collected from IMDB. It contains meta-data about items as well as user reviews with information regarding whether a review contains a spoiler or not.*

### In this notebook we are experimenting on the results of RoBERTa model on dataset with light processed and 10% equally sampled original data 


[1] https://huggingface.co/docs/transformers/model_doc/roberta

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

Importing the various standard library like sklearn, pandas, numpy, os, shutil



In [None]:
import os
import shutil

This below code imports several libraries for natural language processing (NLP) and data visualization tasks.
  * The **'transformers'** library is imported to use pre-trained models and tokenizers for NLP tasks. 
  * We import **'AutoTokenizer'** and **'TFAutoModelForSequenceClassification'** from **'transformers'** to tokenize text data and load a pre-trained model for sequence classification.
  * The **'json'** library is imported to work with JSON data, allowing us to read and write JSON files. 
  * The **'matplotlib.pyplot'** library is imported as **'plt'** to create visualizations, such as plots and charts. 
  * The **'random'** library is imported to generate random numbers or randomly shuffle data. 
  * The **'seaborn'** library is imported as 'sn' to enhance the visual appeal of plots created with 'matplotlib'. 
  * The **'tensorflow'** library is imported as 'tf' for building and training machine learning models, particularly in the context of deep learning. 
  
  By importing these libraries, we ensure that we have access to the necessary functions and tools for our NLP and data visualization tasks.


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import json
import matplotlib.pyplot as plt
import random
import seaborn as sn
import tensorflow as tf

We need to load the processed data and hence we are mounting Google Drive in the notebook. Then we are loading the data which is **light processed and 10% equally sampled original data **. Next we are adding some shuffling to the data.

In [None]:
data = pd.read_json("drive/MyDrive/filtered_reviews.json")

The AutoTokenizer class from the Hugging Face library is used to create a tokenizer. Specifically, the roberta-base model is used as a pretrained tokenizer.

By calling AutoTokenizer.from_pretrained("roberta-base"), the tokenizer is initialized with the pretrained model. This tokenizer is capable of transforming text inputs into numerical representations that can be processed by machine learning models.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("roberta-base") #Tokenizer

In [None]:
texts = list(data["review_text"])

In [None]:
texts

["This movie is so cliché, melodramatic, and cheesy. The characters are so terribly one dimensional. Also, they are not at all steeped in reality. Joe Fox is supposed to be some cutthroat businessman, yet you never see even an inkling of that in Hanks' performance. Meg Ryan is over the top cutesy to the point of inducing nausea. This is a business savvy New Yorker? She seems to barely have the wherewithal to tie her shoes let alone own a store. The supporting characters are weak and completely waste the talents of Parker Posey, Greg Kinnear and Steve Zahn (all awesome actors). Even Dave Chapelle sucked, but to be fair he had nothing to work with and had no real character at all. What a waste!Perhaps it is the on screen charisma of Meg Ryan and Tom Hanks that made me watch this inanity to the end. It's bizarre to me that Meg Ryan was nominated for a Golden Globe for this. Her acting strength lies in When a Man Loves and Woman. Anyway, this movie makes for a decent rainy Sunday. I wouldn

We then use the tokenizer on the imported dataset and store the tokenized data as tensor

In [None]:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='tf') #Tokenized text

We are converting labels with True or False output to 1 or 0 output which can be used easily for processing

In [None]:
labels = list(data["is_spoiler"])
categories=sorted(list(set(labels))) #set will return the unique different entries
n_categories=len(categories)

def indicize_labels(labels):
    """Transforms string labels into indices"""
    indices=[]
    for j in range(len(labels)):
        for i in range(n_categories):
            if labels[j]==categories[i]:
                indices.append(i)
    return indices

In [None]:
batch_size=8

In [None]:
indices = indicize_labels(labels)

We are creating TensorFlow dataset from the inputs and indices. We are spliting it into **30% validation** and **70% training** subsets, and applies batching with **batch size 8** and prefetching to the training subset.




In [None]:
dataset=tf.data.Dataset.from_tensor_slices((dict(inputs), indices)) #Create a tensorflow dataset
#train test split, we use 10% of the data for validation
val_data_size=int(0.3*len(indices))
val_ds=dataset.take(val_data_size).batch(batch_size, drop_remainder=True) 
train_ds=dataset.skip(val_data_size).batch(batch_size, drop_remainder=True)
train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

We are initializing a model for sequence classification using the pretrained **"distilbert-base-uncased"** architecture.


---


RoBERTa (Robustly Optimized BERT approach) is a language model that builds upon the architecture and pretraining methods of BERT (Bidirectional Encoder Representations from Transformers). It was introduced by Facebook AI Research in 2019 as a refinement of BERT to achieve better performance on a wide range of natural language processing (NLP) tasks.

RoBERTa addresses some of the limitations of BERT by optimizing its training methodology. It uses a larger training corpus and removes certain training objectives, such as the next sentence prediction task, which helps improve the model's generalization capabilities. RoBERTa also introduces dynamic masking during pretraining, which randomly masks out and replaces tokens at each training iteration, leading to better representation learning.

In [None]:
model = TFAutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=1)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are using **BinaryCrossEntropy** as a loss measure and **BinaryAccuracy** as an accuracy measure



In [None]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

The model is compiled using the Adam optimizer with a learning rate of 1e-5 and gradient clipping applied along with BinaryCrossentropy loss and BinaryAccuracy metric

We are then training the model for 6 epochs with the training and validation dataset defined above

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5, clipnorm=1.), loss=loss, metrics=metrics)

In [None]:
4history=model.fit(train_ds, validation_data=val_ds, epochs=6, verbose=1)

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6

We had reached a timeout of 12 hours after training the model on google colab Pro. Hence the model could not be saved.

However, we can infer that the model was overfitting as the testing accuracy was not going up even when training accuracy was up. We can thus conclude that data was insufficient to train and we faced computation limits while using RoBerta model 

In [None]:
model.save("robert_unprocessed_5_epochs.tf")
import shutil
shutil.make_archive("robert_5_epochs", 'zip', "robert_unprocessed_5_epochs")
