# MLOps Assigment 1
# Text Classification - **Training**

## 1. Data Loading 

### Load the Dataset from Hugging Face using the `datasets` library

In [None]:
# Load the IMDB Movie Reviews dataset
from datasets import load_dataset
dataset = load_dataset('imdb')

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
dataset['train'][:5]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [4]:
dataset['test'][24995:25000]

{'text': ['Just got around to seeing Monster Man yesterday. It had been a long wait and after lots of anticipation and build up, I\'m glad to say that it came through and met my expectations on every level. True, you really can\'t expect too much from hearing the plot rundown, but after reading some of the reviews for it, I was ecstatic. I mean, what trash fan wouldn\'t want to see a gore flick about a deranged inbred hick mowing people down with his make-shift monster truck? I went in expecting a cross between Road Trip and The Hills Have Eyes and got so much more. This was a horror comedy that actually worked. The film makers got it right when it came to making you squirm and making you howl with laughter at the same time. Kudos to Michael Davis for going all out with the gore and pushing the envelope with the sickass humor. Let me list just a few reasons why I love this movie so much: First off is the story. It\'s been done to death in so many other flicks. A college guy gets wind t

### Split the dataset into training, validation, and test sets 

The data is already split into `train` and `test` sets. We'll split the train set into `train` and `validation` sets.

In [5]:
# Split the dataset into training, validation, and test sets
from sklearn.model_selection import train_test_split

X = dataset['train']['text']
y = dataset['train']['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_test = dataset['test']['text']
y_test = dataset['test']['label']

# Print sizes of the splits
print(f"Train set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

train = {'text': X_train, 'labels': y_train}
val = {'text': X_val, 'labels': y_val}
test = {'text': X_test, 'labels': y_test}

Train set size: 20000
Validation set size: 5000
Test set size: 25000


## 2. Data Preprocessing

### Preprocess the text data

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

In [7]:
# Function to remove punctuations from text
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def remove_punctuation(text):
    regular_punct = string.punctuation
    #return re.sub(r'[#!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]', '', str(text))
    return str(re.sub(r'['+regular_punct+']', '', str(text)))

# Function to remove URLs from text
def remove_urls(text):
    return re.sub(r'http[s]?://\S+', '', text)

# Function to convert the text into lower case
def lower_case(text):
    return text.lower()

# Function to lemmatize text
def lemmatize(text):
  wordnet_lemmatizer = WordNetLemmatizer()

  tokens = nltk.word_tokenize(text)
  lemma_txt = ''
  for w in tokens:
    lemma_txt = lemma_txt + wordnet_lemmatizer.lemmatize(w) + ' '

  return lemma_txt

In [8]:
import pandas as pd

# Apply preprocessing steps to 'text' column in train
series = pd.Series(train['text'])
series = series.apply(remove_urls)
series = series.apply(remove_punctuation)
series = series.apply(lower_case)
series = series.apply(lemmatize)

train['text'] = series.to_list()

In [9]:
train['text'][:5]

['i borrowed this movie despite it extremely low rating because i wanted to see how the crew manages to animate the presence of multiple world a a matter of fact they didnt at least so it seems some cameo appearance cut rather clumsily into the movie thats it this is what the majority of viewer think however the surprise come at the end and unfortunately then when probably most of the viewer have already stopped this movie i wa also astonished when i saw that the brazilianportuguese title of this movie mean voyage into death this is the spoilerbr br that this movie is about a young girl who go alone onto this boat on reason that are completely unclear you understand only in the last 5 minute when you start the movie with the english title haunted boat in your head you clearly think that the cameo appearance of strange figure are the ghost but in reality this movie is not like most other horror movie told from the distant writerwatcher perspective who can at almost any time differentiat

In [10]:
# Apply preprocessing steps to 'text' column in val
series = pd.Series(val['text'])
series = series.apply(remove_urls)
series = series.apply(remove_punctuation)
series = series.apply(lower_case)
series = series.apply(lemmatize)

val['text'] = series.to_list()

In [11]:
val['text'][:5]

['dumb is a dumb doe in this thoroughly uninteresting supposed black comedy essentially what start out a chris klein trying to maintain a low profile eventually morphs into an uninspired version of the three amigo only without any laugh in order for black comedy to work it must be outrageous which play dead is not in order for black comedy to work it can not be mean spirited which play dead is what play dead really is is a town full of nut job fred dunst doe however do a pretty fair imitation of billy bob thornton character from a simple plan while jake busey doe a pretty fair imitation of well jake busey merk ',
 'i dug out from my garage some old musical and this is another one of my favorite it wa written by jay alan lerner and directed by vincent minelli it won two academy award for best picture of 1951 and best screenplay the story of an american painter in paris who try to make it big nina foch is a sophisticated lady of mean and is very interested in helping him but soon find sh

In [12]:
# Apply preprocessing steps to 'text' column in test
series = pd.Series(test['text'])
series = series.apply(remove_urls)
series = series.apply(remove_punctuation)
series = series.apply(lower_case)
series = series.apply(lemmatize)

test['text'] = series.to_list()

In [13]:
test['text'][:5]

['i love scifi and am willing to put up with a lot scifi moviestv are usually underfunded underappreciated and misunderstood i tried to like this i really did but it is to good tv scifi a babylon 5 is to star trek the original silly prosthetics cheap cardboard set stilted dialogue cg that doesnt match the background and painfully onedimensional character can not be overcome with a scifi setting im sure there are those of you out there who think babylon 5 is good scifi tv it not it clichéd and uninspiring while u viewer might like emotion and character development scifi is a genre that doe not take itself seriously cf star trek it may treat important issue yet not a a serious philosophy it really difficult to care about the character here a they are not simply foolish just missing a spark of life their action and reaction are wooden and predictable often painful to watch the maker of earth know it rubbish a they have to always say gene roddenberrys earth otherwise people would not conti

### Tokenize the text data

In [14]:
from transformers import AutoTokenizer

# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [15]:
# Tokenize the text data
def tokenize_data(data):
    return tokenizer(data['text'], return_tensors='pt', padding=True, truncation=True)

train_tokenized = tokenize_data(train)
val_tokenized = tokenize_data(val)
test_tokenized = tokenize_data(test)

In [16]:
print("train_tokenized[0] =", train_tokenized[0])
print("val_tokenized[0] =", val_tokenized[0])
print("test_tokenized[0] =", test_tokenized[0])

train_tokenized[0] = Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
val_tokenized[0] = Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
test_tokenized[0] = Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


### Convert the tokenized data into a format suitable for training

In [None]:
import torch
# Convert tokenized data into tensors
def convert_to_tensors(data):
    return {key: torch.tensor(val) for key, val in data.items()}

train_tensors = convert_to_tensors(train_tokenized)
val_tensors = convert_to_tensors(val_tokenized)
test_tensors = convert_to_tensors(test_tokenized)

In [18]:
train_tensors

{'input_ids': tensor([[  101,   178, 12214,  ...,  1804,  1139,   102],
         [  101,  1170,  1103,  ...,     0,     0,     0],
         [  101,  1113,  1103,  ..., 22523,  5892,   102],
         ...,
         [  101,   178,  1138,  ...,     0,     0,     0],
         [  101,  1355,  1106,  ...,     0,     0,     0],
         [  101,   178,  1148,  ...,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

In [19]:
val_tensors

{'input_ids': tensor([[  101, 14908,  1110,  ...,     0,     0,     0],
         [  101,   178,  8423,  ...,     0,     0,     0],
         [  101,  1170,  2903,  ...,     0,     0,     0],
         ...,
         [  101,  1115,  1110,  ...,     0,     0,     0],
         [  101,  1142,  1110,  ...,     0,     0,     0],
         [  101,  1165,   178,  ...,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

In [20]:
test_tensors

{'input_ids': tensor([[ 101,  178, 1567,  ...,    0,    0,    0],
         [ 101, 3869, 1103,  ...,    0,    0,    0],
         [ 101, 1122,  170,  ...,    0,    0,    0],
         ...,
         [ 101,  178, 1400,  ...,    0,    0,    0],
         [ 101, 1421, 2517,  ...,    0,    0,    0],
         [ 101,  178, 2347,  ...,    0,    0,    0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

### Create data loaders for the training, validation, and test sets

In [21]:
from torch.utils.data import DataLoader

# Create data loaders
train_loader = DataLoader(train_tensors, batch_size=32, shuffle=True)
val_loader = DataLoader(val_tensors, batch_size=32)
test_loader = DataLoader(test_tensors, batch_size=32)

## 3. Model Training

### Naive Bayes Classifier

In [22]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score

# Convert PyTorch tensors to NumPy arrays
X_train_numpy = train_tensors['input_ids'].numpy()
y_train = train['labels']

X_val_numpy = val_tensors['input_ids'].numpy()
y_val = val['labels']

# Reshape the data to match Naive Bayes' input requirements
X_train_flattened = X_train_numpy.reshape(X_train_numpy.shape[0], -1)
X_val_flattened = X_val_numpy.reshape(X_val_numpy.shape[0], -1)

# Initialize and train the Naive Bayes classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_flattened, y_train)

# Predictions
y_pred_train = naive_bayes_classifier.predict(X_train_flattened)
y_pred_val = naive_bayes_classifier.predict(X_val_flattened)

# Calculate accuracies
train_accuracy = accuracy_score(y_train, y_pred_train)
val_accuracy = accuracy_score(y_val, y_pred_val)
train_f1_score = f1_score(y_train, y_pred_train, average='weighted')
val_f1_score = f1_score(y_val, y_pred_val, average='weighted')

print(f"Train Accuracy: {train_accuracy}")
print(f"Validation Accuracy: {val_accuracy}")
print(f"Train F1 Score: {train_f1_score}")
print(f"Validation F1 Score: {val_f1_score}")

Train Accuracy: 0.5139
Validation Accuracy: 0.5028
Train F1 Score: 0.48023454074045346
Validation F1 Score: 0.4728513912854243


## 4. Model Evaluation

### Evaluate the trained model on the test set

In [23]:
# Convert test data to NumPy arrays
X_test_numpy = test_tensors['input_ids'].numpy()
y_test = test['labels']

# Reshape the test data
X_test_flattened = X_test_numpy.reshape(X_test_numpy.shape[0], -1)

# Predictions on the test set
y_pred_test = naive_bayes_classifier.predict(X_test_flattened)

# Calculate accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred_test)
test_f1_score = f1_score(y_test, y_pred_test, average='weighted')

print(f"Test Accuracy: {test_accuracy}")
print(f"Test F1 Score: {test_f1_score}")

Test Accuracy: 0.50316
Test F1 Score: 0.46676893255214535


## 5. Model Deployment

### Save the trained model (Pickle)

In [24]:
import pickle

# Define the file path where you want to save the trained model
model_file_path = "naive_bayes_model.pkl"

# Save the trained Naive Bayes classifier to a file
with open(model_file_path, 'wb') as file:
    pickle.dump(naive_bayes_classifier, file)

print(f"Trained model saved to {model_file_path}")

Trained model saved to naive_bayes_model.pkl
