In [228]:
import pandas as pd
import numpy as np

In [229]:
df = pd.read_csv("IMDB Dataset.csv", chunksize=5)

In [230]:
df = df.get_chunk()

In [231]:
df.shape

(5, 2)

In [232]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Preprocessing (Tokenization & Vocabulary Building)

In [233]:
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [234]:
# Download required resources
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [235]:
# Initialize stopwords and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [236]:
# tokenize
def tokenize(text):
  text = text.lower()
  text = text.replace('?','')
  text = text.replace("'","")
  return text.split()


def preprocess(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove non-alphabetic characters (punctuation, etc.)
    text = re.sub(r'\W', ' ', text)
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords and apply stemming
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return tokens

In [237]:
text1 = df["review"][1]

In [238]:
text1

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [239]:
text1.lower()

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [240]:
text1.replace('?','')

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [241]:
text1.replace("'","")

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great masters of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional dream techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwells murals decorating every surface) are terribly well done.'

In [242]:
text1.split()

['A',
 'wonderful',
 'little',
 'production.',
 '<br',
 '/><br',
 '/>The',
 'filming',
 'technique',
 'is',
 'very',
 'unassuming-',
 'very',
 'old-time-BBC',
 'fashion',
 'and',
 'gives',
 'a',
 'comforting,',
 'and',
 'sometimes',
 'discomforting,',
 'sense',
 'of',
 'realism',
 'to',
 'the',
 'entire',
 'piece.',
 '<br',
 '/><br',
 '/>The',
 'actors',
 'are',
 'extremely',
 'well',
 'chosen-',
 'Michael',
 'Sheen',
 'not',
 'only',
 '"has',
 'got',
 'all',
 'the',
 'polari"',
 'but',
 'he',
 'has',
 'all',
 'the',
 'voices',
 'down',
 'pat',
 'too!',
 'You',
 'can',
 'truly',
 'see',
 'the',
 'seamless',
 'editing',
 'guided',
 'by',
 'the',
 'references',
 'to',
 "Williams'",
 'diary',
 'entries,',
 'not',
 'only',
 'is',
 'it',
 'well',
 'worth',
 'the',
 'watching',
 'but',
 'it',
 'is',
 'a',
 'terrificly',
 'written',
 'and',
 'performed',
 'piece.',
 'A',
 'masterful',
 'production',
 'about',
 'one',
 'of',
 'the',
 'great',
 "master's",
 'of',
 'comedy',
 'and',
 'his',
 'li

In [243]:
tokenize(text1)

['a',
 'wonderful',
 'little',
 'production.',
 '<br',
 '/><br',
 '/>the',
 'filming',
 'technique',
 'is',
 'very',
 'unassuming-',
 'very',
 'old-time-bbc',
 'fashion',
 'and',
 'gives',
 'a',
 'comforting,',
 'and',
 'sometimes',
 'discomforting,',
 'sense',
 'of',
 'realism',
 'to',
 'the',
 'entire',
 'piece.',
 '<br',
 '/><br',
 '/>the',
 'actors',
 'are',
 'extremely',
 'well',
 'chosen-',
 'michael',
 'sheen',
 'not',
 'only',
 '"has',
 'got',
 'all',
 'the',
 'polari"',
 'but',
 'he',
 'has',
 'all',
 'the',
 'voices',
 'down',
 'pat',
 'too!',
 'you',
 'can',
 'truly',
 'see',
 'the',
 'seamless',
 'editing',
 'guided',
 'by',
 'the',
 'references',
 'to',
 'williams',
 'diary',
 'entries,',
 'not',
 'only',
 'is',
 'it',
 'well',
 'worth',
 'the',
 'watching',
 'but',
 'it',
 'is',
 'a',
 'terrificly',
 'written',
 'and',
 'performed',
 'piece.',
 'a',
 'masterful',
 'production',
 'about',
 'one',
 'of',
 'the',
 'great',
 'masters',
 'of',
 'comedy',
 'and',
 'his',
 'life

In [244]:
preprocess(text1)

['wonder',
 'littl',
 'product',
 'br',
 'br',
 'film',
 'techniqu',
 'unassum',
 'old',
 'time',
 'bbc',
 'fashion',
 'give',
 'comfort',
 'sometim',
 'discomfort',
 'sens',
 'realism',
 'entir',
 'piec',
 'br',
 'br',
 'actor',
 'extrem',
 'well',
 'chosen',
 'michael',
 'sheen',
 'got',
 'polari',
 'voic',
 'pat',
 'truli',
 'see',
 'seamless',
 'edit',
 'guid',
 'refer',
 'william',
 'diari',
 'entri',
 'well',
 'worth',
 'watch',
 'terrificli',
 'written',
 'perform',
 'piec',
 'master',
 'product',
 'one',
 'great',
 'master',
 'comedi',
 'life',
 'br',
 'br',
 'realism',
 'realli',
 'come',
 'home',
 'littl',
 'thing',
 'fantasi',
 'guard',
 'rather',
 'use',
 'tradit',
 'dream',
 'techniqu',
 'remain',
 'solid',
 'disappear',
 'play',
 'knowledg',
 'sens',
 'particularli',
 'scene',
 'concern',
 'orton',
 'halliwel',
 'set',
 'particularli',
 'flat',
 'halliwel',
 'mural',
 'decor',
 'everi',
 'surfac',
 'terribl',
 'well',
 'done']

In [245]:
## Tokenizing the Reviews

# Apply preprocessing to the review column
df['tokens'] = df['review'].apply(preprocess)

In [246]:
df.head()

Unnamed: 0,review,sentiment,tokens
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, 1, oz, episod, h..."
1,A wonderful little production. <br /><br />The...,positive,"[wonder, littl, product, br, br, film, techniq..."
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonder, way, spend, time, hot, summe..."
3,Basically there's a family where a little boy ...,negative,"[basic, famili, littl, boy, jake, think, zombi..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, love, time, money, visual, st..."


In [247]:
## Building the Vocabulary

vocab = {'<UNK>': 0}  # Add <UNK> for unknown words

# Function to build the vocabulary
def build_vocab(row):
    tokens = row['tokens']# Get tokenized review
    
    for token in tokens:  # Iterate through tokens in the review
        if token not in vocab:  # If token is not already in vocab
            vocab[token] = len(vocab)  # Assign it a unique index

    # Add the sentiment labels ("positive" and "negative") to the vocabulary
    # sentiment_labels = ["positive", "negative"]
    # Add the sentiment label to the vocab (it's a single label, not a list of labels)
    # sentiment = row["sentiment"]
    # if sentiment not in vocab:
    #     vocab[sentiment] = len(vocab)

In [248]:
# Apply the function to each row of the dataframe
df.apply(build_vocab, axis=1)

0    None
1    None
2    None
3    None
4    None
dtype: object

In [249]:
vocab

{'<UNK>': 0,
 'one': 1,
 'review': 2,
 'mention': 3,
 'watch': 4,
 '1': 5,
 'oz': 6,
 'episod': 7,
 'hook': 8,
 'right': 9,
 'exactli': 10,
 'happen': 11,
 'br': 12,
 'first': 13,
 'thing': 14,
 'struck': 15,
 'brutal': 16,
 'unflinch': 17,
 'scene': 18,
 'violenc': 19,
 'set': 20,
 'word': 21,
 'go': 22,
 'trust': 23,
 'show': 24,
 'faint': 25,
 'heart': 26,
 'timid': 27,
 'pull': 28,
 'punch': 29,
 'regard': 30,
 'drug': 31,
 'sex': 32,
 'hardcor': 33,
 'classic': 34,
 'use': 35,
 'call': 36,
 'nicknam': 37,
 'given': 38,
 'oswald': 39,
 'maximum': 40,
 'secur': 41,
 'state': 42,
 'penitentari': 43,
 'focus': 44,
 'mainli': 45,
 'emerald': 46,
 'citi': 47,
 'experiment': 48,
 'section': 49,
 'prison': 50,
 'cell': 51,
 'glass': 52,
 'front': 53,
 'face': 54,
 'inward': 55,
 'privaci': 56,
 'high': 57,
 'agenda': 58,
 'em': 59,
 'home': 60,
 'mani': 61,
 'aryan': 62,
 'muslim': 63,
 'gangsta': 64,
 'latino': 65,
 'christian': 66,
 'italian': 67,
 'irish': 68,
 'scuffl': 69,
 'death': 

In [250]:
"""
Each word in the dataset will be assigned a unique index, and the <UNK> token is reserved for words
that were not seen during training (helpful for dealing with out-of-vocabulary words during prediction).
"""

'\nEach word in the dataset will be assigned a unique index, and the <UNK> token is reserved for words\nthat were not seen during training (helpful for dealing with out-of-vocabulary words during prediction).\n'

In [251]:
len(vocab)

382

In [252]:
## Convert Text to Indices Using

def text_to_indices(text, vocab):
    indexed_text = []  # Initialize an empty list to store numerical indices

    # Tokenize the input text
    for token in preprocess(text):  # Iterate over each token (word) in the tokenized text
        if token in vocab:  # If the token is in the vocabulary
            indexed_text.append(vocab[token])  # Append the corresponding index from vocab
        else:  # If the token is not in the vocabulary
            indexed_text.append(vocab['<UNK>'])  # Append the index for the <UNK> token

    return indexed_text  # Return the list of indices

In [253]:
# Apply the function to the 'review' column
df['encoded_review'] = df['review'].apply(lambda x: text_to_indices(x, vocab))

In [254]:
df

Unnamed: 0,review,sentiment,tokens,encoded_review
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, 1, oz, episod, h...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13..."
1,A wonderful little production. <br /><br />The...,positive,"[wonder, littl, product, br, br, film, techniq...","[136, 137, 138, 12, 12, 139, 140, 141, 142, 14..."
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonder, way, spend, time, hot, summe...","[200, 136, 201, 202, 143, 203, 204, 205, 206, ..."
3,Basically there's a family where a little boy ...,negative,"[basic, famili, littl, boy, jake, think, zombi...","[264, 265, 137, 266, 267, 268, 269, 270, 271, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, love, time, money, visual, st...","[304, 305, 238, 143, 306, 307, 308, 139, 4, 30..."


In [255]:
text_to_indices("one way ajay", vocab)

[1, 201, 0]

In [256]:
mapping_dict = {'positive': 1, 'negative': 0}
df['Sentiment'] = df['sentiment'].map(mapping_dict)

In [257]:
df

Unnamed: 0,review,sentiment,tokens,encoded_review,Sentiment
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, 1, oz, episod, h...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13...",1
1,A wonderful little production. <br /><br />The...,positive,"[wonder, littl, product, br, br, film, techniq...","[136, 137, 138, 12, 12, 139, 140, 141, 142, 14...",1
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonder, way, spend, time, hot, summe...","[200, 136, 201, 202, 143, 203, 204, 205, 206, ...",1
3,Basically there's a family where a little boy ...,negative,"[basic, famili, littl, boy, jake, think, zombi...","[264, 265, 137, 266, 267, 268, 269, 270, 271, ...",0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, love, time, money, visual, st...","[304, 305, 238, 143, 306, 307, 308, 139, 4, 30...",1


In [258]:
len(df["encoded_review"][1])

92

In [259]:
len(df["encoded_review"][2])

89

In [260]:
from torch.utils.data import Dataset, DataLoader

In [278]:
class MovieDataset(Dataset):

  def __init__(self, df, vocab):
    self.df = df
    self.vocab = vocab

  def __len__(self):
    return self.df.shape[0]

  def __getitem__(self, index):
    numerical_review = text_to_indices(self.df.iloc[index]['review'], self.vocab)
    sentiment_label  = self.df.iloc[index]['Sentiment']

    # Map the sentiment labels to numeric values (assuming 'positive' -> 1 and 'negative' -> 0)
    if sentiment_label == "positive":
        sentiment = torch.tensor([1])  # Positive sentiment
    else:
        sentiment = torch.tensor([0])  # Negative sentiment

    return torch.tensor(numerical_review), sentiment

In [279]:
dataset = MovieDataset(df, vocab)

In [280]:
dataset

<__main__.MovieDataset at 0x25c12119c70>

In [281]:
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

In [282]:
for review, sentiment in dataloader:
  print(review, sentiment)

tensor([[200, 136, 201, 202, 143, 203, 204, 205, 206, 207, 208, 209,   4, 210,
          26, 176, 211, 212, 213, 214, 215, 216, 217, 117, 218, 219, 220, 221,
         127, 222, 223, 224, 225, 226, 227, 228, 200, 229, 230, 231, 232, 233,
         234, 235,  61, 236, 237, 238,  12,  12, 239,   1, 230, 176, 240,  86,
          80, 241,  76, 242, 243, 244, 245, 246, 247, 248, 249,   9, 250, 251,
         252, 253,  12,  12, 127, 254, 255, 256, 257, 258, 259, 260, 261, 262,
         175, 176,  22, 162, 263]]) tensor([[0]])
tensor([[264, 265, 137, 266, 267, 268, 269, 270, 271, 272, 143,  12,  12, 273,
         274, 275, 276, 277, 267, 278, 128, 279, 114, 269,  12,  12, 280,  13,
          22, 281, 139, 282, 278, 283, 284, 284, 273, 285, 271, 286, 287, 288,
         289, 177, 267, 270, 290, 291, 139, 292, 162, 293, 294, 273, 295,   4,
         284, 296, 283, 297,  12,  12, 298, 299, 117, 187, 271, 300, 301, 302,
         267, 303]]) tensor([[0]])
tensor([[304, 305, 238, 143, 306, 307, 308, 13

In [283]:
import torch.nn as nn

In [284]:
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim=50)
        self.rnn = nn.RNN(50, 64, batch_first=True)
        self.fc = nn.Linear(64, 2)

    def forward(self, review):
        embedding_review = self.embeddings(review)
        hidden, final = self.rnn(embedding_review)
        output = self.fc(final.squeeze(0))

        return output

In [285]:
x = nn.Embedding(384, embedding_dim=50)
y = nn.RNN(50, 64, batch_first=True)
z = nn.Linear(64, 2)

a = dataset[1][0].unsqueeze(0)
print("shape of a:", a.shape)
b = x(a)
print("shape of b:", b.shape)
c, d = y(b)
print("shape of c:", c.shape)
print("shape of d:", d.shape)

e = z(d.squeeze(0))

print("shape of e:", e.shape)

shape of a: torch.Size([1, 92])
shape of b: torch.Size([1, 92, 50])
shape of c: torch.Size([1, 92, 64])
shape of d: torch.Size([1, 1, 64])
shape of e: torch.Size([1, 2])


In [286]:
learning_rate = 0.001
epochs = 20

In [287]:
model = SimpleRNN(len(vocab))

In [288]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [289]:
# training loop

for epoch in range(epochs):

  total_loss = 0

  for review, sentiment in dataloader:

    sentiment = sentiment.squeeze(0).long()

    print(f"Sentiment: {sentiment}, Shape: {sentiment.shape}")

    print("Unique sentiment labels:", sentiment.unique())

    optimizer.zero_grad()

    # forward pass
    output = model(review)

    print("Output shape:", output.shape)

    # loss -> output shape (1,324) - (1)
    loss = criterion(output, sentiment)

    # gradients
    loss.backward()

    # update
    optimizer.step()

    total_loss = total_loss + loss.item()

  print(f"Epoch: {epoch+1}, Loss: {total_loss:4f}")

Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Epoch: 1, Loss: 2.699081
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Size([1])
Unique sentiment labels: tensor([0])
Output shape: torch.Size([1, 2])
Sentiment: tensor([0]), Shape: torch.Si

In [301]:
def predict(model, question, threshold=1.2):

  # convert question to numbers
  numerical_question = text_to_indices(question, vocab)

  # tensor
  question_tensor = torch.tensor(numerical_question).unsqueeze(0)

  # send to model
  output = model(question_tensor)

  # convert logits to probs
  probs = torch.nn.functional.softmax(output, dim=1)

  # find index of max prob
  value, index = torch.max(probs, dim=1)

  if value < threshold:
    print("I don't know")

  print(list(vocab.keys())[index])

In [302]:
predict(model, "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.")

I don't know
<UNK>
