### Importing Dependencies 
pip install -r requirements.txt
Sequential is used to build deep learining models (APi used to create layer-by-layer in a linear stack ) 

model = Sequential([

Dense(64, activation='relu', input_shape=(32,)),

Dense(64, activation='relu'),

Dense(1, activation='sigmoid')
    
])

In [53]:
import os 
import json

from zipfile import ZipFile
import pandas as pd 
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input


- Dense layer is the fully connected layer
- Embedding will be teh frist layer in lstm network 
- lstm layer itself 

The input gate controls the flow of new information into the cell state, while the forget gate controls the flow of information that is no longer relevant. The output gate controls the flow of information from the cell state to the output of the unit.

### Data collection - Kaggle API 

Note: The following steps outline how to download the file directly from Kaggle. Alternatively, you can use the zip file uploaded in the datasets folder.

In [54]:
kaggle_dict = json.load(open("datasets/kaggle.json"))

In [55]:
kaggle_dict.keys()

dict_keys(['username', 'key'])

setup kaggle credentials as environment variables


In [56]:
os.environ["KAGGLE_USERNAME"] = kaggle_dict["username"]
os.environ["KAGGLE_KEY"] = kaggle_dict["key"]

In [57]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews -p datasets

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


unzip the dataset file 

In [58]:
zip_file_path = "datasets/imdb-dataset-of-50k-movie-reviews.zip"
extract_to_folder = "datasets"  

with ZipFile(zip_file_path, "r") as zip_file:
    zip_file.extractall(extract_to_folder)


### Load the datasest 

In [59]:
df = pd.read_csv("datasets/IMDB Dataset.csv")

In [60]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [61]:
df.shape


(50000, 2)

In [62]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

The data is perfectly balanced ==> Don't have to worry about class imbalanced

#### Copy data 

In [63]:
df_copy = df.copy()

#### Encode sentiment 

In [64]:
df_copy.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)


  df_copy.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)


In [65]:
df_copy.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


#### Split data into training and test data 

In [66]:
train_data, test_data = train_test_split(df_copy, test_size=0.2, random_state=42)

### Data Preprocessing 

#### Tokenize text data 
- The tokenizer should only keep the top 5,000 most frequent words from the training data. Words that occur less frequently than this threshold will be ignored, which helps reduce the dimensionality of the input data and focus on the most relevant words.
- It learns the vocabulary of the training dataset and assigns an integer index to each unique word, starting from 1 (0 is reserved for padding).
- This function pads the sequences to ensure they are all of the same length. In this case, sequences longer than 200 words will be truncated, and shorter sequences will be padded with zeros at the beginning (default behavior).

In [67]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=200)

In [68]:
print(X_train)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]


In [69]:
print(X_test)

[[   0    0    0 ...  995  719  155]
 [  12  162   59 ...  380    7    7]
 [   0    0    0 ...   50 1088   96]
 ...
 [   0    0    0 ...  125  200 3241]
 [   0    0    0 ... 1066    1 2305]
 [   0    0    0 ...    1  332   27]]


In [70]:
Y_train = train_data["sentiment"]
Y_test = test_data["sentiment"]

In [71]:
print(Y_train)

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64


In [72]:
print(Y_test)

33553    1
9427     1
199      0
12447    1
39489    0
        ..
28567    0
25079    1
18707    1
15200    0
5857     1
Name: sentiment, Length: 10000, dtype: int64


### LSTM -Long Short Term Memory 

#### Build the model
- Embedding is to reprsent data the input into a vector space of 128 ==> represeng the data in vector form 
- the dense layer is the output layer, it receives input from all the 128 neurons and get one output with the activation sigmoid 

In [73]:
model = Sequential()
model.add(Input(shape=(200,)))  # Specify the input shape
model.add(Embedding(input_dim=5000, output_dim=128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation="sigmoid"))

In [74]:
model.summary()

In [75]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

#### Training the model 

In [76]:
model.fit(X_train, Y_train, epochs=5, batch_size=64, validation_split=0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 125ms/step - accuracy: 0.7074 - loss: 0.5529 - val_accuracy: 0.8374 - val_loss: 0.3891
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 125ms/step - accuracy: 0.8484 - loss: 0.3635 - val_accuracy: 0.8453 - val_loss: 0.3558
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 132ms/step - accuracy: 0.8510 - loss: 0.3518 - val_accuracy: 0.7975 - val_loss: 0.4326
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m76s[0m 152ms/step - accuracy: 0.8870 - loss: 0.2827 - val_accuracy: 0.8571 - val_loss: 0.3371
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 143ms/step - accuracy: 0.8956 - loss: 0.2575 - val_accuracy: 0.8755 - val_loss: 0.3241


<keras.src.callbacks.history.History at 0x7fd8c60458e0>

#### Model Evaluation 

In [77]:
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 25ms/step - accuracy: 0.8759 - loss: 0.3115
Test Loss: 0.3073367178440094
Test Accuracy: 0.8794000148773193


##### Building a predective system 

In [78]:
def predict_sentiment(review):
  # tokenize and pad the review
  sequence = tokenizer.texts_to_sequences([review])
  padded_sequence = pad_sequences(sequence, maxlen=200)
  prediction = model.predict(padded_sequence)
  sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
  return sentiment

In [79]:
# example usage
new_review = "This movie was fantastic. I loved it."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 178ms/step
The sentiment of the review is: positive


In [80]:
# example usage
new_review = "This movie was not that good"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
The sentiment of the review is: negative


In [81]:
# example usage
new_review = "This movie was ok but not that good."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
The sentiment of the review is: negative


In [None]:
# example usage
new_review = "This movie was ok but not that good."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")