# Data loading and cleaning

**Import dependancies**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2025-09-24 21:55:22.100276: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758750922.263218      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758750922.317577      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


**Load data**

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/imdb-dataset-of-50k-movie-reviews


In [3]:
data = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

**Discover and clean data**

In [4]:
data.shape

(50000, 2)

In [5]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
data['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

The data is balanced

In [7]:
data.replace({'sentiment': {'positive': 1, 'negative': 0}}, inplace = True)

  data.replace({'sentiment': {'positive': 1, 'negative': 0}}, inplace = True)


In [8]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [9]:
data['sentiment'].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

**Split data**

In [10]:
train_data, test_data = train_test_split(data, random_state = 42, test_size=0.2)

In [11]:
train_data.shape, test_data.shape

((40000, 2), (10000, 2))

**Data Preprocessing**

In [12]:
#Tokenize the data 
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data['review'])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data['review']), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data['review']), maxlen=200)

Tokenizer is used to convert words to integers or vectors, in our case we are converting them to integers 

**What happens exactly?**

When you call fit_on_texts(), the tokenizer scans all the text data and builds a word index (mapping each unique word to an integer, sorted by frequency).

By setting num_words=5000, you tell it:
👉 “Keep only the top 5000 most frequent words in the training data. Ignore the rest.”

Words outside this top-5000 list will either:

- be skipped (ignored)

- or replaced by an out-of-vocabulary token (<OOV>) if you set oov_token when creating the tokenizer.

**Why is this useful?**

- Reduce memory usage – text datasets can have tens of thousands of unique words, but many are rare or irrelevant. Limiting vocabulary avoids huge embeddings.

- Prevent overfitting – rare words don’t add much value but increase noise.

- Improve training speed – smaller vocabulary → smaller embedding matrix → faster training.

**Usual choice of num_words**

There’s no universal fixed number — it depends on your dataset and resources. But here are the common practices:

- Small datasets / simple tasks (e.g. sentiment analysis on IMDB reviews):
num_words = 5,000 – 20,000 is often enough.
(IMDB tutorial in Keras usually uses 10,000).

- Medium datasets (millions of tokens):
num_words = 30,000 – 50,000.

- Large datasets (news corpora, Wikipedia, translation):
num_words = 100k or more (if GPU/memory allows).


**Integers vs. Vectors**

🔹 Integers (tokenizer outputs like [15, 27, 3, 99])

- These are word IDs (indexes in the vocabulary).

- They don’t capture meaning by themselves, but they’re necessary because ML models can’t process raw text.

🔹 Vectors (embeddings like [0.12, -0.45, 0.67, ...])

- These are dense, continuous representations of words.

- They capture semantic meaning (e.g., “king” and “queen” will be close in vector space).

- Usually generated by an Embedding layer in Keras (learned during training or initialized with pretrained embeddings like GloVe/Word2Vec).

**When to use integers?**

- Right after tokenization.

- You keep them as integers to feed into:

    - Embedding layers (Embedding(input_dim=num_words, output_dim=128)) → turns them into vectors automatically.

    - Or classical ML approaches (bag-of-words, TF-IDF) where you don’t need dense embeddings.

**When to use vectors?**

- When feeding text into models like LSTMs, GRUs, Transformers, CNNs, etc.

- You either:

    - Let the Embedding layer learn them (most common).

    - Or load pretrained word vectors if you want semantic knowledge from large corpora.

- pad_sequences(maxlen=200) → makes all sequences exactly length 200 by :
    - If a sequence is shorter than 200 → it gets padded with zeros.

    - If a sequence is longer than 200 → it gets truncated (by default from the start).

- Fitting tokenizer only on train data is correct.

    - Test data is transformed with the same word index.

    - Unknown words in test → ignored or mapped to <OOV>.
 
So at this stage, your review looks like:

    'This movie was great'

    → [14, 57, 92, 8]
    
    → after padding to maxlen=200 → [0, 0, 0, …, 14, 57, 92, 8]
 
These are integers, not vectors.

In [13]:
X_train, X_test

(array([[1935,    1, 1200, ...,  205,  351, 3856],
        [   3, 1651,  595, ...,   89,  103,    9],
        [   0,    0,    0, ...,    2,  710,   62],
        ...,
        [   0,    0,    0, ..., 1641,    2,  603],
        [   0,    0,    0, ...,  245,  103,  125],
        [   0,    0,    0, ...,   70,   73, 2062]], dtype=int32),
 array([[   0,    0,    0, ...,  995,  719,  155],
        [  12,  162,   59, ...,  380,    7,    7],
        [   0,    0,    0, ...,   50, 1088,   96],
        ...,
        [   0,    0,    0, ...,  125,  200, 3241],
        [   0,    0,    0, ..., 1066,    1, 2305],
        [   0,    0,    0, ...,    1,  332,   27]], dtype=int32))

In [14]:
Y_train = train_data['sentiment']
Y_test = test_data['sentiment']

In [15]:
Y_train.shape, Y_test.shape

((40000,), (10000,))

In [16]:
Y_train

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64

# LSTM model (Long Short Term Memory)

**LSTM = Long Short-Term Memory**, a special kind of Recurrent Neural Network (RNN = un type de réseau de neurones artificiels conçu pour traiter des données séquentielles, comme le texte ou les séries temporelles).

- Normal RNNs process sequences word by word and pass information forward, but they forget long-term dependencies (vanishing gradient problem).

- LSTMs solve this by adding a “memory cell” with gates:

    - Forget gate: decide what old info to drop.

    - Input gate: decide what new info to store.

    - Output gate: decide what info to pass to the next step.

👉 This lets them remember important info for a long time (like context in a sentence).

**Advantages over simple feedforward networks**

- Sequence awareness:
A dense feedforward NN just sees input as a bag of numbers, losing word order.
LSTM keeps order and context.

- Memory of context:
They can “remember” earlier parts of the text when analyzing later parts.

- Better at long sequences:
Unlike vanilla RNNs, LSTMs handle long sentences without easily forgetting early words.

**So, rule of thumb:**

- If sequence is short/simple and you want a quick baseline → Simple RNN is okay (ex: Predicting the next character in a short word ("hel → hello"), Small toy tasks / academic examples (to show how recurrence works), Very small datasets where model simplicity prevents overfitting).

- If sequence is longer or context really matters (which is most NLP problems) → LSTM (or GRU, or Transformers) is better

In [17]:
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128, input_length=200))
#inpu_dim = max number of words, output_dim = dim of vector, input_length = dim of sentences voctors
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
#Special to recurrent layers (like LSTM/GRU).Instead of dropping inputs, it drops connections in the recurrent state (hidden state between time steps).
#This helps prevent the model from memorizing sequences too tightly and overfitting.
model.add(Dense(1, activation='sigmoid'))

I0000 00:00:1758750945.712576      36 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0


In [18]:
model.summary()

In TensorFlow 2.x / Keras, some layers remain “unbuilt” until the model actually sees input data or you explicitly build it.
Even though you gave input_length, Keras doesn’t finalize shapes until the model is built.

In [19]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [20]:
model.fit(X_train, Y_train, epochs=10, batch_size=64, validation_split=0.2)

Epoch 1/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m161s[0m 307ms/step - accuracy: 0.7199 - loss: 0.5433 - val_accuracy: 0.8367 - val_loss: 0.3767
Epoch 2/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 303ms/step - accuracy: 0.8552 - loss: 0.3528 - val_accuracy: 0.8646 - val_loss: 0.3300
Epoch 3/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m153s[0m 306ms/step - accuracy: 0.8722 - loss: 0.3115 - val_accuracy: 0.8610 - val_loss: 0.3196
Epoch 4/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 304ms/step - accuracy: 0.8880 - loss: 0.2783 - val_accuracy: 0.8734 - val_loss: 0.3141
Epoch 5/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m153s[0m 305ms/step - accuracy: 0.9085 - loss: 0.2372 - val_accuracy: 0.8745 - val_loss: 0.3095
Epoch 6/10
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 305ms/step - accuracy: 0.9136 - loss: 0.2184 - val_accuracy: 0.8776 - val_loss: 0.3122
Epoc

<keras.src.callbacks.history.History at 0x7a66b004d050>

**Model Evaluation**

In [21]:
loss, accuracy = model.evaluate(X_test, Y_test)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 97ms/step - accuracy: 0.8788 - loss: 0.3330


**Predictive system**

In [22]:
def predict_sentiment(review):
    sequence = tokenizer.texts_to_sequences([review]) # [] means treat it as a single sentence
    padded_sequence = pad_sequences(sequence, maxlen=200)
    prediction = model.predict(padded_sequence)
    sentiment = 'positive' if prediction[0][0] > 0.5 else 'negative'
    return sentiment

In [23]:
new_review = 'This movie is really good, I loved it'
sentiment = predict_sentiment(new_review)
print("The sentiment of the review is : ", sentiment)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 563ms/step
The sentiment of the review is :  positive


In [24]:
new_review = "The movie didn't catch me, it is basic"
sentiment = predict_sentiment(new_review)
print("The sentiment of the review is : ", sentiment)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step
The sentiment of the review is :  positive


In [25]:
new_review = "The movie was not that good"
sentiment = predict_sentiment(new_review)
print("The sentiment of the review is : ", sentiment)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step
The sentiment of the review is :  negative


In [26]:
new_review = "It is a lovely movie"
sentiment = predict_sentiment(new_review)
print("The sentiment of the review is : ", sentiment)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 125ms/step
The sentiment of the review is :  positive


In [27]:
model.export("sentiment_analysis_model")

Saved artifact at 'sentiment_analysis_model'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 200), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 1), dtype=tf.float32, name=None)
Captures:
  134579791712080: TensorSpec(shape=(), dtype=tf.resource, name=None)
  134579791718608: TensorSpec(shape=(), dtype=tf.resource, name=None)
  134579791711504: TensorSpec(shape=(), dtype=tf.resource, name=None)
  134581945963664: TensorSpec(shape=(), dtype=tf.resource, name=None)
  134579791711888: TensorSpec(shape=(), dtype=tf.resource, name=None)
  134581945963472: TensorSpec(shape=(), dtype=tf.resource, name=None)
  134581945961744: TensorSpec(shape=(), dtype=tf.resource, name=None)


In [28]:
import shutil

# Zip the folder
shutil.make_archive("/kaggle/working/sentiment_analysis_model", 'zip', "/kaggle/working/sentiment_analysis_model")

'/kaggle/working/sentiment_analysis_model.zip'

**Saving the tokenizer**

In [29]:
import pickle

with open("tokenizer.pkl", "wb") as f:
    pickle.dump(tokenizer, f)