## Build a Movie Recommender System

In [7]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb

In [8]:
vocab_size = 10000
(x_train, y_train),(x_test, y_test) = imdb.load_data(num_words= vocab_size)

print(f"Training samples: {len(x_train)}, Testing samples: {len(x_test)}")
print("Sample review (numerical format):", x_train[0])

Training samples: 25000, Testing samples: 25000
Sample review (numerical format): [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 1

In [9]:
#To understand the data better, let's decode the numerical review back into text.

word_index = imdb.get_word_index()
reverse_word_index = {v: k for k, v in word_index.items()}

# Decode the first review
decoded_review = " ".join([reverse_word_index.get(i - 3, "?") for i in x_train[0]])
print(decoded_review)

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

In [10]:
x_train.shape

(25000,)

## Why Is Padding Important?
Ensures uniform input shape:

Neural networks require fixed-size inputs.
Padding ensures batch processing is efficient.
Prevents information loss (when using "post" padding):

If we truncate long reviews from the start, we might lose important context.
Improves computation efficiency:

Most GPU-based deep learning frameworks work best with fixed-size tensors.


In [11]:
from os import truncate
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 200
x_train_padded = pad_sequences(x_train, maxlen = max_length, truncating = "post", padding = "post")
x_test_padded = pad_sequences(x_test, maxlen = max_length, truncating = "post", padding = "post")

print("Padded Review Shape: ", x_train_padded.shape)

Padded Review Shape:  (25000, 200)


## What is LSTM (Long Short-Term Memory)?
LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) designed to handle sequential data (e.g., time series, text, speech). Unlike traditional RNNs, LSTMs can remember long-term dependencies using gates to control the flow of information.

Why is LSTM Used?
Solves vanishing gradient problem (RNNs struggle with long sequences).
Maintains memory over long sequences (important for NLP & time-series).
Effective for text classification, speech recognition, and stock price prediction.

How Does LSTM Work?

LSTM has three key gates:

Forget Gate: Decides what information to discard from memory.

Input Gate: Updates memory with new input.

Output Gate: Decides what to output.


Step 1: Load GloVe Pre-trained Embeddings
python
Copy
Edit
import numpy as np
from tensorflow.keras.layers import Embedding
numpy is used to handle arrays efficiently.
Embedding is the TensorFlow layer for word embeddings.
Step 2: Read the GloVe File


```
embedding_index = {}  # Dictionary to store word embeddings
with open("glove.6B.100d.txt", encoding="utf-8") as f:  # Open GloVe file
    for line in f:
        values = line.split()  # Split line into word + numbers
        word = values[0]  # First part is the word
        vector = np.array(values[1:], dtype="float32")  # Remaining numbers are vector values
        embedding_index[word] = vector  # Store in dictionary
```


✅ How does this work?

The GloVe file is a text file where each line looks like:
```
the 0.418 0.24968 -0.41242 0.1217 ... (100 numbers in total)
apple 0.1223 -0.314 0.5403 ... (100 numbers)

```
This loads all word embeddings into a dictionary:

```
embedding_index["the"] = [0.418, 0.24968, -0.41242, 0.1217, ...]  # 100-d vector
embedding_index["apple"] = [0.1223, -0.314, 0.5403, ...]
```
✅ Why do we need this matrix?

```
embedding_matrix = np.zeros((vocab_size, 100))  # Create a matrix of zeros
for word, i in word_index.items():  # word_index comes from tokenizer
    if i < vocab_size:
        vector = embedding_index.get(word)  # Get the pre-trained vector
        if vector is not None:
            embedding_matrix[i] = vector  # Assign to the correct row

```
word_index.items() maps words to their unique integer IDs (from tokenizer).

The embedding matrix is a lookup table where each row corresponds to a word in our vocabulary.
Example:
If "apple" is at index 42 in word_index, then:

embedding_matrix[42] = [0.1223, -0.314, 0.5403, ...]  # GloVe vector for "apple"
Words not found in GloVe remain as zeros.

Step 4: Define the TensorFlow Embedding Layer

embedding_layer = Embedding(input_dim=vocab_size, output_dim=100,
                            weights=[embedding_matrix], trainable=False)
✅ What happens here?

input_dim=vocab_size: The total number of words in your vocabulary.
output_dim=100: The embedding size (from glove.6B.100d.txt).
weights=[embedding_matrix]: Use pre-trained embeddings instead of learning from scratch.
trainable=False: Prevents model from updating embeddings during training (keeps GloVe vectors unchanged).

In [12]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
import numpy as np

embedding_index = {}  # Dictionary to store word embeddings
with open("/content/glove.6B.100d.txt", encoding="utf-8") as f:  # Open GloVe file
    for line in f:
        values = line.split()  # Split line into word + numbers
        word = values[0]  # First part is the word
        vector = np.array(values[1:], dtype="float32")  # Remaining numbers are vector values
        embedding_index[word] = vector  # Store in dictionary

In [13]:
embedding_index["the"]

array([-0.038194, -0.24487 ,  0.72812 , -0.39961 ,  0.083172,  0.043953,
       -0.39141 ,  0.3344  , -0.57545 ,  0.087459,  0.28787 , -0.06731 ,
        0.30906 , -0.26384 , -0.13231 , -0.20757 ,  0.33395 , -0.33848 ,
       -0.31743 , -0.48336 ,  0.1464  , -0.37304 ,  0.34577 ,  0.052041,
        0.44946 , -0.46971 ,  0.02628 , -0.54155 , -0.15518 , -0.14107 ,
       -0.039722,  0.28277 ,  0.14393 ,  0.23464 , -0.31021 ,  0.086173,
        0.20397 ,  0.52624 ,  0.17164 , -0.082378, -0.71787 , -0.41531 ,
        0.20335 , -0.12763 ,  0.41367 ,  0.55187 ,  0.57908 , -0.33477 ,
       -0.36559 , -0.54857 , -0.062892,  0.26584 ,  0.30205 ,  0.99775 ,
       -0.80481 , -3.0243  ,  0.01254 , -0.36942 ,  2.2167  ,  0.72201 ,
       -0.24978 ,  0.92136 ,  0.034514,  0.46745 ,  1.1079  , -0.19358 ,
       -0.074575,  0.23353 , -0.052062, -0.22044 ,  0.057162, -0.15806 ,
       -0.30798 , -0.41625 ,  0.37972 ,  0.15006 , -0.53212 , -0.2055  ,
       -1.2526  ,  0.071624,  0.70565 ,  0.49744 , 

In [14]:
embedding_matrix = np.zeros((vocab_size, 100))  # Create a matrix of zeros
for word, i in word_index.items():  # word_index comes from tokenizer
    if i < vocab_size:
        vector = embedding_index.get(word)  # Get the pre-trained vector
        if vector is not None:
            embedding_matrix[i] = vector  # Assign to the correct row


In [15]:
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=100, weights=[embedding_matrix], trainable=False),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(32, activation="relu"),
    Dropout(0.5),
    Dense(1, activation="sigmoid")  # Binary classification (positive/negative)
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

In [16]:
history = model.fit(x_train_padded, y_train, epochs=5, batch_size=64, validation_data=(x_test_padded, y_test))

Epoch 1/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 31ms/step - accuracy: 0.4971 - loss: 0.6944 - val_accuracy: 0.5017 - val_loss: 0.6932
Epoch 2/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 27ms/step - accuracy: 0.5020 - loss: 0.6931 - val_accuracy: 0.5152 - val_loss: 0.6915
Epoch 3/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 24ms/step - accuracy: 0.5138 - loss: 0.6917 - val_accuracy: 0.5230 - val_loss: 0.6899
Epoch 4/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 31ms/step - accuracy: 0.5281 - loss: 0.6897 - val_accuracy: 0.4946 - val_loss: 0.6932
Epoch 5/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 28ms/step - accuracy: 0.5013 - loss: 0.6928 - val_accuracy: 0.5132 - val_loss: 0.6925


In [17]:
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

def preprocess_review(review):
    tokenizer = Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts([review])
    sequence = tokenizer.texts_to_sequences([review])
    padded_sequence = pad_sequences(sequence, maxlen=max_length, padding="post", truncating="post")
    return padded_sequence

# Example review
new_review = "The movie was fantastic! I really enjoyed it."
processed_review = preprocess_review(new_review)

# Predict sentiment
prediction = model.predict(processed_review)
sentiment = "Positive" if prediction > 0.5 else "Negative"
print(f"Review Sentiment: {sentiment} ({prediction[0][0]:.4f})")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 221ms/step
Review Sentiment: Negative (0.4992)


In [18]:
import pandas as pd

movies_df = pd.read_csv("/content/movies_metadata.csv")
movies_df = movies_df[["title", "overview"]].dropna()
movies_df.head()

  movies_df = pd.read_csv("/content/movies_metadata.csv")


Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf.fit_transform(movies_df["overview"])

print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (44506, 75827)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = movies_df[movies_df["title"] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]  # Top 5 recommendations
    movie_indices = [i[0] for i in sim_scores]
    return movies_df["title"].iloc[movie_indices]

# Example: Recommend movies similar to "The Matrix"
print(get_recommendations("The Matrix"))
