# Movie Recommendation System

This project implements a Hybrid Movie Recommendation System that combines both content-based and collaborative filtering techniques to generate personalized movie recommendations. The content-based filtering component leverages the TF-IDF vectorization of movie genres and titles to compute similarity between movies, while the collaborative filtering component uses a neural network model with user and movie embeddings to predict ratings. The system integrates both methods by combining their respective similarity scores, providing more accurate and diverse recommendations. The model is trained on the MovieLens dataset, and the hybrid approach aims to enhance recommendation quality by considering both user preferences and movie characteristics. Additionally, the system includes evaluation metrics such as genre match rate to assess the relevance of the recommendations.

The MovieLens dataset provides rich data, including:
- **User Ratings**: The ratings users have given to movies on a scale of 1 to 5.
- **Movie Metadata**: Details about the movies such as titles, genres, and release years.

### Hybrid Approach Overview

- **Collaborative Filtering**: This technique identifies patterns by analyzing user-item interactions. We use **user-based** or **item-based** collaborative filtering to recommend movies based on the preferences of similar users or similar movies.
- **Content-Based Filtering**: This method focuses on the characteristics of the movies themselves. It recommends movies based on features like genre, description, and keywords that match the user's past preferences.

By combining these two techniques, the hybrid approach aims to improve the accuracy of predictions, especially when one method has limitations. For example, collaborative filtering might struggle with new or unpopular movies, while content-based filtering may have difficulty recommending diverse options.

In this notebook, we will:

- Preprocess and analyze the data.
- Implement both collaborative filtering and content-based filtering methods.
- Build and evaluate the hybrid model to generate movie recommendations.
- Fine-tune the system to enhance recommendation quality.


## Downloading the data

In [None]:
# Download MovieLens dataset
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

--2025-01-02 18:06:49--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2025-01-02 18:06:52 (773 KB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


## Importing necessary libraries

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

## Identifying the structure of the dataset

In [None]:
# Load the datasets
movies_df = pd.read_csv('ml-latest-small/movies.csv')
ratings_df = pd.read_csv('ml-latest-small/ratings.csv')
links_df = pd.read_csv('ml-latest-small/links.csv')

# Display the first few rows of each dataframe
print("Movies Dataset:")
print(movies_df.head())

print("\nRatings Dataset:")
print(ratings_df.head())

print("\nLinks Dataset:")
print(links_df.head())

Movies Dataset:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

Ratings Dataset:
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931

Links Dataset:
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114

## Preprocess movie data

In [None]:
# Creating a combined text feature for content-based filtering
movies_df['content_features'] = movies_df['genres'] + ' ' + movies_df['title']


## Content-Based Filtering Component

### 1. TF-IDF Vectorization

In [None]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['content_features'])

# Compute content-based similarity matrix
content_sim_matrix = cosine_similarity(tfidf_matrix)

### Collaborative Filtering Preprocessing

In [None]:
# Encode users and movies
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()

ratings_df['user_id_encoded'] = user_encoder.fit_transform(ratings_df['userId'])
ratings_df['movie_id_encoded'] = movie_encoder.fit_transform(ratings_df['movieId'])


user-movie rating matrix

In [None]:
user_movie_matrix = ratings_df.pivot(
    index='user_id_encoded',
    columns='movie_id_encoded',
    values='rating'
).fillna(0)

Collaborative Filtering Neural Network Model

In [None]:
def create_hybrid_model(num_users, num_movies, embedding_size=50):
    # User input
    user_input = tf.keras.layers.Input(shape=(1,), name='user_input')

    # Movie input
    movie_input = tf.keras.layers.Input(shape=(1,), name='movie_input')

    # Embedding layers
    user_embedding = tf.keras.layers.Embedding(
        num_users, embedding_size,
        embeddings_initializer='he_normal',
        input_length=1,
        name='user_embedding'
    )(user_input)
    movie_embedding = tf.keras.layers.Embedding(
        num_movies, embedding_size,
        embeddings_initializer='he_normal',
        input_length=1,
        name='movie_embedding'
    )(movie_input)

    # Flatten embeddings
    user_vector = tf.keras.layers.Flatten()(user_embedding)
    movie_vector = tf.keras.layers.Flatten()(movie_embedding)

    # Concatenate user and movie embeddings
    concatenated = tf.keras.layers.Concatenate()([user_vector, movie_vector])

    # Deep layers
    dense1 = tf.keras.layers.Dense(64, activation='relu')(concatenated)
    dense2 = tf.keras.layers.Dense(32, activation='relu')(dense1)
    output = tf.keras.layers.Dense(1, activation='linear')(dense2)

    # Create model
    model = tf.keras.Model(inputs=[user_input, movie_input], outputs=output)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss='mean_squared_error'
    )

    return model

# Training Model

In [None]:
# Prepare data for training
X_users = ratings_df['user_id_encoded'].values
X_movies = ratings_df['movie_id_encoded'].values
y = ratings_df['rating'].values

# Split data
X_users_train, X_users_test, X_movies_train, X_movies_test, y_train, y_test = train_test_split(
    X_users, X_movies, y, test_size=0.2, random_state=42
)

# Get number of unique users and movies
num_users = len(np.unique(X_users))
num_movies = len(np.unique(X_movies))

### Train Collaborative Filtering Model

In [None]:
cf_model = create_hybrid_model(num_users, num_movies)
history = cf_model.fit(
    [X_users_train, X_movies_train],
    y_train,
    epochs=10,
    batch_size=64,
    validation_split=0.2,
    verbose=1
)

Epoch 1/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - loss: 2.8831 - val_loss: 0.7982
Epoch 2/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.6878 - val_loss: 0.7798
Epoch 3/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.6168 - val_loss: 0.7815
Epoch 4/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.5410 - val_loss: 0.8101
Epoch 5/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - loss: 0.4703 - val_loss: 0.8280
Epoch 6/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5ms/step - loss: 0.4029 - val_loss: 0.8513
Epoch 7/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - loss: 0.3465 - val_loss: 0.8914
Epoch 8/10
[1m1009/1009[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.2979 - val_loss: 0.9082
Epoch 9/10
[1m1009/1009

## Hybrid Recommendation system

In [None]:
# Hybrid Recommendation Function
def get_hybrid_recommendations(input_movies, top_n=5):
    """
    Generate hybrid recommendations combining content and collaborative filtering

    Args:
    input_movies (list): Input movie names
    top_n (int): Number of recommendations to return

    Returns:
    list: Top recommended movies
    """
    try:
        # Find movie indices for input movies
        input_movie_indices = []
        for movie in input_movies:
            idx = movies_df[movies_df['title'] == movie].index
            if len(idx) > 0:
                input_movie_indices.append(idx[0])
            else:
                print(f"Warning: {movie} not found in the dataset.")
                return []

        # Content-based recommendations
        content_scores = np.mean([content_sim_matrix[idx] for idx in input_movie_indices], axis=0)

        # Collaborative filtering predictions
        sample_user_id = X_users[0]  # Using first user as example
        all_movie_ids = np.arange(num_movies)
        user_inputs = np.full(len(all_movie_ids), sample_user_id)

        # Ensure the same movie indices for both methods
        # Align indices between content and collaborative filtering scores
        content_scores = content_scores[:len(all_movie_ids)]

        # Predict ratings using collaborative filtering
        cf_ratings = cf_model.predict([user_inputs, all_movie_ids]).flatten()

        # Combine content and collaborative filtering scores
        hybrid_scores = 0.5 * content_scores + 0.5 * cf_ratings

        # Sort and get top recommendations
        top_indices = np.argsort(hybrid_scores)[::-1]

        # Filter out input movies
        recommendations = []
        for idx in top_indices:
            rec_movie_title = movies_df.loc[idx, 'title']

            if rec_movie_title not in input_movies and len(recommendations) < top_n:
                recommendations.append(rec_movie_title)

        return recommendations

    except Exception as e:
        print(f"An error occurred: {e}")
        return []


## Evaluation Metrics

In [None]:
def evaluate_recommendations(test_movies, top_n=5):
    """
    Evaluate recommendation quality

    Args:
    test_movies (list): Movies to test
    top_n (int): Number of recommendations

    Returns:
    dict: Evaluation metrics
    """
    recommendations = get_hybrid_recommendations(test_movies, top_n)

    # Genre match calculation
    input_genres = set()
    for movie in test_movies:
        genre = movies_df[movies_df['title'] == movie]['genres'].values[0]
        input_genres.update(genre.split('|'))

    # Check genre match for recommendations
    genre_matches = sum(
        any(genre in input_genres for genre in
            movies_df[movies_df['title'] == rec]['genres'].values[0].split('|'))
        for rec in recommendations
    )

    # Calculate metrics
    metrics = {
        'recommendations': recommendations,
        'genre_match_rate': (genre_matches / len(recommendations)) * 100 if recommendations else 0,
        'total_recommendations': len(recommendations)
    }

    return metrics

## Usage and Evaluation

In [None]:
print("\nRecommendations for 'Inception (2010)':")
inception_recommendations = get_hybrid_recommendations(['Inception (2010)'])
print(inception_recommendations)

print("\nEvaluation for 'Inception (2010)':")
evaluation_results = evaluate_recommendations(['Inception (2010)'])
print(f"Genre Match Rate: {evaluation_results['genre_match_rate']:.2f}%")



Recommendations for 'Inception (2010)':
[1m304/304[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
['Sandpiper, The (1965)', 'Secret of Roan Inish, The (1994)', 'GLOW: The Story of the Gorgeous Ladies of Wrestling (2012)', 'Hoop Dreams (1994)', 'Thirteen Days (2000)']

Evaluation for 'Inception (2010)':
[1m304/304[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Genre Match Rate: 60.00%
