# Movie Recommendation System

**By Akash Chakraborty**

## Objective
The primary goal of this project is to develop a movie recommendation system leveraging Python and Pandas. This system combines data exploration and feature engineering to suggest movies based on their similarity to a user-selected title. The analysis integrates key movie attributes and machine learning techniques to provide tailored recommendations.

## Dataset Description
This project utilizes four datasets containing comprehensive movie details:

### Movies.csv
- **id**: Unique identifier for each movie.
- **title**: The title of the movie.
- **genres**: A list of genres associated with the movie.
- **language**: The primary language of the movie.
- **user_score**: The average user rating for the movie.
- **runtime_hour**: The runtime of the movie in hours.
- **runtime_min**: The runtime of the movie in minutes.
- **release_date**: The release date of the movie.
- **vote_count**: The total number of votes received by the movie.

### FilmDetails.csv
- **id**: Unique identifier corresponding to Movies table.
- **director**: Name(s) of the director(s).
- **top_billed**: A list of top-billed actors or actresses.
- **budget_usd**: Budget allocated for producing the movie (in USD).
- **revenue_usd**: Total revenue generated by the movie (in USD).

### MoreInfo.csv
- **id**: Unique identifier corresponding to FilmDetails.
- **runtime**: Total runtime formatted as "Xh Ym".
- **budget**: Budget formatted with a dollar sign (e.g., "$20,000,000").
- **revenue**: Revenue formatted with a dollar sign (e.g., "$28,341,469").
- **film_id**: Foreign key linking back to FilmDetails table.

### PosterPath.csv
- **id**: Unique identifier corresponding to Movies.
- **poster_path**: URL link to the movie's poster image.
- **backdrop_path**: URL link to the movie's backdrop image.

## Methodology and Tools
Using Python as the primary programming language, this project employs various libraries for data analysis, visualization, and recommendation:

- **Data Analysis**: Pandas, NumPy
- **Visualization**: Matplotlib, Seaborn
- **Machine Learning**: Scikit-learn for similarity computation

## Analysis Focus
The project focuses on:

- Combining datasets to create a comprehensive movie dataset.
- Identifying key features such as genres, director, and top-billed actors for building similarity metrics.
- Recommending movies using a similarity matrix based on cosine similarity of extracted features.

**1. Importing Libraries**

We begin by importing the required libraries for data manipulation and analysis:
- **Pandas**: For handling and processing structured data.
- **NumPy**: For numerical computations and array handling.
- **CountVectorizer**: To transform textual data into numerical feature vectors.
- **cosine_similarity**: To compute similarity between feature vectors, crucial for recommendation logic.
- **IPython.display**: To display images like movie posters for enhanced visualization.

In [1]:
## Main Execution Flow

### Load and Merge Data

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import Image, display

**2. Loading Datasets**

Next, we load the datasets into separate pandas DataFrames. Each dataset serves a specific purpose:
- **Movies.csv**: Contains core movie information, including titles, genres, and user scores.
- **FilmDetails.csv**: Provides additional details such as directors and budget.
- **MoreInfo.csv**: Includes formatted budget, revenue, and runtime information.
- **PosterPath.csv**: Contains URLs for movie posters and backdrops.

In [2]:
# Load datasets
movies = pd.read_csv("Movies.csv")
film_details = pd.read_csv("FilmDetails.csv")
more_info = pd.read_csv("MoreInfo.csv")
poster_path = pd.read_csv("PosterPath.csv")

**3. Merging Datasets**

After loading the datasets, we merge them sequentially on the common 'id' column. This creates a unified DataFrame that combines all relevant movie information, ensuring consistency and ease of access for further processing.


In [3]:
# Merge datasets
data = pd.merge(movies, film_details, on="id", how="left")
data = pd.merge(data, more_info, on="id", how="left")
data = pd.merge(data, poster_path, on="id", how="left")

In this section, the dataset is prepared for building a recommendation system by addressing missing values and creating a composite feature for analysis.

**Summary:**
- **Handling Missing Values:** To ensure consistency and avoid errors during processing, missing values in critical columns such as 'genres,' 'director,' 'top_billed,' and 'user_score' are filled with appropriate default values.
  - 'genres,' 'director,' and 'top_billed' are filled with "Unknown" to indicate the absence of information.
  - 'user_score' is filled with 0 as a neutral numeric value.
- **Feature Combination:** A new column, 'features,' is created by combining relevant text-based columns ('genres,' 'director,' and 'top_billed'). This combined feature will be instrumental in extracting meaningful insights for recommendations.

In [4]:
### Data Cleaning and Feature Engineering
# Fill missing values
data["genres"] = data["genres"].fillna("Unknown")
data["director"] = data["director"].fillna("Unknown")
data["top_billed"] = data["top_billed"].fillna("Unknown")
data["user_score"] = data["user_score"].fillna(0)

# Combine relevant columns into a single string for feature extraction
data["features"] = data["genres"] + " " + data["director"] + " " + data["top_billed"]

### Build Similarity Matrix

This step is focused on transforming the combined features into a format suitable for computing movie similarity. By leveraging natural language processing techniques, we create a similarity matrix that quantifies how closely related different movies are based on their features.

**Summary:**
- **Feature Vectorization:** The `CountVectorizer` is used to convert the text in the 'features' column into a matrix of token counts. This step transforms textual data into a structured numerical representation.
- **Cosine Similarity Computation:** The `cosine_similarity` function is then applied to this matrix to calculate the pairwise similarity between movies. The resulting similarity matrix will serve as the backbone for the recommendation algorithm.



In [5]:
### Build Similarity Matrix
# Vectorize the combined features using CountVectorizer
vectorizer = CountVectorizer().fit_transform(data["features"])
similarity_matrix = cosine_similarity(vectorizer)

### Recommendation Function

The recommendation function identifies and returns a list of movies similar to the input movie title. It utilizes the similarity matrix computed earlier and is designed to handle both valid and invalid movie title inputs gracefully.

**Summary:**
- **Input Validation:** The function first checks if the input movie title exists in the dataset. If not, it returns a friendly error message.
- **Similarity Scoring:** For a valid title, the function locates its index in the dataset and retrieves the similarity scores from the similarity matrix.
- **Sorting and Selection:** The similarity scores are sorted in descending order to identify the most similar movies. The top five recommendations are selected and returned, excluding the input movie itself.



In [6]:
### Recommendation Function
def recommend(movie_title, data, similarity_matrix):
    """
    Recommend movies based on similarity to the input movie title.

    Parameters:
    - movie_title (str): Title of the movie to base recommendations on.
    - data (DataFrame): The dataset containing movie information.
    - similarity_matrix (ndarray): Precomputed similarity matrix.

    Returns:
    - list: A list of recommended movie titles.
    """
    if movie_title not in data["title"].values:
        return ["Movie title not found in dataset."]

    idx = data[data['title'] == movie_title].index[0]
    similarity_scores = list(enumerate(similarity_matrix[idx]))
    sorted_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    recommended_movies = [data.iloc[i[0]]['title'] for i in sorted_scores[1:6]]
    return recommended_movies

### Display Posters

The poster display function provides a visual representation of recommended movies by showing their respective posters. This enhances user engagement by associating the recommendations with their visual identifiers.

**Summary:**
- **Poster Retrieval:** For each movie in the recommendation list, the function fetches the corresponding poster URL from the dataset.
- **Poster Display:** Using the `Image` class from `IPython.display`, the posters are displayed within the notebook. If a poster URL is missing, a fallback message is printed instead.



In [7]:
### Display Posters
def show_posters(movie_list, data):
    """
    Display posters of recommended movies.

    Parameters:
    - movie_list (list): List of recommended movie titles.
    - data (DataFrame): Dataset containing movie information and poster links.
    """
    for movie in movie_list:
        poster_url = data[data['title'] == movie]['poster_path'].values[0]
        if poster_url:
            display(Image(url=poster_url))
        else:
            print(f"Poster not available for {movie}.")

This section demonstrates the end-to-end execution of the recommendation system. By specifying a movie title, we generate and display recommendations along with their posters.

**Summary:**
- **Input Movie Title:** The title of a movie is specified to base recommendations on. Replace "Inception" with any valid movie title from your dataset.
- **Generate Recommendations:** The `recommend` function is called to retrieve a list of similar movies.
- **Display Results:** Recommended movie titles are printed, and their posters are displayed using the `show_posters` function.

In [10]:
## Example Execution
movie_to_recommend = "Interstellar"  # Replace with a valid movie title from your dataset
recommended_movies = recommend(movie_to_recommend, data, similarity_matrix)

print("Recommendations for:", movie_to_recommend)
print(recommended_movies)

# Display posters of recommended movies
show_posters(recommended_movies, data)

Recommendations for: Interstellar
['The Prestige', 'The Martian', 'Light of My Life', 'The Man Who Would Be King', 'Gerry']


## Summary
This system delivers a user-friendly movie recommendation experience by integrating key movie attributes into a robust similarity-based framework. By leveraging visualization tools and movie poster links, it provides a visually engaging interface for users to explore movie recommendations.