# Bollywood Movies Recommendation System

This repository contains code for building a recommendation system for Bollywood movies. The system utilizes a combination of collaborative filtering and genre-based recommendations.

## Table of Contents
1. [Introduction](#introduction)
2. [Setup](#setup)
3. [Data Preprocessing](#data-preprocessing)
    - [Loading the Dataset](#loading-the-dataset)
    - [Displaying Dataset Statistics](#displaying-dataset-statistics)
4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis)
    - [Setting a Custom Color Palette](#setting-a-custom-color-palette)
    - [Histogram for Rating Distribution](#histogram-for-rating-distribution)
    - [Pie Chart for Rating Distribution](#pie-chart-for-rating-distribution)
    - [Bar Chart for Number of Movies Released Each Year](#bar-chart-for-number-of-movies-released-each-year)
    - [Line Chart for Relationship between Year and Average Rating](#line-chart-for-relationship-between-year-and-average-rating)
    - [Boxplot for Ratings Over the Years](#boxplot-for-ratings-over-the-years)
5. [Further Data Preprocessing](#further-data-preprocessing)
6. [TF-IDF Vectorization](#tf-idf-vectorization)
7. [Collaborative Filtering using Surprise](#collaborative-filtering-using-surprise)
8. [Genre-Based Recommendations](#genre-based-recommendations)

## Introduction <a name="introduction"></a>

This recommendation system is built for Bollywood movies, incorporating collaborative filtering and genre-based recommendations. The code includes data preprocessing, exploratory data analysis, and the implementation of a recommendation model using the Surprise library.

In [1]:
#Import the necessary libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
from surprise.dataset import DatasetAutoFolds
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk.sentiment import SentimentIntensityAnalyzer
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from torch.utils.data import Dataset, DataLoader



## Data Preprocessing <a name="data-preprocessing"></a>

1. **Loading the Dataset**
   - Load the dataset using pandas from a CSV file.

2. **Displaying Dataset Statistics**
   - Examine dataset summary statistics and identify missing values.


In [2]:
#Loading the dataset:
dataset_path = '/kaggle/input/bollywood-movies-dataset/bollywood_movies.csv'
df = pd.read_csv(dataset_path)

In [3]:
df.describe()

Unnamed: 0,id,year,rating
count,7419.0,7066.0,7419.0
mean,539832.7,1997.956977,3.29969
std,349253.7,22.052941,3.329836
min,480.0,1913.0,0.0
25%,276092.5,1983.25,0.0
50%,485383.0,2004.0,3.5
75%,832671.5,2017.0,6.2
max,1218602.0,2026.0,10.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7419 entries, 0 to 7418
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      7419 non-null   int64  
 1   title   7419 non-null   object 
 2   year    7066 non-null   float64
 3   rating  7419 non-null   float64
dtypes: float64(2), int64(1), object(1)
memory usage: 232.0+ KB


In [5]:
df.head(3)

Unnamed: 0,id,title,year,rating
0,872906,Jawan,2023.0,7.3
1,554600,Uri: The Surgical Strike,2019.0,7.2
2,781732,Animal,2023.0,7.4


In [6]:
df.tail(3)

Unnamed: 0,id,title,year,rating
7416,54098,Soch Lo,2010.0,5.5
7417,46402,Kisse Pyaar Karoon,2009.0,0.0
7418,21757,Toss,2009.0,0.0


# Exploratory Data Analysis (EDA) <a name="exploratory-data-analysis"></a>

1. **Setting a Custom Color Palette**
   - Set a custom color palette using seaborn.

2. **Histogram for Rating Distribution**
   - Visualize the distribution of movie ratings with a histogram.

3. **Pie Chart for Rating Distribution**
   - Display a pie chart illustrating the distribution of movie ratings.

4. **Bar Chart for Number of Movies Released Each Year**
   - Create a bar chart showing the number of movies released each year.

5. **Line Chart for Relationship between Year and Average Rating**
   - Generate a line chart depicting the relationship between release year and average rating.

6. **Boxplot for Ratings Over the Years**
   - Create a boxplot to visualize the distribution of ratings over the years.


### Setting a Custom Color Palette <a name="setting-a-custom-color-palette"></a>

In [7]:
# Set a custom color palette
custom_palette = sns.color_palette("pastel")

### Histogram for Rating Distribution: <a name="histogram-for-rating-distribution"></a>

In [8]:
import plotly.express as px

# Distribution of Ratings
fig1 = px.histogram(df, x='rating', nbins=30, title='Distribution of Ratings', labels={'rating': 'Rating'})
fig1.show()

### Pie Chart for Rating Distribution: <a name="pie-chart-for-rating-distribution"></a>

In [9]:
# Pie chart for the distribution of movie ratings
ratings_distribution = df['rating'].value_counts().reset_index()
ratings_distribution.columns = ['Rating', 'Count']
fig2 = px.pie(ratings_distribution, names='Rating', values='Count', title='Distribution of Movie Ratings', 
              color_discrete_sequence=px.colors.qualitative.Pastel)
fig2.update_layout(height=1000, width=1000)  # Adjust the height and width as needed
fig2.show()


### Bar Chart for Number of Movies Released Each Year: <a name="bar-chart-for-number-of-movies-released-each-year"></a>


In [10]:
# Set 'year' as the index before using reset_index
df_year_counts = df['year'].value_counts().reset_index()
df_year_counts.columns = ['Year', 'Number of Movies']

# Number of Movies Released Each Year
fig3 = px.bar(df_year_counts, x='Year', y='Number of Movies',
              title='Number of Movies Released Each Year',
              color='Number of Movies',
              color_continuous_scale=px.colors.sequential.Plasma)

fig3.update_xaxes(tickangle=45)
fig3.show()


### Line Chart for Relationship between Year and Average Rating: <a name="line-chart-for-relationship-between-year-and-average-rating"></a>

In [11]:
# Relationship between Year and Average Rating
fig4 = px.line(df.groupby('year')['rating'].mean().reset_index(), x='year', y='rating', 
               title='Relationship between Year and Average Rating', labels={'year': 'Year', 'rating': 'Average Rating'},
               line_shape='linear', line_dash_sequence=["solid"])
fig4.update_xaxes(tickangle=45)
fig4.show()

### Boxplot for Ratings Over the Years: <a name="boxplot-for-ratings-over-the-years"></a>

In [12]:
# Boxplot of Ratings Over the Years
fig5 = px.box(df, x='year', y='rating', title='Boxplot of Ratings Over the Years', 
              labels={'year': 'Year', 'rating': 'Rating'}, color='year', color_discrete_sequence=px.colors.qualitative.Pastel)
fig5.show()

# DATA PREPROCESSING: <a name="further-data-preprocessing"></a>

## Further Data Preprocessing
1. **Dropping Duplicate Rows**
   - Remove duplicate rows from the dataset.

2. **Handling Missing Values**
   - Fill missing values in the 'year' column with the median and convert to integer.

3. **Text Representation for Collaborative Filtering**
   - Create a 'text' column combining 'title' and 'overview' for TF-IDF vectorization.


In [13]:
# Dropping duplicate rows
df.drop_duplicates(inplace=True)

In [14]:
# Handling missing values in the 'year' column by filling with the median and converting to integer
df['year'] = df['year'].fillna(df['year'].median()).astype(int)

In [15]:
# Convert 'year' to integer
df['year'] = df['year'].astype(int)

In [16]:
# Assuming your dataset has 'title' and 'overview' columns
# If 'overview' is not available, you can modify this to use 'title' only
df['text'] = df['title']

# TF-IDF Vectorization <a name="tf-idf-vectorization"></a>

1. **Text Vectorization**
   - Use TF-IDF vectorization on movie titles for machine learning.


In [17]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['text'])

In [18]:
# Calculate similarity scores between movies
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [19]:
# Create a Reader object for collaborative filtering
reader = Reader(rating_scale=(0, 10))

In [20]:
# Add a placeholder user ID column (you can use a constant value)
df['user_id'] = 1  # Replace with an appropriate user ID

# Collaborative Filtering using Surprise: <a name="collaborative-filtering-using-surprise"></a>

1. **Loading Data for Surprise**
   - Load the dataset for collaborative filtering using the Surprise library.

2. **SVD for Collaborative Filtering**
   - Implement collaborative filtering using Singular Value Decomposition (SVD).

3. **Making Movie Recommendations**
   - Get top N collaborative filtering recommendations based on predicted ratings.


In [21]:
from surprise import Dataset

# Load the dataset for collaborative filtering using Surprise
data = Dataset.load_from_df(df[['user_id', 'id', 'rating']], reader)

In [22]:
# Use SVD (Singular Value Decomposition) for collaborative filtering
svd = SVD()
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7826ef57aad0>

In [23]:
# Get all movie IDs
all_movie_ids = df['id'].unique()


In [24]:
# Make predictions for all movies
testset = [[1, movie_id, 0] for movie_id in all_movie_ids]  # Assuming user ID 1
predictions = svd.test(testset)

In [25]:
# Sort the predictions based on estimated ratings
predicted_ratings = [(prediction.iid, prediction.est) for prediction in predictions]
predicted_ratings.sort(key=lambda x: x[1], reverse=True)

In [26]:
# Get top N collaborative filtering recommendations
n_recommendations = 5
collaborative_filtering_recommendations = predicted_ratings[:n_recommendations]


In [27]:
# Display the collaborative filtering recommendations
print("\nCollaborative Filtering Recommendations:")
for i, (movie_id, rating) in enumerate(collaborative_filtering_recommendations):
    movie_title = df[df['id'] == movie_id]['title'].values[0]
    print(f"{i + 1}. {movie_title}: Predicted Rating - {rating:.2f}")



Collaborative Filtering Recommendations:
1. Ek Aasha: Predicted Rating - 9.63
2. Khoon Ki Pukaar: Predicted Rating - 9.63
3. Aar Paar: Predicted Rating - 9.62
4. Nirmal Anand Ki Puppy: Predicted Rating - 9.62
5. Chuskit: Predicted Rating - 9.62


# Genre-Based Recommendations: <a name="genre-based-recommendations"></a>

#### We don't have genre column - still tried using NLP to get genre based recommendations:

1. **Genre-Based Recommendations Function**
   - Define a function to get top N movie recommendations based on user preferences (genres).

2. **Example: User Preferences for Romance**
   - Get genre-based recommendations for a user expressing a preference for romantic movies.

3. **Display Recommendations**
   - Display collaborative filtering and genre-based movie recommendations.


In [28]:
# Function to get top N movie recommendations based on user preferences (genres)
def get_genre_based_recommendations(user_preferences, n_recommendations=5):
    # Find movies similar to the user's preferences
    similar_movies = set()
    for genre in user_preferences:
        genre_indices = [i for i, text in enumerate(df['text']) if genre.lower() in text.lower()]
        similar_movies.update(genre_indices)

    # Remove movies the user has already rated
    similar_movies -= set([index for _, index, _ in df[['user_id', 'id', 'rating']].itertuples(index=False)])

    # Calculate average similarity scores for each movie
    movie_scores = [(index, sum(cosine_sim[index])) for index in similar_movies]

    # Get top N recommendations based on average similarity scores
    user_recommendations = sorted(movie_scores, key=lambda x: x[1], reverse=True)[:n_recommendations]

    # Convert movie indices back to movie titles
    recommended_movies = [(df.iloc[index]['title'], df.iloc[index]['rating']) for index, _ in user_recommendations]

    return recommended_movies

In [29]:
# Example: User expresses a preference for romantic movies
user_preferences = ['Romance']


In [30]:
# Get genre-based recommendations for the user
genre_based_recommendations = get_genre_based_recommendations(user_preferences)


In [31]:
# Display the collaborative filtering and genre-based recommendations
print("\nCollaborative Filtering Recommendations:")
for i, (movie_id, rating) in enumerate(collaborative_filtering_recommendations):
    movie_title = df[df['id'] == movie_id]['title'].values[0]
    print(f"{i + 1}. {movie_title}: Predicted Rating - {rating:.2f}")

print(f"\nTop {n_recommendations} Genre-Based Recommended Movies for {', '.join(user_preferences)}:")
for i, (movie, rating) in enumerate(genre_based_recommendations):
    print(f"{i + 1}. {movie}: Rating - {rating:.2f}")



Collaborative Filtering Recommendations:
1. Ek Aasha: Predicted Rating - 9.63
2. Khoon Ki Pukaar: Predicted Rating - 9.63
3. Aar Paar: Predicted Rating - 9.62
4. Nirmal Anand Ki Puppy: Predicted Rating - 9.62
5. Chuskit: Predicted Rating - 9.62

Top 5 Genre-Based Recommended Movies for Romance:
1. Untitled Sunil Pandey/Junaid Khan Supernatural Romance: Rating - 0.00
2. Shuddh Desi Romance: Rating - 5.70
3. Midnight Romance: Rating - 0.00
4. Romance: Rating - 0.00


# **Conclusion and Takeaways:**

- Collaborative filtering, specifically using Singular Value Decomposition (SVD) and the Surprise library, proves effective for predicting user preferences based on historical movie ratings.
- Genre-based recommendations complement collaborative filtering, enhancing personalization by considering user preferences beyond past ratings. This approach leverages TF-IDF vectorization for movie titles.
- Exploratory Data Analysis (EDA) offers valuable insights into the distribution of movie ratings, annual release patterns, and the relationship between release year and average rating.
- Data preprocessing steps include quality assurance through handling missing values and removing duplicate records. Feature engineering involves converting the release year to integers for analysis.
- Continuous model evaluation, fine-tuning, and potential feature expansion, such as incorporating user demographics or movie genres, can further improve the recommendation algorithm's performance.
- The Bollywood Movies Recommendation System lays the groundwork for personalized movie recommendation platforms, adaptable to diverse user preferences.
- Ongoing refinement is essential for maintaining the system's relevance and accuracy in providing movie suggestions to users.
