<a href="https://colab.research.google.com/github/arghads9177/recommendation-system-imdb-movies/blob/master/imdb_movies_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation System

## About Dataset

IMDB is one of the main sources which people use to judge the movie or show. IMDB rating plays an important role for a lot of people watching a movie or show. I watched The Shawshank Redemption after finding out that it's at the top of the list on IMDB.

The IMDb Top 250 Movies dataset provides a comprehensive overview of some of the best-rated movies of all time, as per IMDb ratings. This dataset includes a variety of attributes that offer a detailed description of each movie and the reviews provided by users.

## Data Dictionary

* **rank**: The rank of the movie according to IMDb ratings.
* **movie_id**: A unique identifier for each movie.
* **title**: The name of the movie.
* **year**: The year the movie was released.
* **link**: The URL link to the movie's IMDb page.
* **imdb_votes**: The number of votes the movie has received on IMDb.
* **imdb_rating**: The rating of the movie as per IMDb.
* **certificate**: The certification rating of the movie (e.g., PG-13, R).
* **duration**: The duration of the movie in minutes.
* **genre**: The genre(s) of the movie.
* **cast_id**: A unique identifier for each cast member.
* **cast_name**: The name of the cast member.
* **director_id**: A unique identifier for the director.
* **director_name**: The name of the director.
* **writer_id**: A unique identifier for the writer.
* **writer_name**: The name of the writer.
* **storyline**: A brief summary of the movie's plot.
* **user_id**: A unique identifier for the user who wrote a review.
* **user_name**: The name of the user who wrote the review.
* **review_id**: A unique identifier for the review.
* **review_title**: A short title summarizing the review.
* **review_content**: The full content of the review.

## Problem Statement

In today's digital age, users are overwhelmed with the sheer volume of movie choices available across various streaming platforms. This abundance of options often leads to a paradox of choice, where users find it difficult to decide which movie to watch next. A personalized recommendation system can significantly enhance the user experience by suggesting movies that align with their tastes and preferences.

#### Objective:

Develop a movie recommendation system using the IMDb Top 250 Movies dataset that leverages the rich information provided, such as movie genres, cast, directors, user reviews, and ratings. The system should provide personalized recommendations to users based on their past viewing history and preferences.

#### Challenges:

* **Data Integration**: Combining various attributes like genre, cast, directors, and user reviews to create a comprehensive user profile and movie profile.
* **Similarity Calculation**: Using advanced similarity measures like cosine similarity to find movies that are similar to those the user has liked in the past.
* **Personalization**: Taking into account user reviews and ratings to tailor recommendations that are highly relevant to individual users.
* **Scalability**: Ensuring the system can handle a large number of users and movies without compromising on performance.

#### Proposed Solution:

* **Data Preprocessing**: Clean and preprocess the data to handle missing values and ensure consistency.
* **Feature Engineering**: Create a combined feature space for each movie that includes genres, cast and directors.
* **Cosine Similarity**: Compute the cosine similarity between movies to find those that are similar in terms of content and user reviews.
* **Recommendation Algorithm**: Develop an algorithm that recommends movies based on the computed similarities and user preferences.
* **Efiiciency and Accuracy**: Measure the efficiency and accuracy of the recomdation algorithm.

## Load Necessary Libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## Load Dataset

In [4]:
import os
folder_path = "drive/MyDrive/Colab Notebooks/dscourse/data"
file_path = os.path.join(folder_path, "imdb_movies.csv")
df = pd.read_csv(file_path)

## Get Information About Dataset and Data

In [5]:
# Get top 5 rows of the dataset understand what kind of data are presenr in the dataset
df.head()

Unnamed: 0,rank,movie_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...","ur16161013,ur15311310,ur0265899,ur16117882,ur1...","hitchcockthelegend,Sleepin_Dragon,EyeDunno,ale...","rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
1,2,tt0068646,The Godfather,1972,https://www.imdb.com/title/tt0068646,1882829,9.2,R,2h 55m,"Crime,Drama",...,nm0000338,Francis Ford Coppola,"nm0701374,nm0000338","Mario Puzo,Francis Ford Coppola",The aging patriarch of an organized crime dyna...,"ur24740649,ur86182727,ur15794099,ur15311310,ur...","CalRhys,andrewburgereviews,gogoschka-1,Sleepin...","rw3038370,rw4756923,rw4059579,rw6568526,rw1897...","The Pinnacle Of Flawless Films!,An offer so go...",'The Godfather' is the pinnacle of flawless fi...
2,3,tt0468569,The Dark Knight,2008,https://www.imdb.com/title/tt0468569,2684051,9.0,PG-13,2h 32m,"Action,Crime,Drama",...,nm0634240,Christopher Nolan,"tt0468569,nm0634300,nm0634240,nm0275286,tt0468569","Writers,Jonathan Nolan,Christopher Nolan,David...",When the menace known as the Joker wreaks havo...,"ur87850731,ur1293485,ur129557514,ur12449122,ur...","MrHeraclius,Smells_Like_Cheese,dseferaj,little...","rw5478826,rw1914442,rw6606026,rw1917099,rw5170...","The Dark Knight,The Batman of our dreams! So m...","Confidently directed, dark, brooding, and pack..."
3,4,tt0071562,The Godfather Part II,1974,https://www.imdb.com/title/tt0071562,1285350,9.0,R,3h 22m,"Crime,Drama",...,nm0000338,Francis Ford Coppola,"nm0000338,nm0701374","Francis Ford Coppola,Mario Puzo",The early life and career of Vito Corleone in ...,"ur0176092,ur0688559,ur92260614,ur0200644,ur117...","Nazi_Fighter_David,tfrizzell,umunir-36959,DanB...","rw0135607,rw0135487,rw5049900,rw0135526,rw0135...",Breathtaking in its scope and tragic grandeur....,"Coppola's masterpiece is rivaled only by ""The ..."
4,5,tt0050083,12 Angry Men,1957,https://www.imdb.com/title/tt0050083,800954,9.0,Approved,1h 36m,"Crime,Drama",...,nm0001486,Sidney Lumet,nm0741627,Reginald Rose,The jury in a New York City murder trial is fr...,"ur1318549,ur0643062,ur0688559,ur20552756,ur945...","uds3,tedg,tfrizzell,TheLittleSongbird,henrique...","rw0060044,rw0060025,rw0060034,rw2262425,rw5448...","The over-used term ""classic movie"" really come...",This once-in-a-generation masterpiece simply h...


In [6]:
# Get Number of rows and columns in the dataset
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: { df.shape[1]}")

Number of rows: 250
Number of columns: 22


#### Checking Data Types

By checking datatype of each column we can identify the categorical and numerical columns present in the dataset.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   rank            250 non-null    int64  
 1   movie_id        250 non-null    object 
 2   title           250 non-null    object 
 3   year            250 non-null    int64  
 4   link            250 non-null    object 
 5   imbd_votes      250 non-null    object 
 6   imbd_rating     250 non-null    float64
 7   certificate     249 non-null    object 
 8   duration        250 non-null    object 
 9   genre           250 non-null    object 
 10  cast_id         250 non-null    object 
 11  cast_name       250 non-null    object 
 12  director_id     250 non-null    object 
 13  director_name   250 non-null    object 
 14  writer_id       250 non-null    object 
 15  writer_name     250 non-null    object 
 16  storyline       250 non-null    object 
 17  user_id         250 non-null    obj

#### Missing Value Detection

Missing value detection is essential to chack the quality of the data. If present impute it with proper value so that quality of the data is maintained for robust statistical analysis.

In [8]:
df.isnull().sum()

rank              0
movie_id          0
title             0
year              0
link              0
imbd_votes        0
imbd_rating       0
certificate       1
duration          0
genre             0
cast_id           0
cast_name         0
director_id       0
director_name     0
writer_id         0
writer_name       0
storyline         0
user_id           0
user_name         0
review_id         0
review_title      0
review_content    0
dtype: int64

### Observations

* There is only 1 null value in the dataset and that is in the certificate feature.

In [9]:
# Get Number of unique movies
df["title"].nunique()

250

In [10]:
# Get Number of unique users
df["user_id"].nunique()

250

In [11]:
# Get number og unique direcror
df["director_id"].nunique()

160

In [12]:
df["genre"].nunique()

104

In [13]:
# Convert the string to array of values for each comma separated user_id and user_name by splitting it on comma
df["user_id"] = df["user_id"].str.split(',')
df["user_name"] = df["user_name"].str.split(',')
df.head()

Unnamed: 0,rank,movie_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...","[ur16161013, ur15311310, ur0265899, ur16117882...","[hitchcockthelegend, Sleepin_Dragon, EyeDunno,...","rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
1,2,tt0068646,The Godfather,1972,https://www.imdb.com/title/tt0068646,1882829,9.2,R,2h 55m,"Crime,Drama",...,nm0000338,Francis Ford Coppola,"nm0701374,nm0000338","Mario Puzo,Francis Ford Coppola",The aging patriarch of an organized crime dyna...,"[ur24740649, ur86182727, ur15794099, ur1531131...","[CalRhys, andrewburgereviews, gogoschka-1, Sle...","rw3038370,rw4756923,rw4059579,rw6568526,rw1897...","The Pinnacle Of Flawless Films!,An offer so go...",'The Godfather' is the pinnacle of flawless fi...
2,3,tt0468569,The Dark Knight,2008,https://www.imdb.com/title/tt0468569,2684051,9.0,PG-13,2h 32m,"Action,Crime,Drama",...,nm0634240,Christopher Nolan,"tt0468569,nm0634300,nm0634240,nm0275286,tt0468569","Writers,Jonathan Nolan,Christopher Nolan,David...",When the menace known as the Joker wreaks havo...,"[ur87850731, ur1293485, ur129557514, ur1244912...","[MrHeraclius, Smells_Like_Cheese, dseferaj, li...","rw5478826,rw1914442,rw6606026,rw1917099,rw5170...","The Dark Knight,The Batman of our dreams! So m...","Confidently directed, dark, brooding, and pack..."
3,4,tt0071562,The Godfather Part II,1974,https://www.imdb.com/title/tt0071562,1285350,9.0,R,3h 22m,"Crime,Drama",...,nm0000338,Francis Ford Coppola,"nm0000338,nm0701374","Francis Ford Coppola,Mario Puzo",The early life and career of Vito Corleone in ...,"[ur0176092, ur0688559, ur92260614, ur0200644, ...","[Nazi_Fighter_David, tfrizzell, umunir-36959, ...","rw0135607,rw0135487,rw5049900,rw0135526,rw0135...",Breathtaking in its scope and tragic grandeur....,"Coppola's masterpiece is rivaled only by ""The ..."
4,5,tt0050083,12 Angry Men,1957,https://www.imdb.com/title/tt0050083,800954,9.0,Approved,1h 36m,"Crime,Drama",...,nm0001486,Sidney Lumet,nm0741627,Reginald Rose,The jury in a New York City murder trial is fr...,"[ur1318549, ur0643062, ur0688559, ur20552756, ...","[uds3, tedg, tfrizzell, TheLittleSongbird, hen...","rw0060044,rw0060025,rw0060034,rw2262425,rw5448...","The over-used term ""classic movie"" really come...",This once-in-a-generation masterpiece simply h...


In [14]:
# Explode the user_id and user_name column to create a new row for each user_id and user_name pair
df_explode = df.explode(['user_id', 'user_name'])
df_explode.head()

Unnamed: 0,rank,movie_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...",ur16161013,hitchcockthelegend,"rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...",ur15311310,Sleepin_Dragon,"rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...",ur0265899,EyeDunno,"rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...",ur16117882,alexkolokotronis,"rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...
0,1,tt0111161,The Shawshank Redemption,1994,https://www.imdb.com/title/tt0111161,2711075,9.3,R,2h 22m,Drama,...,nm0001104,Frank Darabont,"nm0000175,nm0001104","Stephen King,Frank Darabont","Over the course of several years, two convicts...",ur1898687,kaspen12,"rw2284594,rw6606154,rw1221355,rw1822343,rw1288...","Some birds aren't meant to be caged.,An incred...",The Shawshank Redemption is written and direct...


In [15]:
df_explode.shape

(6235, 22)

In [16]:
# Create a Pivot table with user id, movie title and ratings
df_pivot = df_explode.pivot_table(index="user_id", columns= "title", values="imbd_rating")

In [17]:
df_pivot

title,12 Angry Men,12 Years a Slave,1917,2001: A Space Odyssey,3 Idiots,A Beautiful Mind,A Clockwork Orange,A Separation,Aladdin,Alien,...,V for Vendetta,Vertigo,WALL·E,Warrior,Whiplash,Wild Strawberries,Wild Tales,Witness for the Prosecution,Yojimbo,Your Name.
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0003136,,,,,,,,,,,...,,,,,,,,,,
ur0005435,,,,,,,,,,,...,,,,,,8.1,,,,
ur0011596,,,,,,,,,,,...,,,,,,,,,,
ur0011667,,,,,,,,,,,...,,,,,,,,,,
ur0011762,,,,,8.4,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ur9972457,,,,,,,,,,,...,,,,,,,,,,
ur9974595,,,,,,,,,,,...,,,,,,,,,,
ur99782462,,,,,,,,,,,...,,,,,,,,,,
ur9978719,,,,,,,,,,,...,,,,,,,,,,


In [18]:
df_pivot_filled = df_pivot.fillna(0)

In [19]:
df_pivot_filled

title,12 Angry Men,12 Years a Slave,1917,2001: A Space Odyssey,3 Idiots,A Beautiful Mind,A Clockwork Orange,A Separation,Aladdin,Alien,...,V for Vendetta,Vertigo,WALL·E,Warrior,Whiplash,Wild Strawberries,Wild Tales,Witness for the Prosecution,Yojimbo,Your Name.
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0003136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0005435,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,8.1,0.0,0.0,0.0,0.0
ur0011596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011762,0.0,0.0,0.0,0.0,8.4,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ur9972457,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur9974595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur99782462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur9978719,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# Compute Cosine Similarity matrix
cosine_sim_matrix = cosine_similarity(df_pivot_filled)

In [21]:
cosine_sim_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

### Top N Similar Users to Recommend a Movie

In [22]:
# Create a dataframe for cosine similarity  matrix for users
cosine_sim_user_df = pd.DataFrame(cosine_sim_matrix, index= df_pivot_filled.index, columns= df_pivot_filled.index)

In [23]:
cosine_sim_user_df

user_id,ur0003136,ur0005435,ur0011596,ur0011667,ur0011762,ur0011817,ur0012098,ur0012815,ur0016134,ur0018365,...,ur9919264,ur99443912,ur99519886,ur99671420,ur99708365,ur9972457,ur9974595,ur99782462,ur9978719,ur9992388
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0003136,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0005435,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011596,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011667,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011762,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ur9972457,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.513513,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
ur9974595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
ur99782462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
ur9978719,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [24]:
def top_n_similar_users(user_id, n= 10):
  # Get similarity score for the given user
  similarity_scores = cosine_sim_user_df[user_id]

  # Sort the scores in descending order and get top n
  similar_users = similarity_scores.sort_values(ascending=False).head(n + 1).iloc[1:]
  return similar_users


In [25]:
userid ='ur0005435'
n= 10
similar_users_10 = top_n_similar_users(userid, n)
print(f"Top {n} similar users of {userid}:")
top_n_similar_userids = list(similar_users_10.index)
similar_users = []
for user_id in top_n_similar_userids:
  similar_users.append(df_explode[df_explode["user_id"] == user_id][["user_id", "user_name"]])
similar_users_df = pd.concat(similar_users)
similar_users_df

Top 10 similar users of ur0005435:


Unnamed: 0,user_id,user_name
185,ur7836253,wiseowl-5
185,ur3479720,fred3f
185,ur6503856,disinterested_spectator
185,ur5732819,vitaleralphlouis
185,ur2167075,SanTropez_Couch
185,ur0005435,iam-1
185,ur0675293,jonr-3
185,ur2706684,sol-kay
185,ur11935902,thxrvg
185,ur13017201,Vincentiu


In [26]:
def recomend_movie_to_users(movie_title, n= 10):

  # Get the users who rated the movie
  rated_users = df_explode[df_explode["title"] == movie_title]["user_id"]

  # Find the similar users for those who rated the movie
  similar_user_list = []
  for user_id in rated_users:
    similar_users = top_n_similar_users(user_id)
    similar_user_list = similar_user_list + list(similar_users.index)

  # Convert the list into Series so that value_count can be applied
  similar_users_series = pd.Series(similar_user_list)

  # Count number of occurances and recomed to top n
  top_n_users = similar_users_series.value_counts().head(n)

  # Return list of user_id of top n users to recomend the movie
  return list(top_n_users.index)


In [27]:
movie_title = "3 Idiots"
n = 10
top_n_users = recomend_movie_to_users(movie_title, n)
top_n_to_recomend = []
for user_id in top_n_users:
  top_n_to_recomend.append(df_explode[df_explode["user_id"] == user_id][["user_id", "user_name"]])
top_n_recomend_df = pd.concat(top_n_to_recomend).reset_index(drop=True)

print(f"Top {n} user to recomend the movie {movie_title}:")
top_n_recomend_df

Top 10 user to recomend the movie 3 Idiots:


Unnamed: 0,user_id,user_name
0,ur19482100,Glock_Boy
1,ur87058213,ShaukathX
2,ur0369748,donutpizza
3,ur20658364,sumanbarthakursmailbox
4,ur18334751,vvv832
5,ur19501058,ankityadav
6,ur3446880,Kicino
7,ur0011762,cherold
8,ur4174655,ahujarajiv
9,ur16379678,ahmedn32004


## Top N Similar Movies to Recommend a User

In [28]:
cosine_sim_movies = cosine_similarity(df_pivot_filled.T)

In [30]:
cosine_sim_movies_df = pd.DataFrame(cosine_sim_movies, index = df_pivot_filled.columns, columns= df_pivot_filled.columns)

In [31]:
cosine_sim_movies_df

title,12 Angry Men,12 Years a Slave,1917,2001: A Space Odyssey,3 Idiots,A Beautiful Mind,A Clockwork Orange,A Separation,Aladdin,Alien,...,V for Vendetta,Vertigo,WALL·E,Warrior,Whiplash,Wild Strawberries,Wild Tales,Witness for the Prosecution,Yojimbo,Your Name.
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12 Angry Men,1.00,0.08,0.12,0.08,0.12,0.12,0.12,0.12,0.20,0.08,...,0.08,0.12,0.12,0.12,0.08,0.08,0.08,0.08,0.12,0.16
12 Years a Slave,0.08,1.00,0.04,0.00,0.04,0.00,0.00,0.04,0.04,0.04,...,0.00,0.08,0.00,0.00,0.28,0.04,0.12,0.04,0.04,0.04
1917,0.12,0.04,1.00,0.04,0.04,0.08,0.04,0.08,0.08,0.08,...,0.08,0.08,0.08,0.04,0.12,0.08,0.00,0.12,0.08,0.12
2001: A Space Odyssey,0.08,0.00,0.04,1.00,0.00,0.12,0.12,0.08,0.08,0.00,...,0.04,0.12,0.08,0.04,0.04,0.08,0.00,0.04,0.08,0.04
3 Idiots,0.12,0.04,0.04,0.00,1.00,0.04,0.00,0.04,0.12,0.08,...,0.08,0.08,0.04,0.04,0.04,0.04,0.00,0.04,0.08,0.08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wild Strawberries,0.08,0.04,0.08,0.08,0.04,0.08,0.16,0.08,0.04,0.04,...,0.08,0.28,0.08,0.04,0.12,1.00,0.04,0.16,0.16,0.08
Wild Tales,0.08,0.12,0.00,0.00,0.00,0.00,0.00,0.08,0.04,0.04,...,0.04,0.08,0.08,0.08,0.08,0.04,1.00,0.04,0.08,0.00
Witness for the Prosecution,0.08,0.04,0.12,0.04,0.04,0.16,0.08,0.12,0.12,0.08,...,0.08,0.40,0.12,0.04,0.12,0.16,0.04,1.00,0.16,0.08
Yojimbo,0.12,0.04,0.08,0.08,0.08,0.12,0.08,0.12,0.08,0.08,...,0.08,0.24,0.08,0.04,0.08,0.16,0.08,0.16,1.00,0.08


In [44]:
def top_n_similar_movies(movie_title, n= 10):
  # Get similarity score for the movie
  similarity_score = cosine_sim_movies_df[movie_title]

  # Sort the score in decending order and find top n
  sim_movies_n = similarity_score.sort_values(ascending= False).head(n + 1).iloc[1:]

  return sim_movies_n

In [45]:
movie_title = "A Beautiful Mind"
n  = 10
top_n_movies = top_n_similar_movies(movie_title, n)
top_n_movies

title
Double Indemnity             0.240000
Singin' in the Rain          0.240000
Full Metal Jacket            0.204124
Reservoir Dogs               0.204124
Persona                      0.204124
In the Name of the Father    0.204124
Cinema Paradiso              0.200000
North by Northwest           0.200000
Million Dollar Baby          0.200000
On the Waterfront            0.200000
Name: A Beautiful Mind, dtype: float64

In [46]:
def recomend_top_n_movies(user_id, n= 10):
  # Get the movies rated by the users
  rated_movies = df_explode[df_explode["user_id"] == user_id]["title"]

  # Find the similar movies for thse rated movies
  similar_movies_list = []
  for movie_title in rated_movies:
    similar_movies = top_n_similar_movies(movie_title)
    similar_movies_list = similar_movies_list + list(similar_movies.index)

  # Convert the list into series to use value_counts
  similar_movies_series = pd.Series(similar_movies_list)

  # Count number of occurances of movies and get top n
  top_n_movies = similar_movies_series.value_counts().head(n)

  return list(top_n_movies.index)


In [47]:
user_id = "ur87058213"
n= 10
recomended_movies = recomend_top_n_movies(user_id, n)
recomended_movies

['Coco',
 'The Father',
 'Unforgiven',
 'Dangal',
 'My Neighbor Totoro',
 'The General',
 'The Boat',
 'The Seventh Seal',
 'Django Unchained',
 'Groundhog Day']