<a href="https://colab.research.google.com/github/arghads9177/recommendation-system_imdb_shows/blob/master/imdb_shows_recomendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Show Recommendation System

## About Dataset

IMDB is one of the main sources which people use to judge the show or show. IMDB rating plays an important role for a lot of people watching a show or show. I watched The Shawshank Redemption after finding out that it's at the top of the list on IMDB.

The IMDb Top 250 shows dataset provides a comprehensive overview of some of the best-rated shows of all time, as per IMDb ratings. This dataset includes a variety of attributes that offer a detailed description of each show and the reviews provided by users.

## Data Dictionary

* **rank**: The rank of the show according to IMDb ratings.
* **show_id**: A unique identifier for each show.
* **title**: The name of the show.
* **year**: The year the show was released.
* **link**: The URL link to the show's IMDb page.
* **imdb_votes**: The number of votes the show has received on IMDb.
* **imdb_rating**: The rating of the show as per IMDb.
* **certificate**: The certification rating of the show (e.g., PG-13, R).
* **duration**: The duration of the show in minutes.
* **genre**: The genre(s) of the show.
* **cast_id**: A unique identifier for each cast member.
* **cast_name**: The name of the cast member.
* **director_id**: A unique identifier for the director.
* **director_name**: The name of the director.
* **writer_id**: A unique identifier for the writer.
* **writer_name**: The name of the writer.
* **storyline**: A brief summary of the show's plot.
* **user_id**: A unique identifier for the user who wrote a review.
* **user_name**: The name of the user who wrote the review.
* **review_id**: A unique identifier for the review.
* **review_title**: A short title summarizing the review.
* **review_content**: The full content of the review.

## Problem Statement

In today's digital age, users are overwhelmed with the sheer volume of show choices available across various streaming platforms. This abundance of options often leads to a paradox of choice, where users find it difficult to decide which show to watch next. A personalized recommendation system can significantly enhance the user experience by suggesting shows that align with their tastes and preferences.

#### Objective:

Develop a show recommendation system using the IMDb Top 250 shows dataset that leverages the rich information provided, such as show genres, cast, directors, user reviews, and ratings. The system should provide personalized recommendations to users based on their past viewing history and preferences.

#### Challenges:

* **Data Integration**: Combining various attributes like genre, cast, directors, and user reviews to create a comprehensive user profile and show profile.
* **Similarity Calculation**: Using advanced similarity measures like cosine similarity to find shows that are similar to those the user has liked in the past.
* **Personalization**: Taking into account user reviews and ratings to tailor recommendations that are highly relevant to individual users.
* **Scalability**: Ensuring the system can handle a large number of users and shows without compromising on performance.

#### Proposed Solution:

* **Data Preprocessing**: Clean and preprocess the data to handle missing values and ensure consistency.
* **Feature Engineering**: Create a combined feature space for each show that includes genres, cast and directors.
* **Cosine Similarity**: Compute the cosine similarity between shows to find those that are similar in terms of content and user reviews.
* **Recommendation Algorithm**: Develop an algorithm that recommends shows based on the computed similarities and user preferences.
* **Efiiciency and Accuracy**: Measure the efficiency and accuracy of the recomdation algorithm.

## Load Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## Load Dataset

In [2]:
import os
folder_path = "drive/MyDrive/Colab Notebooks/dscourse/data"
file_path = os.path.join(folder_path, "shows.csv")
df = pd.read_csv(file_path)

## Get Information About Dataset and Data

In [3]:
# Show 1st 5
df.head()

Unnamed: 0,rank,show_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,"ur0362356,ur33816519,ur64238818,ur69264448,ur2...","Wentloog,john-m-madsen,thespookybuz,pjdickinso...","rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
1,2,tt0903747,Breaking Bad,2008,https://www.imdb.com/title/tt0903747,1881190,9.5,TV-MA,49m,"Crime,Drama,Thriller",...,"nm0533713,nm0002835,nm0319213,nm0118778,nm0806...","Michelle MacLaren,Adam Bernstein,Vince Gilliga...","nm0319213,nm0332467,nm2297407,nm1028558,nm0909...","Vince Gilligan,Peter Gould,George Mastras,Sam ...",A chemistry teacher diagnosed with inoperable ...,"ur128165243,ur6387867,ur158768880,ur20552756,u...","FiRE010,Supermanfan-13,Lukasmj,TheLittleSongbi...","rw7088846,rw7530139,rw8672131,rw3856786,rw8725...","Really Great,Damn near perfect!,A show you nee...",I have never watched a show that is as consist...
2,3,tt0795176,Planet Earth,2006,https://www.imdb.com/title/tt0795176,210164,9.4,TV-PG,8h 58m,Documentary,...,"nm0288144,nm1768412","Alastair Fothergill,Mark Linfield","nm0041003,nm1761192,nm0288144,nm0662263","David Attenborough,Vanessa Berlowitz,Alastair ...",Each 50 minute episode features a global overv...,"ur4445210,ur1002035,ur4344459,ur14156906,ur141...","ccthemovieman-1,bob the moo,bs3dc,robert-kamer...","rw2002220,rw1356723,rw1574512,rw1594404,rw1723...","In A Word: Amazing,A visually impressive and m...","Thankfully, I caught a couple of these episode..."
3,4,tt0185906,Band of Brothers,2001,https://www.imdb.com/title/tt0185906,469081,9.4,TV-MA,9h 54m,"Drama,History,War",...,"nm0291205,nm0004121,nm0000158,nm0500896,nm0518...","David Frankel,Mikael Salomon,Tom Hanks,David L...","nm0024421,nm0096897,nm0296861,nm0000158,nm0420...","Stephen Ambrose,Erik Bork,E. Max Frye,Tom Hank...",The story of Easy Company of the U.S. Army 101...,"ur0312444,ur3922673,ur1019294,ur6387867,ur2467...","rbverhoef,philip_vanderveken,bsmith5552,Superm...","rw0626026,rw0626132,rw0625888,rw8123519,rw3248...","Excellent,This series is so unbelievably reali...",This week I saw three things based on WW-II no...
4,5,tt7366338,Chernobyl,2019,https://www.imdb.com/title/tt7366338,751884,9.4,TV-MA,5h 30m,"Drama,History,Thriller",...,nm0719307,Johan Renck,nm0563301,Craig Mazin,"In April 1986, an explosion at the Chernobyl n...","ur0482513,ur71468234,ur6387867,ur115536310,ur1...","Leofwine_draca,jfirebug,Supermanfan-13,DiCapri...","rw5285929,rw4875873,rw8325723,rw8574390,rw8521...","Exemplary,Incredible,Brilliant!,Must Watch!,Pa...",CHERNOBYL is an excellent depiction of the inf...


In [4]:
# Get number of rows and columns in the dataset
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Number of rows: 250
Number of columns: 22


#### Checking Data Types

By checking datatype of each column we can identify the categorical and numerical columns present in the dataset.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   rank            250 non-null    int64  
 1   show_id         250 non-null    object 
 2   title           250 non-null    object 
 3   year            250 non-null    int64  
 4   link            250 non-null    object 
 5   imbd_votes      250 non-null    object 
 6   imbd_rating     250 non-null    float64
 7   certificate     246 non-null    object 
 8   duration        249 non-null    object 
 9   genre           250 non-null    object 
 10  cast_id         250 non-null    object 
 11  cast_name       250 non-null    object 
 12  director_id     250 non-null    object 
 13  director_name   250 non-null    object 
 14  writer_id       250 non-null    object 
 15  writer_name     250 non-null    object 
 16  storyline       250 non-null    object 
 17  user_id         250 non-null    obj

#### Missing Value Detection

Missing value detection is essential to chack the quality of the data. If present impute it with proper value so that quality of the data is maintained for robust statistical analysis.

In [6]:
df.isnull().sum()

rank              0
show_id           0
title             0
year              0
link              0
imbd_votes        0
imbd_rating       0
certificate       4
duration          1
genre             0
cast_id           0
cast_name         0
director_id       0
director_name     0
writer_id         0
writer_name       0
storyline         0
user_id           0
user_name         0
review_id         0
review_title      0
review_content    0
dtype: int64

### Observations

* There is only 1 null value in the duration feature.
* There are 4 null values in the certificate feature.

In [7]:
# Convert the string to array of values for each comma separated user_id and user_name by splitting it on comma
df["user_id"] = df["user_id"].str.split(",")
df["user_name"] = df["user_name"].str.split(",")

In [8]:
df.head()

Unnamed: 0,rank,show_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,"[ur0362356, ur33816519, ur64238818, ur69264448...","[Wentloog, john-m-madsen, thespookybuz, pjdick...","rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
1,2,tt0903747,Breaking Bad,2008,https://www.imdb.com/title/tt0903747,1881190,9.5,TV-MA,49m,"Crime,Drama,Thriller",...,"nm0533713,nm0002835,nm0319213,nm0118778,nm0806...","Michelle MacLaren,Adam Bernstein,Vince Gilliga...","nm0319213,nm0332467,nm2297407,nm1028558,nm0909...","Vince Gilligan,Peter Gould,George Mastras,Sam ...",A chemistry teacher diagnosed with inoperable ...,"[ur128165243, ur6387867, ur158768880, ur205527...","[FiRE010, Supermanfan-13, Lukasmj, TheLittleSo...","rw7088846,rw7530139,rw8672131,rw3856786,rw8725...","Really Great,Damn near perfect!,A show you nee...",I have never watched a show that is as consist...
2,3,tt0795176,Planet Earth,2006,https://www.imdb.com/title/tt0795176,210164,9.4,TV-PG,8h 58m,Documentary,...,"nm0288144,nm1768412","Alastair Fothergill,Mark Linfield","nm0041003,nm1761192,nm0288144,nm0662263","David Attenborough,Vanessa Berlowitz,Alastair ...",Each 50 minute episode features a global overv...,"[ur4445210, ur1002035, ur4344459, ur14156906, ...","[ccthemovieman-1, bob the moo, bs3dc, robert-k...","rw2002220,rw1356723,rw1574512,rw1594404,rw1723...","In A Word: Amazing,A visually impressive and m...","Thankfully, I caught a couple of these episode..."
3,4,tt0185906,Band of Brothers,2001,https://www.imdb.com/title/tt0185906,469081,9.4,TV-MA,9h 54m,"Drama,History,War",...,"nm0291205,nm0004121,nm0000158,nm0500896,nm0518...","David Frankel,Mikael Salomon,Tom Hanks,David L...","nm0024421,nm0096897,nm0296861,nm0000158,nm0420...","Stephen Ambrose,Erik Bork,E. Max Frye,Tom Hank...",The story of Easy Company of the U.S. Army 101...,"[ur0312444, ur3922673, ur1019294, ur6387867, u...","[rbverhoef, philip_vanderveken, bsmith5552, Su...","rw0626026,rw0626132,rw0625888,rw8123519,rw3248...","Excellent,This series is so unbelievably reali...",This week I saw three things based on WW-II no...
4,5,tt7366338,Chernobyl,2019,https://www.imdb.com/title/tt7366338,751884,9.4,TV-MA,5h 30m,"Drama,History,Thriller",...,nm0719307,Johan Renck,nm0563301,Craig Mazin,"In April 1986, an explosion at the Chernobyl n...","[ur0482513, ur71468234, ur6387867, ur115536310...","[Leofwine_draca, jfirebug, Supermanfan-13, DiC...","rw5285929,rw4875873,rw8325723,rw8574390,rw8521...","Exemplary,Incredible,Brilliant!,Must Watch!,Pa...",CHERNOBYL is an excellent depiction of the inf...


In [9]:
# Explode the user_id and user_name columns to create new rows for each user_id and user_name pair
df_explode = df.explode(["user_id", "user_name"]).reset_index(drop=True)
df_explode.head()

Unnamed: 0,rank,show_id,title,year,link,imbd_votes,imbd_rating,certificate,duration,genre,...,director_id,director_name,writer_id,writer_name,storyline,user_id,user_name,review_id,review_title,review_content
0,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,ur0362356,Wentloog,"rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
1,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,ur33816519,john-m-madsen,"rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
2,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,ur64238818,thespookybuz,"rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
3,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,ur69264448,pjdickinson,"rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...
4,1,tt5491994,Planet Earth II,2016,https://www.imdb.com/title/tt5491994,145597,9.5,TV-G,4h 58m,Documentary,...,"nm1565498,nm3482115,nm4830788,nm1769336,nm2139...","Justin Anderson,Ed Charles,Fredi Devas,Chadden...",nm2357454,Elizabeth White,David Attenborough returns with a new wildlife...,ur24219677,arjanhylkema,"rw3575992,rw3576144,rw3578121,rw3576211,rw3577...","At once awe-inspiring and terrifying!,Yet anot...",I have just finished watching the first episod...


In [10]:
# Get number of rows and columns in exploded dataset
print(f"Number of rows: {df_explode.shape[0]}")
print(f"Number of columns: {df_explode.shape[1]}")

Number of rows: 6138
Number of columns: 22


In [11]:
# Create a Pivot table with user id, movie title and ratings
df_pivot = df_explode.pivot_table(index= "user_id", columns= "title", values= "imbd_rating")
df_pivot.head()

title,1883,Adventure Time,Africa,Alfred Hitchcock Presents,Anne with an E,Apocalypse: The Second World War,Arcane,Archer,Arrested Development,As If,...,What We Do in the Shadows,When They See Us,Whose Line Is It Anyway?,X-Men: The Animated Series,Yeh Meri Family,Yellowstone,Yes Minister,"Yes, Prime Minister",Young Justice,Your Lie in April
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0005879,,,,,,9.0,,,,,...,,,,,,,,,,
ur0009605,,,,,,,,,,,...,,,,,,,,,,
ur0010745,,,,,,,,,,,...,,,,,,,,,,
ur0011762,,,,,,,,,,,...,,,,,,,,,,
ur0013944,,,,,,,,,,,...,,,,,,,,,,


In [12]:
df_pivot_filled= df_pivot.fillna(0)

In [13]:
df_pivot_filled

title,1883,Adventure Time,Africa,Alfred Hitchcock Presents,Anne with an E,Apocalypse: The Second World War,Arcane,Archer,Arrested Development,As If,...,What We Do in the Shadows,When They See Us,Whose Line Is It Anyway?,X-Men: The Animated Series,Yeh Meri Family,Yellowstone,Yes Minister,"Yes, Prime Minister",Young Justice,Your Lie in April
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0005879,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0009605,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0010745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0013944,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ur99782462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur99814311,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur9987967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur99901788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Compute Cosine similarity matrix
cosine_sim_matrix = cosine_similarity(df_pivot_filled)

In [15]:
cosine_sim_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

## Top N Similar Users to Recomend a Show

In [16]:
# Create a dataframe for cosine similarity  matrix for users
cosine_sim_user_df = pd.DataFrame(cosine_sim_matrix, columns= df_pivot_filled.index, index= df_pivot_filled.index)

In [17]:
cosine_sim_user_df

user_id,ur0005879,ur0009605,ur0010745,ur0011762,ur0013944,ur0017740,ur0020866,ur0023870,ur0026270,ur0029708,...,ur99519886,ur99572075,ur99604822,ur99660650,ur99705413,ur99782462,ur99814311,ur9987967,ur99901788,ur99926384
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ur0005879,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0009605,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0010745,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0011762,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ur0013944,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ur99782462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.455464,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
ur99814311,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
ur9987967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
ur99901788,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [22]:
def top_n_similar_users(user_id, n=10):
  # Get similarity score for the user
  similarity_scores = cosine_sim_user_df[user_id]

  # Sort the scores in descending order and get top n
  similar_users = similarity_scores.sort_values(ascending=False).head(n + 1).iloc[1:]

  return similar_users

In [19]:
user_id = "ur0009605"
n = 10
similar_users = top_n_similar_users(user_id, n)

print(f"Top {n} similar users for user {user_id}")
top_n_similar_userids = list(similar_users.index)
similar_users_n = []
for user_id in top_n_similar_userids:
  similar_users_n.append(df_explode[df_explode["user_id"] == user_id][["user_id", "user_name"]])
similar_users_n_df = pd.concat(similar_users_n)
similar_users_n_df

Top 10 similar users for user ur0009605


Unnamed: 0,user_id,user_name
1858,ur30300958,yann-pastor
1861,ur7752337,PoisonKeyblade
1848,ur2400652,eamon-hennedy
1859,ur22761043,bbgrl93
1855,ur9130005,aabonander
1851,ur0009605,TuckMN
1847,ur1224804,Bgb217
1845,ur5032964,stonedonkies
1846,ur1492550,plumberguy66
1854,ur139059812,Legendddd


In [29]:
def recomend_shows_to_users(show_title, n= 10):
  # Get the users who rated this show
  rated_users = df_explode[df_explode["title"] == show_title]["user_id"]

  # Find the similer users to those who rated the show
  similar_users = []
  for user_id in rated_users:
    sim_users = top_n_similar_users(user_id)
    similar_users = similar_users + list(sim_users.index)
  # Convert the list to series so that value_counts can be applied
  similar_users_series = pd.Series(similar_users)

  # Get top n similar users
  top_similar_users = similar_users_series.value_counts().head(n)

  return list(top_similar_users.index)

In [31]:
show_title = "Adventure Time"
n = 10
users_to_recomend = recomend_shows_to_users(show_title, n)

print(f"Top {n} similar users for show {show_title}:")
n_users_to_recomend = []
for user_id in users_to_recomend:
  n_users_to_recomend.append(df_explode[df_explode["user_id"] == user_id][["user_id", "user_name"]])
n_users_to_recomend_df = pd.concat(n_users_to_recomend)
n_users_to_recomend_df

Top 10 similar users for show Adventure Time:


Unnamed: 0,user_id,user_name
3461,ur105082429,patrickfilbeck
3483,ur23933362,he_is_sparticus
3468,ur50615829,Charons_Nightmare
3465,ur90494228,danielgeng-44221
3463,ur66519838,sproutman
3469,ur2021461,bregund
3474,ur98337912,saraaorabi
3466,ur23537571,Nemsi
3470,ur11391828,poke_a_polk
3473,ur18183678,Nnnk-1


## Top N Similar Shows Recomend to a User

In [33]:
cosine_sim_show_matrix = cosine_similarity(df_pivot_filled.T)

In [34]:
# Create a dataframe for similarity matrix for shows
cosine_similar_show_df = pd.DataFrame(cosine_sim_show_matrix, columns = df_pivot_filled.columns, index = df_pivot_filled.columns)

In [35]:
cosine_similar_show_df

title,1883,Adventure Time,Africa,Alfred Hitchcock Presents,Anne with an E,Apocalypse: The Second World War,Arcane,Archer,Arrested Development,As If,...,What We Do in the Shadows,When They See Us,Whose Line Is It Anyway?,X-Men: The Animated Series,Yeh Meri Family,Yellowstone,Yes Minister,"Yes, Prime Minister",Young Justice,Your Lie in April
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1883,1.000000,0.00,0.000000,0.000000,0.040825,0.000000,0.000000,0.000000,0.081650,0.0,...,0.081650,0.0,0.000000,0.000000,0.0,0.163299,0.000000,0.000000,0.000000,0.000000
Adventure Time,0.000000,1.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.040000,0.000000,0.0,...,0.000000,0.0,0.000000,0.040000,0.0,0.000000,0.000000,0.000000,0.040000,0.000000
Africa,0.000000,0.00,1.000000,0.000000,0.000000,0.057735,0.000000,0.000000,0.057735,0.0,...,0.000000,0.0,0.000000,0.057735,0.0,0.000000,0.000000,0.066227,0.057735,0.000000
Alfred Hitchcock Presents,0.000000,0.00,0.000000,1.000000,0.000000,0.000000,0.000000,0.040825,0.000000,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.040825,0.000000,0.000000,0.000000,0.000000
Anne with an E,0.040825,0.00,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yellowstone,0.163299,0.00,0.000000,0.040825,0.000000,0.000000,0.000000,0.000000,0.080000,0.0,...,0.080000,0.0,0.000000,0.000000,0.0,1.000000,0.000000,0.000000,0.000000,0.000000
Yes Minister,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.041703,0.000000,0.000000,0.0,...,0.040000,0.0,0.000000,0.000000,0.0,0.000000,1.000000,0.321182,0.000000,0.040000
"Yes, Prime Minister",0.000000,0.00,0.066227,0.000000,0.000000,0.045883,0.047836,0.000000,0.045883,0.0,...,0.045883,0.0,0.046829,0.000000,0.0,0.000000,0.321182,1.000000,0.000000,0.045883
Young Justice,0.000000,0.04,0.057735,0.000000,0.000000,0.000000,0.000000,0.040000,0.040000,0.0,...,0.000000,0.0,0.000000,0.120000,0.0,0.000000,0.000000,0.000000,1.000000,0.000000


In [38]:
def top_n_similar_shows(show_title, n = 10):
  # Get similarity score for the show
  similarity_score = cosine_similar_show_df[show_title]

  # Sort the score in descending order and get top n
  similar_shows = similarity_score.sort_values(ascending=False).head(n + 1).iloc[1:]

  return similar_shows

In [39]:
show_title = "Yellowstone"
n= 10

similar_shows = top_n_similar_shows(show_title, n)

print(f"Top {n} similar shows for show {show_title}")
similar_shows

Top 10 similar shows for show Yellowstone


title
1883                                 0.163299
The Mandalorian                      0.160000
The Crown                            0.120000
Spartacus: Gods of the Arena         0.120000
Southland                            0.120000
Stranger Things                      0.120000
Justified                            0.120000
True Detective                       0.120000
It's Always Sunny in Philadelphia    0.120000
Seinfeld                             0.120000
Name: Yellowstone, dtype: float64

In [58]:
def recomend_n_shows_to_user(user_id, n= 10):
  # Get the shows that the user rated
  rated_shows= df_explode[df_explode["user_id"] == user_id]["title"]

  # Find the similar shows for these shows
  similar_shows= []
  for show_title in rated_shows:
    sim_shows = top_n_similar_shows(show_title)
    similar_shows = similar_shows + list(sim_shows.index)

    # Convert the list into series for value_counts
  similar_shows_series = pd.Series(similar_shows)

    # Get top n similar shows
  top_n_sim_shows = similar_shows_series.value_counts().head(n)

  return list(top_n_sim_shows.index)

In [59]:
user_id = "ur105082429"
n= 10

recomend_shows = recomend_n_shows_to_user(user_id, n)

print(f"Top {n} similar shows for user {user_id}")
recomend_shows

Top 10 similar shows for user ur105082429


['Over the Garden Wall',
 'Samurai Jack',
 'Dragon Ball Z',
 'Last Week Tonight with John Oliver',
 'Code Geass',
 'South Park',
 'Regular Show',
 'Berserk',
 'Vinland Saga',
 'Young Justice']