# MOVIE RECOMMENDER SYSTEM

<b>Certificate Project<b>

<b><i>Domain –OTT Platform<b></i>

<b><u>Context</u></b>
Over the past two decades, there has been a monumental shift in how people access and consume video content. With the universal access to broadband internet, numerous platforms
like YouTube, Netflix, and HBO Go emerged and steadily grew to prominence.
Although not a household name in itself, OTT is the exact technology that made the streaming revolution possible.
OTT stands for “Over The Top” which refers to any video streaming service delivering content to the users over the internet, however, there are subscription charges associated with the usage of such platforms such as PrimeVideo, Netflix, HotStart, Zee5, SonyLiv, etc.
But choosing your next movie to watch can still be a daunting task, even if you have access to all the platforms.

<b><u>Business Requirement:</u></b>
“MyNextMovie” is a budding startup in the space of recommendations on top of various OTT platforms providing suggestions to its customer base regarding their next movie.
Their major business is to create a recommendation layer on top of these OTT platforms so that they can make suitable recommendations to their customers, however, since they are in research mode right now, they would want to experiment with open-source data first to understand the depth of the models which can be delivered by them.
The data for this exercise is open-source data that has been collected and made available from the MovieLens website (http://movielens.org), a part of GroupLens Research The data sets were collected over various periods of time, depending on the size of the set.
You have recently joined as a Data Scientist at “MyNextMovie” and plan to help the existing team to set up a recommendation platform.


<b><u>Objective:</u></b>
1. Create a popularity-based recommender system at a genre level. The user will input a
genre (g), minimum rating threshold (t) for a movie, and no. of
recommendations(N) for which it should be recommended top N movies which are most popular within that genre (g) ordered by ratings in descending order where each movie has at least (t) reviews.


<li> Genre (g) : Comedy

• Minimum reviews threshold (t): 100

• Num recommendations (N) : 5</li>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
movies=pd.read_csv('movies.csv')
ratings=pd.read_csv('ratings.csv')

In [3]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10329 entries, 0 to 10328
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  10329 non-null  int64 
 1   title    10329 non-null  object
 2   genres   10329 non-null  object
dtypes: int64(1), object(2)
memory usage: 242.2+ KB


In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105339 entries, 0 to 105338
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     105339 non-null  int64  
 1   movieId    105339 non-null  int64  
 2   rating     105339 non-null  float64
 3   timestamp  105339 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.2 MB


In [5]:
movies.shape

(10329, 3)

In [6]:
ratings.shape

(105339, 4)

Average rating and Total movies at genre level.

In [7]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,105339.0,105339.0,105339.0,105339.0
mean,364.924539,13381.312477,3.51685,1130424000.0
std,197.486905,26170.456869,1.044872,180266000.0
min,1.0,1.0,0.5,828565000.0
25%,192.0,1073.0,3.0,971100800.0
50%,383.0,2497.0,3.5,1115154000.0
75%,557.0,5991.0,4.0,1275496000.0
max,668.0,149532.0,5.0,1452405000.0


From the above table we can conclue that

The average rating is 3.5 and minimum and maximum rating is 0.5 and 5 respectively.
There are 668 user who has given their ratings for 149532 movies.

In [8]:
df=pd.merge(ratings,movies, how='left',on='movieId')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,16,4.0,1217897793,Casino (1995),Crime|Drama
1,1,24,1.5,1217895807,Powder (1995),Drama|Sci-Fi
2,1,32,4.0,1217896246,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
3,1,47,4.0,1217896556,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,4.0,1217896523,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In [9]:
df1=df.groupby(['title'])[['rating']].sum()
high_rated=df1.nlargest(20,'rating')
high_rated.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"Shawshank Redemption, The (1994)",1372.0
Pulp Fiction (1994),1352.0
Forrest Gump (1994),1287.0
"Silence of the Lambs, The (1991)",1216.5
Star Wars: Episode IV - A New Hope (1977),1143.5


In [10]:
df2=df.groupby('title')[['rating']].count()
rating_count_20=df2.nlargest(20,'rating')
rating_count_20.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
Pulp Fiction (1994),325
Forrest Gump (1994),311
"Shawshank Redemption, The (1994)",308
Jurassic Park (1993),294
"Silence of the Lambs, The (1991)",290


Unique genres considered

In [11]:
cv=TfidfVectorizer()
tfidf_matrix=cv.fit_transform(movies['genres'])

In [12]:
movie_user = df.pivot_table(index='userId',columns='title',values='rating')
movie_user.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 (1979),...,[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),a/k/a Tommy Chong (2005),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [13]:
# Merge the two datasets
df = pd.merge(movies, ratings, on='movieId')

# Filter the data based on the genre and minimum rating threshold
genre = input("Enter a genre: ")
min_rating = float(input("Enter the minimum rating threshold: "))
min_reviews = int(input("Enter the minimum number of reviews: "))
filtered_data = df[(df['genres'].str.contains(genre)) & (df['rating'] >= min_rating)]
filtered_data = filtered_data.groupby('title').filter(lambda x: len(x) >= min_reviews)

# Compute the mean rating for each movie and sort by descending order
mean_ratings = filtered_data.groupby('title')['rating'].mean().sort_values(ascending=False)

# Recommend the top N movies
N = int(input("Enter the number of recommendations: "))
top_N = mean_ratings.head(N)

print(f"Top {N} recommended movies for {genre} genre with at least {min_reviews} reviews and rating threshold of {min_rating}:")
print(top_N)


Enter a genre: Mystery|Sci-Fi|Thriller
Enter the minimum rating threshold: 4.0
Enter the minimum number of reviews: 200
Enter the number of recommendations: 5
Top 5 recommended movies for Mystery|Sci-Fi|Thriller genre with at least 200 reviews and rating threshold of 4.0:
title
Matrix, The (1999)                           4.582160
Pulp Fiction (1994)                          4.569721
Star Wars: Episode IV - A New Hope (1977)    4.563679
Silence of the Lambs, The (1991)             4.504310
Name: rating, dtype: float64


In [14]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Create a content-based recommender system that recommends top N movies based on
similar movie(m) genres.


In [15]:
#2.Create a content-based recommender system that recommends top N movies based on similar movie(m) genres.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Compute the TF-IDF vectors for each movie genre
tfidf = TfidfVectorizer(stop_words='english', max_df=0.8)
genres = tfidf.fit_transform(movies['genres'])

# Compute the cosine similarity between each pair of movies
cosine_sim = cosine_similarity(genres, genres)

# Get the index of the input movie
title = input("Enter a movie title: ")
idx = movies[movies['title'] == title].index[0]

# Get the top N similar movies
N = int(input("Enter the number of recommendations: "))
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:N+1]
movie_indices = [i[0] for i in sim_scores]

# Print the recommended movies
print(f"Top {N} recommended movies based on similar genres to {title}:")
print(movies['title'].iloc[movie_indices])


Enter a movie title: Star Wars: Episode IV - A New Hope (1977)
Enter the number of recommendations: 5
Top 5 recommended movies based on similar genres to Star Wars: Episode IV - A New Hope (1977):
230            Star Wars: Episode IV - A New Hope (1977)
277                                      Stargate (1994)
390                                Demolition Man (1993)
958    Star Wars: Episode V - The Empire Strikes Back...
971    Star Wars: Episode VI - Return of the Jedi (1983)
Name: title, dtype: object


a collaborative based recommender system which recommends top N movies
based on “K” similar users for a target user “u”


In [16]:
#3. Create a collaborative based recommender system which recommends top N movies based on “K” similar users for a target user “u”
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Load the ratings dataset
ratings = pd.read_csv('ratings.csv')

# Filter the data based on the user ID
user_id = int(input("Enter a user ID: "))
user_ratings = ratings[ratings['userId'] == user_id]

# Compute the cosine similarity between each pair of users
user_similarity = cosine_similarity(ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating', fill_value=0))

# Get the top K similar users
K = int(input("Enter the number of similar users: "))
sim_users = list(enumerate(user_similarity[user_id]))
sim_users = sorted(sim_users, key=lambda x: x[1], reverse=True)[1:K+1]
user_indices = [i[0] for i in sim_users]

# Filter the data based on the similar users and compute the mean rating for each movie
similar_users_ratings = ratings[ratings['userId'].isin(user_indices)]
mean_ratings = similar_users_ratings.groupby('movieId')['rating'].mean().sort_values(ascending=False)

# Exclude the movies that the target user has already rated
already_rated = user_ratings['movieId'].tolist()
mean_ratings = mean_ratings[~mean_ratings.index.isin(already_rated)]

# Get the top N recommended movies
N = int(input("Enter the number of recommendations: "))
top_N = mean_ratings.head(N)

# Print the recommended movies
print(f"Top {N} recommended movies for user {user_id} based on {K} similar users:")
print(pd.merge(top_N, pd.read_csv('movies.csv'), on='movieId')['title'])


Enter a user ID: 2
Enter the number of similar users: 20
Enter the number of recommendations: 4
Top 4 recommended movies for user 2 based on 20 similar users:
0                     Little Big Man (1970)
1    Superman/Batman: Public Enemies (2009)
2                         Women, The (1939)
3                  One for the Money (2012)
Name: title, dtype: object


In [17]:
import ipywidgets as widgets
from IPython.display import display


In [18]:
recommendation_type = widgets.Dropdown(
    options=['Popularity-based', 'Content-based', 'Collaborative-based'],
    value='Popularity-based',
    description='Recommendation type:',
)


In [19]:
search_input = widgets.Text(
    placeholder='Enter movie or genre name',
    description='Search:',
)


In [20]:
num_recommendations = widgets.IntSlider(
    value=5,
    min=1,
    max=10,
    step=1,
    description='Number of recommendations:',
    orientation='horizontal',
)


In [21]:
recommend_button = widgets.Button(
    description='Generate recommendations',
    button_style='success',
)


In [22]:
def generate_recommendations(sender):
    # Get the selected recommendation type
    recommendation_type_value = recommendation_type.value
    
    # Get the search input value
    search_input_value = search_input.value
    
    # Get the number of recommendations to display
    num_recommendations_value = num_recommendations.value
    
    # TODO: Call the appropriate recommendation function based on the selected recommendation type and display the results.
    
    # Clear the search input
    search_input.value = ''
    
# Register the generate_recommendations function to be called when the button is clicked
recommend_button.on_click(generate_recommendations)

# Display the widgets
display(recommendation_type, search_input, num_recommendations, recommend_button)


Dropdown(description='Recommendation type:', options=('Popularity-based', 'Content-based', 'Collaborative-base…

Text(value='', description='Search:', placeholder='Enter movie or genre name')

IntSlider(value=5, description='Number of recommendations:', max=10, min=1)

Button(button_style='success', description='Generate recommendations', style=ButtonStyle())