<div style="text-align: center; font-size: 40px; font-weight: bold; color: orange;">
     Hybrid Recommender System
</div>

Make 10 movie recommendations for the given user ID using the item-based and user-based recommender methods.


The dataset is provided by MovieLens, a movie recommendation service. It contains movies along with their ratings. It contains 2,000,0263 ratings for 27,278 movies. This dataset was created on October 17, 2016. It contains 138,493 users and data between January 09, 1995 and March 31, 2015. Users were randomly selected. It is known that all selected users rated at least 20 movies.


**movie.csv**

| **Column** | **Description**      |
|------------|----------------------|
| movieId    | Unique movie ID.  |
| title      | Movie title.          |
| genres     | Genre.                |


**rating.csv**

| **Column**  | **Description**                              |
|-------------|----------------------------------------------|
| userId      | Unique user ID (UniqueID).               |
| movieId     | Unique movie ID (UniqueID).              |
| rating      | The rating given to the movie by the user.   |
| timestamp   | The date of the rating.                      |


<div style="text-align: center; font-size: 24px; font-weight: bold; color: green;">
    User Based Recommendation
</div>

In [1]:
# import pandas as pd
# print(pd.__version__)

Installing the specific pandas version and importing it. 

In [2]:
import pandas as pd
pd.options.display.max_columns=10
pd.options.display.max_rows=20
pd.options.display.float_format = '{:.3f}'.format
pd.options.display.width = 1000
pd.set_option('display.expand_frame_repr', False)

Calling the data sets

In [3]:
movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv')
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movie.shape

(27278, 3)

In [5]:
rating = pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv')
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [6]:
rating.shape

(20000263, 4)

Merging the datasets

In [7]:
df = pd.merge(movie,rating, how="inner", on="movieId")
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.5,2009-01-02 01:13:41


In [8]:
df.shape

(20000263, 6)

In [9]:
df.isnull().sum()

movieId      0
title        0
genres       0
userId       0
rating       0
timestamp    0
dtype: int64

Calculate the total number of votes for each movie and creating a dataframe for it

In [10]:
comment_counts = pd.DataFrame(df["title"].value_counts())
comment_counts.head()

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
Pulp Fiction (1994),67310
Forrest Gump (1994),66172
"Shawshank Redemption, The (1994)",63366
"Silence of the Lambs, The (1991)",63299
Jurassic Park (1993),59715


We keep the names of the movies with less than 1000 total votes in 'rare_movies' and movies with more than 1000 votes in "common_movies".

In [11]:
rare_movies = comment_counts[comment_counts["count"] < 1000].index
print(rare_movies)

Index(['Rosewood (1997)', 'One Night at McCool's (2001)', 'Ted (2012)', 'Bear, The (Ours, L') (1988)', 'Marked for Death (1990)', 'Adam's Rib (1949)', 'Three to Tango (1999)', 'Stakeout (1987)', 'I Now Pronounce You Chuck and Larry (2007)', 'Someone Like You (2001)',
       ...
       'Expert, The (1995)', 'Goliath Awaits (1981)', 'Short Eyes (1977)', 'Restless Souls (Bag det stille ydre) (2005)', 'Cold Trail (Köld slóð) (2006)', 'Easy Wheels (1989)', 'Ditirambo (1969)', 'Scorching Winds (Garm Hava) (Garam Hawa) (1974)', 'Serrallonga (2008)', 'Innocence (2014)'], dtype='object', name='title', length=23570)


In [12]:
common_movies = df[~df["title"].isin(rare_movies)]
common_movies.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.5,2009-01-02 01:13:41


In [13]:
common_movies["title"].nunique()

3159

Creating a pivot table for the dataframe with userIDs in the index, movie names in the columns, and ratings as values.

In [14]:
#user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
user_movie_df = common_movies.groupby(["userId","title"])["rating"].mean().unstack()
user_movie_df.head(20)

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),...,Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,,,,,,...,,,,,
2,,,,,,...,,,,,
3,,,,,,...,,,,,
4,,,,,,...,,,,,
5,,,,,,...,,,,,
6,,,,,,...,,,,,
7,,,,,,...,,,,,2.0
8,,,,,,...,,,,,
9,,,,,,...,,,,,
10,,,,,,...,,,,,


Functionalize all the operations done above

In [15]:
# def create_user_movie_df():
#     import pandas as pd
#     movie = pd.read_csv('recommender_systems/datasets/movie_lens_dataset/movie.csv')
#     rating = pd.read_csv('recommender_systems/datasets/movie_lens_dataset/rating.csv')
#     df = movie.merge(rating, how="inner", on="movieId")
#     comment_counts = pd.DataFrame(df["title"].value_counts())
#     rare_movies = comment_counts[comment_counts["title"] <= 1000].index
#     common_movies = df[~df["title"].isin(rare_movies)]
#     user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
#     return user_movie_df

# user_movie_df = create_user_movie_df()


Choosing a random user id

In [16]:
#random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=45).values[0])
random_user = user_movie_df.sample(1,random_state=45).index[0]
print(random_user)

28941


Creating a new dataframe named random_user_df consisting of observation units belonging to the selected user.

In [17]:
random_user_df = user_movie_df[user_movie_df.index == random_user]
random_user_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),...,Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
28941,,,,,,...,,,,,


Assigning the movies voted by the selected user to a list named movies_watched.

In [18]:
movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
print(movies_watched)

['Ace Ventura: Pet Detective (1994)', 'Ace Ventura: When Nature Calls (1995)', 'Aladdin (1992)', 'American President, The (1995)', 'Apollo 13 (1995)', 'Babe (1995)', 'Bullets Over Broadway (1994)', 'Clueless (1995)', 'Disclosure (1994)', 'Forrest Gump (1994)', 'Four Weddings and a Funeral (1994)', 'Home Alone (1990)', 'Jurassic Park (1993)', 'Like Water for Chocolate (Como agua para chocolate) (1992)', 'Little Women (1994)', "Mr. Holland's Opus (1995)", 'Mrs. Doubtfire (1993)', 'Much Ado About Nothing (1993)', "Muriel's Wedding (1994)", 'Nine Months (1995)', 'Operation Dumbo Drop (1995)', 'Piano, The (1993)', 'Postman, The (Postino, Il) (1994)', 'Ready to Wear (Pret-A-Porter) (1994)', 'Remains of the Day, The (1993)', 'Sabrina (1995)', "Schindler's List (1993)", 'Secret Garden, The (1993)', 'Sense and Sensibility (1995)', 'Shadowlands (1993)', 'Silence of the Lambs, The (1991)', 'Star Trek: Generations (1994)', 'Stargate (1994)']


In [19]:
len(movies_watched)

33

Selecting the columns of movies watched by the selected user from user_movie_df and create a new dataframe named movies_watched_df.

In [20]:
movies_watched_df = user_movie_df[movies_watched]
movies_watched_df.head()

title,Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Aladdin (1992),"American President, The (1995)",Apollo 13 (1995),...,Sense and Sensibility (1995),Shadowlands (1993),"Silence of the Lambs, The (1991)",Star Trek: Generations (1994),Stargate (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,,,,,,...,,,3.5,,
2,,,,,,...,,,,,
3,,,,,,...,,,5.0,5.0,5.0
4,,3.0,,,,...,,,,3.0,
5,,,5.0,5.0,5.0,...,3.0,,3.0,,4.0


Creating a new dataframe named user_movie_count that contains information about how many of the movies each user has watched, reset the index and update the columns' names.



In [21]:
#user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count = movies_watched_df.notnull().sum(axis=1)
user_movie_count.head()

userId
1     1
2     2
3     4
4     6
5    11
dtype: int64

In [22]:
user_movie_count.max()

33

In [23]:
user_movie_count.sort_values(ascending=False).head(20)

userId
100399    33
8405      33
46663     33
76630     33
81218     33
81596     33
15919     33
83090     33
118205    33
28941     33
41389     33
94231     33
125912    33
137391    33
13938     33
124052    33
130986    33
112939    32
121956    32
88604     32
dtype: int64

In [24]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.head()

Unnamed: 0,userId,0
0,1,1
1,2,2
2,3,4
3,4,6
4,5,11


In [25]:
user_movie_count.columns = ["userId", "movie_count"]
user_movie_count.head()

Unnamed: 0,userId,movie_count
0,1,1
1,2,2
2,3,4
3,4,6
4,5,11


We consider similar users to be those who have watched 60 percent or more of the movies voted by the selected user.  Create a list named users_same_movies from the IDs of these users.

In [26]:
perc = round(len(movies_watched) * 60 / 100)
users_same_movies = user_movie_count[user_movie_count["movie_count"] > perc]["userId"].tolist()

#users_same_movies = user_movie_count[user_movie_count > (movies_watched_df.shape[1] * 60 ) / 100].index

Filtering the movies_watched_df dataframe to find the IDs of users that are similar to the selected user in the user_same_movies list.

In [27]:
final_df = movies_watched_df[movies_watched_df.index.isin(users_same_movies)]

final_df.head()

title,Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Aladdin (1992),"American President, The (1995)",Apollo 13 (1995),...,Sense and Sensibility (1995),Shadowlands (1993),"Silence of the Lambs, The (1991)",Star Trek: Generations (1994),Stargate (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
130,4.0,3.0,,3.0,3.0,...,,3.0,5.0,,3.0
156,3.0,,,5.0,5.0,...,4.0,4.0,5.0,3.0,4.0
158,2.0,1.0,4.0,4.0,3.0,...,4.0,5.0,5.0,,
184,2.0,3.0,3.0,4.0,4.0,...,,4.0,5.0,3.0,4.0
295,,,3.0,3.0,3.0,...,4.0,,4.0,3.0,2.0


Creating a new corr_df dataframe that will contain the correlations between users.

In [28]:
# final_df.T.corr()
# corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()

corr_df = final_df.T.corr().unstack()
corr_df.head()

userId  userId
130     130      1.000
        156      0.129
        158      0.261
        184      0.149
        295      0.597
dtype: float64

In [29]:
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df = corr_df.reset_index()

corr_df.head()

Unnamed: 0,user_id_1,user_id_2,corr
0,130,130,1.0
1,130,156,0.129
2,130,158,0.261
3,130,184,0.149
4,130,295,0.597


Creating a new dataframe named top_users by filtering out users that have a high correlation (above 0.65) with the selected user.

In [30]:
top_users = corr_df[(corr_df["user_id_1"] == random_user) & (corr_df["corr"] >= 0.65)][
    ["user_id_2", "corr"]].reset_index(drop=True)

top_users = top_users.sort_values(by='corr', ascending=False)

top_users.rename(columns={"user_id_2": "userId"}, inplace=True)

top_users.head()

Unnamed: 0,userId,corr
12,28941,1.0
15,45158,0.801
30,101628,0.79
3,7542,0.772
36,127259,0.764


Merging the top_users dataframe with the rating dataset and removing the selected user

In [31]:
top_users_ratings = top_users.merge(rating[["userId", "movieId", "rating"]], how='inner')
top_users_ratings = top_users_ratings[top_users_ratings["userId"] != random_user]
top_users_ratings.sort_values(by='corr', ascending=False)

Unnamed: 0,userId,corr,movieId,rating
33,45158,0.801,1,1.500
366,45158,0.801,3173,1.500
379,45158,0.801,3263,2.000
378,45158,0.801,3261,3.500
377,45158,0.801,3256,1.000
...,...,...,...,...
17082,82666,0.655,273,2.000
17083,82666,0.655,277,3.000
17084,82666,0.655,281,4.000
17085,82666,0.655,296,4.000


Creating a new variable named weighted_rating, which is the product of the corr and rating values of each user.

In [32]:
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']
top_users_ratings.sort_values(by="weighted_rating",ascending=False)

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
167,45158,0.801,1136,5.000,4.004
68,45158,0.801,265,5.000,4.004
465,45158,0.801,4973,5.000,4.004
473,45158,0.801,5135,5.000,4.004
483,45158,0.801,5617,5.000,4.004
...,...,...,...,...,...
15151,103998,0.663,1981,0.500,0.332
15787,103998,0.663,6157,0.500,0.332
16894,94379,0.656,4247,0.500,0.328
16599,94379,0.656,2178,0.500,0.328


Create a new dataframe named recommendation_df that contains the movie id and the average value of all users’ weighted ratings for each movie.

In [33]:

recommendation_df = top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"})

recommendation_df = recommendation_df.reset_index()

recommendation_df[["movieId"]].nunique()

recommendation_df.sort_values(by="weighted_rating" ,ascending=False)

Unnamed: 0,movieId,weighted_rating
46,53,3.952
1751,2504,3.764
1737,2485,3.764
1422,2057,3.764
1437,2077,3.764
...,...,...
3552,5864,0.352
1362,1981,0.349
3651,6157,0.332
3395,5471,0.328


Selecting the movies with a weighted rating greater than 3.5 in recommendation_df and sort them according to their weighted rating.

In [34]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5].sort_values("weighted_rating", ascending=False)
recommendation_df.sort_values(by="weighted_rating" ,ascending=False)

Unnamed: 0,movieId,weighted_rating
46,53,3.952
1751,2504,3.764
1737,2485,3.764
1422,2057,3.764
1437,2077,3.764
...,...,...
3552,5864,0.352
1362,1981,0.349
3651,6157,0.332
3395,5471,0.328


Listing the names of 5 recommended movies.

In [35]:
movies_to_be_recommend = movies_to_be_recommend.merge(movie[["movieId", "title"]])
movies_to_be_recommend.head()

Unnamed: 0,movieId,weighted_rating,title
0,53,3.952,Lamerica (1994)
1,2504,3.764,200 Cigarettes (1999)
2,3910,3.764,Dancer in the Dark (2000)
3,3118,3.764,Tumbleweeds (1999)
4,1922,3.764,Whatever (1998)


<div style="text-align: center; font-size: 24px; font-weight: bold; color: red;">
    Item Based Recommendation
</div>

Making item-based suggestions based on the name of the movie the user last watched and gave the highest rating to.

user ID = 28941

Calling the datasets and merging them

In [36]:
#movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv')
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [37]:
#rating = pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv')
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [38]:
# df = rating.merge(movie, how="inner", on="movieId")
# df.head()

Getting the ID of the movie with the most up-to-date score among the movies that the user to be recommended gave 5 points to.

In [39]:
movie_id = df[(df["userId"] == random_user) & (df["rating"] == 5.0)].sort_values(by='timestamp', ascending=False)["movieId"].iloc[0]
print(movie_id)

7


Filtering the user_movie_df dataframe created in the User based recommendation section according to the selected movie ID.

In [40]:
movie_name = df[df["movieId"] == movie_id]["title"].values[0]
print(movie_name)

Sabrina (1995)


In [41]:
movie_name = user_movie_df[movie_name]

Using the filtered dataframe, find the correlation between the selected movie and other movies and rank them.

In [42]:
corr_df = user_movie_df.corrwith(movie_name).sort_values(ascending=False).head(10)

corr_df = corr_df.reset_index()

corr_df.columns = ["title", "corr"]

corr_df.head()

Unnamed: 0,title,corr
0,Sabrina (1995),1.0
1,Intouchables (2011),0.503
2,Father of the Bride (1991),0.5
3,Anna and the King (1999),0.494
4,Runaway Bride (1999),0.484


List the first 5 movies as suggestions (exclude the selected movie itself)

In [43]:
movies_to_recommend = corr_df.sort_values(by='corr', ascending=False)[1:6]
print(movies_to_recommend)

                              title  corr
1               Intouchables (2011) 0.503
2        Father of the Bride (1991) 0.500
3          Anna and the King (1999) 0.494
4              Runaway Bride (1999) 0.484
5  Phantom of the Opera, The (2004) 0.476


<div style="text-align: center; font-size: 24px; font-weight: bold; color: green;">
    Hybrid Recommendation
</div>

**Weighted_Rate;** represents the weighted average of the ratings given to the movie by each user.

**Corr;** represents the degree of similarity between users. 

We need to find the **hybrid score** by using both user based and item based so that we can get  more robust and personalized recommendation score by striking a balance between two different metrics (correlation and weighted score).

In [44]:
movies_ordered_by_rating = pd.merge(recommendation_df,movie,how="inner",on="movieId")[["movieId","weighted_rating","title"]]
movies_ordered_by_rating.head()

Unnamed: 0,movieId,weighted_rating,title
0,1,2.424,Toy Story (1995)
1,2,1.749,Jumanji (1995)
2,3,1.431,Grumpier Old Men (1995)
3,4,1.691,Waiting to Exhale (1995)
4,5,1.425,Father of the Bride Part II (1995)


In [45]:
merged = pd.merge(corr_df,movies_ordered_by_rating,how="inner", on="title")
merged.head()

Unnamed: 0,title,corr,movieId,weighted_rating
0,Sabrina (1995),1.0,7,2.44
1,Father of the Bride (1991),0.5,6944,2.446
2,Anna and the King (1999),0.494,3155,2.515
3,Runaway Bride (1999),0.484,2724,1.263
4,Mrs. Winterbourne (1996),0.474,691,1.272


In [46]:
merged["hybrid"] = merged["corr"] * merged["weighted_rating"]
merged[["title","hybrid"]].sort_values(by="hybrid", ascending=False, ignore_index=True)[1:11]

Unnamed: 0,title,hybrid
1,Anna and the King (1999),1.243
2,Father of the Bride (1991),1.224
3,Two Weeks Notice (2002),1.111
4,Sweet Home Alabama (2002),0.839
5,You've Got Mail (1998),0.831
6,Runaway Bride (1999),0.611
7,Mrs. Winterbourne (1996),0.604
