***Making predictions for a given userID using Item-Based and User-Based Collaborative Filtering methods***

In [1]:
import pandas as pd

pd.options.display.max_columns=10
pd.options.display.max_rows=20
pd.options.display.float_format = '{:.3f}'.format
pd.options.display.width = 1000

In [2]:
movie = pd.read_csv("/kaggle/input/movie-rating/movie.csv")
rating = pd.read_csv("/kaggle/input/movie-rating/rating.csv")

movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


**The ratings dataframe does not contain the movie titles and genres variables.**

Adding these two variables from the movies df to the ratings df:

In [4]:
# Merging two dataframes
df = pd.merge(movie,rating, how="inner", on="movieId")
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.5,2009-01-02 01:13:41


*The number of ratings for each movie*

In [5]:
rating_counts = df["title"].value_counts()
print(rating_counts)

title
Pulp Fiction (1994)                                67310
Forrest Gump (1994)                                66172
Shawshank Redemption, The (1994)                   63366
Silence of the Lambs, The (1991)                   63299
Jurassic Park (1993)                               59715
                                                   ...  
Easy Wheels (1989)                                     1
Ditirambo (1969)                                       1
Scorching Winds (Garm Hava) (Garam Hawa) (1974)        1
Serrallonga (2008)                                     1
Innocence (2014)                                       1
Name: count, Length: 26729, dtype: int64


**Assigning movies with a low number of ratings (less than or equal 1000) to a variable**

In [6]:
rare_movies = rating_counts[rating_counts <= 1000].index
print(rare_movies)

Index(['Rosewood (1997)', 'One Night at McCool's (2001)', 'Ted (2012)', 'Bear, The (Ours, L') (1988)', 'Marked for Death (1990)', 'Adam's Rib (1949)', 'Three to Tango (1999)', 'Stakeout (1987)', 'I Now Pronounce You Chuck and Larry (2007)', 'Someone Like You (2001)',
       ...
       'Expert, The (1995)', 'Goliath Awaits (1981)', 'Short Eyes (1977)', 'Restless Souls (Bag det stille ydre) (2005)', 'Cold Trail (Köld slóð) (2006)', 'Easy Wheels (1989)', 'Ditirambo (1969)', 'Scorching Winds (Garm Hava) (Garam Hawa) (1974)', 'Serrallonga (2008)', 'Innocence (2014)'], dtype='object', name='title', length=23570)


Assigning movies with a high number of ratings (more than 1000) to a variable

In [7]:
common_movies = df[~df["title"].isin(rare_movies)]

**USER-BASED RECOMMENDATION**

**Creating user-movie matrix**

In [8]:
user_movie_df = common_movies.pivot_table(index="userId",columns="title",values="rating")
user_movie_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),...,Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,,,,,,...,,,,,
2,,,,,,...,,,,,
3,,,,,,...,,,,,
4,,,,,,...,,,,,
5,,,,,,...,,,,,


*Creating random user*

In [9]:
random_user = user_movie_df.sample(1,random_state=45).index[0]
print(random_user)

28941


*Creating random user DF*

In [10]:
random_user_df = user_movie_df[user_movie_df.index==random_user]
random_user_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),...,Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
28941,,,,,,...,,,,,


*Creating a DataFrame list of the movies watched by a random user*

In [11]:
movies_watched = random_user_df.dropna(axis=1).columns.tolist()
print(movies_watched)

['Ace Ventura: Pet Detective (1994)', 'Ace Ventura: When Nature Calls (1995)', 'Aladdin (1992)', 'American President, The (1995)', 'Apollo 13 (1995)', 'Babe (1995)', 'Bullets Over Broadway (1994)', 'Clueless (1995)', 'Disclosure (1994)', 'Forrest Gump (1994)', 'Four Weddings and a Funeral (1994)', 'Home Alone (1990)', 'Jurassic Park (1993)', 'Like Water for Chocolate (Como agua para chocolate) (1992)', 'Little Women (1994)', "Mr. Holland's Opus (1995)", 'Mrs. Doubtfire (1993)', 'Much Ado About Nothing (1993)', "Muriel's Wedding (1994)", 'Nine Months (1995)', 'Operation Dumbo Drop (1995)', 'Piano, The (1993)', 'Postman, The (Postino, Il) (1994)', 'Ready to Wear (Pret-A-Porter) (1994)', 'Remains of the Day, The (1993)', 'Sabrina (1995)', "Schindler's List (1993)", 'Secret Garden, The (1993)', 'Sense and Sensibility (1995)', 'Shadowlands (1993)', 'Silence of the Lambs, The (1991)', 'Star Trek: Generations (1994)', 'Stargate (1994)']


*The movies that are watched by random user*

In [12]:
random_user_watched_ids = common_movies["title"][(common_movies["userId"]==random_user)]
print(random_user_watched_ids)

126205                            Sabrina (1995)
174539            American President, The (1995)
225036              Sense and Sensibility (1995)
250835     Ace Ventura: When Nature Calls (1995)
417968                               Babe (1995)
                           ...                  
3559898                       Shadowlands (1993)
3776293                        Home Alone (1990)
3836848                           Aladdin (1992)
4025787         Silence of the Lambs, The (1991)
4392992              Operation Dumbo Drop (1995)
Name: title, Length: 33, dtype: object


In [13]:
movies_watched_df = user_movie_df[movies_watched]
movies_watched_df.head()

title,Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Aladdin (1992),"American President, The (1995)",Apollo 13 (1995),...,Sense and Sensibility (1995),Shadowlands (1993),"Silence of the Lambs, The (1991)",Star Trek: Generations (1994),Stargate (1994)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,,,,,,...,,,3.5,,
2,,,,,,...,,,,,
3,,,,,,...,,,5.0,5.0,5.0
4,,3.0,,,,...,,,,3.0,
5,,,5.0,5.0,5.0,...,3.0,,3.0,,4.0


*IDs of the users who watched the movies that a random user has watched, and the number of movies they have watched*

In [14]:
user_movie_count = movies_watched_df.notnull().sum(axis=1)
user_movie_count.sort_values(ascending=False).head()

userId
100399    33
8405      33
46663     33
76630     33
81218     33
dtype: int64

*Index and ID information of users who have watched at least 60% of the movies that a random user has watched*

In [15]:
users_same_movies = user_movie_count[user_movie_count > (movies_watched_df.shape[1] * 60 ) / 100].index
print(users_same_movies)

Index([    91,    130,    156,    158,    160,    184,    294,    295,    298,    309,
       ...
       137885, 137949, 137976, 138019, 138162, 138208, 138279, 138382, 138415, 138483], dtype='int64', name='userId', length=4139)


*Filtering the movies_watched_df for selecting similar users*

In [16]:
final_df = movies_watched_df[movies_watched_df.index.isin(users_same_movies)]

*Creating Correlation DF*

In [17]:
corr_df = final_df.T.corr().unstack()
corr_df[random_user].sort_values(ascending=False)

userId
28941     1.000
13477     0.802
45158     0.801
101628    0.790
7542      0.772
          ...  
103594   -0.569
22122    -0.586
92616    -0.618
126388   -0.676
92960    -0.691
Length: 4139, dtype: float64

*Creating Top users/High correlated users df*

In [18]:
top_users = pd.DataFrame(corr_df[random_user][corr_df[random_user] > 0.65], columns=["corr"])

*Merging top_users with rating*

In [19]:
top_users_ratings = pd.merge(top_users, rating[["userId", "movieId", "rating"]], how='inner', on="userId")

top_users_ratings = top_users_ratings[~(top_users_ratings["movieId"].isin(random_user_watched_ids))]

*Calculating Weighted Average Recommendation Score*

In [20]:
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']

top_users_ratings.sort_values(by="corr",ascending=False)

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
3433,28941,1.000,534,5.000,5.000
3429,28941,1.000,509,5.000,5.000
3422,28941,1.000,344,3.000,3.000
3423,28941,1.000,348,4.000,4.000
3424,28941,1.000,356,3.000,3.000
...,...,...,...,...,...
19160,105474,0.652,2302,5.000,3.259
19161,105474,0.652,2312,3.000,1.955
19162,105474,0.652,2316,5.000,3.259
19163,105474,0.652,2321,4.000,2.607


*Creating DF includes movie Ids and all user's weighted average ratings*

In [21]:
recommendation_df = top_users_ratings.pivot_table(values="weighted_rating", index="movieId", aggfunc="mean")
recommendation_df.sort_values(by="weighted_rating" ,ascending=False)

Unnamed: 0_level_0,weighted_rating
movieId,Unnamed: 1_level_1
53,3.952
2057,3.764
2485,3.764
1922,3.764
3118,3.764
...,...
1981,0.349
8912,0.329
7193,0.329
5471,0.328


*Filtering and Sorting recommendation_df*

In [22]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5].sort_values(by="weighted_rating", ascending=False).head(5)

*Returning 5 recommended movies*

In [23]:
pd.merge(movies_to_be_recommend,movie,how="inner",on="movieId")[["movieId","weighted_rating","title","genres"]]

Unnamed: 0,movieId,weighted_rating,title,genres
0,53,3.952,Lamerica (1994),Adventure|Drama
1,1922,3.764,Whatever (1998),Drama
2,2057,3.764,"Incredible Journey, The (1963)",Adventure|Children
3,2485,3.764,She's All That (1999),Comedy|Romance
4,3118,3.764,Tumbleweeds (1999),Drama


**ITEM-BASED RECOMMENDATION**

**Making an item-based recommendation based on the last movie watched and rated the highest by the random user**

*Selecting the random user's last watched and highest rated movie ID*

In [24]:
pick = rating[(rating["rating"] == 5) & (rating["userId"]==random_user)].sort_values(by="timestamp", ascending=False).iloc[0]["movieId"]
print(pick)

7


*Filtering user_movie_df with selected movie ID*

In [25]:
picked_movie_name = movie["title"][movie["movieId"]==pick].iloc[0]
print(picked_movie_name)

Sabrina (1995)


In [26]:
final = user_movie_df[picked_movie_name]
final[final.notna()]

userId
6        5.000
7        3.000
12       3.000
14       3.500
19       5.000
          ... 
138382   3.000
138387   3.000
138404   3.500
138408   3.000
138432   3.000
Name: Sabrina (1995), Length: 12961, dtype: float64

*Creating Correlation DF*

In [27]:
users_wo_random = user_movie_df.drop(random_user,axis=0).drop(movies_watched,axis=1)
final_wo_lucky = final.drop(random_user,axis=0)

movies_similarity = users_wo_random.corrwith(final_wo_lucky).sort_values(ascending=False).reset_index()

movies_similarity.columns=["title","similarity"]

*Returning 5 recommended movies without picked movie*

In [28]:
movies_similarity.sort_values(by="similarity",ascending=False).head(5)

Unnamed: 0,title,similarity
0,Intouchables (2011),0.503
1,Father of the Bride (1991),0.5
2,Anna and the King (1999),0.494
3,Runaway Bride (1999),0.484
4,"Phantom of the Opera, The (2004)",0.476


**HYBRID RECOMMENDER SYSTEM - User-based & Item-based**

**Calculating hybrid movie recommendation based on similarity scores and weighted rating**

*Combining movies highly similar to the ones the random user rated 5 stars, with movies that similar users have rated an average of 3.5 or higher*

In [29]:
movies_ordered_by_rating = pd.merge(recommendation_df,movie,how="inner",on="movieId")[["movieId","weighted_rating","title"]]

merged = pd.merge(movies_similarity,movies_ordered_by_rating,how="inner", on="title")

*Multiplying the similarity score by the average rating*

In [30]:
merged["hybrid"] = merged["similarity"] * merged["weighted_rating"]

*Recommendation based on the ranking*

In [31]:
merged[["title","hybrid"]].sort_values(by="hybrid", ascending=False).head(10)

Unnamed: 0,title,hybrid
116,She's All That (1999),1.419
15,Doc Hollywood (1991),1.232
29,Never Been Kissed (1999),1.212
67,Picture Perfect (1997),1.205
1,Anna and the King (1999),1.197
43,Hitch (2005),1.187
221,Angela's Ashes (1999),1.185
82,"Definitely, Maybe (2008)",1.165
10,Mona Lisa Smile (2003),1.151
0,Father of the Bride (1991),1.122
