# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based and model-based collaborative filtering recommenders for movies. 

Author: <font color="blue">Aniol Petit Cabarrocas</font>

E-mail: <font color="blue">aniol.petit01@estudiant.upf.edu</font>

Date: <font color="blue">06/11/2024</font>

<font size="+2" color="blue">Additional results: surprise library</font>

# 1. The Movies dataset

# 1.1. Load the input files

In [1]:
# Leave this code as-is

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import*
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Leave this code as-is

FILENAME_MOVIES = "movies-2000s.csv"
FILENAME_RATINGS = "ratings-2000s.csv"
FILENAME_TAGS = "tags-2000s.csv"

In [3]:
# Leave this code as-is

movies = pd.read_csv(FILENAME_MOVIES, 
                    sep=',', 
                    engine='python', 
                    encoding='latin-1',
                    names=['movie_id', 'title', 'genres'])
display(movies.head(5))

ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    encoding='latin-1',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

Unnamed: 0,movie_id,title,genres
0,2769,"Yards, The (2000)",Crime|Drama
1,3177,Next Friday (2000),Comedy
2,3190,Supernova (2000),Adventure|Sci-Fi|Thriller
3,3225,Down to You (2000),Comedy|Romance
4,3228,Wirey Spindell (2000),Comedy


Unnamed: 0,user_id,movie_id,rating
0,4,1,3.0
1,4,260,3.5
2,4,296,4.0
3,4,541,4.5
4,4,589,4.0


# 1.2. Merge the data into a single dataframe

<font size="+1" color="red">Replace this cell with your code from the previous practice that joined these three dataframes using "merge" into a single dataframe named "ratings". Print the first 5 rows of the resulting dataframe, which should contain columns "user_id", "movie_id", "rating", "title", and "genres".</font>

In [4]:
ratings = pd.merge(movies, ratings_raw, how='inner', on='movie_id')
display(ratings.head(5))

Unnamed: 0,movie_id,title,genres,user_id,rating
0,2769,"Yards, The (2000)",Crime|Drama,1115,4.0
1,2769,"Yards, The (2000)",Crime|Drama,1209,2.0
2,2769,"Yards, The (2000)",Crime|Drama,2004,3.0
3,2769,"Yards, The (2000)",Crime|Drama,2502,4.0
4,2769,"Yards, The (2000)",Crime|Drama,2827,4.0


<font size="+1" color="red">Replace this cell with your code from the previous practice for "find_movies" that list movies containing a keyword</font>

In [5]:
def find_movies(keyword, df):
    for _, row in df.iterrows():
        title = row["title"]
        if keyword in title:
            print(f"movie_id: {row["movie_id"]}, title: {title}")

In [6]:
# LEAVE AS-IS

# For testing, this should print 9 movies
find_movies("Spider-Man", movies)

movie_id: 5349, title: Spider-Man (2002)
movie_id: 8636, title: Spider-Man 2 (2004)
movie_id: 52722, title: Spider-Man 3 (2007)
movie_id: 76709, title: Spider-Man: The Ultimate Villain Showdown (2002)
movie_id: 95510, title: Amazing Spider-Man, The (2012)
movie_id: 110553, title: The Amazing Spider-Man 2 (2014)
movie_id: 122926, title: Untitled Spider-Man Reboot (2017)
movie_id: 195159, title: Spider-Man: Into the Spider-Verse (2018)
movie_id: 201773, title: Spider-Man: Far from Home (2019)


In [7]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [8]:
# LEAVE AS-IS

# For testing, should print "Spider-Man 2 (2004)"
print(get_title(8636, movies))

Spider-Man 2 (2004)


## 1.3. Count unique registers

<font size="+1" color="red">Replace this cell with your own code to indicate the number of unique users and unique movies in the "ratings" variable.</font>

In [9]:
print(f"Number of users who have rated a movie: {len(ratings.user_id.unique())}")
print(f"Number of movies that have been rated: {len(ratings.movie_id.unique())}")
print(f"Total number of movies: {len(movies.movie_id.unique())}")

Number of users who have rated a movie: 12676
Number of movies that have been rated: 2049
Total number of movies: 33168


# 2. Item-based Collaborative Filtering

## 2.1. Data pre-processing

<font size="+1" color="red">Replace this cell with your code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title</font>

In [10]:
rated_movies = ratings.drop(columns="genres")
display(rated_movies.head(10))

Unnamed: 0,movie_id,title,user_id,rating
0,2769,"Yards, The (2000)",1115,4.0
1,2769,"Yards, The (2000)",1209,2.0
2,2769,"Yards, The (2000)",2004,3.0
3,2769,"Yards, The (2000)",2502,4.0
4,2769,"Yards, The (2000)",2827,4.0
5,2769,"Yards, The (2000)",6629,1.0
6,2769,"Yards, The (2000)",12435,4.0
7,2769,"Yards, The (2000)",13873,3.0
8,2769,"Yards, The (2000)",14799,3.0
9,2769,"Yards, The (2000)",15691,2.5


<font size="+1" color="red">Replace this cell with your code to generate "ratings_summary" and print the first 10 rows.</font>

In [11]:
ratings_summary = rated_movies[["movie_id", "title"]].groupby("movie_id").first()
ratings_mean = rated_movies.groupby("movie_id")["rating"].mean()
ratings_count = rated_movies.groupby("movie_id")["rating"].count()
ratings_summary["ratings_mean"] = ratings_mean
ratings_summary["ratings_count"] = ratings_count
display(ratings_summary.head(10))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2769,"Yards, The (2000)",3.122549,102
3177,Next Friday (2000),2.824,125
3190,Supernova (2000),2.395683,139
3225,Down to You (2000),2.577273,110
3228,Wirey Spindell (2000),2.5,2
3239,Isn't She Great? (2000),1.947368,19
3273,Scream 3 (2000),2.444664,759
3275,"Boondock Saints, The (2000)",3.870682,1071
3276,Gun Shy (2000),3.33871,31
3279,Knockout (2000),2.0,2


To select from dataframe A those having column C larger or equal to N, you can do `A[A.C >= N]`.

To sort dataframe A by decreasing values of column C, you can do `A.sort_values(by='C', ascending=False)`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to print the top 5 highest rated movies, considering only movies receiving at least 100 ratings.</font>

In [12]:
more_100_ratings = ratings_summary[ratings_summary["ratings_count"] >= 100]
more_100_ratings_sorted = more_100_ratings.sort_values(by="ratings_mean", ascending=False)
display(more_100_ratings_sorted.head(5))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5618,Spirited Away (Sen to Chihiro no kamikakushi) ...,4.215216,2458
6016,City of God (Cidade de Deus) (2002),4.186592,2133
4226,Memento (2000),4.158512,4476
7156,Fog of War: Eleven Lessons from the Life of Ro...,4.112013,308
4973,"Amelie (Fabuleux destin d'AmÃ©lie Poulain, Le)...",4.097234,3687


<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings.</font>

In [13]:
more_3_ratings = ratings_summary[ratings_summary["ratings_count"] >= 3]
more_3_ratings_sorted = more_3_ratings.sort_values(by="ratings_mean", ascending=False)
display(more_3_ratings_sorted.head(5))

Unnamed: 0_level_0,title,ratings_mean,ratings_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5082,"Rumor of Angels, A (2000)",4.666667,6
27764,2LDK (2003),4.5,3
31954,Beautiful City (Shah-re ziba) (2004),4.4,5
5224,Promises (2001),4.388889,18
6775,Life and Debt (2001),4.333333,3


<font size="+1" color="red">Replace this cell with a brief commentary, in your own words, on what happens when the number of ratings is set to a small value.</font>

##### By looking at the results of the top rated movies for movies with at least 100 ratings and movies with at least 3 ratings, we can see that when the minimum number of ratings is set to a small value, the best rated movies have very few ratings, which makes sense, since a movie rated by a small number of people will most likely have high ratings, since they may be more specific or of a high interest to the raters. Instead, when there are more ratings involved it is harder to get a very high mean, since it is impossible to please everyone, and the more ratings, the more chance of getting low ones that decrease the average. That is also a reason why among the top rated movies in both cases, for the more than 3 ratings the mean rating is a bit higher.

## 2.2. Compute the user-movie matrix

<font size="+1" color="red">Replace this cell with code to generate a "user_movie" matrix by calling "pivot_table" on "rated_movies". Print the first 5 rows. It might take about one minute to compute, depending on your computer.</font>

In [14]:
user_movie = rated_movies.pivot_table(index="user_id", columns="movie_id", values="rating")
display(user_movie.head(5))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,,,,,,,,,,,...,,,,,,,,,,
33,,,,,,,,,,,...,,,,,,,,,,
62,,,,,,,,4.5,,,...,,,,,,,,,,3.5
63,,,,,,,,,,,...,,,,,,,,,,
95,,,,,,,,3.5,,,...,,,,,,,,,,


<font size="+1" color="red">Replace this a brief commentary indicating why do you think the "user_movie" matrix has so many "NaN" values. How do we call this characteristic of user ratings in recommender systems?</font>

##### It is normal that there are so many null values, since there are more than 2k films, and for all users it is likely that they haven't seen the great majority of these films, because they are not of interest or just because of a matter of time, since these are a lot of films. This is why we get so many null values, because every user may rate only a very small fraction of all the rated films, and for those that he has not rated there is a NaN value. This characteristic is known as <b>sparsity<b>


# 2.3. Explore some correlations in the user-movie matrix

<font size="+1" color="red">Replace this cell with code to compute and display the first 10 rows of the "ratings3" table as described above.</font>

In [15]:
find_movies("Lord of the Rings", movies) #we need the real title in our dataset for the first film, which is not exactly as described in the instructions

movie_id: 4993, title: Lord of the Rings: The Fellowship of the Ring, The (2001)
movie_id: 5952, title: Lord of the Rings: The Two Towers, The (2002)
movie_id: 7153, title: Lord of the Rings: The Return of the King, The (2003)


In [16]:
id_pivot = movies[movies["title"] == "Lord of the Rings: The Fellowship of the Ring, The (2001)"].movie_id.iloc[0]
id_m1 = movies[movies["title"] == "Finding Nemo (2003)"].movie_id.iloc[0]
id_m2 = movies[movies["title"] == "Talk to Her (Hable con Ella) (2002)"].movie_id.iloc[0]

s1 = user_movie[id_pivot].dropna()
s2 = user_movie[id_m1].dropna()
s3 = user_movie[id_m2].dropna()

ratings3 = pd.concat([s1, s2, s3], axis=1).dropna()
display(ratings3.head(10))

Unnamed: 0_level_0,4993,6377,5878
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
859,3.0,4.0,5.0
1229,4.0,4.0,4.5
1281,3.0,2.5,3.0
1722,5.0,4.5,4.0
2004,4.5,3.0,3.5
4590,4.0,4.0,2.0
5052,2.0,4.0,4.0
5144,5.0,5.0,5.0
6497,3.5,3.5,3.5
8369,3.0,4.0,4.5


<font size="+1" color="red">Replace this cell with code to compute all correlations between these three movies, as described above.</font>

In [17]:
def print_similarity(movie_id1, movie_id2, movies):
    similarity = ratings3[movie_id1].corr(ratings3[movie_id2])
    print(f"Similarity between {movies[movies["movie_id"] == movie_id1].title.iloc[0]} and {movies[movies["movie_id"] == movie_id2].title.iloc[0]}: {similarity: .2f}")
print_similarity(4993, 6377, movies)
print_similarity(4993, 5878, movies)
print_similarity(6377, 5878, movies)

Similarity between Lord of the Rings: The Fellowship of the Ring, The (2001) and Finding Nemo (2003):  0.38
Similarity between Lord of the Rings: The Fellowship of the Ring, The (2001) and Talk to Her (Hable con Ella) (2002):  0.16
Similarity between Finding Nemo (2003) and Talk to Her (Hable con Ella) (2002):  0.20


<font size="+1" color="red">Replace this cell with a brief commentary on the correlations you find.</font>

<b>The correlation between Lord of the Rings and Finding Nemo is moderately high, which suggests that users who enjoyed the first one also enjoyed the second one. Both movies were popular mainstream releases with wide appeal, which can explain this correlation.<b>

<b>The correlation between Lord of Rings and Talk to Her (a Spanish drama with a niche audience) is quite low, they are weakly related. The difference in genre and target audience likely contributes to this lower similarity.<b>

<b>The correlation between Finding Nemo and Talk to Her is also quite low. "Finding Nemo" is a family-oriented animated film, while "Talk to Her" is an adult drama, likely appealing to different audiences.<b>





<font size="+1" color="red">Replace this cell with code to create a "similarity_to_pivot" series that contains the computed correlations, droping the NaNs in the series.</font>

In [18]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
pivot_ratings = pd.DataFrame(user_movie[id_pivot].dropna()).rename(columns={id_pivot: "rating"})
correlations = user_movie.corrwith(pivot_ratings['rating'])
similarity_to_pivot = pd.DataFrame({
    'movie_id': correlations.index,
    'corr_with_pivot': correlations.values
})
similarity_to_pivot = similarity_to_pivot.dropna()
display(similarity_to_pivot)

Unnamed: 0,movie_id,corr_with_pivot
0,2769,-0.127515
1,3177,0.093221
2,3190,0.041206
3,3225,0.126600
5,3239,0.338378
...,...,...
2044,33154,0.318255
2045,33158,0.228214
2046,33162,0.285377
2047,33164,0.037130


<font size="+1" color="red">Replace this cell with code to create a "corr_with_pivot" dataframe as specified above, and to print the 20 movies (rated 500 times or more) with the highest correlation with the selected movie.</font>

In [19]:
corr_with_pivot = pd.merge(similarity_to_pivot, ratings_summary, how='inner', on='movie_id')
corr_with_pivot = corr_with_pivot[corr_with_pivot["ratings_count"] > 500]
display(corr_with_pivot.sort_values("corr_with_pivot", ascending=False).head(10))

Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count
481,4993,1.0,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
808,5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
1178,7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
987,6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
1340,8368,0.340934,Harry Potter and the Prisoner of Azkaban (2004),3.809971,2397
55,3578,0.337667,Gladiator (2000),3.95105,4811
86,3793,0.329686,X-Men (2000),3.556436,3535
451,4896,0.31918,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.678509,2843
68,3624,0.307471,Shanghai Noon (2000),3.297443,1017
1775,31658,0.303898,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.064417,1141


<font size="+1" color="red">Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?</font>

In [20]:
corr_with_pivot_large = corr_with_pivot[corr_with_pivot["ratings_count"] > 3000]
display(corr_with_pivot_large.sort_values("corr_with_pivot", ascending=False).head(10))

corr_with_pivot_low = corr_with_pivot[corr_with_pivot["ratings_count"] > 10]
display(corr_with_pivot_low.sort_values("corr_with_pivot", ascending=False).head(10))

Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count
481,4993,1.0,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
808,5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
1178,7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
987,6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
55,3578,0.337667,Gladiator (2000),3.95105,4811
86,3793,0.329686,X-Men (2000),3.556436,3535
592,5349,0.302174,Spider-Man (2002),3.457931,3209
294,4306,0.296144,Shrek (2001),3.768787,4591
959,6377,0.268611,Finding Nemo (2003),3.862284,3765
444,4886,0.264137,"Monsters, Inc. (2001)",3.850066,3775


Unnamed: 0,movie_id,corr_with_pivot,title,ratings_mean,ratings_count
481,4993,1.0,"Lord of the Rings: The Fellowship of the Ring,...",4.09253,5944
808,5952,0.892103,"Lord of the Rings: The Two Towers, The (2002)",4.083869,5449
1178,7153,0.892073,"Lord of the Rings: The Return of the King, The...",4.08396,5449
987,6539,0.377599,Pirates of the Caribbean: The Curse of the Bla...,3.779241,3950
1340,8368,0.340934,Harry Potter and the Prisoner of Azkaban (2004),3.809971,2397
55,3578,0.337667,Gladiator (2000),3.95105,4811
86,3793,0.329686,X-Men (2000),3.556436,3535
451,4896,0.31918,Harry Potter and the Sorcerer's Stone (a.k.a. ...,3.678509,2843
68,3624,0.307471,Shanghai Noon (2000),3.297443,1017
1775,31658,0.303898,Howl's Moving Castle (Hauru no ugoku shiro) (2...,4.064417,1141


<b> The movies that are more similar to the pivot one ("The Lord of the Rings: The Fellowship of the Ring") are those that share the same fan base, for instance other episodes of The Lord of the Rings, which makes absolute sense that users that enjoyed one also did with the rest. The other movies in the list are movies of similar genre or theme, including fantasy and action films, which are quite similar to what our pivot film is.

<b>When setting the threshold of the count to very large or very small value, we observe that in this case the lisst of movies does not change so much, it does a bit for the last movies in the top 10, but the top ones remain the same. This is because the correlation is so strong that it is hard to beat by any other films even if they have less ratings (or much more). However, we have to be careful with that, because in some cases this could lead to unreliable correlations due to the reduced amount of ratings and not represent accurately enough what we aim to express.

# 2.4. Implement the item-based recommendations

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in "item_similarity", and print the first 10 rows.</font>

In [21]:
item_similarity = user_movie.corr()

In [22]:
display(item_similarity.head(10))

movie_id,2769,3177,3190,3225,3228,3239,3273,3275,3276,3279,...,33138,33145,33148,33150,33152,33154,33158,33162,33164,33166
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2769,1.0,0.115068,0.033721,-0.232268,,-0.5,0.197011,0.199514,0.250873,,...,0.37998,0.87831,,,,0.248126,0.1806095,-0.08557,-0.408248,0.105671
3177,0.115068,1.0,0.30382,0.559533,,,0.331191,0.167918,1.0,,...,0.546119,0.735767,-1.0,,,-0.221382,0.3174747,0.014735,0.661989,0.185654
3190,0.033721,0.30382,1.0,0.636361,,-0.014315,0.146042,0.394293,-0.290397,,...,0.246183,0.632026,,,,0.378181,0.1709261,0.022444,-0.07336,-0.054114
3225,-0.232268,0.559533,0.636361,1.0,,0.578414,0.347716,0.263671,-0.250313,,...,-0.300376,0.318377,,,,0.480173,0.7503063,0.536828,0.753141,0.098748
3228,,,,,1.0,,,,,,...,,,,,,,,,,
3239,-0.5,,-0.014315,0.578414,,1.0,0.180846,1.0,,,...,,,,,,1.0,,1.0,0.636285,0.8882
3273,0.197011,0.331191,0.146042,0.347716,,0.180846,1.0,0.105735,0.154371,,...,0.006774,0.409968,1.0,,,0.088405,0.07516779,0.143492,0.466705,0.084202
3275,0.199514,0.167918,0.394293,0.263671,,1.0,0.105735,1.0,0.485071,,...,-0.011426,0.279624,,,,0.075827,0.2994603,0.187713,0.285584,0.225317
3276,0.250873,1.0,-0.290397,-0.250313,,,0.154371,0.485071,1.0,,...,,0.29277,,,,0.0,-6.885311000000001e-17,-0.45553,0.5,-0.138013
3279,,,,,,,,,,1.0,...,,,,,,,,,,


<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in "item_similarity_min_ratings"</font>

In [None]:
item_similarity_min_ratings = user_movie.corr(min_periods=100)

In [None]:
display(item_similarity_min_ratings.head(5))

<font size="+1" color="red">Replace this cell with your code to find the userids of two example users: user_id_super (the who liked the three superhero movies), and user_id_drama (the one who liked the three dramas)</font>

In [None]:
like_spiderman = rated_movies[((rated_movies['movie_id'] == 5349) & (rated_movies['rating'] > 4.5))]
like_xmen = rated_movies[((rated_movies['movie_id'] == 3793) & (rated_movies['rating'] > 4.5))]
like_hulk = rated_movies[((rated_movies['movie_id'] == 6534) & (rated_movies['rating'] > 4.5))]
first_merge = pd.merge(like_spiderman, like_xmen, how="inner", on="user_id")
user_id_super = pd.merge(first_merge, like_hulk, how="inner", on="user_id")["user_id"].iloc[0]
print(f"Selected user who liked three superhero movies: {user_id_super}")

In [None]:
like_mysticriver = rated_movies[((rated_movies['movie_id'] == 6870) & (rated_movies['rating'] > 4.5))]
like_pianist = rated_movies[((rated_movies['movie_id'] == 5995) & (rated_movies['rating'] > 4.5))]
like_u571 = rated_movies[((rated_movies['movie_id'] == 3555) & (rated_movies['rating'] > 4.5))]
first_merge = pd.merge(like_mysticriver, like_pianist, how="inner", on="user_id")
user_id_drama = pd.merge(first_merge, like_u571, how="inner", on="user_id")["user_id"].iloc[0]
print(f"Selected user who liked three drama movies: {user_id_drama}")

In [None]:
# Leave this code as-is

# Gets a list of watched movies for a user_id
def get_watched_movies(user_id, user_movie):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)
    
# Gets the rating a user_id has given to a movie_id
def get_rating(user_id, movie_id, user_movie):
    return user_movie[movie_id][user_id]

# Print watched movies
def print_watched_movies(user_id, user_movie, movies):
    for movie_id in get_watched_movies(user_id, user_movie):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_id, movie_id, user_movie), get_title(movie_id, movies)))


In [None]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_super, user_movie, movies)

In [None]:
# LEAVE AS-IS (TESTING CODE)

print_watched_movies(user_id_drama, user_movie, movies)

<font size="+1" color="red">Replace this cell with your code for "get_movies_relevance"</font>

In [None]:
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    # Create an empty series
    movies_relevance = pd.Series(dtype='float64')
    
    # Iterate through the movies the user has watched
    for watched_movie in get_watched_movies(user_id, user_movie):
        
        # Obtain the rating given
        rating_given = get_rating(user_id, watched_movie, user_movie)
        
        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = item_similarity_matrix[watched_movie]
        
        # Multiply this vector by the given rating
        weighted_similarities = similarities*rating_given
        
        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

<font size="+1" color="red">Replace this cell with your code to obtain the 5 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)</font>

In [None]:
relevance_super = get_movies_relevance(user_id_super, user_movie, item_similarity)
relevance_super = relevance_super.merge(movies[['movie_id', 'title']], on='movie_id')
print("Most relevant movies for user who likes superhero movies: ")
display(relevance_super.sort_values(by='relevance', ascending=False).head(5))

relevance_drama = get_movies_relevance(user_id_drama, user_movie, item_similarity)
relevance_drama = relevance_drama.merge(movies[['movie_id', 'title']], on='movie_id')
print("Most relevant movies for user who likes drama movies: ")
display(relevance_drama.sort_values(by='relevance', ascending=False).head(5))

<font size="+1" color="red">Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.</font>

<font size="-1" color="gray">All those trivial facts you learned about 1980s and 1990s pop culture were supposed to be useful one day; that day has arrived :-)</font>

<b> For the superhero user:

1- I, Robot (2004): This is a sci-fi action film, which aligns well with superhero and action fans, featuring themes of technology, futuristic action, and an iconic performance by Will Smith.

2- Men in Black II (2002): Another sci-fi action movie with Will Smith, known for humor and action, and definitely relevant for fans of the superhero genre.

3- The Patriot (2000): While this film is more of a historical war drama than a sci-fi or superhero film, it has a lot of action sequences and epic battles, which may still appeal to an action-loving audience.

4- The Day After Tomorrow (2004): This is a disaster film with intense action sequences and high-stakes tension, appealing to audiences who enjoy thrilling action scenes.

5- Pearl Harbor (2001): Although this is primarily a war drama, it features significant action sequences, which might appeal to an action-oriented viewer.

Most of these recommendations are indeed action-packed or sci-fi films, which aligns reasonably well with the preferences of a superhero/action fan. Although films like The Patriot and Pearl Harbor are not directly superhero or sci-fi, their action-oriented themes make them somewhat relevant.


<b> For the drama user:

1- Ray (2004): A biographical drama about Ray Charles, which is highly relevant for someone who enjoys character-driven, dramatic stories.

2- Finding Forrester (2000): This is a drama film that focuses on a young writer’s mentorship and growth, well-suited for a drama enthusiast.

3- Seabiscuit (2003): A biographical sports drama about a racehorse during the Great Depression. The inspirational story and emotional depth make it appealing to drama fans.

4- A Beautiful Mind (2001): A biographical drama based on the life of mathematician John Nash, which delves into mental health struggles and personal growth, aligning well with a drama lover’s taste.

5- I Am Sam (2001): This emotional drama explores themes of parenting and disability, making it a strong recommendation for someone interested in human-centered drama. 

All the recommendations for this user align very well with a preference for drama films. Each movie is heavily character-driven, emotional, and deals with serious themes, making them highly relevant for someone who enjoys dramas.

<font size="+1" color="red">Replace this cell with your code implementing "get_recommended_movies"</font>

In [None]:
def get_recommended_movies(user_id, user_movie, item_similarity_matrix, movies):
    relevant_movies = get_movies_relevance(user_id, user_movie, item_similarity_matrix)
    relevant_movies.set_index('movie_id', inplace=True)
    watched_movies = get_watched_movies(user_id, user_movie)
    relevant_movies = relevant_movies.drop(watched_movies, errors='ignore')
    recommended_movies = relevant_movies.merge(movies[['movie_id', 'title']], on='movie_id')
    recommended_movies = recommended_movies.sort_values(by='relevance', ascending=False)
    return recommended_movies    

<font size="+1" color="red">Replace this cell with your code to obtain the 10 most recommended movies for the users user_id_super and user_id_drama</font>

In [None]:
recommendations_super = get_recommended_movies(user_id_super, user_movie, item_similarity, movies)
display(recommendations_super.head(10))
recommendations_drama = get_recommended_movies(user_id_drama, user_movie, item_similarity, movies)
display(recommendations_drama.head(10))

<font size="+1" color="red">Replace this cell with a brief commentary on these recommendations. Do you think they are relevant? Why or why not? After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?</font>

The recommendations appear to be again reasonably aligned with the interests of each user. Let's review them:

<b>User Interested in Superhero Movies:

This user has high relevance scores for titles like "The Matrix Reloaded", "xXx (2002)", and "The Italian Job". These movies feature action-packed plots, adventure, or high-intensity sequences that would appeal to someone who enjoys superhero or action-oriented content.
Other recommended movies such as "Fast and the Furious", "Ocean's Eleven", and "The Matrix Revolutions" also fit well within this genre, providing action, suspense, and thrills.

Overall, the recommendations seem relevant, as they capture the action/superhero appeal this user would likely enjoy.

<b>User Interested in Dramas:

The recommendations for this user lean toward deeper, story-driven films, including "Ray", "Seabiscuit", and "I Am Sam", which align with the emotional and dramatic themes this user prefers.
Other recommendations like "We Were Soldiers", "Man on Fire", and "Enemy at the Gates" are also drama-centric, focusing on intense, character-driven narratives and historical or biographical elements.

These selections appear relevant to a user who has shown a preference for drama movies, as they include impactful storytelling and strong emotional themes.

<b>Comparison of Relevance Scores After Filtering Watched Movies

After filtering out the movies that the users have already watched, the relevance scores of the remaining recommendations appear slightly lower but still comparable to the original list containing all relevant movies. This makes sense because removing already watched does not imply that there are other similar movies in terms of genre or thematic that they have not watched yet, and these can have a great relevance also, not just the films users already watched.

# EXTRA POINTS 

In [None]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import KNNBasic

In [None]:
ratings_data = user_movie.reset_index().melt(id_vars='user_id', var_name='movie_id', value_name='rating').dropna()
ratings_data['movie_id'] = ratings_data['movie_id'].astype(str)

# Define a reader with the appropriate rating scale
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(ratings_data[['user_id', 'movie_id', 'rating']], reader)

trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

algo = SVD()  # Singular Value Decomposition model

algo.fit(trainset)

def get_surprise_recommendations(user_id, algo, user_movie, movies, n_recommendations=10):
    # Get a list of all movies
    all_movie_ids = user_movie.columns.astype(str)
    
    # Get the movies the user has already rated, converting each to a string
    watched_movies = set(str(movie_id) for movie_id in get_watched_movies(user_id, user_movie))
    
    # Generate predictions for all movies the user hasn't seen
    recommendations = []
    for movie_id in all_movie_ids:
        if movie_id not in watched_movies:
            pred = algo.predict(user_id, movie_id)
            recommendations.append((movie_id, pred.est))  # pred.est is the predicted rating
    
    # Sort by predicted rating in descending order and select the top-n
    recommendations.sort(key=lambda x: x[1], reverse=True)
    top_recommendations = recommendations[:n_recommendations]
    
    # Convert to dataframe for better readability
    recommended_movies_df = pd.DataFrame(top_recommendations, columns=['movie_id', 'predicted_rating'])
    recommended_movies_df['movie_id'] = recommended_movies_df['movie_id'].astype(int)
    
    # Merge with movie titles
    recommended_movies_df = recommended_movies_df.merge(movies, on='movie_id', how='left')
    
    return recommended_movies_df[['movie_id', 'title', 'predicted_rating']]

recommendations_super = get_surprise_recommendations(user_id_super, algo, user_movie, movies)
recommendations_drama = get_surprise_recommendations(user_id_drama, algo, user_movie, movies)

print("Recommendations for user who likes superhero movies:")
print(recommendations_super)

print("\nRecommendations for user who likes dramas:")
print(recommendations_drama)

<b>Recommendations for the User Who Likes Superhero Movies

The top recommendations for this user, surprisingly, include films like "Snatch", "Ocean's Eleven", and "Bloody Sunday", which, while popular, aren't directly related to superhero themes. In fact, these movies are largely heist, crime, or drama films rather than action-based superhero content. While they might share some action or thrill elements, they don't align very well with the superhero genre that this user prefers. This could be due to the model emphasizing general popularity or high ratings rather than the specific genre preferences.

<b>Recommendations for the User Who Likes Dramas

For the user who prefers drama movies, the recommendations are more aligned. The list includes "Amelie", "Howl's Moving Castle", "Man Without a Past", "Amores Perros", and "Spirited Away", which all fall under drama or thought-provoking genres. These films are generally well-regarded for their narrative depth and emotional storytelling, making them suitable recommendations for a drama enthusiast. Some recommendations, like "Spirited Away" and "Howl's Moving Castle", also bring in some animated drama, adding diversity while maintaining relevance.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>