# Individual Recommender Systems - Tutorial 1 (Lab 1)


In this tutorial, we will see some implementations of strategies for individual recommender systems seen in the first lectures of the course. At the same time, we will introduce some libraries useful for machine learning applications in python. After completing this tutorial, you will be able to: 
- Create a non-personalized recommender system in the domain of movies for individual users.
- Create a personalized recommender system using either a content based or a collaborative filtering approach.
- Evaluate the performance of the recommender systems, using different evaluation protocols.


#### Summary

1. Dataset
2. Non personalized recommendations: ordering by average rating
3. Towards personalization: Filtering by genre
4. Content Based Recommender System
5. Collaborative Filtering Recommender System
6. Evaluation of Recommender Systems
    1. Evaluation of a CB recommender
    2. Evaluation of a CF recommender



#### 1. Dataset


We will use the preprocessed dataset provided in the folder *preprocessed_dataset*. The dataset has been obtained from the 
[Movielens latest small](https://www.kaggle.com/grouplens/movielens-latest-small) dataset (providing user's ratings) and the [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots) dataset (providing movies' information). Check the notebook **Dataset Preprocessing** for details on how the datasets have been preprocessed and merged.

The dataset contains two CSV files, following  the structure of the Movielens dataset:
1. ratings.csv: contains the information about the feedback of the users for the movies. Each row contains the id of a user,the id of a movie, and the rating that the user gave to the movie. It also contains the timestamp, but we will not use it in this tutorial.
2. movies.csv: contains the information about the movies. In particular, we will use the genre and the plot.



We start using the [pandas](https://pandas.pydata.org/) library to read the content of the CSV files and use it. Let us have a look at the first 10 rows of each file, to check which contents are provided by the *ratings.csv* and the *movies.csv* files. To do so, we use the *read_csv* method which returns a pandas [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the information read from the csv file. Such object provides several methods to manipulate the data in it, and we will use the [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method, which returns the first *n* rows of the dataframe. We use also the [display](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display) method, that allows to print information nicely in a jupyter notebook.

In [4]:


import pandas as pd

In [5]:
ratings_df = pd.read_csv("ratings.csv")
display(ratings_df.head(10))

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
5,1,112,3.5,1094785740
6,1,151,4.0,1094785734
7,1,223,4.0,1112485573
8,1,253,4.0,1112484940
9,1,260,4.0,1112484826


In [7]:
movies_df = pd.read_csv("movies.csv", index_col="movieId")
display(movies_df.head(10))

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


Let us have a look at the number of ratings we have for each movie. First, we select the columns *user* and *item*. Then, we can use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method to group the columns given a specific column value. The method returns a [Groupby](https://pandas.pydata.org/docs/reference/groupby.html) object, which provides methods to aggregate the values of each group. We aggregate by *item*, to count the number of users that rated each item. To compute this, we use the [count](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.count.html) method which returns a Dataframe object, in which we will have a row for each movie, and the number of users who rated each movie in the *user* column.

Then, we visualize the distribution of the number of evaluations for each movie using the [seaborn](https://seaborn.pydata.org/) library. We use the [displot](https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.displot) method, specifying `x='user'`.

In [11]:
import seaborn as sns

count_ratings_df = ratings_df[['userId','movieId']].groupby(['movieId']).count()
display(count_ratings_df)
# sns.displot(data=count_ratings_df, x='userId', binwidth=1)

Unnamed: 0_level_0,userId
movieId,Unnamed: 1_level_1
1,49695
2,22243
3,12735
4,2756
5,12161
...,...
131254,1
131256,1
131258,1
131260,1


Note that here, we have on the horizontal axe the number of users that evaluated a movie, and on the vertical axe the number of movies having the corresponding number of evaluations. We migh want to visuaize instead on the horizontal axe the movies, and on the vertical axe the number of users who evaluated it. There are different ways to do it, we will just sort the previous Dataframe by *user* (i.e. the number of user evaluations for the corresponding movie) in a descending order, using the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method of the Dataframe object, and then reset the index two times, in order to have a progressive id for the movies (the [reset_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) function will create a nerw index assigning a progressive identifier to each row in the Dataframe). The, we just plot the line having *index* on the horizontal axe and *user* on the vertical axe. 

In [12]:
count_ratings_df_ord = count_ratings_df.sort_values(by="userId", ascending=False).reset_index().reset_index()
display(count_ratings_df_ord)
#sns.lineplot(data=count_ratings_df_ord, x='movieId', y='userId')

Unnamed: 0,index,movieId,userId
0,0,296,67310
1,1,356,66172
2,2,318,63366
3,3,593,63299
4,4,480,59715
...,...,...,...
26739,26739,110807,1
26740,26740,110811,1
26741,26741,78984,1
26742,26742,110818,1


Is this what you expected? Does this reming you of something?

![title](img/The-long-tail-dinosaur.png)

#### 2. Non personalized recommendations: ordering by average rating

A first idea to provide recommendations for our users can be to use the ratings provided and determine the highest rated movies. To do so, we first need to compute the average rating for each movie, and then determine the top rated. Let's use again the functions provided by the Dataframe object. We can use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) again. To compute the average ratings, we use the [mean](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.mean.html) method which returns a Dataframe object, in which we will have a row for each movie, and the average rating will be in the *rating* column. Finally, we use the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method to sort the movies by rating, and we print the first 10 movies.

In [13]:
average_ratings_df = ratings_df.groupby(['movieId']).mean()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)

display(sorted_avg_ratings_df.head(10))

Unnamed: 0_level_0,userId,rating,timestamp
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
95517,135399.0,5.0,1341366000.0
105846,68026.0,5.0,1382644000.0
89133,29086.0,5.0,1425168000.0
105187,83331.0,5.0,1380285000.0
105191,83331.0,5.0,1380286000.0
88488,30317.0,5.0,1399737000.0
126945,28916.0,5.0,1422623000.0
105526,117718.0,5.0,1381669000.0
105529,117718.0,5.0,1381670000.0
105841,68026.0,5.0,1382644000.0


Let's use the information in the *movies_df* to associate the title on each movie, and then print the list of recommended movies. We will use the [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) method of the Dataframe object. Remember that the attribute you specify with the *on* parameter must be the index of the second dataframe.

In [15]:
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
display(joined_df.head(10))

Unnamed: 0_level_0,userId,rating,timestamp,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
95517,135399.0,5.0,1341366000.0,"Barchester Chronicles, The (1982)",Drama
105846,68026.0,5.0,1382644000.0,Only Daughter (2013),Drama
89133,29086.0,5.0,1425168000.0,Boys (Drenge) (1977),Drama
105187,83331.0,5.0,1380285000.0,Linotype: The Film (2012),Documentary
105191,83331.0,5.0,1380286000.0,Rocaterrania (2009),Documentary|Fantasy
88488,30317.0,5.0,1399737000.0,"Summer Wishes, Winter Dreams (1973)",Drama
126945,28916.0,5.0,1422623000.0,Small Roads (2011),(no genres listed)
105526,117718.0,5.0,1381669000.0,Year Zero: The Silent Death of Cambodia (1979),Documentary
105529,117718.0,5.0,1381670000.0,Stealing a Nation (2004),Documentary
105841,68026.0,5.0,1382644000.0,B-Side (2013),Comedy|Romance


The movie we recommend are not so famous. Why is this happening? Let's check how many people rated each of these movies. We use again the *group_by* method, and then we use the [agg](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.agg.html) method of the Groupby onject to specify different aggregations: for the *user*, we are going to count the number of users in the group, using the *'size'* input, while for the rating we want the average value, so we specify *'mean'*. The rest of the operations are the same as before:

In [16]:
average_ratings_df = ratings_df.groupby(['movieId']).agg(count=('userId', 'size'), rating=('rating', 'mean')).reset_index()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
display(joined_df[['title', 'count', 'rating']].head(10))

Unnamed: 0,title,count,rating
19152,"Barchester Chronicles, The (1982)",1,5.0
21842,Only Daughter (2013),1,5.0
17703,Boys (Drenge) (1977),1,5.0
21656,Linotype: The Film (2012),1,5.0
21658,Rocaterrania (2009),1,5.0
17556,"Summer Wishes, Winter Dreams (1973)",1,5.0
25936,Small Roads (2011),1,5.0
21762,Year Zero: The Silent Death of Cambodia (1979),1,5.0
21763,Stealing a Nation (2004),1,5.0
21840,B-Side (2013),1,5.0


We can see that all the movies we selected have been rated by only one user. Let's try to recommend only movies that have been rated by at least 20 users. Note that the *item* column is the index for the Dataframe object obtained using the *mean* and *count* methods. We will use the [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) attribute of the Dataframe object, which allows to access a group of rows and columns by label(s) or a boolean array. In particular, we will determine a Series of boolean: The instruction `rating_counts_df['rating'] > minimum_ratings` evaluate the condition for each row of the *rating* column. Hence, *loc* will only return the rows for which such condition is true.

In [17]:
minimum_ratings = 20
average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
display(joined_df[['title','rating']].head(10))

Unnamed: 0_level_0,title,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
318,"Shawshank Redemption, The (1994)",4.44699
858,"Godfather, The (1972)",4.364732
50,"Usual Suspects, The (1995)",4.334372
527,Schindler's List (1993),4.310175
101850,Death on the Staircase (Soupçons) (2004),4.285714
1221,"Godfather: Part II, The (1974)",4.275641
2019,Seven Samurai (Shichinin no samurai) (1954),4.27418
904,Rear Window (1954),4.271334
7502,Band of Brothers (2001),4.263182
912,Casablanca (1942),4.258327


We can see now that the recommendation contains more famous movies.

#### 3. Towards personalization: filtering by genre

Now, let us try to make a personalized recommender system. One possible solution is to suggest movies having the highest ratings for a specific genre. We also want to only select movies evaluated by at least 20 users. We use the same methods used in the previous example. To select the movies by genre, we need to check if the value in the column *genre* contains the specified genre. We can write the conditions as `average_ratings_df['genres'].str.contains(genre)`. 

In [18]:
genre = 'Action'
minimum_ratings = 20

average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
average_ratings_df = average_ratings_df.join(movies_df['genres'], on='movieId')
average_ratings_df = average_ratings_df.loc[average_ratings_df['genres'].str.contains(genre)]
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
display(joined_df[['title','rating','genres']].head(10))

Unnamed: 0_level_0,title,rating,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019,Seven Samurai (Shichinin no samurai) (1954),4.27418,Action|Adventure|Drama
7502,Band of Brothers (2001),4.263182,Action|Drama|War
6016,City of God (Cidade de Deus) (2002),4.23541,Action|Adventure|Crime|Drama|Thriller
908,North by Northwest (1959),4.233538,Action|Adventure|Mystery|Romance|Thriller
2959,Fight Club (1999),4.227123,Action|Crime|Drama|Thriller
58559,"Dark Knight, The (2008)",4.220129,Action|Crime|Drama|IMAX
1198,Raiders of the Lost Ark (Indiana Jones and the...,4.219009,Action|Adventure
3030,Yojimbo (1961),4.211717,Action|Adventure
260,Star Wars: Episode IV - A New Hope (1977),4.190672,Action|Adventure|Sci-Fi
1196,Star Wars: Episode V - The Empire Strikes Back...,4.188202,Action|Adventure|Sci-Fi


We can see here the genres associated to the chosen movies. A movie can belong to multiple genres, some of which a user likes, and some he/she dislike. 

#### 4. Content Based Recommender System

Let us try to implement a strategies seen during the lectures to provide more personalized recommendations. We start from a content based approach. In the next steps, we will implementing a CB recommender for a specific user, and generate a recommendation list considering the movies the user didn't rate. We will see the following steps:
- Selection of a random user
- Retrieving the information of the rated and unrated movies for the selected user
- Compute a feature representation for the movie's contents (TF-IDF)
- Train a regressor, compute the predicted ratings for unrated movies, and recommending the movies with higher predicted ratings

##### 4.1 Selection of a random user


For simplicity, we select a random user from the *ratings_df* Dataframe having more than 200 evaluations. First, we select all the users with more than 200 evaluations, and then we use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) method to extract a random row. Finally, we extract the *user* column and we read the content with the [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method.

In [19]:
users_ratings = ratings_df.groupby(['userId']).count()

selected = users_ratings['rating'] > 200
selected_users = users_ratings.loc[selected]
random_selected = selected_users.sample() # sample() returns a random row from the dataframe. The returned object is a dataframe with one row. If you pass a number as argument you specify to select more than one row.
select_column_df = random_selected.reset_index()['userId'] # reset_index() create a new index, and the user became a column. Then, we can filter using the column name
selected_user = select_column_df.iloc[0] # iloc select by index, since our dataframe only has one row we read it from the index 0
print("Selected user: " + str(selected_user))

Selected user: 84873


##### 4.2 Retrieving the information of the rated and unrated movies for the selected user

Now, we need to select the user evaluations, that we will use to train our recommender. We print the number of ratings that we have for the considered user, and the extracted dataset, ordered by ratings.

In [20]:
selected_user_ratings = ratings_df.loc[ratings_df['userId'] == selected_user]
selected_user_ratings = selected_user_ratings.sort_values(by='movieId', ascending=True)
print("Rated movies: " + str(selected_user_ratings.shape[0]))
display(selected_user_ratings.head(10))

Rated movies: 422


Unnamed: 0,userId,movieId,rating,timestamp
12288349,84873,1,3.5,1128881437
12288350,84873,5,2.0,1128883050
12288351,84873,17,4.5,1128880206
12288352,84873,25,4.0,1128880235
12288353,84873,29,4.5,1120098065
12288354,84873,31,3.0,1128877916
12288355,84873,32,4.0,1112509185
12288356,84873,34,4.5,1128878812
12288357,84873,45,3.5,1128881991
12288358,84873,46,2.0,1128880577


Now, we need the select content of the rated movies, that we will need to compute the features that our model will use. We do it by selecting the rows from our *movies_df* dataframe having a *item* that is in the list of *item* of the *selected_user_ratings* Dataframe, which only contains the evaluations of the considered user. We do this using the [isin](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html) method. Since the *item* is the index in the *movies_df* Dataframe, we can select directly pass the list of *item* we want to select to *loc*, and it will return the rows for which the index is in the considered list. Finally, we only select the columns *title* and *plot*.

In [23]:
rated_movies_df = movies_df.loc[list(selected_user_ratings['movieId'])]
rated_movies_df = rated_movies_df[['title', 'genres']]
print("Rated movies: " + str(rated_movies_df.shape[0]))
display(rated_movies_df.head(10))

Rated movies: 422


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
5,Father of the Bride Part II (1995),Comedy
17,Sense and Sensibility (1995),Drama|Romance
25,Leaving Las Vegas (1995),Drama|Romance
29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi
31,Dangerous Minds (1995),Drama
32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
34,Babe (1995),Children|Drama
45,To Die For (1995),Comedy|Drama|Thriller
46,How to Make an American Quilt (1995),Drama|Romance


We use a similar approach to select the unrated movies. We compute the set of movies index that are not rated using the set subtraction operator. We access the index of the *movies_df* and the *rated_movies_df* Dataframes, create two sets containing such indexes, compute the difference set *diff*, and select the movies having an  index in the *diff* set.

In [24]:
diff = set(movies_df.index) - set(rated_movies_df.index)
unrated_movies_df = movies_df.loc[diff]
unrated_movies_df = unrated_movies_df[['title', 'genres']]
print("Unrated movies: " + str(unrated_movies_df.shape[0]))
display(unrated_movies_df.head(10))

Unrated movies: 26856


Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
131072,Jesus liebt mich (2012),Comedy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller
11,"American President, The (1995)",Comedy|Drama|Romance


Finally, we can add the rating so that we just need to work with a single Dataframe. We do it using the *join* method. We specify the *item* as idex in the *selected_user_ratings* Dataframe to procede with the join. The we select the *rating* column.

In [25]:
rated_movies_df = rated_movies_df.join(selected_user_ratings.set_index('movieId')['rating'], on='movieId')
print("Rated movies: " + str(rated_movies_df.shape[0]))
display(rated_movies_df.head(10))

Rated movies: 422


Unnamed: 0_level_0,title,genres,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.5
5,Father of the Bride Part II (1995),Comedy,2.0
17,Sense and Sensibility (1995),Drama|Romance,4.5
25,Leaving Las Vegas (1995),Drama|Romance,4.0
29,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi,4.5
31,Dangerous Minds (1995),Drama,3.0
32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,4.0
34,Babe (1995),Children|Drama,4.5
45,To Die For (1995),Comedy|Drama|Thriller,3.5
46,How to Make an American Quilt (1995),Drama|Romance,2.0


Now, we have all the data we need to train a content based recomemnder system for our user.

##### 4.3 Compute a feature representation for the movie's contents (TF-IDF)

We will use the well-known pythion library [sklearn](https://scikit-learn.org/stable/) to train our recommender. First, we need to extract features from our content (the plots of the movies). Remember that we need a vector representing each document, in order to compare them. We use the TF-IDF model, hance we will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class, which allows us to easily compute TF-IDF vectors for our movies. We need to instantiate the a *TfidfVectorizer* object, and then we need to use the [fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform) method. The method fits (trains) the features model, and then transform our data, returning the TF-IDF matrix representation. We use the [shape](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape) attribute of the matrix to check the dimensions of our features matrix.

In [26]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(rated_movies_df['genres'])
print(X.shape)

(422, 19)


##### 4.4 Train a regressor, computing of the predicted ratings for unrated movies, and recommending the movies with higher predicted ratings

Now, we have our features. In this example, we will train a regressor, that is able to determine, for each unrated movie, an extimation of the rating that the user would assign to it. Of course, we could use a different approach, treating the problem as a classification problem. In that case, we would need to transform our ratings to obtain labels. We will see that later today. Now, for simplicity, we will work with the ratings. 

For the regressor, we will use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) class of the sklearn library. We need to create an instance of *KNeighborsRegressor*. Remember that the KNN algorithm uses the similarity between items to compute the predictions. In the case of regression, the rating assigned to new item is computed based on the mean of the ratings of its nearest neighbors.

In the constructor, we specify the number of neighbors to use. We set it at 5. Finally, we can train our model, using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.fit) method, passing as arguments the features matrix *X* and the corresponding ratings vector *y*.

In [27]:
from sklearn.neighbors import KNeighborsRegressor

y = rated_movies_df['rating']
neigh = KNeighborsRegressor(n_neighbors=5)
neigh.fit(X, y)

KNeighborsRegressor()

We can now use the trained model to compute our recommendations. First, we need to extract the features from our unrated movies plots. We can do it using the previously trained *TfidfVectorizer* *vectorizer*, passing to its method [transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) the *Plot* column if the *unrated_movies* Dataframe. 

Then, we use the trained *KNeighborsRegressor* *neigh*, passing the features matrix of the unrated movies *x_unrated* to its [predict](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.predict) method, determining the predicted ratings *y_unrated*.

In [28]:
X_unrated = vectorizer.transform(unrated_movies_df['genres'])
print(X_unrated.shape)
y_unrated = neigh.predict(X_unrated)
print(y_unrated.shape)

(26856, 19)
(26856,)


Finally, we can combine the predicted ratings with the information in the unrated_movies Dataframe, order the movies by predicted ratings, and determine the best 10 movies to recommend to the user.

In [29]:
unrated_movies_df['predicted_ratings_KNN'] = y_unrated
unrated_movies_df_KNN_ordering = unrated_movies_df.sort_values(by='predicted_ratings_KNN', ascending=False)
display(unrated_movies_df_KNN_ordering.head(10))

Unnamed: 0_level_0,title,genres,predicted_ratings_KNN
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5736,Faces of Death 3 (1985),Documentary|Horror,4.4
107266,I Heart Monster Movies (2012),Documentary|Horror,4.4
49359,Celine and Julie Go Boating (Celine et Julie v...,Drama|Fantasy|Mystery,4.4
120871,A Night at the Movies: The Horrors of Stephen ...,Documentary|Horror,4.4
84442,Dark Forces (Harlequin) (1980),Drama|Fantasy|Mystery,4.4
37955,Innocence (2004),Drama|Fantasy|Mystery,4.4
50705,Neverwas (2005),Drama|Fantasy|Mystery,4.4
96653,"Vampir (Cuadecuc, vampir) (1971)",Documentary|Horror,4.4
5734,Faces of Death 2 (1981),Documentary|Horror,4.4
5735,Faces of Death (1978),Documentary|Horror,4.4


#### EXERCISE

Following the example above, implement a second content-based recommender system using the [SVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) regressor of the Sklearn library (use a *linear* kernel and the default hyperparameters).

##### Solution

In [30]:
from sklearn.svm import SVR

y = rated_movies_df['rating']
svr = SVR(kernel="linear")
svr.fit(X, y)

SVR(kernel='linear')

In [31]:
X_unrated = vectorizer.transform(unrated_movies_df['genres'])
print(X_unrated.shape)
y_unrated = svr.predict(X_unrated)
print(y_unrated.shape)

(26856, 19)
(26856,)


Finally, we can combine the predicted ratings with the information in the unrated_movies Dataframe, order the movies by predicted ratings, and determine the best 10 movies to recommend to the user.

In [32]:
unrated_movies_df['predicted_ratings_SVR'] = y_unrated
unrated_movies_df_SVR_ordering = unrated_movies_df.sort_values(by='predicted_ratings_SVR', ascending=False)
display(unrated_movies_df_SVR_ordering.head(10))

Unnamed: 0_level_0,title,genres,predicted_ratings_KNN,predicted_ratings_SVR
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
80478,"Man Vanishes, A (Ningen Johatsu) (1967)",Documentary|Drama|Mystery,3.8,4.380545
61236,Waltz with Bashir (Vals im Bashir) (2008),Animation|Documentary|Drama|War,3.9,4.378523
98193,"Invisible War, The (2012)",Crime|Documentary|Drama|War,4.0,4.354369
53737,"Night of the Generals, The (1967)",Crime|Drama|Mystery|Thriller|War,3.5,4.352557
27397,Joint Security Area (Gongdong gyeongbi guyeok ...,Crime|Drama|Mystery|Thriller|War,3.5,4.352557
125523,Time Limit (1957),Drama|Mystery|War,4.3,4.340669
86781,Incendies (2010),Drama|Mystery|War,4.3,4.340669
111100,Three Faces East (1930),Drama|Mystery|War,4.3,4.340669
127011,Under the Flag of the Rising Sun (1972),Drama|Mystery|War,4.3,4.340669
61991,Miracle at St. Anna (2008),Drama|Mystery|War,4.3,4.340669


#### 5. Collaborative Filtering Recommender System

As we highlighted in the lectures, content based approaches suffers from some limitations (do you remember some?). Collaborative filtering approaches use the "wisdom of the crowd" to recommend items, following the assumption that different users with similar tastes in the past, will probably have similar tastes in the future. Hence, without the need of external information, but only looking at the interactions of for similar users (or movies) we can predict the rating of an unseen movie for an user.

Also, in this case, we will not implement the strategy, but we will use a library. In particular, we are going to use the [Lenskit](https://lkpy.readthedocs.io/en/stable/) library. Note that such library requires the python 3.6 version. We will focus on the user-user recommender, while, as exercise, you will have to use the item-item strategy.

##### 5.1 Collaborative user-user recommender with Lenskit

We use the collaborative user algorithm [UserUser](https://lkpy.readthedocs.io/en/stable/knn.html#module-lenskit.algorithms.user_knn), that uses the nearest neighbors algorithm. The constructor takes as inputs the maximum and minumum number of neighbors to consider (we set them to 15 and 3). Then we use the [adapt](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.adapt) method of the [Recommender](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender) class, to obtain a *Recommender* object. Then, we can use the [fit](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Algorithm.fit) method to train the model. 

In [35]:
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580
...,...,...,...,...
20000258,138493,68954,4.5,1258126920
20000259,138493,69526,4.5,1259865108
20000260,138493,69644,3.0,1260209457
20000261,138493,70286,5.0,1258126944


In [36]:
ratings_df.rename(columns={"userId": "user", "movieId":"item"}, inplace=True)

In [38]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# We use the collaborative user algorithm UserUser, that use the nearest neighbors 
num_recs = 10  # Number of recommendations to generate
user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(ratings_df)

<lenskit.algorithms.ranking.TopN at 0x27a5495c2e0>

Now, we can provide the recommendations for a specific user. Let's generate 10 recommendations for the user selected for the previous example. We use the [recommend](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.recommend) method of the *Recommender* class to generate the recommendations. Then, we add the title information and display the recommendations.

In [41]:
selected_movies_useruser

Unnamed: 0,item,score
0,80478,5.141823
1,3226,5.131763
2,912,5.128495
3,3435,4.957709
4,116155,4.85084
5,858,4.83466
6,89083,4.823506
7,5618,4.823481
8,2019,4.806858
9,5147,4.774738


In [42]:
movies_df["title"]

movieId
1                           Toy Story (1995)
2                             Jumanji (1995)
3                    Grumpier Old Men (1995)
4                   Waiting to Exhale (1995)
5         Father of the Bride Part II (1995)
                         ...                
131254          Kein Bund für's Leben (2007)
131256         Feuer, Eis & Dosenbier (2002)
131258                    The Pirates (2014)
131260                   Rentun Ruusu (2001)
131262                      Innocence (2014)
Name: title, Length: 27278, dtype: object

In [43]:
selected_movies_useruser = recsys.recommend(selected_user, 10) # generate 10 recommendations for the selected user
selected_movies_useruser = pd.merge(selected_movies_useruser, movies_df["title"], left_on="item", right_on="movieId", how="left")
#selected_movies_useruser = selected_movies_useruser.join(movies_df['title'], on='movieId')
display(selected_movies_useruser)

Unnamed: 0,item,score,title
0,80478,5.141823,"Man Vanishes, A (Ningen Johatsu) (1967)"
1,3226,5.131763,Hellhounds on My Trail (1999)
2,912,5.128495,Casablanca (1942)
3,3435,4.957709,Double Indemnity (1944)
4,116155,4.85084,Still Life (2013)
5,858,4.83466,"Godfather, The (1972)"
6,89083,4.823506,"Great White Silence, The (1924)"
7,5618,4.823481,Spirited Away (Sen to Chihiro no kamikakushi) ...
8,2019,4.806858,Seven Samurai (Shichinin no samurai) (1954)
9,5147,4.774738,Wild Strawberries (Smultronstället) (1957)


We can compare such recommendations with the one previously computed with the content based approach.

In [44]:
display(unrated_movies_df_KNN_ordering.head(10))

Unnamed: 0_level_0,title,genres,predicted_ratings_KNN
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5736,Faces of Death 3 (1985),Documentary|Horror,4.4
107266,I Heart Monster Movies (2012),Documentary|Horror,4.4
49359,Celine and Julie Go Boating (Celine et Julie v...,Drama|Fantasy|Mystery,4.4
120871,A Night at the Movies: The Horrors of Stephen ...,Documentary|Horror,4.4
84442,Dark Forces (Harlequin) (1980),Drama|Fantasy|Mystery,4.4
37955,Innocence (2004),Drama|Fantasy|Mystery,4.4
50705,Neverwas (2005),Drama|Fantasy|Mystery,4.4
96653,"Vampir (Cuadecuc, vampir) (1971)",Documentary|Horror,4.4
5734,Faces of Death 2 (1981),Documentary|Horror,4.4
5735,Faces of Death (1978),Documentary|Horror,4.4


#### EXERCISE

Following the example above, implement a second collaborative filtering recommender system using the [ItemItem](https://lkpy.readthedocs.io/en/stable/knn.html#lenskit.algorithms.item_knn.ItemItem) collaborative user algorithm provided by the Sklearn library.

##### Solution

In [None]:
from lenskit.algorithms.item_knn import ItemItem

# We use the collaborative user algorithm UserUser, that use the nearest neighbors
num_recs = 10  # Number of recommendations to generate
item_item = ItemItem(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(item_item)
recsys.fit(ratings_df)

selected_movies_itemitem = recsys.recommend(selected_user, 10)  # generate 10 recommendations for the selected user
selected_movies_itemitem = selected_movies_itemitem.join(movies_df['title'], on='item')
display(selected_movies_itemitem)


#### 6. Evaluation of Recommender Systems

We have seen different approaches, but how can we evaluate the godness of the proposed recommendations? We are going to see now some examples, focusing on the two personalized approaches (content based and collaborative filtering).

The content based recommender system is specific for each user; hence, we would need to evaluate a model for each user, and provide an average evaluation for the approach. On the contrary, the collaborative filtering recommender trains a single model for all the users. Hence, the evaluation procedure is a bit different, and we will see an example for each. 

In the examples, we will use the holdout evaluation strategy. As exercise you will have to implement a K-Fold evaluation.

Furthermore, in the previous example, we estimated user's ratings on unseen movies, using a regressor. Now, we will see the problem as a classification problem. We will consider as relevant, for the user, all the movies rated higher than 3.

#### 6.1 Evaluation of a Content-based Recommender System

We implemented our Content-based Recommender System using the lenskit library. The same library offers a wide range of tools to evaluate regressors and classifiers, that we will use (and describe). First, let's define a support function which allows us to select the information we need for a particular user.

The function is called *get_user_rated_movies_plots* and takes as input the user, and the two Dataframe *ratings_df* and *plot_df*. It first selects the ratings of the considered user. Then, it selects the movies information for the movies evaluated by the user. Finally, it joins the information in a single Dataframe, and adds a column in which a relevance for each movie is computed, according to the correlated rating: if the user gave a rating higher than 3, the relevance is set to 1; on the contrary, the relevance is 0. 

To do this, we use the apply function of the Dataframe class (seen previously). In this case, since the operation to perform is easy, we don't define a function for it. Instead, we define a lambda function (you can see it as a small anonymous function). The lambda function is applied to each row of the dataframe on which the apply method is called. In this case, we call it on the *'rating'* column of the *rated_movies* Dataframe. Hence, x is the rating of a specific movie.

The lambda function then computes the relevance value through a [ternary operator](https://book.pythontips.com/en/latest/ternary_operators.html).

In [None]:
def get_user_rated_movies_plots(user, ratings_df, movies_df):
    """
    extract from the dataframe ratings_df the ratings of the user
    """
    selected_user_ratings = ratings_df.loc[ratings_df['user'] == user] # select the ratings of the user
    rated_movies_df = movies_df.loc[list(selected_user_ratings['item'])] # select the movie information for the movies rated by the user
    rated_movies_df = rated_movies_df[['title', 'plot']] # select only the information we need
    selected_user_ratings = selected_user_ratings.set_index('item') # set the index for the next join
    rated_movies_df = rated_movies_df.join(selected_user_ratings['rating'], on='item') # join the two dataframes
    rated_movies_df['relevant'] = rated_movies_df['rating'].apply(lambda x: 1 if x>3 else 0) # compute the relevance values for the user
    return rated_movies_df

We can test the function retrieving the information for the user selected for the previous evaluations

In [None]:
user_plots_ratings_df = get_user_rated_movies_plots(selected_user, ratings_df, movies_df)
display(user_plots_ratings_df)

Now, since we have a different recommender for each user, we need a function that evaluates the recommender for a specific user. We will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from the sklearn library to compute the TD-IDF representations of the movies' plots, as we did before. But, since we are addressing the problem as a classification problem, we use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) instead of a regressor. 

Differently from the previous example, we want to evaluate the performances of our classifier, hence we need to test it on a set of movies for which we have the correct relevance for this user. In other words, we need to split the information we have about the user ratings in two sets:
1. Training set: we will use this data to train our classifier.
2. Test set: we will apply our classifier to this data and compare the predictions with the actual relevance values.

To split the data we will use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method, provided by the sklearn library.
Finally, once we have the classifier predictions, we use the [precision_recall_fscore_support](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) method, provided by the sklearn library, to compute the information retievalmetrics *precision*, *recall* and *fscore*. The function returns such values for the considered user.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

def train_and_test_recsys_for_user(user):
    user_plots_ratings_df = get_user_rated_movies_plots(user, ratings_df, movies_df) # retrieve user info with the previously defined method
    
    X_plots = user_plots_ratings_df['plot'] # select the Plot column, from which we will compute ourTF-IDF features
    y = user_plots_ratings_df['relevant'] # select the elevant column, that will be used as label
    
    X_plots_train, X_plots_test, y_train, y_test = train_test_split(X_plots, y, test_size=0.2) # randomly splits the data in train and test, we specify that 20% of the data will go into the test set

    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_plots_train) # Trains our TF-IDF model and computes the features
    
    neigh = KNeighborsClassifier(n_neighbors=5)
    neigh.fit(X_train, y_train) # train our cassifier
    
    X_test = vectorizer.transform(X_plots_test)
    y_pred = neigh.predict(X_test) # evaluates the predictions of the classifier
    
    return precision_recall_fscore_support(y_test, y_pred, average="binary", zero_division=0) # compare the real relevance values with the predicted one, and return precision, recall, and fscore

We can test our method on the selected user. 
Note that the results are affected from the random split (you can check it by executing the method several times: the results change!)

How can we obtain more reliable results?

In [None]:
train_and_test_recsys_for_user(selected_user)


However, we keep this example simple, and now we address the final problem: merging the results for all the customers.
We will iterate over all the users and compute precision, recall and fscore values for the recommender for each specfic customer. For simplicity, we will exclude the customers for which we have less than 100 evaluations (what could be the problem for such users?).

Finally, we compute at the end the means and standard deviations, and we print it.

In [None]:
import statistics

user_list = list(set(ratings_df['user']))

precision_list = list()
recall_list = list()
fscore_list = list()
print("evaluating recommender on " + str(len(user_list)) + " users")
for user in user_list:
    user_plots_ratings_df = get_user_rated_movies_plots(user, ratings_df, movies_df)
    
    if user_plots_ratings_df.shape[0] > 100:
        precision, recall, fscore, _ = train_and_test_recsys_for_user(user)
        if precision>0 : 
            precision_list.append(precision)
        if recall > 0:
            recall_list.append(recall)
        if fscore > 0:
            fscore_list.append(fscore)
            
# Compute average results and standard deviations
results = {
    'avg_precision' : statistics.mean(precision_list),
    'stdv_precision' : statistics.stdev(precision_list),
    'avg_recall' : statistics.mean(recall_list),
    'stdv_recall' : statistics.stdev(recall_list),
    'avg_fscore' : statistics.mean(fscore_list),
    'stdv_fscore' : statistics.stdev(fscore_list)
}

display(results)

We can see some warnings. The *precision_recall_fscore_support* method gives us a warning when the metrics cannot be computed. This happens, for instance, when the system does not retrieve any element as relevant. In this case, for instance, precision and fscore are undefined. In such cases we should check what is the problem for the specific user. For this tutorial, we will just ignore them.

#### EXERCISE

Extend the example above implementing a K-Fold cross-validation strategy and evaluate the previous recommender system using 10 folds for the selected user.

HINTS:
1. To generate the folds you can use the [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class provided by the sklearn library.
2. Differently from what you see in the example on the sklearn web page, the [split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold.split) method returns positional indices. Hence, you cannot use the suggested notation `X_train = X[train_index]`, while you should use the [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) method instead: `X_train = X.iloc[train_index]`.

##### Solution

We can implement our solution starting from the *train_and_test_recsys_for_user* method. We will define a similar method called *K_fold_eval_recsys_for_user* in which we will implement our K-fold evaluation for a single user. The function will have also K as input.

#### 6.2 Evaluation of a Collaborative Filtering Recommender System

We see now an example of evaluation of a collaborative filtering approach. We will use again the Lenskit library for this.
Since we have one single model for all the users, the evaluation is simpler: we can just consider all the ratings we have and split them into training and test set.; then, we train the recommender on the trainig set, and we use it to predict the results on the test set. We will do it using the [predict](https://lkpy.readthedocs.io/en/stable/interfaces.html#lenskit.Predictor.predict) method. Finally, we compute the relevance values from the original ratings and from the predicted ratings, and we compare them to obtain precision, recall and fscore values.

Note that we will split our dataset using again the *train_test_split* method. In this case, we pass a Dataframe as argument. The method will return two dataframes, randomly splitting the rows between them.

In [None]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

train_df, test_df = train_test_split(ratings_df, test_size=0.2)

user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(train_df)

test_df['predicted_rating'] = recsys.predict(test_df)

test_df['relevant'] = test_df['rating'].apply(lambda x: 1 if x>3 else 0)
test_df['predicted_relevant'] = test_df['predicted_rating'].apply(lambda x: 1 if x>3 else 0)

y_test = list(test_df['relevant'])
y_pred = list(test_df['predicted_relevant'])
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred, average="binary")

print("Precision:\t" + str(precision) + 
      "\nRecall:\t" + str(recall) + 
      "\nFscore:\t" + str(fscore))


We can ignore the warnings. 

Again, we have an evaluation that is strictly related to the split performed. A proper evaluation should consider more reliable validation strategy (any ideas?). For timing reasons, we used the simplest approach. Try to implement some other approaches and check the differences (let us know if you have any problems in this).

#### EXERCISE

Extend the example above implementing a k-Fold cross-validation strategy and evaluate the previous recommender system using 10 folds.

HINTS:
1. To generate the folds you can use the [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class provided by the sklearn library.
2. Use might need to use the shuffle attribute of the KFold constructor.
3. Again, note that the *split* method returns positional indices, so you should *iloc* method to filter the Dataframe rows.

##### Solution