# Individual Recommender Systems - Tutorial 1 (Lab 1)


In this tutorial, we will see some implementations of strategies for individual recommender systems seen in the first lectures of the course. At the same time, we will introduce some libraries useful for machine learning applications in python. We will use the preprocessed dataset provided in the folder *preprocessed_dataset*. The dataset has been obtained from the 
[Movielens latest small](https://www.kaggle.com/grouplens/movielens-latest-small) dataset (providing user's ratings) and the [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots) dataset (providing movies' information). Check the notebook **Dataset Preprocessing** for details on how the datasets have been preprocessed and merged.


#### 1. Dataset

The dataset contains several files CSV. We can start using the [pandas](https://pandas.pydata.org/) library to read the content of the CSV files and use it. Let's see the first 10 rows to check which contents are provided by the *ratings.csv* and the *movies.csv* files. To do so, we use the *read_csv* method which returns a pandas [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the information read from the csv file. Such object provides several methods to manipulate the data in it, and we will use the [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method, which returns the first *n* rows of the dataframe. We use the [display](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display) method, that allows to print information nicely in a jupyter notebook.

In [47]:
preprocessed_dataset_folder = "../preprocessed_dataset"

import pandas as pd

In [48]:
ratings_df = pd.read_csv(preprocessed_dataset_folder+"/ratings.csv") 
display(ratings_df.head(10))

Unnamed: 0,user,item,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,70,3.0,964982400
4,1,101,5.0,964980868
5,1,110,4.0,964982176
6,1,151,5.0,964984041
7,1,157,5.0,964984100
8,1,163,5.0,964983650
9,1,216,5.0,964981208


In [49]:
movies_df = pd.read_csv(preprocessed_dataset_folder+"/movies.csv", index_col="item")
display(movies_df.head(10))

Unnamed: 0_level_0,title,year,genres,plot
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,toy story,1995,Adventure|Animation|Children|Comedy|Fantasy,In a world where toys are living things who pr...
2,jumanji,1995,Adventure|Children|Fantasy,"In 1869, near Brantford, New Hampshire, two br..."
3,grumpier old men,1995,Comedy|Romance,The feud between Max (Walter Matthau) and John...
4,waiting to exhale,1995,Comedy|Drama|Romance,"""Friends are the People who let you be yoursel..."
5,father of the bride part ii,1995,Comedy,The film begins five years after the events of...
6,heat,1995,Action|Crime|Thriller,"Neil McCauley, a career criminal, hires Waingr..."
7,sabrina,1995,Comedy|Romance,Sabrina Fairchild is the young daughter of the...
8,tom and huck,1995,Adventure|Children,The movie opens with Injun Joe (Eric Schweig) ...
9,sudden death,1995,Action,Darren McCord (Jean-Claude Van Damme) is a Fre...
10,goldeneye,1995,Action|Adventure|Thriller,"In 1986, at Arkhangelsk, MI6 agents James Bond..."


The *ratings.csv* file contains ratings provided by an user for a movie, and the corresponding timestamp. The *movies.csv* provides information about the title, year, genres and plot of each movie.

#### 2. Recommending highest rated movies

A first idea to provide recommendations for our users can be to use the ratings provided and determine the highest rated movies. To do so, we first need to compute the average rating for each movie, and then determine the top rated. Let's use again the functions provided by the Dataframe object. In particular, we can use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method to group the columns given a specific column value. The method returns a [Groupby](https://pandas.pydata.org/docs/reference/groupby.html) object, which provides methods to aggregate the values of each group. To compute the average ratings, we use the [mean](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.mean.html) method which returns a Dataframe object, in which we will have a row for each movie, and the average rating will be in the *rating* column. Finally, we use the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method to sort the movies by rating, and we print the first 10 movies.

In [54]:
average_ratings_df = ratings_df.groupby(['item']).mean()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)

display(sorted_avg_ratings_df.head(10))

Unnamed: 0_level_0,user,rating,timestamp
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4116,275.0,5.0,1049077000.0
2196,594.0,5.0,1108975000.0
3096,156.0,5.0,946799200.0
164367,534.0,5.0,1487278000.0
114265,598.0,5.0,1430617000.0
33138,187.0,5.0,1161955000.0
91386,89.0,5.0,1520409000.0
42556,377.0,5.0,1340343000.0
45503,432.0,5.0,1316391000.0
3939,546.0,5.0,973588700.0


Let's use the information in the *movies_df* to associate the title on each movie, and then print the list of recommended movies. We will use the [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) method of the Dataframe object. To do so, we need to set the *movieId* column of the *movies_df* object as the index of the Dataframe. We use the [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method for this.

In [56]:
joined_df = sorted_avg_ratings_df.join(movies_df, on='item')
display(joined_df['title'].head(10))

item
4116                hollywood shuffle
2196                        knock off
3096                   my man godfrey
164367    the girl with all the gifts
114265                        laggies
33138                     palindromes
91386                  happy feet two
42556              7 faces of dr. lao
45503                peaceful warrior
3939        slumber party massacre ii
Name: title, dtype: object

The movie we recommend are not so famous. Why is this happening? Let's check how many people rated each of these movies. We use again the *group_by* method, and then we use the [agg](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.agg.html) method of the Groupby onject to specify different aggregations: for the userId, we are gonna count the number of users in the group, using the *'size'* input, while for the rating we want the average value, so we specify *'mean'*. The rest of the operations are the same as before:

In [57]:
average_ratings_df = ratings_df.groupby(['item']).agg(count=('user', 'size'), rating=('rating', 'mean')).reset_index()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)
joined_df = sorted_avg_ratings_df.join(movies_df, on='item')
display(joined_df[['title', 'count', 'rating']].head(10))

Unnamed: 0,title,count,rating
1818,hollywood shuffle,1,5.0
974,knock off,1,5.0
1384,my man godfrey,1,5.0
4678,the girl with all the gifts,1,5.0
4411,laggies,1,5.0
3232,palindromes,1,5.0
4098,happy feet two,1,5.0
3342,7 faces of dr. lao,1,5.0
3401,peaceful warrior,1,5.0
1733,slumber party massacre ii,1,5.0


We can see that all the movies we selected have been rated by only one user. Let's try to recommend only movies that have been rated by at least 20 users. Note that the *movieId* column is the index for the Dataframe objects obtained using the *mean* and *count* methods. We will use the [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) attribute of the Dataframe object, which allows to access a group of rows and columns by label(s) or a boolean array. In particular, we will determine a Series of boolean: The instructon `rating_counts_df['rating'] > minimum_ratings` evaluate the condition for each row of the *rating* column. Hence, *loc* will only return the rows for which such condition is true.

In [58]:
minimum_ratings = 20
average_ratings_df = ratings_df.groupby(['item']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['item']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='item') #associate the title
display(joined_df[['title','rating']].head(10))

Unnamed: 0_level_0,title,rating
item,Unnamed: 1_level_1,Unnamed: 2_level_1
1204,lawrence of arabia,4.3
1235,harold and maude,4.288462
168252,logan,4.28
2959,fight club,4.272936
1276,cool hand luke,4.27193
750,dr. strangelove or: how i learned to stop worr...,4.268041
904,rear window,4.261905
1213,goodfellas,4.25
912,casablanca,4.24
926,all about eve,4.229167


We can see now that the recommendation contains more famous movies.

#### 3. Recommending by genre

We want more personalized the recommendations. One possible solution is to suggest movies having the highest ratings for a specific genre. We also want to only select movies evaluated by at least 20 users. We use the same methods used in the previous example. To select the movies by genre, we need to check if the value in the column *genre* contains the specified genre. We can write the conditions as `average_ratings_df['genres'].str.contains(genre)`. 

In [60]:
genre = 'Action'
minimum_ratings = 20

average_ratings_df = ratings_df.groupby(['item']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['item']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
average_ratings_df = average_ratings_df.join(movies_df['genres'], on='item')
average_ratings_df = average_ratings_df.loc[average_ratings_df['genres'].str.contains(genre)]
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='item') #associate the title
display(joined_df[['title','rating']].head(10))

Unnamed: 0_level_0,title,rating
item,Unnamed: 1_level_1,Unnamed: 2_level_1
168252,logan,4.28
2959,fight club,4.272936
1208,apocalypse now,4.219626
908,north by northwest,4.184211
2028,saving private ryan,4.146277
68157,inglourious basterds,4.136364
541,blade runner,4.100806
79132,inception,4.066434
54997,3:10 to yuma,4.06
112852,guardians of the galaxy,4.050847


#### 4. Content Based Recommender

Lets'try to implement some of the strategies seen during the lectures to provide more personalized recommendations. We start from a content based approach. Let's start implementing the recommnender for a specific user. For simplicity, we select a random user from the *ratings_df* Dataframe having more than 200 evaluations. First, we select all the users with more than 200 evaluations, and then we use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) method to extract a random row. Finally, we extract the *user* column and we read the content with the [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method.

In [107]:
users_ratings = ratings_df.groupby(['user']).count()

selected = users_ratings['rating'] > 200
selected_users = users_ratings.loc[selected]
random_selected = selected_users.sample() # sample() returns a random row from the dataframe. The returned object is a dataframe with one row. If you pass a number as argument you specify to select more than one row.
select_column_df = random_selected.reset_index()['user'] # reset_index() create a new index, and the userId became a column. Then, we can filter using the column name
selected_user = select_column_df.iloc[0] # iloc select by index, since our dataframe only has one row we read it from the index 0
print("Selected user: " + str(selected_user))

Selected user: 534


Now, we need to select the user evaluations, that we will use to train our recommender. We print the number of ratings that we have for the considered user, and the extracted dataset, ordered by ratings.

In [108]:
selected_user_ratings = ratings_df.loc[ratings_df['user'] == selected_user]
selected_user_ratings = selected_user_ratings.sort_values(by='item', ascending=True)
print("Rated movies: " + str(selected_user_ratings.shape[0]))
display(selected_user_ratings.head(10))

Rated movies: 357


Unnamed: 0,user,item,rating,timestamp
54496,534,1,4.0,1459787997
54497,534,2,4.5,1459787996
54498,534,10,4.0,1459787996
54499,534,19,4.0,1459787996
54500,534,31,3.5,1459788707
54501,534,34,4.0,1459793302
54502,534,44,4.0,1459788735
54503,534,48,4.0,1459792580
54504,534,104,4.0,1459787997
54505,534,153,3.5,1459787996


Now, we need the select content of the rated movies, that we will need to compute the features that our model will use. We do it by selecting the rows from our *movies_df* dataframe having a *item* that is in the list of *item* of the *selected_user_ratings* Dataframe, which only contains the evaluations of the considered user. We do this using the [isin](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html) method. Since the *item* is the index in the *movies_df* Dataframe, we can select directly pass the list of *item* we want to select to *loc*, and it will return the rows for which the index is in the considered list. Finally, we only select the columns *title* and *plot*.

In [121]:
rated_movies_df = movies_df.loc[list(selected_user_ratings['item'])]
rated_movies_df = rated_movies_df[['title', 'plot']]
print("Rated movies: " + str(rated_movies_df.shape[0]))
display(rated_movies_df.head(10))

Rated movies: 371


Unnamed: 0_level_0,title,plot
item,Unnamed: 1_level_1,Unnamed: 2_level_1
1,toy story,In a world where toys are living things who pr...
2,jumanji,"In 1869, near Brantford, New Hampshire, two br..."
10,goldeneye,"In 1986, at Arkhangelsk, MI6 agents James Bond..."
19,ace ventura: when nature calls,"In the Himalayas, after a failed rescue missio..."
31,dangerous minds,"LouAnne Johnson (Michelle Pfeiffer), a dischar..."
34,babe,"Babe, an orphaned piglet, is chosen for a ""gue..."
44,mortal kombat,Mortal Kombat is a fighting tournament between...
48,pocahontas,"In 1607, the Susan Constant sails to the New W..."
104,happy gilmore,Happy Gilmore is an aspiring ice hockey player...
153,batman forever,"In Gotham City, the crime fighter Batman defus..."


We use a similar approach to select the unrated movies. We compute the set of movies index that are not rated using the set subtraction operator. We access the index of the *movies_df* and the *rated_movies_df* Dataframes, create two sets containing such indexes, compute the difference set *diff*, and select the movies having an  index in the *diff* set.

In [122]:
diff = set(movies_df.index) - set(rated_movies_df.index)
unrated_movies_df = movies_df.loc[diff]
unrated_movies_df = unrated_movies_df[['title', 'plot']]
print("Unrated movies: " + str(unrated_movies_df.shape[0]))
display(unrated_movies_df.head(10))

Unrated movies: 4507


Unnamed: 0_level_0,title,plot
item,Unnamed: 1_level_1,Unnamed: 2_level_1
3,grumpier old men,The feud between Max (Walter Matthau) and John...
4,waiting to exhale,"""Friends are the People who let you be yoursel..."
5,father of the bride part ii,The film begins five years after the events of...
6,heat,"Neil McCauley, a career criminal, hires Waingr..."
7,sabrina,Sabrina Fairchild is the young daughter of the...
8,tom and huck,The movie opens with Injun Joe (Eric Schweig) ...
9,sudden death,Darren McCord (Jean-Claude Van Damme) is a Fre...
180231,daddy's home 2,After finally becoming friends at the end of t...
12,dracula: dead and loving it,Solicitor Thomas Renfield travels all the way ...
13,balto,"In New York City, an elderly woman, her grandd..."


Finally, we can add the rating so that we just need to work with a single Dataframe. We do it using the *join* method. We specify the *item* as idex in the *selected_user_ratings* Dataframe to procede with the join. The we select the *rating* column.

In [123]:
rated_movies_df = rated_movies_df.join(selected_user_ratings.set_index('item')['rating'], on='item')
print("Rated movies: " + str(rated_movies_df.shape[0]))
display(rated_movies_df.head(10))

Rated movies: 371


Unnamed: 0_level_0,title,plot,rating
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,toy story,In a world where toys are living things who pr...,4.0
2,jumanji,"In 1869, near Brantford, New Hampshire, two br...",4.5
10,goldeneye,"In 1986, at Arkhangelsk, MI6 agents James Bond...",4.0
19,ace ventura: when nature calls,"In the Himalayas, after a failed rescue missio...",4.0
31,dangerous minds,"LouAnne Johnson (Michelle Pfeiffer), a dischar...",3.5
34,babe,"Babe, an orphaned piglet, is chosen for a ""gue...",4.0
44,mortal kombat,Mortal Kombat is a fighting tournament between...,4.0
48,pocahontas,"In 1607, the Susan Constant sails to the New W...",4.0
104,happy gilmore,Happy Gilmore is an aspiring ice hockey player...,4.0
153,batman forever,"In Gotham City, the crime fighter Batman defus...",3.5


Now, we have all the data we need to train a content based recomemnder system for our user.

##### TF-IDF computation with sklearn

We will use the well-known pythion library [sklearn](https://scikit-learn.org/stable/) to train our recommender. First, we need to extract features from our content (the plots of the movies). We will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class, which allows us to easily compute TF-IDF vectors for our movies. We need to instantiate the a *TfidfVectorizer* object, and then we need to use the [fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform) method. The method fits (trains) the features model, and then transform our data, returning the TF-IDF matrix representation. We use the [shape](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape) attribute of the matrix to check the dimensions of our features matrix.

In [124]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(rated_movies_df['plot'])
print(X.shape)

(371, 18736)


##### Training a KNN Regressor with sklearn

Now, we have our features. To train our model, we need also the ratings. In this example, we are gonna using the ratings, and we will train a regressor, that is able to determine, for each unrated movie, an extimation of the rating that the user would assign to it. Of course, we could use a different approach, treating the problem as a classification problem. In that case, we would need to transform our ratings to obtain labels. For simplicity, we will work with the ratings. Hence, we need to prepare a vector containing the ratings for each movie, and we do it simply selecting the *rating* column of our dataset.

In [125]:
y = rated_movies_df['rating']
print(y.shape)

(371,)


Then, we train our regressor. We use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) class of the sklearn library. We need to create an instance of *KNeighborsRegressor*. In the constructor, we specify the number of neighbors to use. We set it at 5. Finally, we can train our model, using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.fit) method, passing as arguments the features matrix *X* and the corresponding ratings vector *y*.

In [126]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor(n_neighbors=5)
neigh.fit(X, y)

KNeighborsRegressor()

We can now use the trained model to compute our recommendations. First, we need to extract the features from our unrated movies plots. We can do it using the previously trained *TfidfVectorizer* *vectorizer*, passing to its method [transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) the *Plot* column if the *unrated_movies* Dataframe. 

Then, we use the trained *KNeighborsRegressor* *neigh*, passing the features matrix of the unrated movies *x_unrated* to its [predict](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.predict) method, determining the predicted ratings *y_unrated*.

In [127]:
X_unrated = vectorizer.transform(unrated_movies_df['plot'])
print(X_unrated.shape)
y_unrated = neigh.predict(X_unrated)
print(y_unrated.shape)

(4507, 18736)
(4507,)


Finally, we can combine the predicted ratings with the information in the unrated_movies Dataframe, order the movies by predicted ratings, and determine the best 10 movies to recommend to the user.

In [128]:
unrated_movies_df['predicted_ratings'] = y_unrated
unrated_movies_df = unrated_movies_df.sort_values(by='predicted_ratings', ascending=False)
display(unrated_movies_df.head(10))

Unnamed: 0_level_0,title,plot,predicted_ratings
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
96283,diary of a wimpy kid: dog days,The Heffley family attends a pool party at the...,4.6
180297,the disaster artist,"San Francisco, 1998: 19-year-old Greg Sestero ...",4.6
129737,unfinished business,After an argument with his boss Chuck Portnoy ...,4.5
1870,"dancer, texas pop. 81","Set in the small, fictional American town of D...",4.4
87234,submarine,Unpopular Oliver Tate (Craig Roberts) is a 15-...,4.4
58806,smart people,Carnegie Mellon English Professor Lawrence Wet...,4.4
74228,triangle,While preparing to take her autistic son Tommy...,4.4
3628,flying tigers,Jim Gordon (John Wayne in his first war film) ...,4.4
2541,cruel intentions,"In an upscale New York City mansion, wealthy a...",4.4
72395,precious,"In 1987, 16-year-old Precious (Gabourey Sidibe...",4.4


#### 4.1 EXERCISE

Following the example above, implement a second content-based recommender system using the [SVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) regressor of the Sklearn library (use a *linear* kernel and the default hyperparameters).

#### 5. Collaborative Filtering Recommender

We will now implement a different approach for our recommender system. We want to use collaborative filtering strategies. For this, we are gonna use the [Lenskit](https://lkpy.readthedocs.io/en/stable/) library. Note that such library requires the python 3.6 version.

##### Collaborative user-user recommender with Lenskit

We use the collaborative user algorithm [UserUser](https://lkpy.readthedocs.io/en/stable/knn.html#module-lenskit.algorithms.user_knn), that uses the nearest neighbors algorithm. The constructor takes as inputs the maximum and minumum number of neighbors to consider (we set them to 15 and 3). Then we use the [adapt](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.adapt) method of the [Recommender](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender) class, to obtain a *Recommender* object. Then, we can use the [fit](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Algorithm.fit) method to train the model. 

In [132]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# We use the collaborative user algorithm UserUser, that use the nearest neighbors 
num_recs = 10  # Number of recommendations to generate
user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(ratings_df)

<lenskit.algorithms.basic.TopN at 0x1dae58062b0>

Now, we can provide the recommendations for a specific user. Let's generate 10 recommendations for the user selected for the previous example. We use the [recommend](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.recommend) method of the *Recommender* class to generate the recommendations. Then, we add the title information and display the recommendations.

In [133]:
selected_movies = recsys.recommend(selected_user, 10) # generate 10 recommendations for the selected user 
selected_movies = selected_movies.join(movies_df['title'], on='item')
display(selected_movies)

Unnamed: 0,item,score,title
0,3814,5.694308,love and death
1,3030,5.675545,yojimbo
2,71899,5.4464,mary and max
3,5747,5.424213,gallipoli
4,7121,5.344701,adam's rib
5,85,5.26502,angels and insects
6,4021,5.252647,before night falls
7,3494,5.242296,true grit
8,4117,5.154325,hope and glory
9,2936,5.115676,sullivan's travels


We can compare such recommendations with the one previously computed with the content based approach.

In [134]:
display(unrated_movies.head(10))

Unnamed: 0_level_0,title,plot
item,Unnamed: 1_level_1,Unnamed: 2_level_1
4,waiting to exhale,"""Friends are the People who let you be yoursel..."
5,father of the bride part ii,The film begins five years after the events of...
6,heat,"Neil McCauley, a career criminal, hires Waingr..."
7,sabrina,Sabrina Fairchild is the young daughter of the...
8,tom and huck,The movie opens with Injun Joe (Eric Schweig) ...
9,sudden death,Darren McCord (Jean-Claude Van Damme) is a Fre...
180231,daddy's home 2,After finally becoming friends at the end of t...
12,dracula: dead and loving it,Solicitor Thomas Renfield travels all the way ...
13,balto,"In New York City, an elderly woman, her grandd..."
14,nixon,The movie begins in 1972 with the White House ...


#### 5.1 EXERCISE

Following the example above, implement a second collaborative filtering recommender system using the [ItemItem](https://lkpy.readthedocs.io/en/stable/knn.html#lenskit.algorithms.item_knn.ItemItem) collaborative user algorithm provided by the Sklearn library.

### Evaluation of Recommender Systems

We have seen two approaches, but how can we evaluate the godness of the proposed recommnendations?
We are gonna see two examples, one for each approach.

#### Evaluation of a Content-based Recommender System

We implemented our Content-based Recommender System using the lenskit library. The same library offers a wide range of tools to evaluate regressors and classifiers, that we will use (and describe).

In the previous example, we estimated user's ratings on unseen movies. Now, we will see the problem treated as a classification problem. We will consider as relevant, for the user, all the movies rated higher than 3.

First, let's define a support function which allows us to select the information we need for a particular user.

The function is called *get_user_rated_movies_plots* and takes as input the user, and the two Dataframe *ratings_df* and *plot_df*. It first selects the ratings of the considered user. Then, it selects the movies information for the movies evaluated by the user. Finally, it joins the information in a single Dataframe, and adds a column in which a relevance for each movie is computed, according to the correlated rating: if the user gave a rating higher than 3, the relevance is set to 1; on the contrary, the relevance is 0. 

To do this, we use the apply function of the Dataframe class (seen previously). In this case, since the operation to perform is easy, we don't define a function for it. Instead, we define a lambda function (you can see it as a small anonymous function). The lambda function is applied to each row of the dataframe on which the apply method is called. In this case, we call it on the *'rating'* column of the *rated_movies* Dataframe. Hence, x is the rating of a specfic movie.

The lambda function then computes the relevance value through a [ternary operator](https://book.pythontips.com/en/latest/ternary_operators.html).

In [143]:
def get_user_rated_movies_plots(user, ratings_df, movies_df):
    """
    extract from the dataframe ratings_df the ratings of the user
    """
    selected_user_ratings = ratings_df.loc[ratings_df['user'] == user] # select the ratings of the user
    rated_movies_df = movies_df.loc[list(selected_user_ratings['item'])] # select the movie information for the movies rated by the user
    rated_movies_df = rated_movies_df[['title', 'plot']] # select only the information we need
    selected_user_ratings = selected_user_ratings.set_index('item') # set the index for the next join
    rated_movies_df = rated_movies_df.join(selected_user_ratings['rating'], on='item') # join the two dataframes
    rated_movies_df['relevant'] = rated_movies_df['rating'].apply(lambda x: 1 if x>3 else 0) # compute the relevance values for the user
    return rated_movies_df

We can test the function retrieving the information for the user selected for the previous evaluations

In [144]:
user_plots_ratings_df = get_user_rated_movies_plots(selected_user, ratings_df, movies_df)
display(user_plots_ratings_df)

Unnamed: 0_level_0,title,plot,rating,relevant
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,toy story,In a world where toys are living things who pr...,4.0,1
2,jumanji,"In 1869, near Brantford, New Hampshire, two br...",4.5,1
10,goldeneye,"In 1986, at Arkhangelsk, MI6 agents James Bond...",4.0,1
19,ace ventura: when nature calls,"In the Himalayas, after a failed rescue missio...",4.0,1
31,dangerous minds,"LouAnne Johnson (Michelle Pfeiffer), a dischar...",3.5,1
...,...,...,...,...
134853,inside out,Riley Andersen is born in Minnesota. Within he...,3.5,1
135885,absolutely anything,"Decades after being launched into space, a spa...",3.0,0
136864,batman v superman: dawn of justice,Eighteen months after the battle between Super...,1.5,0
138036,the man from u.n.c.l.e.,"In 1963, at the height of the Cold War, profes...",3.5,1


Now, since we have a different recommender for each user, we need a function that evaluates the recommender for a specific user. We will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from the sklearn library to compute the TD-IDF representations of the movies' plots, as we did before. But, since we are addressing the problem as a classification problem, we use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) instead of a regressor. 

Differently from the previous example, we want to evaluate the performances of our classifier, hence we need to test it on a set of movies for which we have the correct relevance for this user. In other words, we need to split the information we have about the user ratings in two sets:
1. Training set: we will use this data to train our classifier.
2. Test set: we will apply our classifier to this data and compare the predictions with the actual relevance values.

To split the data we will use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method, provided by the sklearn library.
Finally, once we have the classifier predictions, we use the [precision_recall_fscore_support](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) method, provided by the sklearn library, to compute the information retievalmetrics *precision*, *recall* and *fscore*. The function returns such values for the considered user.

In [147]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

def train_and_test_recsys_for_user(user):
    user_plots_ratings_df = get_user_rated_movies_plots(user, ratings_df, movies_df) # retrieve user info with the previously defined method
    
    X_plots = user_plots_ratings_df['plot'] # select the Plot column, from which we will compute ourTF-IDF features
    y = user_plots_ratings_df['relevant'] # select the elevant column, that will be used as label
    
    X_plots_train, X_plots_test, y_train, y_test = train_test_split(X_plots, y, test_size=0.2) # randomly splits the data in train and test, we specify that 20% of the data will go into the test set

    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_plots_train) # Trains our TF-IDF model and computes the features
    
    neigh = KNeighborsClassifier(n_neighbors=5)
    neigh.fit(X_train, y_train) # train our cassifier
    
    X_test = vectorizer.transform(X_plots_test)
    y_pred = neigh.predict(X_test) # evaluates the predictions of the classifier
    
    return precision_recall_fscore_support(y_test, y_pred, average="binary") # compare the real relevance values with the predicted one, and return precision, recall, and fscore

We can test our method on the selected user. 
Note that the results are affected from the random split (you can check it by executing the method several times: the results change!)

How can we obtain more reliable results?

In [149]:
train_and_test_recsys_for_user(selected_user)


(0.7887323943661971, 0.9655172413793104, 0.8682170542635659, None)

However, we keep this example simple, and now we address the final problem: merging the results for all the customers.
We will iterate over all the users and compute precision, recall and fscore values for the recommender for each specfic customer. For simplicity, we will exclude the customers for which we have less than 100 evaluations (what could be the problem fo such users?).

This might take some time, hence we print the current user on each iteration to keep track of the progresses of our computation.

Finally, we compute at the end the means and standard deviations, and we print it.

In [151]:
import statistics

user_list = list(set(ratings_df['user']))

precision_list = list()
recall_list = list()
fscore_list = list()
print("evaluating recommender on " + str(len(user_list)) + " users")
for user in user_list:
    user_plots_ratings_df = get_user_rated_movies_plots(user, ratings_df, movies_df)
    print("user " + str(user) + " / " + str(user_plots_ratings_df.shape[0]))
    if user_plots_ratings_df.shape[0] > 100:
        precision, recall, fscore, _ = train_and_test_recsys_for_user(user)
        if precision>0 : 
            precision_list.append(precision)
        if recall > 0:
            recall_list.append(recall)
        if fscore > 0:
            fscore_list.append(fscore)
avg_precision = statistics.mean(precision_list)
stdv_precision = statistics.stdev(precision_list)
avg_recall = statistics.mean(recall_list)
stdv_recall = statistics.stdev(recall_list)
avg_fscore = statistics.mean(fscore_list)
stdv_fscore = statistics.stdev(fscore_list)
    
print("Precision:\t" + str(avg_precision) + " +/- " + str(stdv_precision) + 
      "\nRecall:\t" + str(avg_recall) + " +/- " + str(stdv_recall) +
      "\nFscore:\t" + str(avg_fscore) + " +/- " + str(stdv_fscore))
            
            

evaluating recommender on 610 users
user 1 / 156
user 2 / 19
user 3 / 29
user 4 / 130
user 5 / 30
user 6 / 213
user 7 / 99
user 8 / 32
user 9 / 29
user 10 / 96
user 11 / 51
user 12 / 28
user 13 / 21
user 14 / 35
user 15 / 99
user 16 / 49
user 17 / 57
user 18 / 337
user 19 / 497
user 20 / 152
user 21 / 299
user 22 / 80
user 23 / 53
user 24 / 72
user 25 / 14
user 26 / 15
user 27 / 86
user 28 / 368
user 29 / 57
user 30 / 19
user 31 / 31
user 32 / 67
user 33 / 94
user 34 / 49
user 35 / 18
user 36 / 39
user 37 / 15
user 38 / 56
user 39 / 63
user 40 / 65
user 41 / 159
user 42 / 312
user 43 / 84
user 44 / 32
user 45 / 285
user 46 / 29
user 47 / 98
user 48 / 21
user 49 / 19
user 50 / 153
user 51 / 257


  _warn_prf(average, modifier, msg_start, len(result))


user 52 / 87
user 53 / 12
user 54 / 23
user 55 / 19
user 56 / 29
user 57 / 318
user 58 / 79
user 59 / 61
user 60 / 11
user 61 / 26
user 62 / 257
user 63 / 166
user 64 / 333
user 65 / 27
user 66 / 228
user 67 / 21
user 68 / 873
user 69 / 28
user 70 / 38
user 71 / 26
user 72 / 22
user 73 / 166
user 74 / 101
user 75 / 45
user 76 / 73
user 77 / 14
user 78 / 44
user 79 / 30
user 80 / 120
user 81 / 17
user 82 / 165
user 83 / 74
user 84 / 192
user 85 / 15
user 86 / 47
user 87 / 10
user 88 / 31
user 89 / 253
user 90 / 29
user 91 / 373
user 92 / 17
user 93 / 70
user 94 / 42
user 95 / 116
user 96 / 49
user 97 / 23
user 98 / 66
user 99 / 45
user 100 / 116
user 101 / 45
user 102 / 42
user 103 / 269
user 104 / 185
user 105 / 344
user 106 / 25
user 107 / 23
user 108 / 51
user 109 / 89
user 110 / 31
user 111 / 444
user 112 / 43
user 113 / 108
user 114 / 21
user 115 / 77
user 116 / 62
user 117 / 111
user 118 / 16
user 119 / 145


  _warn_prf(average, modifier, msg_start, len(result))


user 120 / 14
user 121 / 46
user 122 / 203
user 123 / 40
user 124 / 36
user 125 / 227
user 126 / 26
user 127 / 16
user 128 / 23
user 129 / 87
user 130 / 21
user 131 / 42
user 132 / 222
user 133 / 22
user 134 / 25
user 135 / 198
user 136 / 79
user 137 / 79
user 138 / 10
user 139 / 144
user 140 / 404


  _warn_prf(average, modifier, msg_start, len(result))


user 141 / 112
user 142 / 30
user 143 / 54
user 144 / 89
user 145 / 18
user 146 / 19
user 147 / 17
user 148 / 34
user 149 / 37
user 150 / 17
user 151 / 44
user 152 / 37
user 153 / 122
user 154 / 28
user 155 / 35
user 156 / 255
user 157 / 19
user 158 / 16
user 159 / 76
user 160 / 297
user 161 / 22
user 162 / 30
user 163 / 18
user 164 / 22
user 165 / 42
user 166 / 116
user 167 / 114
user 168 / 54
user 169 / 195
user 170 / 38
user 171 / 47
user 172 / 15
user 173 / 22
user 174 / 51
user 175 / 18
user 176 / 24
user 177 / 633
user 178 / 46
user 179 / 47
user 180 / 11
user 181 / 91
user 182 / 618
user 183 / 34
user 184 / 72
user 185 / 28
user 186 / 147
user 187 / 126
user 188 / 31
user 189 / 9
user 190 / 43
user 191 / 56
user 192 / 16
user 193 / 19
user 194 / 14
user 195 / 134
user 196 / 21
user 197 / 22
user 198 / 145
user 199 / 246
user 200 / 253
user 201 / 72
user 202 / 246
user 203 / 31
user 204 / 45
user 205 / 21
user 206 / 18
user 207 / 17
user 208 / 13
user 209 / 23
user 210 / 74
user 

  _warn_prf(average, modifier, msg_start, len(result))


user 372 / 163
user 373 / 70
user 374 / 20
user 375 / 24
user 376 / 76
user 377 / 74
user 378 / 29
user 379 / 27
user 380 / 791
user 381 / 314
user 382 / 212
user 383 / 22
user 384 / 51
user 385 / 134
user 386 / 50
user 387 / 593
user 388 / 18
user 389 / 23
user 390 / 54
user 391 / 235
user 392 / 15
user 393 / 84
user 394 / 18
user 395 / 47
user 396 / 18
user 397 / 14
user 398 / 19
user 399 / 20
user 400 / 23
user 401 / 56
user 402 / 34
user 403 / 27
user 404 / 41
user 405 / 72
user 406 / 14
user 407 / 10
user 408 / 103
user 409 / 85
user 410 / 99
user 411 / 93
user 412 / 63
user 413 / 38
user 414 / 1804
user 415 / 55
user 416 / 27
user 417 / 40
user 418 / 72
user 419 / 98
user 420 / 83
user 421 / 22
user 422 / 63
user 423 / 15
user 424 / 93
user 425 / 206
user 426 / 64
user 427 / 62
user 428 / 216
user 429 / 47
user 430 / 46
user 431 / 15
user 432 / 164
user 433 / 13
user 434 / 157
user 435 / 30
user 436 / 73
user 437 / 80
user 438 / 457
user 439 / 10
user 440 / 27
user 441 / 28
user 

We can see some warnings. The *precision_recall_fscore_support* method gives us a warning when the metrics cannot be computed. This happens, for instance, when the system does not retrieve any element as relevant. In this case, for instance, precision and fscore are undefined. In such cases we should check what is the problem for the specific user. For this tutorial, we will just ignore them.

#### EXERCISE

Extend the example above implementing a N-Fold cross-validation strategy and evaluate the previous recommender system using 10 folds.

#### Evaluation of a Collaborative Filtering Recommender System

We see now an example of evaluation of a collaborative filtering approach. We will use again the Lenskit library for this.
Since we have one single model for all the users, the evaluation is simpler: we can just consider all the ratings we have and split them into training and test set.; then, we train the recommender on the trainig set, and we use it to predict the results on the test set. We will do it using the [predict](https://lkpy.readthedocs.io/en/stable/interfaces.html#lenskit.Predictor.predict) method. Finally, we compute the relevance values from the original ratings and from the predicted ratings, and we compare them to obtain precision, recall and fscore values.

Note that we will split our dataset using again the *train_test_split* method. In this case, we pass a Dataframe as argument. The method will return two dataframes, randomly splitting the rows between them.

In [152]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

train_df, test_df = train_test_split(ratings_df, test_size=0.2)

user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(train_df)

test_df['predicted_rating'] = recsys.predict(test_df)

test_df['relevant'] = test_df['rating'].apply(lambda x: 1 if x>3 else 0)
test_df['predicted_relevant'] = test_df['predicted_rating'].apply(lambda x: 1 if x>3 else 0)

y_test = list(test_df['relevant'])
y_pred = list(test_df['predicted_relevant'])
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred, average="binary")

print("Precision:\t" + str(precision) + 
      "\nRecall:\t" + str(recall) + 
      "\nFscore:\t" + str(fscore))


Precision:	0.7092449592449592
Recall:	0.8528501418622646
Fscore:	0.7744466565171566


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


We can ignore the warnings. 

Again, we have an evaluation that is strictly related to the split performed. A proper evaluation should consider more reliable validation strategy (any ideas?). For timing reasons, we used the simplest approach. Try to implement some other approaches and check the differences (let us know if you have any problems in this).

#### EXERCISE

Extend the example above implementing a N-Fold cross-validation strategy and evaluate the previous recommender system using 10 folds.