# Recommender Systems - Tutorial 1 (Lab 1)
### Individual Recommender Systems

In this tutorial, we will see some implementations for the strategies to implement individual recommender systems seen in the first lectures of the course. At the same time, we will introduce some libraries useful for machine learning applications in python.

#### 1. Dataset

First, we need to provide some data. In this tutorial, we will use the 
[Movielens latest small](https://www.kaggle.com/grouplens/movielens-latest-small). Other datasets can be found on Canvas, in the [Project Resources](https://canvas.maastrichtuniversity.nl/courses/7954/pages/project-resources?module_item_id=203858) module. We download the dataset and we insert the files in the folder *dataset*. The dataset contains several files CSV. We can start using the [pandas](https://pandas.pydata.org/) library to read the content of the CSV files and use it.

Let's see the first 10 rows to check which contents are provided by the *ratings.csv* and the *movies.csv* files. To do so, we use the *read_csv* method which returns a pandas [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the information read from the csv file. Such object provides several methods to manipulate the data in it, and we will use the *head(N)* method, which returns the first *N* rows of the dataframe.

In [9]:
dataset_folder = "./dataset"

import pandas as pd

In [28]:
ratings_df = pd.read_csv(dataset_folder+"/ratings.csv") 
print(ratings_df.head(10))

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
5       1       70     3.0  964982400
6       1      101     5.0  964980868
7       1      110     4.0  964982176
8       1      151     5.0  964984041
9       1      157     5.0  964984100


In [29]:
movies_df = pd.read_csv(dataset_folder+"/movies.csv", index_col='movieId')
print(movies_df.head(10))

                                      title  \
movieId                                       
1                          Toy Story (1995)   
2                            Jumanji (1995)   
3                   Grumpier Old Men (1995)   
4                  Waiting to Exhale (1995)   
5        Father of the Bride Part II (1995)   
6                               Heat (1995)   
7                            Sabrina (1995)   
8                       Tom and Huck (1995)   
9                       Sudden Death (1995)   
10                         GoldenEye (1995)   

                                              genres  
movieId                                               
1        Adventure|Animation|Children|Comedy|Fantasy  
2                         Adventure|Children|Fantasy  
3                                     Comedy|Romance  
4                               Comedy|Drama|Romance  
5                                             Comedy  
6                              Action|Crime|Thrill

The *ratings.csv* file contains ratings provided by an user for a movie, and the corresponding timestamp. The *movies.csv* provides information about the title and the genres of each movie.

#### 2. Recommending highest rated movies

A first idea to provide recommendations for our users can be to use the ratings provided and determine the highest rated movies. To do so, we first need to compute the average rating for each movie, and then determine the top rated. Let's use again the functions provided by the Dataframe object. In particular, we can use the *groupby* method to group the columns given a specific column values and compute and aggregate the values of each group. Then we use the *sort_values* method to sort the movies by average rating, and we print the first 10 movies.

In [38]:
average_ratings_df = ratings_df.groupby(['movieId']).mean()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)

print(sorted_avg_ratings_df.head(10))

         userId  rating     timestamp
movieId                              
88448     483.0     5.0  1.315438e+09
100556    462.0     5.0  1.456151e+09
143031     89.0     5.0  1.520409e+09
143511    105.0     5.0  1.526207e+09
143559     89.0     5.0  1.520410e+09
6201      474.0     5.0  1.100120e+09
102217     63.0     5.0  1.443200e+09
102084    380.0     5.0  1.493422e+09
6192      182.0     5.0  1.063275e+09
145994    105.0     5.0  1.526207e+09


Let's use the information in the movies_df to associate the title on each movie, and then print the list of recommended movies

In [39]:
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
print(joined_df['title'].head(10))

movieId
88448         Paper Birds (Pájaros de papel) (2010)
100556                   Act of Killing, The (2012)
143031                              Jump In! (2007)
143511                                 Human (2015)
143559                          L.A. Slasher (2015)
6201                               Lady Jane (1986)
102217               Bill Hicks: Revelations (1993)
102084                 Justice League: Doom (2012) 
6192      Open Hearts (Elsker dig for evigt) (2002)
145994                       Formula of Love (1984)
Name: title, dtype: object


The movie we recommend are not so famous. Why is this happening? Let's check how many people rated each of these movies. We use again the group_by method as follows:

In [44]:
average_ratings_df = ratings_df.groupby(['movieId']).agg(count=('userId', 'size'), rating=('rating', 'mean')).reset_index()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
print(joined_df[['title', 'count', 'rating']].head(10))

                                          title  count  rating
7638      Paper Birds (Pájaros de papel) (2010)      1     5.0
8089                 Act of Killing, The (2012)      1     5.0
9065                            Jump In! (2007)      1     5.0
9076                               Human (2015)      1     5.0
9078                        L.A. Slasher (2015)      1     5.0
4245                           Lady Jane (1986)      1     5.0
8136             Bill Hicks: Revelations (1993)      1     5.0
8130               Justice League: Doom (2012)       1     5.0
4240  Open Hearts (Elsker dig for evigt) (2002)      1     5.0
9104                     Formula of Love (1984)      1     5.0


We can see that all the movies we selected have been rated by only one user. Let's try to recommend only movies that hjave been rated by at least 20 users. Note that the *movieId* column is the index for the Dataframe objects obtained using the *mean* and *count* methods.

In [48]:
minimum_ratings = 20
average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
print(joined_df[['title','rating']].head(10))

                                                 title    rating
movieId                                                         
318                   Shawshank Redemption, The (1994)  4.429022
922      Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)  4.333333
898                     Philadelphia Story, The (1940)  4.310345
475                   In the Name of the Father (1993)  4.300000
1204                         Lawrence of Arabia (1962)  4.300000
246                                 Hoop Dreams (1994)  4.293103
858                              Godfather, The (1972)  4.289062
1235                           Harold and Maude (1971)  4.288462
168252                                    Logan (2017)  4.280000
2959                                 Fight Club (1999)  4.272936


We can see now that the recommendation contains more famous movies

#### 3. Recommending by genre



In [52]:
genre = 'Action'
minimum_ratings = 20

average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
average_ratings_df = average_ratings_df.join(movies_df['genres'], on='movieId')
average_ratings_df = average_ratings_df.loc[average_ratings_df['genres'].str.contains(genre)]
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
print(joined_df[['title','rating']].head(10))

                                                     title    rating
movieId                                                             
168252                                        Logan (2017)  4.280000
2959                                     Fight Club (1999)  4.272936
58559                              Dark Knight, The (2008)  4.238255
1197                            Princess Bride, The (1987)  4.232394
260              Star Wars: Episode IV - A New Hope (1977)  4.231076
3275                           Boondock Saints, The (2000)  4.220930
1208                                 Apocalypse Now (1979)  4.219626
1196     Star Wars: Episode V - The Empire Strikes Back...  4.215640
1233                          Boot, Das (Boat, The) (1981)  4.212500
1198     Raiders of the Lost Ark (Indiana Jones and the...  4.207500


#### 4. Content Based Recommender


In [69]:
# use sklearn KNN

167


#### 5. Collaborative Filtering Recommender

In [95]:
users_ratings = ratings_df.groupby(['userId']).count()

selected = users_ratings['rating'] > 100
selected_users = users_ratings.loc[selected]
random_selected = selected_users.sample() # sample() returns a random row from the dataframe. The returned object is a dataframe with one row. If you pass a number as argument you specify to select more than one row.
select_column_df = random_selected.reset_index()['userId'] # reset_index() create a new index, and the userId became a column. Then, we can filter using the column name
selected_user = select_column_df.iloc[0] # iloc select by index, since our dataframe only has one row we read it from the index 0
print("Selected user: " + str(selected_user))

Selected user: 495


In [88]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# We need to rename the columns of our dataframe, lenskit needs the columns 'user', 'item' and 'rating'
rating_lenskit_df = ratings_df.rename(columns={'userId': 'user', 'movieId': 'item'})
# rating_lenskit_df = pd.read_csv(dataset_folder+"/ratings.csv", header=0, names=['user', 'item','rating', 'timestamp']) 
display(rating_lenskit_df.head(10))

Unnamed: 0,user,item,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [91]:
movies_lenskit_df = movies_df.copy(deep=True)
movies_lenskit_df.index.names = ['item']
display(movies_lenskit_df.head(10))

Unnamed: 0_level_0,title,genres
item,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


In [96]:
# We use the collaborative user algorithm UserUser, that use the nearest neighbors 
num_recs = 10  # Number of recommendations to generate
user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
algo = Recommender.adapt(user_user)
algo.fit(rating_lenskit_df)
selected_movies = algo.recommend(selected_user, 10) # generate 10 recommendations for the selected user 
display(selected_movies)

Unnamed: 0,item,score
0,3030,5.74666
1,8542,5.588559
2,2511,5.583618
3,3494,5.581411
4,5075,5.486634
5,2936,5.458198
6,5747,5.431861
7,3606,5.419078
8,27156,5.404012
9,3508,5.340424


In [97]:
selected_movies.join(movies_lenskit_df['title'], on='item')

Unnamed: 0,item,score,title
0,3030,5.74666,Yojimbo (1961)
1,8542,5.588559,"Day at the Races, A (1937)"
2,2511,5.583618,"Long Goodbye, The (1973)"
3,3494,5.581411,True Grit (1969)
4,5075,5.486634,Waydowntown (2000)
5,2936,5.458198,Sullivan's Travels (1941)
6,5747,5.431861,Gallipoli (1981)
7,3606,5.419078,On the Town (1949)
8,27156,5.404012,Neon Genesis Evangelion: The End of Evangelion...
9,3508,5.340424,"Outlaw Josey Wales, The (1976)"
