# Recommender Systems - Tutorial 1 (Lab 1)
### Individual Recommender Systems

In this tutorial, we will see some implementations for the strategies to implement individual recommender systems seen in the first lectures of the course. At the same time, we will introduce some libraries useful for machine learning applications in python.

#### 1. Dataset

First, we need to provide some data. In this tutorial, we will use the 
[Movielens latest small](https://www.kaggle.com/grouplens/movielens-latest-small). Other datasets can be found on Canvas, in the [Project Resources](https://canvas.maastrichtuniversity.nl/courses/7954/pages/project-resources?module_item_id=203858) module. We download the dataset and we insert the files in the folder *dataset*. The dataset contains several files CSV. We can start using the [pandas](https://pandas.pydata.org/) library to read the content of the CSV files and use it.

Let's see the first 10 rows to check which contents are provided by the *ratings.csv* and the *movies.csv* files. To do so, we use the *read_csv* method which returns a pandas [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the information read from the csv file. Such object provides several methods to manipulate the data in it, and we will use the [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method, which returns the first *n* rows of the dataframe.

In [51]:
movielens_dataset_folder = "./movielens_dataset_small"

import pandas as pd

In [52]:
ratings_df = pd.read_csv(movielens_dataset_folder+"/ratings.csv") 
print(ratings_df.head(10))

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
5       1       70     3.0  964982400
6       1      101     5.0  964980868
7       1      110     4.0  964982176
8       1      151     5.0  964984041
9       1      157     5.0  964984100


In [58]:
movies_df = pd.read_csv(movielens_dataset_folder+"/movies.csv")
print(movies_df.head(10))

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   
5        6                         Heat (1995)   
6        7                      Sabrina (1995)   
7        8                 Tom and Huck (1995)   
8        9                 Sudden Death (1995)   
9       10                    GoldenEye (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
5                        Action|Crime|Thriller  
6                               Comedy|Romance  
7                           Adventure|Children  
8       

The *ratings.csv* file contains ratings provided by an user for a movie, and the corresponding timestamp. The *movies.csv* provides information about the title and the genres of each movie.

#### 2. Recommending highest rated movies

A first idea to provide recommendations for our users can be to use the ratings provided and determine the highest rated movies. To do so, we first need to compute the average rating for each movie, and then determine the top rated. Let's use again the functions provided by the Dataframe object. In particular, we can use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method to group the columns given a specific column value. The method returns a [Groupby](https://pandas.pydata.org/docs/reference/groupby.html) object, which provides methods to aggregate the values of each group. To compute the average ratings, we use the [mean](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.mean.html) method which returns a Dataframe object, in which we will have a row for each movie, and the average rating will be in the *rating* column. Finally, we use the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method to sort the movies by rating, and we print the first 10 movies.

In [59]:
average_ratings_df = ratings_df.groupby(['movieId']).mean()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)

print(sorted_avg_ratings_df.head(10))

         userId  rating     timestamp
movieId                              
88448     483.0     5.0  1.315438e+09
100556    462.0     5.0  1.456151e+09
143031     89.0     5.0  1.520409e+09
143511    105.0     5.0  1.526207e+09
143559     89.0     5.0  1.520410e+09
6201      474.0     5.0  1.100120e+09
102217     63.0     5.0  1.443200e+09
102084    380.0     5.0  1.493422e+09
6192      182.0     5.0  1.063275e+09
145994    105.0     5.0  1.526207e+09


Let's use the information in the movies_df to associate the title on each movie, and then print the list of recommended movies. We will use the [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) method of the Dataframe object. To do so, we need to set the *movieId* column of the *movies_df* object as the index of the Dataframe. We use the [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method for this.

In [60]:
movies_df = movies_df.set_index('movieId')
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
print(joined_df['title'].head(10))

movieId
88448         Paper Birds (Pájaros de papel) (2010)
100556                   Act of Killing, The (2012)
143031                              Jump In! (2007)
143511                                 Human (2015)
143559                          L.A. Slasher (2015)
6201                               Lady Jane (1986)
102217               Bill Hicks: Revelations (1993)
102084                 Justice League: Doom (2012) 
6192      Open Hearts (Elsker dig for evigt) (2002)
145994                       Formula of Love (1984)
Name: title, dtype: object


The movie we recommend are not so famous. Why is this happening? Let's check how many people rated each of these movies. We use again the *group_by* method, and then we use the [agg](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.agg.html) method of the Groupby onject to specify different aggregations: for the userId, we are gonna count the number of users in the group, using the *'size'* input, while for the rating we want the average value, so we specify *'mean'*. The rest of the operations are the same as before:

In [61]:
average_ratings_df = ratings_df.groupby(['movieId']).agg(count=('userId', 'size'), rating=('rating', 'mean')).reset_index()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
print(joined_df[['title', 'count', 'rating']].head(10))

                                          title  count  rating
7638      Paper Birds (Pájaros de papel) (2010)      1     5.0
8089                 Act of Killing, The (2012)      1     5.0
9065                            Jump In! (2007)      1     5.0
9076                               Human (2015)      1     5.0
9078                        L.A. Slasher (2015)      1     5.0
4245                           Lady Jane (1986)      1     5.0
8136             Bill Hicks: Revelations (1993)      1     5.0
8130               Justice League: Doom (2012)       1     5.0
4240  Open Hearts (Elsker dig for evigt) (2002)      1     5.0
9104                     Formula of Love (1984)      1     5.0


We can see that all the movies we selected have been rated by only one user. Let's try to recommend only movies that have been rated by at least 20 users. Note that the *movieId* column is the index for the Dataframe objects obtained using the *mean* and *count* methods. We will use the [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) attribute of the Dataframe object, which allows to access a group of rows and columns by label(s) or a boolean array. In particular, we will determine a Series of boolean: The instructon `rating_counts_df['rating'] > minimum_ratings` evaluate the condition for each row of the *rating* column. Hence, *loc* will only return the rows for which such condition is true.

In [64]:
minimum_ratings = 20
average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
print(joined_df[['title','rating']].head(10))

                                                 title    rating
movieId                                                         
318                   Shawshank Redemption, The (1994)  4.429022
922      Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)  4.333333
898                     Philadelphia Story, The (1940)  4.310345
475                   In the Name of the Father (1993)  4.300000
1204                         Lawrence of Arabia (1962)  4.300000
246                                 Hoop Dreams (1994)  4.293103
858                              Godfather, The (1972)  4.289062
1235                           Harold and Maude (1971)  4.288462
168252                                    Logan (2017)  4.280000
2959                                 Fight Club (1999)  4.272936


We can see now that the recommendation contains more famous movies.

#### 3. Recommending by genre

We want more personalized the recommendations. One possible solution is to suggest movies having the highest ratings for a specific genre. We also want to only select movies evaluated by at least 20 users. We use the same methods used in the previous example. To select the movies by genre, we need to check if the value in the column *genre* contains the specified genre. We can write the conditions as `average_ratings_df['genres'].str.contains(genre)`. 

In [66]:
genre = 'Action'
minimum_ratings = 20

average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
average_ratings_df = average_ratings_df.join(movies_df['genres'], on='movieId')
average_ratings_df = average_ratings_df.loc[average_ratings_df['genres'].str.contains(genre)]
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
print(joined_df[['title','rating']].head(10))

                                                     title    rating
movieId                                                             
168252                                        Logan (2017)  4.280000
2959                                     Fight Club (1999)  4.272936
58559                              Dark Knight, The (2008)  4.238255
1197                            Princess Bride, The (1987)  4.232394
260              Star Wars: Episode IV - A New Hope (1977)  4.231076
3275                           Boondock Saints, The (2000)  4.220930
1208                                 Apocalypse Now (1979)  4.219626
1196     Star Wars: Episode V - The Empire Strikes Back...  4.215640
1233                          Boot, Das (Boat, The) (1981)  4.212500
1198     Raiders of the Lost Ark (Indiana Jones and the...  4.207500


#### 4. Content Based Recommender

Lets'try to implement some of the strategies seen during the lectures to provide more personalized recommendations. We start from a content based approach. Of course, we need some content to process, and the dataset we are using only provides the genre of each movie. So, let's start retriving the movies plots. There are different options for this. Several API online provide this information, but they require to perform different calls and each API has some day limit. Hence, to save some time, we will use a dataset providing such information. The [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots) dataset provides descriptions of 34,886 movies from around the world. Let's open it and check the provided content.

Note that, from here, we will stop using the print method to print the Dataframes. We will start using the [display](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display) method, that allows to print information nicely in a jupyter notebook.

In [71]:
wiki_plots_dataset_folder = "./wiki_movie_plots_deduped"
plots_df = pd.read_csv(wiki_plots_dataset_folder+"/wiki_movie_plots_deduped.csv") 
display(plots_df.head(10))

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
5,1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderl...,"Alice follows a large white rabbit down a ""Rab..."
6,1903,The Great Train Robbery,American,Edwin S. Porter,,western,https://en.wikipedia.org/wiki/The_Great_Train_...,The film opens with two bandits breaking into ...
7,1904,The Suburbanite,American,Wallace McCutcheon,,comedy,https://en.wikipedia.org/wiki/The_Suburbanite,The film is about a family who move to the sub...
8,1905,The Little Train Robbery,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Little_Train...,The opening scene shows the interior of the ro...
9,1905,The Night Before Christmas,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Night_Before...,Scenes are introduced using lines of the poem....


The movies plot are contained in the *Plot* column. But how can we combine this information with our previous dataset? Let's have a look at the information in our *movies_df* Dataframe.

In [72]:
movies_df.head(10)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


Since the id are different in the two datasets, we need to use the *title* to combine the information. Furthermore, since different movies might have the same title, we will also use the year. So, to start, we need to separate such informations, that are stored in the same column in our *movie_df* dataset. To do so, we will use the [apply]() method of the Dataframe object. Such method takes as input a function. In the code above, the function is called on each row of the Dataframe on which the function is called.

Hence, we define a function `compute_title_year` that has a row as input. This function should check if the *title* of this movie contains the year (since for some of the movies the year is not present). If so, the *year* is extracted from the *title*, and a new cleaned title is also computed. We are gonna using also the some functions of the [Text Sequence Type - str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str). You can see in the comments what each instruction do, while for a more detailed reference of each method follow this [link](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [73]:
def compute_title_year(row):
    # read the title of the specified row, and cast it to the str type,
    # then use the strip() method to eliminate the spaces from the beginning and the end of the string
    title = str(row['title']).strip() 
    
    # select the substring from the fifth-to-last (included) character to the last (excluded)
    year = title[-5:-1] 
    
    # use the isdigit method to determine if the substring selected is a number
    if year.isdigit():
        # remove the year from the title
        # select the substring from the beninning to the sixth-to-last (excluded)
        # then use the strip() method to remove spaces at the beginning and at the end of the string
        # finally, use the lower() method to obtain the title in lower case
        title_clean = title[:-6].strip().lower()
        
        # return the year and the title
        return int(year), title_clean
    else:
        # nothing to do, return 0 as year and the title
        return 0, title

Now, we use the *apply* method to execute the *compute_title_year* function on all the rows in our dataframe. We assign the resulting columns in two new columns *year* and *title_clean*.

In [74]:
movies_df = movies_df.reset_index()
movies_df[['year', 'title_clean']] = movies_df.apply(compute_title_year, axis=1, result_type ='expand')
display(movies_df.head(10))

Unnamed: 0,movieId,title,genres,year,title_clean
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,toy story
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,grumpier old men
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,waiting to exhale
4,5,Father of the Bride Part II (1995),Comedy,1995,father of the bride part ii
5,6,Heat (1995),Action|Crime|Thriller,1995,heat
6,7,Sabrina (1995),Comedy|Romance,1995,sabrina
7,8,Tom and Huck (1995),Adventure|Children,1995,tom and huck
8,9,Sudden Death (1995),Action,1995,sudden death
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995,goldeneye


Now we need to compute a lowercase version of the titles in the *plots_df* Dataframe, in order to make our comparisons easier. We can do it by operating directly on the *Title* column. We select the column, obtaining a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object. On that, we use the [str](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html) function, than we use the *strip* method to remove spaces from the beginning and the end of the string. We obtain a new Series object, hence we need to use again the *str* function before we can apply the *lower* function. At the end, we assign the final Series object to the new column *title_clean*.

In [77]:
plots_df['title_clean'] = plots_df['Title'].str.strip().str.lower()
display(plots_df.head(10))

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,title_clean
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",kansas saloon smashers
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",love by the light of the moon
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",the martyred presidents
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,"terrible teddy, the grizzly king"
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,jack and the beanstalk
5,1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderl...,"Alice follows a large white rabbit down a ""Rab...",alice in wonderland
6,1903,The Great Train Robbery,American,Edwin S. Porter,,western,https://en.wikipedia.org/wiki/The_Great_Train_...,The film opens with two bandits breaking into ...,the great train robbery
7,1904,The Suburbanite,American,Wallace McCutcheon,,comedy,https://en.wikipedia.org/wiki/The_Suburbanite,The film is about a family who move to the sub...,the suburbanite
8,1905,The Little Train Robbery,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Little_Train...,The opening scene shows the interior of the ro...,the little train robbery
9,1905,The Night Before Christmas,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Night_Before...,Scenes are introduced using lines of the poem....,the night before christmas


Now, we have all the information to merge the two Dataframes. We use the [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function in this case, that works similarly to a database join. We specify the two datasets, than `how='inner'` specifies that we will perform an inner join (hence we will only keep movies for which we have the corresponding plot), and then, i the correct order, the join attibutes for the first Dataframe (`left_on=['title_clean','year']`) and for the second Dataframe (`right_on = ['title_clean','Release Year']`). 

We also use the [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) attribute to print the number of movies for which we have the plot.

In [81]:
joined_plot_df = pd.merge(movies_df, plots_df,  how='inner', left_on=['title_clean','year'], right_on = ['title_clean','Release Year'])
print("Number of movies: " + str(joined_plot_df.shape[0]))      
display(joined_plot_df.head(10))

Number of movies: 4878


Unnamed: 0,movieId,title,genres,year,title_clean,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,toy story,1995,Toy Story,American,John Lasseter,"Tim Allen, Tom Hanks (voices)",animated film,https://en.wikipedia.org/wiki/Toy_Story,In a world where toys are living things who pr...
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,jumanji,1995,Jumanji,American,Joe Johnston,"Robin Williams, Bonnie Hunt, Kirsten Dunst, Br...","family, fantasy",https://en.wikipedia.org/wiki/Jumanji_(film),"In 1869, near Brantford, New Hampshire, two br..."
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,grumpier old men,1995,Grumpier Old Men,American,Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret, Soph...",comedy,https://en.wikipedia.org/wiki/Grumpier_Old_Men,The feud between Max (Walter Matthau) and John...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,waiting to exhale,1995,Waiting to Exhale,American,Forest Whitaker,"Whitney Houston, Angela Bassett, Loretta Devin...",drama,https://en.wikipedia.org/wiki/Waiting_to_Exhale,"""Friends are the People who let you be yoursel..."
4,5,Father of the Bride Part II (1995),Comedy,1995,father of the bride part ii,1995,Father of the Bride Part II,American,Charles Shyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",comedy,https://en.wikipedia.org/wiki/Father_of_the_Br...,The film begins five years after the events of...
5,6,Heat (1995),Action|Crime|Thriller,1995,heat,1995,Heat,American,Michael Mann,"Al Pacino, Robert De Niro, Val Kilmer, Ashley ...","crime drama, action",https://en.wikipedia.org/wiki/Heat_(1995_film),"Neil McCauley, a career criminal, hires Waingr..."
6,7,Sabrina (1995),Comedy|Romance,1995,sabrina,1995,Sabrina,American,Sydney Pollack,"Harrison Ford, Julia Ormond, Greg Kinnear, Nan...",romantic comedy,https://en.wikipedia.org/wiki/Sabrina_(1995_film),Sabrina Fairchild is the young daughter of the...
7,8,Tom and Huck (1995),Adventure|Children,1995,tom and huck,1995,Tom and Huck,American,Peter Hewitt,"Jonathan Taylor Thomas, Brad Renfro","comedy, family",https://en.wikipedia.org/wiki/Tom_and_Huck,The movie opens with Injun Joe (Eric Schweig) ...
8,9,Sudden Death (1995),Action,1995,sudden death,1995,Sudden Death,American,Peter Hyams,"Jean-Claude Van Damme, Powers Boothe, Dorian H...",action,https://en.wikipedia.org/wiki/Sudden_Death_(19...,Darren McCord (Jean-Claude Van Damme) is a Fre...
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995,goldeneye,1995,GoldenEye,British,Martin Campbell,"Pierce Brosnan, Sean Bean, Famke Janssen",spy/action,https://en.wikipedia.org/wiki/GoldenEye,"In 1986, at Arkhangelsk, MI6 agents James Bond..."


Now that we have all the information we need, let's start implementing the recommnender for a specific user. For simplicity, we select a random user from the *ratings_df* Dataframe having more than 200 evaluations. First, we select all the users with more that 200 evaluations, and then we use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) method to extract a random row. Finally, we extract the *userId* column and we read the content with the [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method.

In [82]:
users_ratings = ratings_df.groupby(['userId']).count()

selected = users_ratings['rating'] > 200
selected_users = users_ratings.loc[selected]
random_selected = selected_users.sample() # sample() returns a random row from the dataframe. The returned object is a dataframe with one row. If you pass a number as argument you specify to select more than one row.
select_column_df = random_selected.reset_index()['userId'] # reset_index() create a new index, and the userId became a column. Then, we can filter using the column name
selected_user = select_column_df.iloc[0] # iloc select by index, since our dataframe only has one row we read it from the index 0
print("Selected user: " + str(selected_user))

Selected user: 560


Now, we need to select the user evaluations, that we need to train our recommender. We print the number of ratings that we have for the considered user, and the extracted dataset, ordered by ratings.

In [83]:
selected_user_ratings = ratings_df.loc[ratings_df['userId'] == selected_user]
selected_user_ratings = selected_user_ratings.sort_values(by='movieId', ascending=True)
print("Rated movies: " + str(selected_user_ratings.shape[0]))
display(selected_user_ratings.head(10))

Rated movies: 458


Unnamed: 0,userId,movieId,rating,timestamp
86124,560,1,3.0,1469653413
86125,560,32,3.5,1469647882
86126,560,34,2.5,1469647264
86127,560,47,4.0,1469648029
86128,560,48,2.5,1469654312
86129,560,50,4.0,1469647239
86130,560,70,4.0,1469653546
86131,560,97,4.0,1469653182
86132,560,110,4.0,1469652439
86133,560,150,4.0,1469648079


Now, we need the select content of the rated movies, that we will need to compute the features that our model will use. We do it by selecting the rows from our *joined_plot_df* dataframe having a *movieId* that is in the list of *movieId* of the *selected_user_ratings* Dataframe, which only contains the evaluations of the considered user. We do this using the [isin](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html) method. Finally, we only select the columns *movieId*, *title_clean* and *Plot*, and we set the *movieId* column as the index of the Dataframe.

In [84]:
rated_movies = joined_plot_df.loc[joined_plot_df['movieId'].isin(list(selected_user_ratings['movieId']))]
rated_movies = rated_movies[['movieId', 'title_clean', 'Plot']].set_index('movieId')
print("Rated movies: " + str(rated_movies.shape[0]))
display(rated_movies.head(10))

Rated movies: 284


Unnamed: 0_level_0,title_clean,Plot
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,toy story,In a world where toys are living things who pr...
34,babe,"Babe, an orphaned piglet, is chosen for a ""gue..."
48,pocahontas,"In 1607, the Susan Constant sails to the New W..."
70,from dusk till dawn,Fugitive bank robbers Seth and Richie Gecko ho...
110,braveheart,"In 1280, King Edward ""Longshanks"" invades and ..."
150,apollo 13,"In July 1969, astronaut Jim Lovell hosts a hou..."
163,desperado,"At the Tarasco bar in Mexico, an American man ..."
208,waterworld,Long after the melting of the polar ice caps i...
223,clerks,"Dante Hicks, a 22-year-old retail clerk at the..."
288,natural born killers,Mickey Knox and his wife Mallory stop at a din...


We use a similar approach to select the unrated movies. The only difference is that we need o select the rows from our *joined_plot_df* dataframe having a *movieId* that is **NOT** in the list of *movieId* of the *selected_user_ratings* Dataframe. We can do it by simply negating the condition we used in the previous example, using the `~` symbol.

In [86]:
unrated_movies = joined_plot_df.loc[~joined_plot_df['movieId'].isin(list(selected_user_ratings['movieId']))]
unrated_movies = unrated_movies[['movieId', 'title_clean', 'Plot']].set_index('movieId')
print("Unrated movies: " + str(unrated_movies.shape[0]))
display(unrated_movies.head(10))

Unrated movies: 4594


Unnamed: 0_level_0,title_clean,Plot
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,jumanji,"In 1869, near Brantford, New Hampshire, two br..."
3,grumpier old men,The feud between Max (Walter Matthau) and John...
4,waiting to exhale,"""Friends are the People who let you be yoursel..."
5,father of the bride part ii,The film begins five years after the events of...
6,heat,"Neil McCauley, a career criminal, hires Waingr..."
7,sabrina,Sabrina Fairchild is the young daughter of the...
8,tom and huck,The movie opens with Injun Joe (Eric Schweig) ...
9,sudden death,Darren McCord (Jean-Claude Van Damme) is a Fre...
10,goldeneye,"In 1986, at Arkhangelsk, MI6 agents James Bond..."
12,dracula: dead and loving it,Solicitor Thomas Renfield travels all the way ...


Finally, let's combine the information about the rating in our *rated_movies* Dataframe. We do it using the *join* method.

In [87]:
selected_user_ratings = selected_user_ratings.set_index('movieId')
rated_movies = rated_movies.join(selected_user_ratings['rating'], on='movieId')
print("Rated movies: " + str(rated_movies.shape[0]))
display(rated_movies.head(10))

Rated movies: 284


Unnamed: 0_level_0,title_clean,Plot,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,toy story,In a world where toys are living things who pr...,3.0
34,babe,"Babe, an orphaned piglet, is chosen for a ""gue...",2.5
48,pocahontas,"In 1607, the Susan Constant sails to the New W...",2.5
70,from dusk till dawn,Fugitive bank robbers Seth and Richie Gecko ho...,4.0
110,braveheart,"In 1280, King Edward ""Longshanks"" invades and ...",4.0
150,apollo 13,"In July 1969, astronaut Jim Lovell hosts a hou...",4.0
163,desperado,"At the Tarasco bar in Mexico, an American man ...",3.5
208,waterworld,Long after the melting of the polar ice caps i...,3.5
223,clerks,"Dante Hicks, a 22-year-old retail clerk at the...",4.0
288,natural born killers,Mickey Knox and his wife Mallory stop at a din...,3.5


Now, we have all the data we need to train a content based recomemnder system for our user.

##### TF-IDF computation with sklearn

We will use the well-known pythion library [sklearn](https://scikit-learn.org/stable/) to train our recommender. First, we need to extract features from our content (the plots of the movies). We will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class, which allows us to easily compute TF-IDF vectors for our movies. We need to instantiate the a *TfidfVectorizer* object, and then we need to use the [fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform) method. The method fits (trains) the features model, and then transform our data, returning the TF-IDF matrix representation. We use the [shape](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape) attribute of the matrix to check the dimensions of our features matrix.

In [88]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(rated_movies['Plot'])
print(X.shape)

(284, 16949)


##### Training a KNN Regressor with sklearn

Now, we have our features. To train our model, we need also the ratings. In this example, we are gonna using the ratings, and we will train a regressor, that is able to determine, for each unrated movie, an extimation of the rating that the user would assign to it. Of course, we could use a different approach, treating the problem as a classification problem. In that case, we would need to transform our ratings to obtain labels. For simplicity, we will work with the ratings. Hence, we need to prepare a vector containing the ratings for each movie, and we do it simply selecting the *rating* column of our dataset.

In [89]:
y = rated_movies['rating']
print(y.shape)

(284,)


Then, we train our regressor. We use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) class of the sklearn library. We need to create an instance of *KNeighborsRegressor*. In the constructor, we specify the number of neighbors to use. We set it at 5. Finally, we can train our model, using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.fit) method, passing as arguments the features matrix *X* and the corresponding ratings vector *y*.

In [90]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor(n_neighbors=5)
neigh.fit(X, y)

KNeighborsRegressor()

We can now use the trained model to compute our recommendations. First, we need to extract the features from our unrated movies plots. We can do it using the previously trained *TfidfVectorizer* *vectorizer*, passing to its method [transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) the *Plot* column if the *unrated_movies* Dataframe. 

Then, we use the trained *KNeighborsRegressor* *neigh*, passing the features matrix of the unrated movies *x_unrated* to its [predict](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.predict) method, determining the predicted ratings *y_unrated*.

In [92]:
X_unrated = vectorizer.transform(unrated_movies['Plot'])
print(X_unrated.shape)
y_unrated = neigh.predict(X_unrated)
print(y_unrated.shape)

(4594, 16949)
(4594,)


Finally, we can combine the predicted ratings with the information in the unrated_movies Dataframe, order the movies by predicted ratings, and determine the best 10 movies to recommend to the user.

In [93]:
unrated_movies['predicted_ratings'] = y_unrated
unrated_movies = unrated_movies.sort_values(by='predicted_ratings', ascending=False)
display(unrated_movies.head(10))

Unnamed: 0_level_0,title_clean,Plot,predicted_ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4583,disorganized crime,The story begins in a small town in western Mo...,4.1
3766,missing in action,Colonel James Braddock is a US military office...,4.1
2477,firewalker,Norris and Gossett play Max Donigan and Leo Po...,4.1
57951,fool's gold,"""Finn"" Finnegan is a treasure hunter searching...",4.0
619,ed,"Jack ""Deuce"" Cooper (Matt LeBlanc) is a farm b...",4.0
6464,good burger,"On the first day of summer, slacker high schoo...",4.0
6436,this boy's life,"Nomadic, flaky, Caroline Wolff, wants to settl...",4.0
2457,running scared,Ray Hughes (Hines) and Danny Costanzo (Crystal...,4.0
6870,mystic river,"Three boys, Jimmy Markum, Sean Devine, and Dav...",4.0
1057,everyone says i love you,The emotions of an extended upper-class family...,3.9


#### 5. Collaborative Filtering Recommender

We will now implement a different approach for our recommender system. We want to use collaborative filtering strategies. For this, we are gonna use the [Lenskit](https://lkpy.readthedocs.io/en/stable/) library. Note that such library requires the python 3.6 version.

We start again from our initial Dataframes *ratings_df* and *movies_df*. We need to rename the columns, since lenskit needs that the ratings Dataframe has the columns *user* and *item*, and we would also need the column *item* in the *movies_df* Dataframe, setting it as index.

In [98]:
rating_lenskit_df = ratings_df.rename(columns={'userId': 'user', 'movieId': 'item'})
display(rating_lenskit_df.head(10))

Unnamed: 0,user,item,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [99]:
movies_lenskit_df = movies_df.rename(columns={'movieId': 'item'})
movies_lenskit_df = movies_lenskit_df.set_index('item')
display(movies_lenskit_df.head(10))

Unnamed: 0_level_0,title,genres,year,title_clean
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,toy story
2,Jumanji (1995),Adventure|Children|Fantasy,1995,jumanji
3,Grumpier Old Men (1995),Comedy|Romance,1995,grumpier old men
4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,waiting to exhale
5,Father of the Bride Part II (1995),Comedy,1995,father of the bride part ii
6,Heat (1995),Action|Crime|Thriller,1995,heat
7,Sabrina (1995),Comedy|Romance,1995,sabrina
8,Tom and Huck (1995),Adventure|Children,1995,tom and huck
9,Sudden Death (1995),Action,1995,sudden death
10,GoldenEye (1995),Action|Adventure|Thriller,1995,goldeneye


##### Collaborative user-user recommender with Lenskit

We use the collaborative user algorithm [UserUser](https://lkpy.readthedocs.io/en/stable/knn.html#module-lenskit.algorithms.user_knn), that uses the nearest neighbors algorithm. The constructor takes as inputs the maximum and minumum number of neighbors to consider (we set them to 15 and 3). Then we use the [adapt](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.adapt) method of the [Recommender](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender) class, to obtain a *Recommender* object. Then, we can use the [fit](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Algorithm.fit) method to train the model. 

In [101]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# We use the collaborative user algorithm UserUser, that use the nearest neighbors 
num_recs = 10  # Number of recommendations to generate
user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(rating_lenskit_df)

<lenskit.algorithms.basic.TopN at 0x2a80fdd5da0>

Now, we can provide the recommendations for a specific user. Let's generate 10 recommendations for the user selected for the previous example. We use the [recommend](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.recommend) method of the *Recommender* class to generate the recommendations. Then, we add the title information and display the recommendations.

In [104]:
selected_movies = algo.recommend(selected_user, 10) # generate 10 recommendations for the selected user 
selected_movies = selected_movies.join(movies_lenskit_df['title_clean'], on='item')
display(selected_movies)

Unnamed: 0,item,score,title_clean
0,25771,5.647198,"andalusian dog, an (chien andalou, un)"
1,69524,5.154576,raiders of the lost ark: the adaptation
2,27156,5.025301,neon genesis evangelion: the end of evangelion...
3,3404,4.96183,titanic
4,3030,4.932982,yojimbo
5,4117,4.847822,hope and glory
6,187,4.834099,party girl
7,3606,4.82574,on the town
8,4406,4.825562,"man who shot liberty valance, the"
9,2511,4.810279,"long goodbye, the"


We can compare such recommendations with the one previously computed with the content based approach.

In [105]:
display(unrated_movies.head(10))

Unnamed: 0_level_0,title_clean,Plot,predicted_ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4583,disorganized crime,The story begins in a small town in western Mo...,4.1
3766,missing in action,Colonel James Braddock is a US military office...,4.1
2477,firewalker,Norris and Gossett play Max Donigan and Leo Po...,4.1
57951,fool's gold,"""Finn"" Finnegan is a treasure hunter searching...",4.0
619,ed,"Jack ""Deuce"" Cooper (Matt LeBlanc) is a farm b...",4.0
6464,good burger,"On the first day of summer, slacker high schoo...",4.0
6436,this boy's life,"Nomadic, flaky, Caroline Wolff, wants to settl...",4.0
2457,running scared,Ray Hughes (Hines) and Danny Costanzo (Crystal...,4.0
6870,mystic river,"Three boys, Jimmy Markum, Sean Devine, and Dav...",4.0
1057,everyone says i love you,The emotions of an extended upper-class family...,3.9


We can notice that the recommendations are quite different. Which one is the best recommender?