# Recommender Systems - Tutorial 1 (Lab 1)
### Individual Recommender Systems

In this tutorial, we will see some implementations for the strategies to implement individual recommender systems seen in the first lectures of the course. At the same time, we will introduce some libraries useful for machine learning applications in python.

#### 1. Dataset

First, we need to provide some data. In this tutorial, we will use the 
[Movielens latest small](https://www.kaggle.com/grouplens/movielens-latest-small). Other datasets can be found on Canvas, in the [Project Resources](https://canvas.maastrichtuniversity.nl/courses/7954/pages/project-resources?module_item_id=203858) module. We download the dataset and we insert the files in the folder *dataset*. The dataset contains several files CSV. We can start using the [pandas](https://pandas.pydata.org/) library to read the content of the CSV files and use it.

Let's see the first 10 rows to check which contents are provided by the *ratings.csv* and the *movies.csv* files. To do so, we use the *read_csv* method which returns a pandas [Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object containing the information read from the csv file. Such object provides several methods to manipulate the data in it, and we will use the [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method, which returns the first *n* rows of the dataframe.

In [2]:
movielens_dataset_folder = "./movielens_dataset_small"

import pandas as pd

In [3]:
ratings_df = pd.read_csv(movielens_dataset_folder+"/ratings.csv") 
print(ratings_df.head(10))

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
5       1       70     3.0  964982400
6       1      101     5.0  964980868
7       1      110     4.0  964982176
8       1      151     5.0  964984041
9       1      157     5.0  964984100


In [4]:
movies_df = pd.read_csv(movielens_dataset_folder+"/movies.csv")
print(movies_df.head(10))

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   
5        6                         Heat (1995)   
6        7                      Sabrina (1995)   
7        8                 Tom and Huck (1995)   
8        9                 Sudden Death (1995)   
9       10                    GoldenEye (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
5                        Action|Crime|Thriller  
6                               Comedy|Romance  
7                           Adventure|Children  
8       

The *ratings.csv* file contains ratings provided by an user for a movie, and the corresponding timestamp. The *movies.csv* provides information about the title and the genres of each movie.

#### 2. Recommending highest rated movies

A first idea to provide recommendations for our users can be to use the ratings provided and determine the highest rated movies. To do so, we first need to compute the average rating for each movie, and then determine the top rated. Let's use again the functions provided by the Dataframe object. In particular, we can use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method to group the columns given a specific column value. The method returns a [Groupby](https://pandas.pydata.org/docs/reference/groupby.html) object, which provides methods to aggregate the values of each group. To compute the average ratings, we use the [mean](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.mean.html) method which returns a Dataframe object, in which we will have a row for each movie, and the average rating will be in the *rating* column. Finally, we use the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method to sort the movies by rating, and we print the first 10 movies.

In [5]:
average_ratings_df = ratings_df.groupby(['movieId']).mean()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)

print(sorted_avg_ratings_df.head(10))

         userId  rating     timestamp
movieId                              
88448     483.0     5.0  1.315438e+09
100556    462.0     5.0  1.456151e+09
143031     89.0     5.0  1.520409e+09
143511    105.0     5.0  1.526207e+09
143559     89.0     5.0  1.520410e+09
6201      474.0     5.0  1.100120e+09
102217     63.0     5.0  1.443200e+09
102084    380.0     5.0  1.493422e+09
6192      182.0     5.0  1.063275e+09
145994    105.0     5.0  1.526207e+09


Let's use the information in the movies_df to associate the title on each movie, and then print the list of recommended movies. We will use the [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) method of the Dataframe object. To do so, we need to set the *movieId* column of the *movies_df* object as the index of the Dataframe. We use the [set_index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method for this.

In [6]:
movies_df = movies_df.set_index('movieId')
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
print(joined_df['title'].head(10))

movieId
88448         Paper Birds (Pájaros de papel) (2010)
100556                   Act of Killing, The (2012)
143031                              Jump In! (2007)
143511                                 Human (2015)
143559                          L.A. Slasher (2015)
6201                               Lady Jane (1986)
102217               Bill Hicks: Revelations (1993)
102084                 Justice League: Doom (2012) 
6192      Open Hearts (Elsker dig for evigt) (2002)
145994                       Formula of Love (1984)
Name: title, dtype: object


The movie we recommend are not so famous. Why is this happening? Let's check how many people rated each of these movies. We use again the *group_by* method, and then we use the [agg](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.agg.html) method of the Groupby onject to specify different aggregations: for the userId, we are gonna count the number of users in the group, using the *'size'* input, while for the rating we want the average value, so we specify *'mean'*. The rest of the operations are the same as before:

In [7]:
average_ratings_df = ratings_df.groupby(['movieId']).agg(count=('userId', 'size'), rating=('rating', 'mean')).reset_index()
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False)
joined_df = sorted_avg_ratings_df.join(movies_df, on='movieId')
print(joined_df[['title', 'count', 'rating']].head(10))

                                          title  count  rating
7638      Paper Birds (Pájaros de papel) (2010)      1     5.0
8089                 Act of Killing, The (2012)      1     5.0
9065                            Jump In! (2007)      1     5.0
9076                               Human (2015)      1     5.0
9078                        L.A. Slasher (2015)      1     5.0
4245                           Lady Jane (1986)      1     5.0
8136             Bill Hicks: Revelations (1993)      1     5.0
8130               Justice League: Doom (2012)       1     5.0
4240  Open Hearts (Elsker dig for evigt) (2002)      1     5.0
9104                     Formula of Love (1984)      1     5.0


We can see that all the movies we selected have been rated by only one user. Let's try to recommend only movies that have been rated by at least 20 users. Note that the *movieId* column is the index for the Dataframe objects obtained using the *mean* and *count* methods. We will use the [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) attribute of the Dataframe object, which allows to access a group of rows and columns by label(s) or a boolean array. In particular, we will determine a Series of boolean: The instructon `rating_counts_df['rating'] > minimum_ratings` evaluate the condition for each row of the *rating* column. Hence, *loc* will only return the rows for which such condition is true.

In [8]:
minimum_ratings = 20
average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
print(joined_df[['title','rating']].head(10))

                                                 title    rating
movieId                                                         
318                   Shawshank Redemption, The (1994)  4.429022
922      Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)  4.333333
898                     Philadelphia Story, The (1940)  4.310345
475                   In the Name of the Father (1993)  4.300000
1204                         Lawrence of Arabia (1962)  4.300000
246                                 Hoop Dreams (1994)  4.293103
858                              Godfather, The (1972)  4.289062
1235                           Harold and Maude (1971)  4.288462
168252                                    Logan (2017)  4.280000
2959                                 Fight Club (1999)  4.272936


We can see now that the recommendation contains more famous movies.

#### 3. Recommending by genre

We want more personalized the recommendations. One possible solution is to suggest movies having the highest ratings for a specific genre. We also want to only select movies evaluated by at least 20 users. We use the same methods used in the previous example. To select the movies by genre, we need to check if the value in the column *genre* contains the specified genre. We can write the conditions as `average_ratings_df['genres'].str.contains(genre)`. 

In [9]:
genre = 'Action'
minimum_ratings = 20

average_ratings_df = ratings_df.groupby(['movieId']).mean() #compute the average ratings for each movie
rating_counts_df = ratings_df.groupby(['movieId']).count() #compute the number of evaluations for each movie
average_ratings_df = average_ratings_df.loc[rating_counts_df['rating'] > minimum_ratings] #select the movies having the required number of evaluations
average_ratings_df = average_ratings_df.join(movies_df['genres'], on='movieId')
average_ratings_df = average_ratings_df.loc[average_ratings_df['genres'].str.contains(genre)]
sorted_avg_ratings_df = average_ratings_df.sort_values(by="rating", ascending=False) #sort the movies according to the average ratings
joined_df = sorted_avg_ratings_df.join(movies_df['title'], on='movieId') #associate the title
print(joined_df[['title','rating']].head(10))

                                                     title    rating
movieId                                                             
168252                                        Logan (2017)  4.280000
2959                                     Fight Club (1999)  4.272936
58559                              Dark Knight, The (2008)  4.238255
1197                            Princess Bride, The (1987)  4.232394
260              Star Wars: Episode IV - A New Hope (1977)  4.231076
3275                           Boondock Saints, The (2000)  4.220930
1208                                 Apocalypse Now (1979)  4.219626
1196     Star Wars: Episode V - The Empire Strikes Back...  4.215640
1233                          Boot, Das (Boat, The) (1981)  4.212500
1198     Raiders of the Lost Ark (Indiana Jones and the...  4.207500


#### 4. Content Based Recommender

Lets'try to implement some of the strategies seen during the lectures to provide more personalized recommendations. We start from a content based approach. Of course, we need some content to process, and the dataset we are using only provides the genre of each movie. So, let's start retriving the movies plots. There are different options for this. Several API online provide this information, but they require to perform different calls and each API has some day limit. Hence, to save some time, we will use a dataset providing such information. The [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots) dataset provides descriptions of 34,886 movies from around the world. Let's open it and check the provided content.

Note that, from here, we will stop using the print method to print the Dataframes. We will start using the [display](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.display) method, that allows to print information nicely in a jupyter notebook.

In [10]:
wiki_plots_dataset_folder = "./wiki_movie_plots_deduped"
plots_df = pd.read_csv(wiki_plots_dataset_folder+"/wiki_movie_plots_deduped.csv") 
display(plots_df.head(10))

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
5,1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderl...,"Alice follows a large white rabbit down a ""Rab..."
6,1903,The Great Train Robbery,American,Edwin S. Porter,,western,https://en.wikipedia.org/wiki/The_Great_Train_...,The film opens with two bandits breaking into ...
7,1904,The Suburbanite,American,Wallace McCutcheon,,comedy,https://en.wikipedia.org/wiki/The_Suburbanite,The film is about a family who move to the sub...
8,1905,The Little Train Robbery,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Little_Train...,The opening scene shows the interior of the ro...
9,1905,The Night Before Christmas,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Night_Before...,Scenes are introduced using lines of the poem....


The movies plot are contained in the *Plot* column. But how can we combine this information with our previous dataset? Let's have a look at the information in our *movies_df* Dataframe.

In [11]:
movies_df.head(10)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller


Since the id are different in the two datasets, we need to use the *title* to combine the information. Furthermore, since different movies might have the same title, we will also use the year. So, to start, we need to separate such informations, that are stored in the same column in our *movie_df* dataset. To do so, we will use the [apply]() method of the Dataframe object. Such method takes as input a function. In the code above, the function is called on each row of the Dataframe on which the function is called.

Hence, we define a function `compute_title_year` that has a row as input. This function should check if the *title* of this movie contains the year (since for some of the movies the year is not present). If so, the *year* is extracted from the *title*, and a new cleaned title is also computed. We are gonna using also the some functions of the [Text Sequence Type - str](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str). You can see in the comments what each instruction do, while for a more detailed reference of each method follow this [link](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [12]:
def compute_title_year(row):
    # read the title of the specified row, and cast it to the str type,
    # then use the strip() method to eliminate the spaces from the beginning and the end of the string
    title = str(row['title']).strip() 
    
    # select the substring from the fifth-to-last (included) character to the last (excluded)
    year = title[-5:-1] 
    
    # use the isdigit method to determine if the substring selected is a number
    if year.isdigit():
        # remove the year from the title
        # select the substring from the beninning to the sixth-to-last (excluded)
        # then use the strip() method to remove spaces at the beginning and at the end of the string
        # finally, use the lower() method to obtain the title in lower case
        title_clean = title[:-6].strip().lower()
        
        # return the year and the title
        return int(year), title_clean
    else:
        # nothing to do, return 0 as year and the title
        return 0, title

Now, we use the *apply* method to execute the *compute_title_year* function on all the rows in our dataframe. We assign the resulting columns in two new columns *year* and *title_clean*.

In [13]:
movies_df = movies_df.reset_index()
movies_df[['year', 'title_clean']] = movies_df.apply(compute_title_year, axis=1, result_type ='expand')
display(movies_df.head(10))

Unnamed: 0,movieId,title,genres,year,title_clean
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,toy story
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,grumpier old men
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,waiting to exhale
4,5,Father of the Bride Part II (1995),Comedy,1995,father of the bride part ii
5,6,Heat (1995),Action|Crime|Thriller,1995,heat
6,7,Sabrina (1995),Comedy|Romance,1995,sabrina
7,8,Tom and Huck (1995),Adventure|Children,1995,tom and huck
8,9,Sudden Death (1995),Action,1995,sudden death
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995,goldeneye


Now we need to compute a lowercase version of the titles in the *plots_df* Dataframe, in order to make our comparisons easier. We can do it by operating directly on the *Title* column. We select the column, obtaining a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object. On that, we use the [str](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html) function, than we use the *strip* method to remove spaces from the beginning and the end of the string. We obtain a new Series object, hence we need to use again the *str* function before we can apply the *lower* function. At the end, we assign the final Series object to the new column *title_clean*.

In [14]:
plots_df['title_clean'] = plots_df['Title'].str.strip().str.lower()
display(plots_df.head(10))

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,title_clean
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",kansas saloon smashers
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",love by the light of the moon
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",the martyred presidents
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,"terrible teddy, the grizzly king"
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,jack and the beanstalk
5,1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderl...,"Alice follows a large white rabbit down a ""Rab...",alice in wonderland
6,1903,The Great Train Robbery,American,Edwin S. Porter,,western,https://en.wikipedia.org/wiki/The_Great_Train_...,The film opens with two bandits breaking into ...,the great train robbery
7,1904,The Suburbanite,American,Wallace McCutcheon,,comedy,https://en.wikipedia.org/wiki/The_Suburbanite,The film is about a family who move to the sub...,the suburbanite
8,1905,The Little Train Robbery,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Little_Train...,The opening scene shows the interior of the ro...,the little train robbery
9,1905,The Night Before Christmas,American,Edwin Stanton Porter,,unknown,https://en.wikipedia.org/wiki/The_Night_Before...,Scenes are introduced using lines of the poem....,the night before christmas


Now, we have all the information to merge the two Dataframes. We use the [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function in this case, that works similarly to a database join. We specify the two datasets, than `how='inner'` specifies that we will perform an inner join (hence we will only keep movies for which we have the corresponding plot), and then, i the correct order, the join attibutes for the first Dataframe (`left_on=['title_clean','year']`) and for the second Dataframe (`right_on = ['title_clean','Release Year']`). 

We also use the [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) attribute to print the number of movies for which we have the plot.

In [15]:
joined_plot_df = pd.merge(movies_df, plots_df,  how='inner', left_on=['title_clean','year'], right_on = ['title_clean','Release Year'])
print("Number of movies: " + str(joined_plot_df.shape[0]))      
display(joined_plot_df.head(10))

Number of movies: 4878


Unnamed: 0,movieId,title,genres,year,title_clean,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,toy story,1995,Toy Story,American,John Lasseter,"Tim Allen, Tom Hanks (voices)",animated film,https://en.wikipedia.org/wiki/Toy_Story,In a world where toys are living things who pr...
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,jumanji,1995,Jumanji,American,Joe Johnston,"Robin Williams, Bonnie Hunt, Kirsten Dunst, Br...","family, fantasy",https://en.wikipedia.org/wiki/Jumanji_(film),"In 1869, near Brantford, New Hampshire, two br..."
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,grumpier old men,1995,Grumpier Old Men,American,Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret, Soph...",comedy,https://en.wikipedia.org/wiki/Grumpier_Old_Men,The feud between Max (Walter Matthau) and John...
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,waiting to exhale,1995,Waiting to Exhale,American,Forest Whitaker,"Whitney Houston, Angela Bassett, Loretta Devin...",drama,https://en.wikipedia.org/wiki/Waiting_to_Exhale,"""Friends are the People who let you be yoursel..."
4,5,Father of the Bride Part II (1995),Comedy,1995,father of the bride part ii,1995,Father of the Bride Part II,American,Charles Shyer,"Steve Martin, Diane Keaton, Martin Short, Kimb...",comedy,https://en.wikipedia.org/wiki/Father_of_the_Br...,The film begins five years after the events of...
5,6,Heat (1995),Action|Crime|Thriller,1995,heat,1995,Heat,American,Michael Mann,"Al Pacino, Robert De Niro, Val Kilmer, Ashley ...","crime drama, action",https://en.wikipedia.org/wiki/Heat_(1995_film),"Neil McCauley, a career criminal, hires Waingr..."
6,7,Sabrina (1995),Comedy|Romance,1995,sabrina,1995,Sabrina,American,Sydney Pollack,"Harrison Ford, Julia Ormond, Greg Kinnear, Nan...",romantic comedy,https://en.wikipedia.org/wiki/Sabrina_(1995_film),Sabrina Fairchild is the young daughter of the...
7,8,Tom and Huck (1995),Adventure|Children,1995,tom and huck,1995,Tom and Huck,American,Peter Hewitt,"Jonathan Taylor Thomas, Brad Renfro","comedy, family",https://en.wikipedia.org/wiki/Tom_and_Huck,The movie opens with Injun Joe (Eric Schweig) ...
8,9,Sudden Death (1995),Action,1995,sudden death,1995,Sudden Death,American,Peter Hyams,"Jean-Claude Van Damme, Powers Boothe, Dorian H...",action,https://en.wikipedia.org/wiki/Sudden_Death_(19...,Darren McCord (Jean-Claude Van Damme) is a Fre...
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995,goldeneye,1995,GoldenEye,British,Martin Campbell,"Pierce Brosnan, Sean Bean, Famke Janssen",spy/action,https://en.wikipedia.org/wiki/GoldenEye,"In 1986, at Arkhangelsk, MI6 agents James Bond..."


Now that we have all the information we need, let's start implementing the recommnender for a specific user. For simplicity, we select a random user from the *ratings_df* Dataframe having more than 200 evaluations. First, we select all the users with more than 200 evaluations, and then we use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) method to extract a random row. Finally, we extract the *userId* column and we read the content with the [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method.

In [16]:
users_ratings = ratings_df.groupby(['userId']).count()

selected = users_ratings['rating'] > 200
selected_users = users_ratings.loc[selected]
random_selected = selected_users.sample() # sample() returns a random row from the dataframe. The returned object is a dataframe with one row. If you pass a number as argument you specify to select more than one row.
select_column_df = random_selected.reset_index()['userId'] # reset_index() create a new index, and the userId became a column. Then, we can filter using the column name
selected_user = select_column_df.iloc[0] # iloc select by index, since our dataframe only has one row we read it from the index 0
print("Selected user: " + str(selected_user))

Selected user: 610


Now, we need to select the user evaluations, that we need to train our recommender. We print the number of ratings that we have for the considered user, and the extracted dataset, ordered by ratings.

In [17]:
selected_user_ratings = ratings_df.loc[ratings_df['userId'] == selected_user]
selected_user_ratings = selected_user_ratings.sort_values(by='movieId', ascending=True)
print("Rated movies: " + str(selected_user_ratings.shape[0]))
display(selected_user_ratings.head(10))

Rated movies: 1302


Unnamed: 0,userId,movieId,rating,timestamp
99534,610,1,5.0,1479542900
99535,610,6,5.0,1493850345
99536,610,16,4.5,1479542171
99537,610,32,4.5,1479543331
99538,610,47,5.0,1479545853
99539,610,50,4.0,1493844757
99540,610,70,4.0,1495959282
99541,610,95,3.5,1479542004
99542,610,110,4.5,1479545829
99543,610,111,5.0,1479542162


Now, we need the select content of the rated movies, that we will need to compute the features that our model will use. We do it by selecting the rows from our *joined_plot_df* dataframe having a *movieId* that is in the list of *movieId* of the *selected_user_ratings* Dataframe, which only contains the evaluations of the considered user. We do this using the [isin](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html) method. Finally, we only select the columns *movieId*, *title_clean* and *Plot*, and we set the *movieId* column as the index of the Dataframe.

In [18]:
rated_movies = joined_plot_df.loc[joined_plot_df['movieId'].isin(list(selected_user_ratings['movieId']))]
rated_movies = rated_movies[['movieId', 'title_clean', 'Plot']].set_index('movieId')
print("Rated movies: " + str(rated_movies.shape[0]))
display(rated_movies.head(10))

Rated movies: 785


Unnamed: 0_level_0,title_clean,Plot
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,toy story,In a world where toys are living things who pr...
6,heat,"Neil McCauley, a career criminal, hires Waingr..."
16,casino,"In 1973, sports handicapper and Mafia associat..."
70,from dusk till dawn,Fugitive bank robbers Seth and Richie Gecko ho...
95,broken arrow,Major Vic Deakins (John Travolta) and Captain ...
110,braveheart,"In 1280, King Edward ""Longshanks"" invades and ..."
111,taxi driver,"Travis Bickle, a 26-year-old honorably dischar..."
153,batman forever,"In Gotham City, the crime fighter Batman defus..."
159,clockers,"In a Brooklyn, New York, New York City, New Yo..."
194,smoke,The film follows the lives of multiple charact...


We use a similar approach to select the unrated movies. The only difference is that we need o select the rows from our *joined_plot_df* dataframe having a *movieId* that is **NOT** in the list of *movieId* of the *selected_user_ratings* Dataframe. We can do it by simply negating the condition we used in the previous example, using the `~` symbol.

In [19]:
unrated_movies = joined_plot_df.loc[~joined_plot_df['movieId'].isin(list(selected_user_ratings['movieId']))]
unrated_movies = unrated_movies[['movieId', 'title_clean', 'Plot']].set_index('movieId')
print("Unrated movies: " + str(unrated_movies.shape[0]))
display(unrated_movies.head(10))

Unrated movies: 4093


Unnamed: 0_level_0,title_clean,Plot
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,jumanji,"In 1869, near Brantford, New Hampshire, two br..."
3,grumpier old men,The feud between Max (Walter Matthau) and John...
4,waiting to exhale,"""Friends are the People who let you be yoursel..."
5,father of the bride part ii,The film begins five years after the events of...
7,sabrina,Sabrina Fairchild is the young daughter of the...
8,tom and huck,The movie opens with Injun Joe (Eric Schweig) ...
9,sudden death,Darren McCord (Jean-Claude Van Damme) is a Fre...
10,goldeneye,"In 1986, at Arkhangelsk, MI6 agents James Bond..."
12,dracula: dead and loving it,Solicitor Thomas Renfield travels all the way ...
13,balto,"In New York City, an elderly woman, her grandd..."


Finally, let's combine the information about the rating in our *rated_movies* Dataframe. We do it using the *join* method.

In [20]:
selected_user_ratings = selected_user_ratings.set_index('movieId')
rated_movies = rated_movies.join(selected_user_ratings['rating'], on='movieId')
print("Rated movies: " + str(rated_movies.shape[0]))
display(rated_movies.head(10))

Rated movies: 785


Unnamed: 0_level_0,title_clean,Plot,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,toy story,In a world where toys are living things who pr...,5.0
6,heat,"Neil McCauley, a career criminal, hires Waingr...",5.0
16,casino,"In 1973, sports handicapper and Mafia associat...",4.5
70,from dusk till dawn,Fugitive bank robbers Seth and Richie Gecko ho...,4.0
95,broken arrow,Major Vic Deakins (John Travolta) and Captain ...,3.5
110,braveheart,"In 1280, King Edward ""Longshanks"" invades and ...",4.5
111,taxi driver,"Travis Bickle, a 26-year-old honorably dischar...",5.0
153,batman forever,"In Gotham City, the crime fighter Batman defus...",3.0
159,clockers,"In a Brooklyn, New York, New York City, New Yo...",3.5
194,smoke,The film follows the lives of multiple charact...,5.0


Now, we have all the data we need to train a content based recomemnder system for our user.

##### TF-IDF computation with sklearn

We will use the well-known pythion library [sklearn](https://scikit-learn.org/stable/) to train our recommender. First, we need to extract features from our content (the plots of the movies). We will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class, which allows us to easily compute TF-IDF vectors for our movies. We need to instantiate the a *TfidfVectorizer* object, and then we need to use the [fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform) method. The method fits (trains) the features model, and then transform our data, returning the TF-IDF matrix representation. We use the [shape](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape) attribute of the matrix to check the dimensions of our features matrix.

In [21]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(rated_movies['Plot'])
print(X.shape)

(785, 26533)


##### Training a KNN Regressor with sklearn

Now, we have our features. To train our model, we need also the ratings. In this example, we are gonna using the ratings, and we will train a regressor, that is able to determine, for each unrated movie, an extimation of the rating that the user would assign to it. Of course, we could use a different approach, treating the problem as a classification problem. In that case, we would need to transform our ratings to obtain labels. For simplicity, we will work with the ratings. Hence, we need to prepare a vector containing the ratings for each movie, and we do it simply selecting the *rating* column of our dataset.

In [22]:
y = rated_movies['rating']
print(y.shape)

(785,)


Then, we train our regressor. We use the [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) class of the sklearn library. We need to create an instance of *KNeighborsRegressor*. In the constructor, we specify the number of neighbors to use. We set it at 5. Finally, we can train our model, using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.fit) method, passing as arguments the features matrix *X* and the corresponding ratings vector *y*.

In [23]:
from sklearn.neighbors import KNeighborsRegressor

neigh = KNeighborsRegressor(n_neighbors=5)
neigh.fit(X, y)

KNeighborsRegressor()

We can now use the trained model to compute our recommendations. First, we need to extract the features from our unrated movies plots. We can do it using the previously trained *TfidfVectorizer* *vectorizer*, passing to its method [transform](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) the *Plot* column if the *unrated_movies* Dataframe. 

Then, we use the trained *KNeighborsRegressor* *neigh*, passing the features matrix of the unrated movies *x_unrated* to its [predict](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.predict) method, determining the predicted ratings *y_unrated*.

In [24]:
X_unrated = vectorizer.transform(unrated_movies['Plot'])
print(X_unrated.shape)
y_unrated = neigh.predict(X_unrated)
print(y_unrated.shape)

(4093, 26533)
(4093,)


Finally, we can combine the predicted ratings with the information in the unrated_movies Dataframe, order the movies by predicted ratings, and determine the best 10 movies to recommend to the user.

In [25]:
unrated_movies['predicted_ratings'] = y_unrated
unrated_movies = unrated_movies.sort_values(by='predicted_ratings', ascending=False)
display(unrated_movies.head(10))

Unnamed: 0_level_0,title_clean,Plot,predicted_ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3628,flying tigers,Jim Gordon (John Wayne in his first war film) ...,4.7
2041,condorman,"Woodrow ""Woody"" Wilkins is an imaginative, yet...",4.7
55805,before the devil knows you're dead,Note: The story is explained here in its chron...,4.6
1992,child's play 2,Two years after Chucky was destroyed by the Ba...,4.6
2884,dog park,The main character Andy is a writer of newspap...,4.6
4389,lost and delirious,Mary is a new student at the all girls' boardi...,4.5
1791,twilight,"Aging private detective Harry Ross (Newman), a...",4.5
5786,paid in full,Ace (Wood Harris) is a young man from Harlem s...,4.5
1701,deconstructing harry,"One night, Lucy (Judy Davis) gets a taxi to th...",4.5
1993,child's play 3,Eight years after Chucky's second demise in th...,4.5


#### 5. Collaborative Filtering Recommender

We will now implement a different approach for our recommender system. We want to use collaborative filtering strategies. For this, we are gonna use the [Lenskit](https://lkpy.readthedocs.io/en/stable/) library. Note that such library requires the python 3.6 version.

We start again from our initial Dataframes *ratings_df* and *movies_df*. We need to rename the columns, since lenskit needs that the ratings Dataframe has the columns *user* and *item*, and we would also need the column *item* in the *movies_df* Dataframe, setting it as index.

In [26]:
rating_lenskit_df = ratings_df.rename(columns={'userId': 'user', 'movieId': 'item'})
display(rating_lenskit_df.head(10))

Unnamed: 0,user,item,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [27]:
movies_lenskit_df = movies_df.rename(columns={'movieId': 'item'})
movies_lenskit_df = movies_lenskit_df.set_index('item')
display(movies_lenskit_df.head(10))

Unnamed: 0_level_0,title,genres,year,title_clean
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,toy story
2,Jumanji (1995),Adventure|Children|Fantasy,1995,jumanji
3,Grumpier Old Men (1995),Comedy|Romance,1995,grumpier old men
4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,waiting to exhale
5,Father of the Bride Part II (1995),Comedy,1995,father of the bride part ii
6,Heat (1995),Action|Crime|Thriller,1995,heat
7,Sabrina (1995),Comedy|Romance,1995,sabrina
8,Tom and Huck (1995),Adventure|Children,1995,tom and huck
9,Sudden Death (1995),Action,1995,sudden death
10,GoldenEye (1995),Action|Adventure|Thriller,1995,goldeneye


##### Collaborative user-user recommender with Lenskit

We use the collaborative user algorithm [UserUser](https://lkpy.readthedocs.io/en/stable/knn.html#module-lenskit.algorithms.user_knn), that uses the nearest neighbors algorithm. The constructor takes as inputs the maximum and minumum number of neighbors to consider (we set them to 15 and 3). Then we use the [adapt](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.adapt) method of the [Recommender](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender) class, to obtain a *Recommender* object. Then, we can use the [fit](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Algorithm.fit) method to train the model. 

In [28]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

# We use the collaborative user algorithm UserUser, that use the nearest neighbors 
num_recs = 10  # Number of recommendations to generate
user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(rating_lenskit_df)

<lenskit.algorithms.basic.TopN at 0x1ef9fe9d3c8>

Now, we can provide the recommendations for a specific user. Let's generate 10 recommendations for the user selected for the previous example. We use the [recommend](https://lkpy.readthedocs.io/en/stable/interfaces.html?highlight=Recommender#lenskit.Recommender.recommend) method of the *Recommender* class to generate the recommendations. Then, we add the title information and display the recommendations.

In [30]:
selected_movies = recsys.recommend(selected_user, 10) # generate 10 recommendations for the selected user 
selected_movies = selected_movies.join(movies_lenskit_df['title_clean'], on='item')
display(selected_movies)

Unnamed: 0,item,score,title_clean
0,334,5.164433,vanya on 42nd street
1,25771,5.153802,"andalusian dog, an (chien andalou, un)"
2,69524,5.139986,raiders of the lost ark: the adaptation
3,27156,5.130459,neon genesis evangelion: the end of evangelion...
4,3030,5.069795,yojimbo
5,65642,4.967804,"timecrimes (cronocrímenes, los)"
6,665,4.936815,underground
7,5137,4.927986,scratch
8,8261,4.903164,3 women (three women)
9,3266,4.896502,man bites dog (c'est arrivé près de chez vous)


We can compare such recommendations with the one previously computed with the content based approach.

In [31]:
display(unrated_movies.head(10))

Unnamed: 0_level_0,title_clean,Plot,predicted_ratings
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3628,flying tigers,Jim Gordon (John Wayne in his first war film) ...,4.7
2041,condorman,"Woodrow ""Woody"" Wilkins is an imaginative, yet...",4.7
55805,before the devil knows you're dead,Note: The story is explained here in its chron...,4.6
1992,child's play 2,Two years after Chucky was destroyed by the Ba...,4.6
2884,dog park,The main character Andy is a writer of newspap...,4.6
4389,lost and delirious,Mary is a new student at the all girls' boardi...,4.5
1791,twilight,"Aging private detective Harry Ross (Newman), a...",4.5
5786,paid in full,Ace (Wood Harris) is a young man from Harlem s...,4.5
1701,deconstructing harry,"One night, Lucy (Judy Davis) gets a taxi to th...",4.5
1993,child's play 3,Eight years after Chucky's second demise in th...,4.5


### Evaluation of Recommender Systems

We have seen two approaches, but how can we evaluate the godness of the proposed recommnendations?
We are gonna see two examples, one for each approach.

#### Evaluation of a Content-based Recommender System

We implemented our Content-based Recommender System using the lenskit library. The same library offers a wide range of tools to evaluate regressors and classifiers, that we will use (and describe).

In the previous example, we estimated user's ratings on unseen movies. Now, we will see the problem treated as a classification problem. We will consider as relevant, for the user, all the movies rated higher than 3.

First, let's define a support function which allows us to select the information we need for a particular user.

The function is called *get_user_rated_movies_plots* and takes as input the user, and the two Dataframe *ratings_df* and *plot_df*. It first selects the ratings of the considered user. Then, it selects the movies information for the movies evaluated by the user. Finally, it joins the information in a single Dataframe, and adds a column in which a relevance for each movie is computed, according to the correlated rating: if the user gave a rating higher than 3, the relevance is set to 1; on the contrary, the relevance is 0. 

To do this, we use the apply function of the Dataframe class (seen previously). In this case, since the operation to perform is easy, we don't define a function for it. Instead, we define a lambda function (you can see it as a small anonymous function). The lambda function is applied to each row of the dataframe on which the apply method is called. In this case, we call it on the *'rating'* column of the *rated_movies* Dataframe. Hence, x is the rating of a specfic movie.

The lambda function then computes the relevance value through a [ternary operator](https://book.pythontips.com/en/latest/ternary_operators.html).

In [186]:
def get_user_rated_movies_plots(user, ratings_df, plot_df):
    """
    extract from the dataframe ratings_df the ratings of the user
    """
    selected_user_ratings = ratings_df.loc[ratings_df['userId'] == user] # select the ratings of the user
    rated_movies = plot_df.loc[plot_df['movieId'].isin(list(selected_user_ratings['movieId']))] # select the movie information for the movies rated by the user
    rated_movies = rated_movies[['movieId', 'title_clean', 'Plot']].set_index('movieId') # select only the information we need and set the index to prepare the Dataframe for the next join
    selected_user_ratings = selected_user_ratings.set_index('movieId') # set the index for the next join
    rated_movies = rated_movies.join(selected_user_ratings['rating'], on='movieId') # join the two dataframes
    rated_movies['relevant'] = rated_movies['rating'].apply(lambda x: 1 if x>3 else 0) # compute the relevance values for the user
    return rated_movies

We can try the function retrieving the information for the user selected for the previous evaluations

In [187]:
user_plots_ratings_df = get_user_rated_movies_plots(selected_user, ratings_df, joined_plot_df)
display(user_plots_ratings_df)

Unnamed: 0_level_0,title_clean,Plot,rating,relevant
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,toy story,In a world where toys are living things who pr...,5.0,1
6,heat,"Neil McCauley, a career criminal, hires Waingr...",5.0,1
16,casino,"In 1973, sports handicapper and Mafia associat...",4.5,1
70,from dusk till dawn,Fugitive bank robbers Seth and Richie Gecko ho...,4.0,1
95,broken arrow,Major Vic Deakins (John Travolta) and Captain ...,3.5,1
...,...,...,...,...
163937,blair witch,"In 2014, James Donahue finds a video on YouTub...",3.5,1
166534,split,Casey Cooke is an emotionally withdrawn teenag...,4.0,1
168250,get out,African-American photographer Chris Washington...,5.0,1
168252,logan,"In 2029, no mutants have been born in 25 years...",5.0,1


Now, since we have a different recommender for each user, we need a function that evaluates the recommender for a specific user. We will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class from the sklearn library to compute the TD-IDF representations of the movies' plots, as we did before. But, since we are addressing the problem as a classification problem, we use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) instead of a regressor. 

Differently from the previous example, we want to evaluate the performances of our classifier, hence we need to test it on a set of movies for which we have the correct relevance for this user. In other words, we need to split the information we have about the user ratings in two sets:
1. Training set: we will use this data to train our classifier.
2. Test set: we will apply our classifier to this data and compare the predictions with the actual relevance values.

To split the data we will use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method, provided by the sklearn library.
Finally, once we have the classifier predictions, we use the [precision_recall_fscore_support](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) method, provided by the sklearn library, to compute the information retievalmetrics *precision*, *recall* and *fscore*. The function returns such values for the considered user.

In [188]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

def train_and_test_recsys_for_user(user):
    user_plots_ratings_df = get_user_rated_movies_plots(user, ratings_df, joined_plot_df) # retrieve user info with the previously defined method
    
    X_plots = user_plots_ratings_df['Plot'] # select the Plot column, from which we will compute ourTF-IDF features
    y = user_plots_ratings_df['relevant'] # select the elevant column, that will be used as label
    
    X_plots_train, X_plots_test, y_train, y_test = train_test_split(X_plots, y, test_size=0.2) # randomly splits the data in train and test, we specify that 20% of the data will go into the test set

    vectorizer = TfidfVectorizer()
    X_train = vectorizer.fit_transform(X_plots_train) # Trains our TF-IDF model and computes the features
    
    neigh = KNeighborsClassifier(n_neighbors=5)
    neigh.fit(X_train, y_train) # train our cassifier
    
    X_test = vectorizer.transform(X_plots_test)
    y_pred = neigh.predict(X_test) # evaluates the predictions of the classifier
    
    return precision_recall_fscore_support(y_test, y_pred, average="binary") # compare the real relevance values with the predicted one, and return precision, recall, and fscore

We can test our method on the selected user. 
Note that the results are affected from the random split (you can check it by executing the method several times: the results change!)

How can we obtain more reliable results?

In [189]:
train_and_test_recsys_for_user(selected_user)


(0.7328767123287672, 0.9553571428571429, 0.8294573643410854, None)

However, we keep this example simple, and now we address the final problem: merging the results for all the customers.
We will iterate over all the users and compute precision, recall and fscore values for the recommender for each specfic customer. For simplicity, we will exclude the customers for which we have less than 100 evaluations (what could be the problem fo such users?).

This might take some time, hence we print the current user on each iteration to keep track of the progresses of our computation.

Finally, we compute at the end the means and standard deviations, and we print it.

In [190]:
import statistics

user_list = list(set(ratings_df['userId']))

precision_list = list()
recall_list = list()
fscore_list = list()
print("evaluating recommender on " + str(len(user_list)) + " users")
for user in user_list:

    print("user " + str(user) + " / " + str(user_plots_ratings_df.shape[0]))
    if user_plots_ratings_df.shape[0] > 100:
        precision, recall, fscore, _ = train_and_test_recsys_for_user(user)
        if precision>0 : 
            precision_list.append(precision)
        if recall > 0:
            recall_list.append(recall)
        if fscore > 0:
            fscore_list.append(fscore)
avg_precision = statistics.mean(precision_list)
stdv_precision = statistics.stdev(precision_list)
avg_recall = statistics.mean(recall_list)
stdv_recall = statistics.stdev(recall_list)
avg_fscore = statistics.mean(fscore_list)
stdv_fscore = statistics.stdev(fscore_list)
    
print("Precision:\t" + str(avg_precision) + " +/- " + str(stdv_precision) + 
      "\nRecall:\t" + str(avg_recall) + " +/- " + str(stdv_recall) +
      "\nFscore:\t" + str(avg_fscore) + " +/- " + str(stdv_fscore))
            
            

evaluating recommender on 610 users
user 1 / 785
user 2 / 785
user 3 / 785
user 4 / 785
user 5 / 785
user 6 / 785
user 7 / 785
user 8 / 785
user 9 / 785
user 10 / 785
user 11 / 785
user 12 / 785
user 13 / 785
user 14 / 785
user 15 / 785
user 16 / 785
user 17 / 785
user 18 / 785
user 19 / 785
user 20 / 785
user 21 / 785
user 22 / 785
user 23 / 785
user 24 / 785
user 25 / 785
user 26 / 785
user 27 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 28 / 785
user 29 / 785
user 30 / 785
user 31 / 785
user 32 / 785
user 33 / 785
user 34 / 785
user 35 / 785
user 36 / 785
user 37 / 785
user 38 / 785
user 39 / 785
user 40 / 785
user 41 / 785
user 42 / 785
user 43 / 785
user 44 / 785
user 45 / 785
user 46 / 785
user 47 / 785
user 48 / 785
user 49 / 785
user 50 / 785
user 51 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 52 / 785
user 53 / 785
user 54 / 785
user 55 / 785
user 56 / 785
user 57 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 58 / 785
user 59 / 785
user 60 / 785
user 61 / 785
user 62 / 785
user 63 / 785
user 64 / 785
user 65 / 785
user 66 / 785
user 67 / 785
user 68 / 785
user 69 / 785
user 70 / 785
user 71 / 785
user 72 / 785
user 73 / 785
user 74 / 785
user 75 / 785
user 76 / 785
user 77 / 785
user 78 / 785
user 79 / 785
user 80 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 81 / 785
user 82 / 785
user 83 / 785
user 84 / 785
user 85 / 785
user 86 / 785
user 87 / 785
user 88 / 785
user 89 / 785
user 90 / 785
user 91 / 785
user 92 / 785
user 93 / 785
user 94 / 785
user 95 / 785
user 96 / 785
user 97 / 785
user 98 / 785
user 99 / 785
user 100 / 785
user 101 / 785
user 102 / 785
user 103 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 104 / 785
user 105 / 785
user 106 / 785
user 107 / 785
user 108 / 785
user 109 / 785
user 110 / 785
user 111 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 112 / 785
user 113 / 785
user 114 / 785
user 115 / 785
user 116 / 785
user 117 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 118 / 785
user 119 / 785
user 120 / 785
user 121 / 785
user 122 / 785
user 123 / 785
user 124 / 785
user 125 / 785
user 126 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 127 / 785
user 128 / 785
user 129 / 785
user 130 / 785
user 131 / 785
user 132 / 785
user 133 / 785
user 134 / 785
user 135 / 785
user 136 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 137 / 785
user 138 / 785
user 139 / 785
user 140 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 141 / 785
user 142 / 785
user 143 / 785
user 144 / 785
user 145 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 146 / 785
user 147 / 785
user 148 / 785
user 149 / 785
user 150 / 785
user 151 / 785
user 152 / 785
user 153 / 785
user 154 / 785
user 155 / 785
user 156 / 785
user 157 / 785
user 158 / 785
user 159 / 785
user 160 / 785
user 161 / 785
user 162 / 785
user 163 / 785
user 164 / 785
user 165 / 785
user 166 / 785
user 167 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 168 / 785
user 169 / 785
user 170 / 785
user 171 / 785
user 172 / 785
user 173 / 785
user 174 / 785
user 175 / 785
user 176 / 785
user 177 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 178 / 785
user 179 / 785
user 180 / 785
user 181 / 785
user 182 / 785
user 183 / 785
user 184 / 785
user 185 / 785
user 186 / 785
user 187 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 188 / 785
user 189 / 785
user 190 / 785
user 191 / 785
user 192 / 785
user 193 / 785
user 194 / 785
user 195 / 785
user 196 / 785
user 197 / 785
user 198 / 785
user 199 / 785
user 200 / 785
user 201 / 785
user 202 / 785
user 203 / 785
user 204 / 785
user 205 / 785
user 206 / 785
user 207 / 785
user 208 / 785
user 209 / 785
user 210 / 785
user 211 / 785
user 212 / 785
user 213 / 785
user 214 / 785
user 215 / 785
user 216 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 217 / 785
user 218 / 785
user 219 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 220 / 785
user 221 / 785
user 222 / 785
user 223 / 785
user 224 / 785
user 225 / 785
user 226 / 785
user 227 / 785
user 228 / 785
user 229 / 785
user 230 / 785
user 231 / 785
user 232 / 785
user 233 / 785
user 234 / 785
user 235 / 785
user 236 / 785
user 237 / 785
user 238 / 785
user 239 / 785
user 240 / 785
user 241 / 785
user 242 / 785
user 243 / 785
user 244 / 785
user 245 / 785
user 246 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 247 / 785
user 248 / 785
user 249 / 785
user 250 / 785
user 251 / 785
user 252 / 785
user 253 / 785
user 254 / 785
user 255 / 785
user 256 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 257 / 785
user 258 / 785
user 259 / 785
user 260 / 785
user 261 / 785
user 262 / 785
user 263 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 264 / 785
user 265 / 785
user 266 / 785
user 267 / 785
user 268 / 785
user 269 / 785
user 270 / 785
user 271 / 785
user 272 / 785
user 273 / 785
user 274 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 275 / 785
user 276 / 785
user 277 / 785
user 278 / 785
user 279 / 785
user 280 / 785
user 281 / 785
user 282 / 785
user 283 / 785
user 284 / 785
user 285 / 785
user 286 / 785
user 287 / 785
user 288 / 785
user 289 / 785
user 290 / 785
user 291 / 785
user 292 / 785
user 293 / 785
user 294 / 785
user 295 / 785
user 296 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 297 / 785
user 298 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 299 / 785
user 300 / 785
user 301 / 785
user 302 / 785
user 303 / 785
user 304 / 785
user 305 / 785
user 306 / 785
user 307 / 785
user 308 / 785
user 309 / 785
user 310 / 785
user 311 / 785
user 312 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 313 / 785
user 314 / 785
user 315 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 316 / 785
user 317 / 785
user 318 / 785
user 319 / 785
user 320 / 785
user 321 / 785
user 322 / 785
user 323 / 785
user 324 / 785
user 325 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 326 / 785
user 327 / 785
user 328 / 785
user 329 / 785
user 330 / 785
user 331 / 785
user 332 / 785
user 333 / 785
user 334 / 785
user 335 / 785
user 336 / 785
user 337 / 785
user 338 / 785
user 339 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 340 / 785
user 341 / 785
user 342 / 785
user 343 / 785
user 344 / 785
user 345 / 785
user 346 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 347 / 785
user 348 / 785
user 349 / 785
user 350 / 785
user 351 / 785
user 352 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 353 / 785
user 354 / 785
user 355 / 785
user 356 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 357 / 785
user 358 / 785
user 359 / 785
user 360 / 785
user 361 / 785
user 362 / 785
user 363 / 785
user 364 / 785
user 365 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 366 / 785
user 367 / 785
user 368 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 369 / 785
user 370 / 785
user 371 / 785
user 372 / 785
user 373 / 785
user 374 / 785
user 375 / 785
user 376 / 785
user 377 / 785
user 378 / 785
user 379 / 785
user 380 / 785
user 381 / 785
user 382 / 785
user 383 / 785
user 384 / 785
user 385 / 785
user 386 / 785
user 387 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 388 / 785
user 389 / 785
user 390 / 785
user 391 / 785
user 392 / 785
user 393 / 785
user 394 / 785
user 395 / 785
user 396 / 785
user 397 / 785
user 398 / 785
user 399 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 400 / 785
user 401 / 785
user 402 / 785
user 403 / 785
user 404 / 785
user 405 / 785
user 406 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 407 / 785
user 408 / 785
user 409 / 785
user 410 / 785
user 411 / 785
user 412 / 785
user 413 / 785
user 414 / 785
user 415 / 785
user 416 / 785
user 417 / 785
user 418 / 785
user 419 / 785
user 420 / 785
user 421 / 785
user 422 / 785
user 423 / 785
user 424 / 785
user 425 / 785
user 426 / 785
user 427 / 785
user 428 / 785
user 429 / 785
user 430 / 785
user 431 / 785
user 432 / 785
user 433 / 785
user 434 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 435 / 785
user 436 / 785
user 437 / 785
user 438 / 785
user 439 / 785
user 440 / 785
user 441 / 785
user 442 / 785
user 443 / 785
user 444 / 785
user 445 / 785
user 446 / 785
user 447 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 448 / 785
user 449 / 785
user 450 / 785
user 451 / 785
user 452 / 785
user 453 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 454 / 785
user 455 / 785
user 456 / 785
user 457 / 785
user 458 / 785
user 459 / 785
user 460 / 785
user 461 / 785
user 462 / 785
user 463 / 785
user 464 / 785
user 465 / 785
user 466 / 785
user 467 / 785
user 468 / 785
user 469 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 470 / 785
user 471 / 785
user 472 / 785
user 473 / 785
user 474 / 785
user 475 / 785
user 476 / 785
user 477 / 785
user 478 / 785
user 479 / 785
user 480 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 481 / 785
user 482 / 785
user 483 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 484 / 785
user 485 / 785
user 486 / 785
user 487 / 785
user 488 / 785
user 489 / 785
user 490 / 785
user 491 / 785
user 492 / 785
user 493 / 785
user 494 / 785
user 495 / 785
user 496 / 785
user 497 / 785
user 498 / 785
user 499 / 785
user 500 / 785
user 501 / 785
user 502 / 785
user 503 / 785
user 504 / 785
user 505 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 506 / 785
user 507 / 785
user 508 / 785
user 509 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 510 / 785
user 511 / 785
user 512 / 785
user 513 / 785
user 514 / 785
user 515 / 785
user 516 / 785
user 517 / 785
user 518 / 785
user 519 / 785
user 520 / 785
user 521 / 785
user 522 / 785
user 523 / 785
user 524 / 785
user 525 / 785
user 526 / 785
user 527 / 785
user 528 / 785
user 529 / 785
user 530 / 785
user 531 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 532 / 785
user 533 / 785
user 534 / 785
user 535 / 785
user 536 / 785
user 537 / 785
user 538 / 785
user 539 / 785
user 540 / 785
user 541 / 785
user 542 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


user 543 / 785
user 544 / 785
user 545 / 785
user 546 / 785
user 547 / 785
user 548 / 785
user 549 / 785
user 550 / 785
user 551 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 552 / 785
user 553 / 785
user 554 / 785
user 555 / 785
user 556 / 785
user 557 / 785
user 558 / 785
user 559 / 785
user 560 / 785
user 561 / 785
user 562 / 785
user 563 / 785
user 564 / 785
user 565 / 785
user 566 / 785
user 567 / 785
user 568 / 785
user 569 / 785
user 570 / 785
user 571 / 785
user 572 / 785
user 573 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 574 / 785
user 575 / 785
user 576 / 785
user 577 / 785
user 578 / 785
user 579 / 785
user 580 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 581 / 785
user 582 / 785
user 583 / 785
user 584 / 785
user 585 / 785
user 586 / 785
user 587 / 785
user 588 / 785
user 589 / 785
user 590 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 591 / 785
user 592 / 785
user 593 / 785
user 594 / 785
user 595 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 596 / 785
user 597 / 785
user 598 / 785
user 599 / 785
user 600 / 785
user 601 / 785
user 602 / 785
user 603 / 785
user 604 / 785
user 605 / 785
user 606 / 785


  _warn_prf(average, modifier, msg_start, len(result))


user 607 / 785
user 608 / 785
user 609 / 785
user 610 / 785


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Precision:	0.7211606163358072 +/- 0.21224569422976247
Recall:	0.8231089631594488 +/- 0.2585975087536438
Fscore:	0.7390784747655503 +/- 0.21644513443714947


We can see some warnings. The *precision_recall_fscore_support* method gives us a warning when the metrics cannot be computed. This happens, for instance, when the system does not retrieve any element as relevant. In this case, for instance, precision and fscore are undefined. In such cases we should check what is the problem for the specific user. For this tutorial, we will just ignore them.

#### Evaluation of a Collaborative Filtering Recommender System

We see now an example of evaluation of a collaborative filtering approach. We will use again the Lenskit library for this.
Since we have one single model for all the users, the evaluation is simpler: we can just consider all the ratings we have and split them into training and test set.; then, we train the recommender on the trainig set, and we use it to predict the results on the test set. We will do it using the [predict](https://lkpy.readthedocs.io/en/stable/interfaces.html#lenskit.Predictor.predict) method. Finally, we compute the relevance values from the original ratings and from the predicted ratings, and we compare them to obtain precision, recall and fscore values.

Note that we will split our dataset using again the *train_test_split* method. In this case, we pass a Dataframe as argument. The method will return two dataframes, randomly splitting the rows between them.

In [193]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

train_df, test_df = train_test_split(rating_lenskit_df, test_size=0.2)

user_user = UserUser(15, min_nbrs=3)  # Minimum (3) and maximum (15) number of neighbors to consider
recsys = Recommender.adapt(user_user)
recsys.fit(train_df)

test_df['predicted_rating'] = recsys.predict(test_df)

test_df['relevant'] = test_df['rating'].apply(lambda x: 1 if x>3 else 0)
test_df['predicted_relevant'] = test_df['predicted_rating'].apply(lambda x: 1 if x>3 else 0)

y_test = list(test_df['relevant'])
y_pred = list(test_df['predicted_relevant'])
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred, average="binary")

print("Precision:\t" + str(precision) + 
      "\nRecall:\t" + str(recall) + 
      "\nFscore:\t" + str(fscore))


Precision:	0.7233366294674389
Recall:	0.8290293855743545
Fscore:	0.7725849647316209


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


We can ignore the warnings. 

Again, we have an evaluation that is strictly related to the split performed. A proper evaluation should consider more reliable validation strategy (any ideas?). For timing reasons, we used the simplest approach. Try to implement some other approaches and check the differences (let us know if you have any problems in this).

### Group Recommender Systems

Now, let's focus on Group Recommendations. For this, we need a group! For simplicity, we will create a random group by selecting 5 users in our dataset. We will select 5 users for which we have at least 200 evaluations.

In [201]:
users_ratings = rating_lenskit_df.groupby(['user']).count()
selected = users_ratings['rating'] > 200
selected_users = users_ratings.loc[selected]
random_selected = selected_users.sample(n=5) # sample() returns now n random rows from the dataframe. The returned object is a dataframe with five rows. 
select_column_df = random_selected.reset_index()['user'] # reset_index() create a new index, and the userId became a column. Then, we can filter using the column name
group_users = list(select_column_df) # iloc select by index, since our dataframe only has one row we read it from the index 0
print(group_users)

[517, 104, 603, 330, 294]


Let's assume we want to recommend to this group a list of 10 movies that nobody in the group has seen yet. We first need to determine the list of possible candidates. For simplicity, we will only consider movies for which we more then 10 evaluations.

In [202]:
group_ratings = rating_lenskit_df.loc[rating_lenskit_df['user'].isin(group_users)]
all_movies = set(movies_lenskit_df.index.tolist())
num_ratings_df = rating_lenskit_df.groupby(['item']).count()
considered_movies = set(num_ratings_df.loc[num_ratings_df['user'] > 10].reset_index()['item'])

group_seen_movies = set(group_ratings['item'].tolist())
group_unseen_movies = considered_movies - group_seen_movies

print(len(all_movies))
print(len(considered_movies))
print(len(group_seen_movies))
print(len(group_unseen_movies))

9742
2121
1690
1068


Now, we need to evaluate individuals' preverences for the unseen movies. To do so, we use the Lenskit library. We will use the same CF recommender used in the previous example. To generate the Dataframe with user-item pairs to pass as input in the *predict* function, we use the [product](https://docs.python.org/3/library/itertools.html#itertools.product) method of the itertools library, which takes as imput two lists and returns all the possible combinations between elements of the two lists. This is passed as input for the Dataframe constructor, which will then generate a Dataframe containing a pair on each row.

In [203]:
import itertools

recsys = Recommender.adapt(user_user)
recsys.fit(rating_lenskit_df)
group_unseen_df = pd.DataFrame(list(itertools.product(group_users, group_unseen_movies)), columns=['user', 'item'])
group_unseen_df['predicted_rating'] = recsys.predict(group_unseen_df)
display(group_unseen_df)

Unnamed: 0,user,item,predicted_rating
0,517,122882,2.489024
1,517,5,2.069946
2,517,122886,2.331179
3,517,7,2.245272
4,517,9,1.767936
...,...,...,...
5335,294,4085,3.030146
5336,294,106487,2.842899
5337,294,106489,2.712835
5338,294,55290,3.119811


We have now our predicted ratings.
We can apply an aggregation strategy to generate the group recommendations.

In [204]:
# Additive strategy

additive_df = group_unseen_df.groupby('item').sum()
additive_df = additive_df.join(movies_lenskit_df['title_clean'], on='item')
additive_df = additive_df.sort_values(by="predicted_rating", ascending=False).reset_index()[['item', 'title_clean', 'predicted_rating']]
display(additive_df.head(10))

Unnamed: 0,item,title_clean,predicted_rating
0,3451,guess who's coming to dinner,21.530129
1,720,wallace & gromit: the best of aardman animation,20.628406
2,1178,paths of glory,20.487025
3,3030,yojimbo,20.271477
4,1411,hamlet,20.130886
5,741,ghost in the shell (kôkaku kidôtai),19.855466
6,951,his girl friday,19.738923
7,905,it happened one night,19.614492
8,1277,cyrano de bergerac,19.489525
9,1243,rosencrantz and guildenstern are dead,19.479057


In [205]:
# least misery

least_misery_df = group_unseen_df.groupby('item').min()
least_misery_df = least_misery_df.join(movies_lenskit_df['title_clean'], on='item')
least_misery_df = least_misery_df.sort_values(by="predicted_rating", ascending=False).reset_index()[['item', 'title_clean', 'predicted_rating']]
display(least_misery_df.head(10))

Unnamed: 0,item,title_clean,predicted_rating
0,3451,guess who's coming to dinner,3.617363
1,720,wallace & gromit: the best of aardman animation,3.561745
2,3198,papillon,3.404474
3,1277,cyrano de bergerac,3.288948
4,64716,seven pounds,3.259776
5,6283,cowboy bebop: the movie (cowboy bebop: tengoku...,3.253743
6,78499,toy story 3,3.233489
7,3328,ghost dog: the way of the samurai,3.228569
8,1411,hamlet,3.220505
9,2542,"lock, stock & two smoking barrels",3.189256


In [209]:
# most pleasure

most_pleasure_df = group_unseen_df.groupby('item').max()
most_pleasure_df = most_pleasure_df.join(movies_lenskit_df['title_clean'], on='item')
most_pleasure_df = most_pleasure_df.sort_values(by="predicted_rating", ascending=False).reset_index()[['item', 'title_clean', 'predicted_rating']]
display(most_pleasure_df.head(10))

Unnamed: 0,item,title_clean,predicted_rating
0,3451,guess who's coming to dinner,4.894335
1,3030,yojimbo,4.857929
2,1178,paths of glory,4.705606
3,106100,dallas buyers club,4.6787
4,1411,hamlet,4.674026
5,2360,"celebration, the (festen)",4.661257
6,720,wallace & gromit: the best of aardman animation,4.551402
7,905,it happened one night,4.540263
8,77455,exit through the gift shop,4.520232
9,140110,the intern,4.49255


In [210]:

group_unseen_df.loc[group_unseen_df['item']==140110]

Unnamed: 0,user,item,predicted_rating
496,517,140110,2.608021
1564,104,140110,4.49255
2632,603,140110,3.275559
3700,330,140110,3.616939
4768,294,140110,2.115635
