# Baseline model
## Recommendations on Popularity
In this notebook we build an recommendation list based on the popularity of items. 
We use Spearman ranking, which sorts the items on a list by the number of ratings they received.


## Load data

In [152]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

from scipy.stats import spearmanr

RSEED = 42

In [153]:
# load the data
df_movies = pd.read_csv("../data/ml-latest-small/movies.csv")
df_ratings = pd.read_csv("../data/ml-latest-small/ratings.csv")

In [154]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [155]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [156]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


## Popularity based on interaction counts

In [157]:
# count the number of movies appearing in the rating list (ordered) and save their index in an array
top_counts = df_ratings \
    .movieId.value_counts() \
    .index.values

In [158]:
# slice the top10
top10_counts = top_counts[:10]
top10_counts

array([ 356,  318,  296,  593, 2571,  260,  480,  110,  589,  527])

In [159]:
def get_recommendations(df_movies, top_series):
    '''matches the movie Id from the top10 series to the titles in the movie dataframe and returns a dataframe'''
    rec_df = df_movies[df_movies["movieId"].isin(top_series)][["movieId", "title"]]
    return rec_df

In [160]:
# match the list to the title in the df_movie 
get_recommendations(df_movies, top10_counts)

Unnamed: 0,movieId,title
97,110,Braveheart (1995)
224,260,Star Wars: Episode IV - A New Hope (1977)
257,296,Pulp Fiction (1994)
277,318,"Shawshank Redemption, The (1994)"
314,356,Forrest Gump (1994)
418,480,Jurassic Park (1993)
461,527,Schindler's List (1993)
507,589,Terminator 2: Judgment Day (1991)
510,593,"Silence of the Lambs, The (1991)"
1939,2571,"Matrix, The (1999)"



## Popularity based on positive ratings

We define popularity of a movie as the average rating of the movie.
We chose the median because it is more robust to outliers.
+ Overall Mean (ratings): 3.5
+ Overall Median (ratings): 3.5


In [161]:
# calculate the median for every movie
top_rated = df_ratings[["movieId", "rating"]] \
    .groupby("movieId") \
    .median() \
    .sort_values(by="rating", ascending=False) \
    .reset_index()

In [162]:
top_rated.head()

Unnamed: 0,movieId,rating
0,3942,5.0
1,147250,5.0
2,115122,5.0
3,86237,5.0
4,1151,5.0


In [163]:
# slice the top10
top10_rated = top_rated.movieId[:10]
top10_rated.index.values
top10_rated

0      3942
1    147250
2    115122
3     86237
4      1151
5    146662
6    114265
7    146684
8      3851
9      5416
Name: movieId, dtype: int64

### Top 10 highest rated movies

In [164]:
# match the list to the title in the df_movie 
get_recommendations(df_movies, top10_rated)

Unnamed: 0,movieId,title
870,1151,Lesson Faust (1994)
2880,3851,I'm the One That I Want (2000)
2939,3942,Sorority House Massacre II (1990)
3852,5416,Cherish (2002)
7581,86237,Connections (1978)
8516,114265,Laggies (2014)
8536,115122,What We Do in the Shadows (2014)
9129,146662,Dragons: Gift of the Night Fury (2011)
9131,146684,Cosmic Scrat-tastrophe (2015)
9138,147250,The Adventures of Sherlock Holmes and Doctor W...


__*OBS*__: The top 10 rated movies includes also movies, that were rated only by few users. To get popular movies, the number of ratings have to be taken into account, too.

## Popularity based on positive ratings and interaction counts

In [165]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [166]:
top_pop = df_ratings[["movieId", "rating"]] \
    .groupby("movieId") \
    .count() \
    .sort_values(by="rating", ascending=False) \
    .reset_index()

In [168]:
#sort the movies by median and most rated values
top_pop = top_pop \
    .merge(top_rated, on="movieId", how="inner") \
    .rename(columns={'rating_x':'n_ratings', 'rating_y':'median_ratings'}) \
    .sort_values(["n_ratings", "median_ratings"], ascending=False)
top_pop

Unnamed: 0,movieId,n_ratings,median_ratings
0,356,329,4.0
1,318,317,4.5
2,296,307,4.5
3,593,279,4.0
4,2571,278,4.5
...,...,...,...
9516,60363,1,0.5
9550,65350,1,0.5
9600,54934,1,0.5
9663,57326,1,0.5


In [169]:
top10_pop = top_pop.movieId[:10]
top10_pop.index.values
top10_pop

0     356
1     318
2     296
3     593
4    2571
5     260
6     480
7     110
8     589
9     527
Name: movieId, dtype: int64

In [170]:
get_recommendations(df_movies, top10_pop)

Unnamed: 0,movieId,title
97,110,Braveheart (1995)
224,260,Star Wars: Episode IV - A New Hope (1977)
257,296,Pulp Fiction (1994)
277,318,"Shawshank Redemption, The (1994)"
314,356,Forrest Gump (1994)
418,480,Jurassic Park (1993)
461,527,Schindler's List (1993)
507,589,Terminator 2: Judgment Day (1991)
510,593,"Silence of the Lambs, The (1991)"
1939,2571,"Matrix, The (1999)"


# Spearman correlation

In [186]:
n_ratings = np.array(top_pop.n_ratings)
median_ratings = np.array(top_pop.median_ratings)

In [None]:
#spearmanr()

## Calculate the error of the baseline model

We predict that every user would rate the movie with the median of the rating. Based on this, we calculate the error between predicted rating (median) and true rating.

In [178]:
error = df_ratings \
    .merge(top_rated, on="movieId", how="inner") \
    .rename(columns={'rating_x':'true_rating', 'rating_y':'pred_rating'}) 

In [187]:
error.head()

Unnamed: 0,userId,movieId,true_rating,timestamp,pred_rating
0,1,1,4.0,964982703,4.0
1,5,1,4.0,847434962,4.0
2,7,1,4.5,1106635946,4.0
3,15,1,2.5,1510577970,4.0
4,17,1,4.5,1305696483,4.0


In [180]:
# check out one user
error.loc[error["userId"] == 101]

Unnamed: 0,userId,movieId,true_rating,timestamp,pred_rating
1278,101,223,4.0,968440895,4.00
6591,101,1127,4.0,968440828,3.50
7594,101,1210,4.0,968440698,4.00
12132,101,2395,5.0,968441198,3.50
12397,101,2492,1.0,968440828,4.00
...,...,...,...,...,...
85451,101,2318,5.0,968441029,4.25
87634,101,2600,4.0,968443749,4.00
87847,101,233,4.0,968440983,3.50
87859,101,2337,3.0,968443714,3.25


## Evaluation
Calculate the error RMSE

In [181]:
# RMSE
y_true = error.true_rating
y_pred = error.pred_rating
mean_squared_error(y_true, y_pred, squared=False)

0.9031690748700261

In [182]:
# MSE
mean_squared_error(y_true, y_pred)

0.8157143778015788

In [183]:
# MAE
mean_absolute_error(y_true, y_pred)

0.6401037327938435