# Recommender System based on MovieLens dataset
---

Implementation of a recommendation system based on the MovieLens dataset that we can find at: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip.

The dataset contains four files: links.csv, movies.csv, ratings.csv, tags.csv.

In the following the details of the files are reported.

#### Ratings Data File Structure (ratings.csv)

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId, movieId, rating, timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

#### Tags Data File Structure (tags.csv)

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId, movieId, tag, timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

#### Movies Data File Structure (movies.csv)

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId, title, genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

## Data preprocessing

We start by reading the `ratings.csv` file, in order to preprocess it.
We read it in the `ratings_df` variable which is a pandas DataFrame and we want to construct the `utility_matrix` that we will use in the following to build the Recommender System.

In [1]:
# Import ALL the needed libraries for the project
import pandas as pd 
import numpy as np

In [2]:
ratings_df = pd.read_csv('data/ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
utility_matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating')
utility_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


Since we want to find similar users, we have to apply some similarity measures. 

In this case we apply the `Pearson Correlation` (a.k.a Centered Cosine) which allow us to treat the missing rates as "average" and to handle "tough raters" (the ones which tends to give low rates) and "easy raters" (the ones which tends to give high rates).

To do that we have to compute the mean of the ratings for each row (user) and assign this value to the missing rates of each user. Then we have to subract that value from each rating and we obtain what we call a "centered rate" (centered around 0). The negative values represent rates which are under the average while the positive values represent rates which are over the average.

In [4]:
row_mean = utility_matrix.mean(axis=1)
centered_utility_matrix = utility_matrix.T.fillna(row_mean).T
centered_utility_matrix.apply(lambda x: x - row_mean)
centered_utility_matrix.head()

KeyboardInterrupt: 

The `centered_utility_matrix` has now no missing values and it is centered around zero.

userId
1      4.366379
2      3.948276
3      2.435897
4      3.555556
5      3.636364
         ...   
606    3.657399
607    3.786096
608    3.134176
609    3.270270
610    3.688556
Length: 610, dtype: float64