# Machine Learning - Project 2 - 2018
## Fatine Benhsain - Tabish Qureshi - Ayyoub El Amrani
### Recommender System

# 1. Introduction

The goal of the project is to create a recommendation systems for movies based on data......bla bla

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.base import TransformerMixin
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor
from sklearn import grid_search
from sklearn.metrics import r2_score
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

import warnings
warnings.filterwarnings('ignore')



In order to preprocess and use the data, the library pandas is used. The later provides very useful tools in this framework.

# 2. Data Importation & Preprocessing

This step is made for importing data and prepare them before implementing a machine learning method. The preprocessing includes steps such as exploration, wrangling... and can be used for feature engineering.

## 2.1 Training Set Importation

In [2]:
raw_data = pd.read_csv('data_train.csv')

Before starting anything, the data will be explored in order to perform data wrangling and features engineering.

In [3]:
raw_data['Prediction'].value_counts()

5    435237
4    324700
3    274327
2     99180
1     43508
Name: Prediction, dtype: int64

In [4]:
raw_data.head()

Unnamed: 0,Id,Prediction
0,r44_c1,4
1,r61_c1,3
2,r67_c1,4
3,r72_c1,3
4,r86_c1,5


The data are stacked into 2 columns with the Id rX_cY. X correspond to an user and Y corresponds to the movie.
In order to make proper analysis, one needs to group users (same X) and the rating (Prediction) on movies (Y).
For this sake, it is necessary to :
1. Unstack the Id and separate X and Y
2. Group the same X (users) as rows with corresponding movies (Y) as columns and the rating as argument of the cell.

In [5]:
#Splitting of the Id:
split = raw_data['Id'].str.split('(\d+)([A-z]+)(\d+)', expand=True)
split = split.loc[:,[1,3]]
split.rename(columns={1:'User', 2:'y', 3:'Movie'}, inplace=True)
split.head()

Unnamed: 0,User,Movie
0,44,1
1,61,1
2,67,1
3,72,1
4,86,1


In [6]:
split['eval']=raw_data['Prediction']

In [7]:
split['User'] = split['User'].astype(int)
split['Movie'] = split['Movie'].astype(int)
split.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1176952 entries, 0 to 1176951
Data columns (total 3 columns):
User     1176952 non-null int64
Movie    1176952 non-null int64
eval     1176952 non-null int64
dtypes: int64(3)
memory usage: 26.9 MB


Now the splitting is done, one needs to create a table to match users with the movies they rated:

In [8]:
rating_table = split.pivot(index = 'User', columns = 'Movie', values = 'eval')

In [9]:
rating_table.head()

Movie,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,5.0,...,,,,,,,,,,
2,,,,3.0,,5.0,,4.0,,,...,,,,,,,,5.0,3.0,3.0
3,,,,2.0,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,2.0,,,,5.0,,3.0,,,...,,,,,5.0,,,,,


<b>Now, the dataset is more readable and it is now possible to start analysis.

## 2.2 Data Preprocessing

### 2.2.1 Data Exploration

It is important to explore the data in order to have an overview of the dataset.

It is possible for example to have an idea of:
* The most/less watched movies
* The most/less well rated
* ...

In [10]:
#Top 10 movies that have been rated:
rating_table.isnull().describe().transpose().sort_values('freq').head(10)

Unnamed: 0_level_0,count,unique,top,freq
Movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
178,10000,2,True,5410
608,10000,2,True,5510
594,10000,2,True,5597
6,10000,2,True,5653
156,10000,2,True,5687
596,10000,2,True,5723
46,10000,2,True,5742
668,10000,2,True,5798
256,10000,2,True,5894
60,10000,2,True,5906


In [11]:
#Top 10 movies that have not been rated:
rating_table.isnull().describe().transpose().sort_values('freq', ascending = False).head(10)

Unnamed: 0_level_0,count,unique,top,freq
Movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
955,10000,2,True,9992
928,10000,2,True,9981
468,10000,2,True,9973
784,10000,2,True,9973
946,10000,2,True,9967
709,10000,2,True,9962
758,10000,2,True,9959
243,10000,2,True,9955
957,10000,2,True,9954
41,10000,2,True,9952


In [20]:
rating_table[1].value_counts()

3.0    119
4.0     98
5.0     56
2.0     53
1.0     14
Name: 1, dtype: int64