## Building user-based recommendation model for Amazon

Analysis Task
- Exploratory Data Analysis:

Which movies have maximum views/ratings?
What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
Define the top 5 movies with the least audience.
- Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

Divide the data into training and test data
Build a recommendation model on training data
Make predictions on the test data

In [1]:
# import libraries
import pandas as pd
import numpy as np
from surprise.model_selection import train_test_split
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import accuracy

In [2]:
ratings = pd.read_csv("~/Amazon - Movies and TV Ratings.csv")

In [3]:
ratings.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [4]:
ratings.dtypes

user_id      object
Movie1      float64
Movie2      float64
Movie3      float64
Movie4      float64
             ...   
Movie202    float64
Movie203    float64
Movie204    float64
Movie205    float64
Movie206    float64
Length: 207, dtype: object

In [5]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


In [6]:
ratings.shape

(4848, 207)

In [7]:
ratings.describe

<bound method NDFrame.describe of              user_id  Movie1  Movie2  Movie3  Movie4  Movie5  Movie6  Movie7  \
0     A3R5OBKS7OM2IR     5.0     5.0     NaN     NaN     NaN     NaN     NaN   
1      AH3QC2PC1VTGP     NaN     NaN     2.0     NaN     NaN     NaN     NaN   
2     A3LKP6WPMP9UKX     NaN     NaN     NaN     5.0     NaN     NaN     NaN   
3      AVIY68KEPQ5ZD     NaN     NaN     NaN     5.0     NaN     NaN     NaN   
4     A1CV1WROP5KTTW     NaN     NaN     NaN     NaN     5.0     NaN     NaN   
...              ...     ...     ...     ...     ...     ...     ...     ...   
4843  A1IMQ9WMFYKWH5     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
4844  A1KLIKPUF5E88I     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
4845   A5HG6WFZLO10D     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
4846  A3UU690TWXCG1X     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
4847   AI4J762YI6S06     NaN     NaN     NaN     NaN     NaN     NaN     NaN   

     

In [8]:
ratings.describe().transpose

<bound method DataFrame.transpose of        Movie1  Movie2  Movie3  Movie4     Movie5  Movie6  Movie7  Movie8  \
count     1.0     1.0     1.0     2.0  29.000000     1.0     1.0     1.0   
mean      5.0     5.0     2.0     5.0   4.103448     4.0     5.0     5.0   
std       NaN     NaN     NaN     0.0   1.496301     NaN     NaN     NaN   
min       5.0     5.0     2.0     5.0   1.000000     4.0     5.0     5.0   
25%       5.0     5.0     2.0     5.0   4.000000     4.0     5.0     5.0   
50%       5.0     5.0     2.0     5.0   5.000000     4.0     5.0     5.0   
75%       5.0     5.0     2.0     5.0   5.000000     4.0     5.0     5.0   
max       5.0     5.0     2.0     5.0   5.000000     4.0     5.0     5.0   

       Movie9  Movie10  ...  Movie197  Movie198  Movie199  Movie200  Movie201  \
count     1.0      1.0  ...  5.000000       2.0       1.0  8.000000  3.000000   
mean      5.0      5.0  ...  3.800000       5.0       5.0  4.625000  4.333333   
std       NaN      NaN  ...  1.6431

## Exploratory Data Analysis

### 1. Which movies have maximum views/ratings?

In [9]:
ratings.count().sort_values(ascending=False)

user_id     4848
Movie127    2313
Movie140     578
Movie16      320
Movie103     272
            ... 
Movie73        1
Movie74        1
Movie75        1
Movie77        1
Movie100       1
Length: 207, dtype: int64

Insight
- from the above data shows Movie127 has maximum ratings given by user

### 2. What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

In [10]:
#avg rating
ratings.mean().sort_values(ascending=False)

Movie1      5.0
Movie55     5.0
Movie131    5.0
Movie132    5.0
Movie133    5.0
           ... 
Movie60     1.0
Movie58     1.0
Movie45     1.0
Movie67     1.0
Movie144    1.0
Length: 206, dtype: float64

### top 5 rated movies

In [11]:
ratings.mean().sort_values(ascending=False).head(5)

Movie1      5.0
Movie55     5.0
Movie131    5.0
Movie132    5.0
Movie133    5.0
dtype: float64

In [13]:
### 3. Define the top 5 movies with the least audience. 

In [14]:
ratings.count().sort_values(ascending=True)

Movie100       1
Movie77        1
Movie75        1
Movie74        1
Movie73        1
            ... 
Movie103     272
Movie16      320
Movie140     578
Movie127    2313
user_id     4848
Length: 207, dtype: int64

In [15]:
ratings.count().sort_values(ascending=True).head(5)

Movie100    1
Movie77     1
Movie75     1
Movie74     1
Movie73     1
dtype: int64

### Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

In [21]:
# Transpose data set
transformed = ratings.melt(id_vars= ratings.columns[0],value_vars=ratings.columns[1:],var_name='Movie_Name',value_name='Movie_Rating')

### 1. Training and test data split

In [22]:
reader = Reader(rating_scale=(-1, 10))
data = Dataset.load_from_df(transformed.fillna(0), reader=reader)

In [23]:
train, test = train_test_split(data, test_size=.25)

### 2. Build a recommendation model on training data

In [24]:
svd = SVD()

In [25]:
svd.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f3f01366a90>

### 3. Make predictions on the test data

In [26]:
predictions = svd.test(test)

In [27]:
accuracy.rmse(predictions)

RMSE: 0.2824


0.28239637961591985

#### Since RMSE score is low then the model's accruacy is good

In [28]:
user_id = 'A3R5OBKS7OM2IR'
movie_id = 'Movie5121'
r_ui = 5.0
svd.predict(user_id, movie_id, r_ui=r_ui, verbose=True)

user: A3R5OBKS7OM2IR item: Movie5121  r_ui = 5.00   est = 0.09   {'was_impossible': False}


Prediction(uid='A3R5OBKS7OM2IR', iid='Movie5121', r_ui=5.0, est=0.09126411642326442, details={'was_impossible': False})

In [29]:
user_id = 'A3R5OBKS79M2IR'
movie_id = 'Movie7892'
r_ui = 5.0
svd.predict(user_id, movie_id, r_ui=r_ui, verbose=True)

user: A3R5OBKS79M2IR item: Movie7892  r_ui = 5.00   est = 0.02   {'was_impossible': False}


Prediction(uid='A3R5OBKS79M2IR', iid='Movie7892', r_ui=5.0, est=0.021743193736849412, details={'was_impossible': False})