# Project 2: Building user-based recommendation model for Amazon

### <u>DESCRIPTION</u>

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014 (...)

### <u>DATA CONSIDERATIONS</u>
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

### <u>ANALYSIS TASK</u>
- 1.) Which movies have maximum views/ratings?
- 2.) What is the average rating for each movie? 
- 3.) Define the top 5 movies with the maximum ratings.
- 4.) Define the top 5 movies with the least audience.

### <u>RECOMMENDATION MODEL:</u>

Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

- 1.) Divide the data into training and test data
- 2.) Build a recommendation model on training data
- 3.) Make predictions on the test data

--------------------------------------------


In [130]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [131]:
import numpy as np
import pandas as pd

file_path = "/movie_tv_ratings.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


*not looking for null-values since they already state that (naturally) not every user has watched all 206 movies... therefore null-values are unavoidable in this situation*

In [132]:
# for exploration purposes, we don't need "user_id" 

df_for_explore = df.drop('user_id',axis=1)
df_for_explore.head()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,5.0,5.0,,,,,,,,,...,,,,,,,,,,
1,,,2.0,,,,,,,,...,,,,,,,,,,
2,,,,5.0,,,,,,,...,,,,,,,,,,
3,,,,5.0,,,,,,,...,,,,,,,,,,
4,,,,,5.0,,,,,,...,,,,,,,,,,


---------------------------------------------

In [133]:
df_for_explore.describe().T["count"].sort_values(ascending = False)[0:10]

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
Movie91      128.0
Movie92      101.0
Movie89       83.0
Movie158      66.0
Movie108      54.0
Name: count, dtype: float64

#### Question #1: Which movies have maximum views/ratings?

#### **Answer #1: Movie127 has the most ratings (2313). Followed by Movie140, Movie16, Movie103, and Movie29**

------------------------------------------

In [134]:
df_mean = df_for_explore.mean().sort_values(ascending=False)

df_mean

Movie1      5.0
Movie66     5.0
Movie76     5.0
Movie75     5.0
Movie74     5.0
           ... 
Movie58     1.0
Movie60     1.0
Movie154    1.0
Movie45     1.0
Movie144    1.0
Length: 206, dtype: float64

In [135]:
df_mean.shape

(206,)

In [136]:
index = 0
total = 0

while (index < 206):
    total += df_mean[index]
    index += 1

average_rating = total/206
average_rating

4.448436665448387

#### Question #2: What is the average rating for each movie?

#### **Answer #2: the average rating per movie can be seen above in the table (I'm not listing all 206 values)... but the overall average seems to be about 4.45 out of 5**

------------------------------------------

In [137]:
df_for_explore.sum().sort_values(ascending=False).head()

Movie127    9511.0
Movie140    2794.0
Movie16     1446.0
Movie103    1241.0
Movie29     1168.0
dtype: float64

#### Question #3: Define the top 5 movies with the maximum ratings.

#### **Answer #3: Movie127, Movie140, Movie16, Movie103, and Movie29**

------------------------------------------

In [138]:
df_for_explore = df_for_explore.fillna(0)

df_for_explore.describe().T['mean'].sort_values(ascending=True)[:5]

Movie67     0.000206
Movie154    0.000206
Movie58     0.000206
Movie60     0.000206
Movie45     0.000206
Name: mean, dtype: float64

#### Question #4: Define the top 5 movies with the least audience.

#### **Answer #4: Movie67, Movie154, Movie58, Movie60, Movie45**

----------------------------

## <u>Recommendation Model</u>

#### Step one: divide the data into training and test data

In [139]:
pip install surprise




In [140]:
import surprise
from surprise import Reader
from surprise import Dataset
from surprise import SVD

df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [141]:
# right now, the dataframe is really hard to work with... so let's fix that with surprise

pretty_df = df.melt(id_vars= df.columns[0],value_vars=df.columns[1:],var_name='Movie',value_name='rating')
pretty_df = pretty_df.fillna(0)
pretty_df

Unnamed: 0,user_id,Movie,rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,0.0
2,A3LKP6WPMP9UKX,Movie1,0.0
3,AVIY68KEPQ5ZD,Movie1,0.0
4,A1CV1WROP5KTTW,Movie1,0.0
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [142]:
train, test = train_test_split(pretty_df, test_size=0.25)

--------------------------

### Step two: build a recommendation model on training data

**as per the requirements: "Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best"**

In [143]:
reader = surprise.Reader(rating_scale=(-1, 10))

data = Dataset.load_from_df(pretty_df, reader=reader)

trainset_model_surprise = surprise.Dataset.load_from_df(train, reader).build_full_trainset()
testset_model_surprise = surprise.Dataset.load_from_df(test, reader).build_full_trainset()

In [144]:
# train the svd with 100 n_factors (number chosen randomly)
svd = SVD(n_factors=100)
svd.fit(trainset_model_surprise)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fad4aed30d0>

--------------------------------

### Step three: Make predictions on the test data

In [145]:
testset_surprise = testset_model_surprise.build_testset()

predict= svd.test(testset_surprise)

pd.DataFrame(predict).sort_values(ascending=False,by='est')

Unnamed: 0,uid,iid,r_ui,est,details
101514,A9U0X1KU1M88Z,Movie127,5.0,3.170904,{'was_impossible': False}
247747,A2T2HPQ16WP4XD,Movie127,0.0,3.108508,{'was_impossible': False}
99992,A1NMUSODK9RHAT,Movie127,3.0,3.078917,{'was_impossible': False}
101875,A3C9KX3AD7EAJE,Movie127,2.0,3.037674,{'was_impossible': False}
64275,A12ZUB1BA0CID1,Movie127,0.0,3.009622,{'was_impossible': False}
...,...,...,...,...,...
36578,A1YUZHRF1K8M24,Movie16,0.0,-0.696349,{'was_impossible': False}
69560,A2UUJ5JMS5FGTP,Movie16,0.0,-0.715759,{'was_impossible': False}
229663,A1OUJZ6NKNA7DX,Movie16,0.0,-0.735206,{'was_impossible': False}
30265,A1EKOYM2H8EYIR,Movie103,0.0,-0.759015,{'was_impossible': False}


In [146]:
from surprise.model_selection import cross_validate

cross_validate(svd,data,measures=['RMSE','MAE'],cv=3,verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2784  0.2842  0.2848  0.2825  0.0029  
MAE (testset)     0.0424  0.0435  0.0439  0.0433  0.0006  
Fit time          44.55   43.70   44.19   44.15   0.35    
Test time         5.48    4.71    3.78    4.66    0.69    


{'fit_time': (44.54787039756775, 43.69886493682861, 44.193177700042725),
 'test_mae': array([0.04243182, 0.04347419, 0.04387658]),
 'test_rmse': array([0.27842821, 0.28420937, 0.28480909]),
 'test_time': (5.475965738296509, 4.706968545913696, 3.7840657234191895)}

In [151]:
# Predict a rating for a movie by a specific user
# (we will check how the last user might feel about the most popular movie):

svd.predict('A2OHIBSF4XKTDC','Movie127')

Prediction(uid='A2OHIBSF4XKTDC', iid='Movie127', r_ui=None, est=4.575814835999572, details={'was_impossible': False})