# Machine Learning 2 Project NYX
### Group: Stats ML lead by David Stroud
### Yang Zhang
### 12/18/2020

## Problem Definition
1. Problem Statement
2. Ideal Problem Solution
3. Understanding insight into the problem
4. Technical requirements
## Research
1. Data Structure and source
2. Model architecture
3. Algorithm research
4. Hardware requirements
5. Software requirements
## Model Exploration
1. Establish baselines for model performance
2. Start with a simple model using initial data pipeline
3. Stay nimble and try many parallel (isolated) ideas 
## Model Refinement
1. Perform model-specific optimizations
2. Iteratively debug models as complexity is added


In [1]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# For multiple line outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [88]:
# read in the csv file
#movieData = pd.read_csv('C:/Users/taniat470s/Desktop/SMU_course/DS7335/Project_NYX/ml-25m/movies.csv') # read in the csv file
movieData = pd.read_csv('C:/Users/taniat470s/Desktop/SMU_course/DS7335/ml-latest-small/movies.csv') # read in the csv file
 
movieData.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
# read in the csv file
#ratingData = pd.read_csv('C:/Users/taniat470s/Desktop/SMU_course/DS7335/Project_NYX/ml-25m/ratings.csv') # read in the csv file
ratingData = pd.read_csv('C:/Users/taniat470s/Desktop/SMU_course/DS7335/ml-latest-small/ratings.csv') # read in the csv file
      
ratingData.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [94]:
ratingData

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [95]:
# Loads Pandas dataframe
data = Dataset.load_from_df(ratingData[["userId", "movieId", "rating"]], reader)

#### Test with KNN first

In [96]:
from surprise import KNNWithMeans

# To use item-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": False,  # Compute  similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)

In [97]:
trainingSet = data.build_full_trainset()

algo.fit(trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x26ca6f3dc48>

In [100]:
prediction = algo.predict(1, 1)
prediction.est

4.563359297316269

In [101]:
prediction = algo.predict(1, 3)
prediction.est

4.158043591329903

#### Test with SVD

In [103]:
algo = SVD()

In [104]:
algo.fit(trainingSet)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x26cabf275c8>

In [105]:
prediction = algo.predict(1, 3)
prediction.est

3.929339295776256

In [106]:
prediction = algo.predict(1, 1)
prediction.est

4.841726875546885

### Benchmark with Different Algorithm

In [116]:
from surprise import KNNBasic
from surprise import SVDpp
from surprise import SlopeOne
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNWithZScore
from surprise import BaselineOnly
from surprise import CoClustering
from surprise.model_selection import cross_validate

In [117]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.869016,712.082028,23.08818
BaselineOnly,0.875843,0.494188,1.144993
SVD,0.881471,6.199813,0.730093
KNNBaseline,0.881752,1.303142,11.926233
KNNWithZScore,0.902673,0.563582,10.990214
KNNWithMeans,0.903495,0.406015,5.735572
SlopeOne,0.908998,8.397802,22.18684
NMF,0.932728,17.45897,0.901461
CoClustering,0.951079,10.005873,0.602783
KNNBasic,0.955309,0.343917,5.996226


From above results, considering both RMSE and running time, we pick the following two algorithms:
- BaselineOnly
- SVD

### Use BaselineOnly

In [118]:
print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


{'test_rmse': array([0.87000228, 0.86774573, 0.87471032]),
 'fit_time': (0.19148659706115723, 0.18894147872924805, 0.18350863456726074),
 'test_time': (0.43857622146606445, 0.7972283363342285, 0.5600166320800781)}

In [123]:
from surprise.model_selection import train_test_split
from surprise import accuracy

In [124]:
trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.8678


0.8678010694890607

In [136]:
algo_SVD = SVD(param_grid = {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4})

TypeError: __init__() got an unexpected keyword argument 'param_grid'

### Use SVD

In [126]:
param_grid = {
    "n_epochs": [10],
    "lr_all": [0.005],
    "reg_all": [0.4]
}

#param_grid

#algo_SVD = SVD(param_grid=param_grid)
algo_SVD = SVD()
cross_validate(algo_SVD, data, measures=['RMSE'], cv=3, verbose=False)

{'test_rmse': array([0.878015  , 0.88054097, 0.88205691]),
 'fit_time': (10.885946989059448, 10.659837484359741, 14.825446367263794),
 'test_time': (1.0708503723144531, 0.7110385894775391, 1.0574126243591309)}

In [127]:
predictions = algo_SVD.fit(trainset).test(testset)
accuracy.rmse(predictions)

RMSE: 0.8752


0.8752230251275583

### Prediction Test

In [130]:
ratingData[ratingData['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
227,1,3744,4.0,964980694
228,1,3793,5.0,964981855
229,1,3809,4.0,964981220
230,1,4006,4.0,964982903


In [131]:
prediction = algo.predict(1, 1994)
prediction.est

4.1593106576377075

In [132]:
prediction = algo_SVD.predict(1, 1994)
prediction.est

4.225427634548112