### User-to-User (Collaborative Filtering algorithm)

To predict the missing value in the movie_rating.xlsx dataset, we use collaborative filtering method to build the algorithm. Firstly, we define the function,colaborative_filtering, to do the prediction work. After taking the whole movie_rating matrix and the number of neighbors,k, the function will return a data frame containing all the prediction value form the original matrix.

Since the k is a parameter to be tuned, we then develop the cross_validation function to help us calculate the RMSE related to each k. Inside the cross_validation function, we also use function pred and rmse_df. The former function is constructed to give the actual prediction value given a list of target user_id and movie_id, and the latter function is built to give the RMSE, comparing the actual value and the prediction value.

After having the RMSE dictionary, we use the k_optimal function to return the k related to the lowest RMSE in the testing set.

With all the function developed, we first split the whole movie_rating dataset into a training set and a test set, with the ratio of 0.8 and 0.2. And then use cross_validation to find the RMSE related to k, whose range is between 10 and 19. Since the RMSE related to 19 is the smallest in the test set, we finally use colaborative_filtering and k = 19 to predict the whole dataset.

In [15]:
# import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [16]:
# read the data as a dataframe, observations (rows) are users, and attributes (columns) are items
xls = pd.read_excel('movie_ratings_inclass.xlsx', sheetname = 'Sheet1' )
df = xls.set_index('User')

In [17]:
# Dedine the colaboration filtering method, which only choose the k nearest neighbour( sorted by correlation) to do the prediction
def colaborative_filtering(df,k):# input is the matrix to be filled and k
    cor = df.T.corr()
    cor = cor.fillna(0) #fill the nan cell of correlation table with 0
    mean_user = df.mean(axis = 1).tolist() #mean before fill the original matrix nan with 0)
    new_df = df.fillna(0)#rebuild the matrix
    for i in range(df.shape[0]):#user
        for j in range(df.shape[1]):#item
            numerator = 0
            denominator = 0
            neighbour = (np.argsort(cor).iloc[i,cor.shape[1]-k-1:cor.shape[1]-1]).tolist() #choose the neighbour, rank by correlation
            for n in neighbour:#neignbour
                if i!=n and (not pd.isnull(df.iloc[n,j])):#only count weight when the neighbour has rate the item
                    denominator = denominator + abs(cor.iloc[i,n]) 
                    numerator = numerator + cor.iloc[i,n]*(df.iloc[n,j]-mean_user[n])
            if denominator ==0:
                new_df.iloc[i,j] =float(mean_user[i])
            else:  
                new_df.iloc[i,j] =float(mean_user[i]) + (float(numerator)/denominator)
    return new_df
 


In [18]:
# extract the value we want in the whole new matrix to be the new pediction
def pred(df_pred,df_test):
    prediction = []
    for i in range(df_test.shape[0]):
        pred = df_pred.loc[df_test.iloc[i,0],df_test.iloc[i,1]]
        prediction.append(pred)
    return prediction

In [19]:
#compare the value in the give matrix and the value we predict
def rmse_df(df, df_pred):
    err = df.sub(df_pred)
    n = np.sum(1-np.isnan(df)).sum()
    se = np.power(err,2).sum().sum()
    rmse = np.power(se/n,0.5)
    return rmse

In [20]:
#use training set and test set to find the optimal number of neighbours, k , to be put in the colaboration filtering method
def cross_validation(df_train, df_test,k_range):
    rmse_dic = {}
    for k in k_range:
        df_pred= colaborative_filtering(df_train,k)
        prediction = pd.DataFrame({'rating':pred(df_pred,df_test)})
        rmse_dic[k] = rmse_df(df_test[['rating']],prediction)
    return rmse_dic

In [21]:
#return the optimal k related to the lowest rmse 
def k_optimal(rmse_dic):
    return min(rmse_dic, key = rmse_dic.get)  

In [22]:
#split the whole dataset to trainig set and test set
melt = pd.melt(xls, id_vars='User', 
               value_vars=list(df.columns[0:]),
               var_name='movie_id', 
               value_name='rating')

np.random.seed(1)  
train2,test2 = train_test_split(melt,test_size = 0.2)
df_train = train2.pivot(index = 'User', columns = 'movie_id', values = 'rating')
df_test = test2.reset_index()[['User','movie_id','rating']] 

In [23]:
#do cross validation to find optimal k
k_range = range(10,df.shape[1])          
rmse_dictionary =cross_validation(df_train,df_test,k_range)
k_op = k_optimal(rmse_dictionary)

In [24]:
rmse_dictionary  # the optimal k is 19, with the rmse being the lowest.

{10: 1.7236967819610547,
 11: 1.7128829348636185,
 12: 1.7119369348474973,
 13: 1.6754964360996423,
 14: 1.6459935932464638,
 15: 1.6499851195518964,
 16: 1.6622488339525201,
 17: 1.695511802858092,
 18: 1.665066818326314,
 19: 1.635576003194916}

In [25]:
colaborative_filtering(df,k_op)# the full matrix we predict

Unnamed: 0_level_0,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,2.125887,3.809875,2.468465,2.997126,3.664587,3.102324,2.606409,2.673367,2.317369,3.521447,3.044063,2.934637,4.040158,2.998709,2.57367,4.267142,1.761662,4.332319,3.593288,2.319217
2,3.988588,2.915932,3.115639,2.133721,3.057883,2.432871,2.279495,2.580919,3.000151,3.63267,3.096949,2.527106,2.173434,2.532439,2.473829,3.200056,3.763495,1.881988,2.68988,1.752868
3,1.912954,2.094736,2.369515,3.683575,1.987374,2.466174,3.281397,1.978337,3.130212,1.539215,3.272523,2.279263,2.603694,2.982542,2.18674,1.321762,2.126969,2.364964,1.98824,3.365319
4,3.32694,2.339844,2.388143,2.70054,2.575861,1.627711,2.875391,2.793358,3.117819,2.315239,2.864993,3.63611,1.807446,2.924626,3.392369,2.389743,3.752023,1.487925,2.007544,3.683318
5,3.578885,2.5014,3.047309,2.395425,3.392122,2.622037,3.08719,3.083817,3.72331,2.617868,3.964756,3.011088,3.098395,3.409873,3.955804,2.317867,3.497197,3.571934,3.677112,3.248757
6,2.901324,4.029951,3.009891,3.663975,3.521451,3.323299,3.00793,2.218895,2.993365,4.031558,3.873461,2.121943,4.000037,2.838593,1.655562,3.619244,2.748993,3.299288,2.454697,2.453458
7,2.567399,1.529158,3.11306,2.17578,2.513765,3.18591,2.801474,3.567647,2.323389,2.179307,1.930943,2.658001,2.259856,2.175,2.778515,1.777193,2.325736,2.792869,3.540116,2.824637
8,3.514328,3.031754,1.988014,1.799454,2.775932,2.795472,2.251844,3.030631,2.282562,2.746412,2.828054,2.144753,2.837773,2.326532,3.466282,3.619402,1.919753,3.713291,4.156083,2.86003
9,3.65076,1.960428,3.351168,3.515065,2.621152,3.134144,3.795093,3.731547,3.544526,2.412741,2.995182,3.572298,2.399526,3.780511,4.530988,1.999423,3.485569,2.459768,3.110284,4.022902
10,3.895273,2.232604,3.870131,4.047578,2.657119,3.249615,3.713518,3.436299,4.137508,3.085499,3.17607,3.475341,2.360364,3.587308,3.287793,2.135742,4.102399,1.827758,2.260609,3.852409


In [26]:
rmse_df(df,colaborative_filtering(df,k_op)) #the rmse of the prediction

0.86429700227436157

## Question 2: Part B

### Predicting the Netflix Dataset

For the Netflix dataset, we began by researching the ways that previous teams had achieved accurate predictions. We found that the optimal solution was not create one model, but to train several models and then blend them together.

We first took the training set and split it (75% training, 25% validation). We then chose our models. We found the Surprise package to be extremely useful, as it is a dedicated user recommendation system. From the Surprise package, we trained a selection of 7 models, including SVD, and KNN. We trained all 7 models on the training set, and then used them on to predict the ratings in the validation set. Our initial predictions ranged in RMSE, from 0.844 (SVDpp), to 0.9 (NMF). During this process we also made predictions for the 2000 observations in the test set, for each model.


With these validation and test sets predictions, we began blending our models. To do this, we created a new dataset out of the predictions on our validation set, with each model as its features, and we split this into two subsets. We then trained a linear regression on one of these subsets, with the model predictions as our X variables, and the real ratings as our Y variable. We then used this linear regression to predict the ratings of the other validation subset. This gave us an estimated RMSE of 0.84. 

At this stage, we tried several other ways to blend the models. Despite trying more advanced models, such as a simple Neural network, a Random Forest and Elastic Net, Linear Regression was still either the most accurate, or of similar accuracy but with far lower computational complexity. 

Finally, we re-trained the linear regression on the entire validation set predictions, and then used this model to predict the ratings in the hold out test data set. Please see Netflix_Predictions.txt, for our final predictions.


In [None]:
# -*- coding: utf-8 -*-
"""
Created on Fri Jun  8 11:49:50 2018

@author: Mark
"""
#Load core packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
#Load Surprise helpers
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
#Get all Models
from surprise.prediction_algorithms import SVDpp
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans
from surprise.prediction_algorithms import KNNWithZScore
from surprise.prediction_algorithms import NMF
from surprise.prediction_algorithms import KNNBaseline
from surprise.prediction_algorithms import SlopeOne
from surprise.prediction_algorithms import CoClustering


#For final calculation
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())


#Load in datasets
path = r'C:/Users/Mark/Desktop/Marketing Analytics/Homework/Homework 3/netflix_HW3/Netflix_HW3_training.txt'
df = pd.read_table(path, delimiter=",",
                               names=['itemID', 'userID', 'ratings'],
                               dtype={'itemID':np.uint32, 'userID':np.uint32, 
                                      'ratings':np.float})
    

order = ['userID', 'itemID', 'ratings']
df = df[order]
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df, reader)
del df

#Split train test
trainset, testset = train_test_split(data, test_size=.25)

path2 = r'C:/Users/Mark/Desktop/Marketing Analytics/Homework/Homework 3/netflix_HW3/Netflix_HW3_test.txt'
df2 = pd.read_table(path2, delimiter=",",
                               names=['itemID', 'userID'],
                                dtype={'itemID':np.uint32, 'userID':np.uint32})
    

order2 = ['userID', 'itemID']
df2 = df2[order2]


#Dicts for storing results
accuracies=dict() #Accuracy of each model
predictions_dict=dict() #Get all predictions for each model
final_predictions_dict=dict() #Get all predictions for each model


#Create List of Models
labels =['SVD','SVDpp','CoClustering','KNNBaseline','KNNWithMeans','NMF','SlopeOne']
algos=dict() #Match each algo name to its model
#set up algos
algos[labels[0]]=  SVD(n_factors =30,n_epochs= 10, lr_all= 0.007, reg_all= 0.01,verbose=True)
algos[labels[1]]=  SVDpp(verbose=True)
algos[labels[2]]=  CoClustering(verbose=True)
algos[labels[3]]=  KNNBaseline(verbose=True)
algos[labels[4]]=  KNNWithMeans(verbose=True)
algos[labels[5]]=  NMF(verbose=True)
algos[labels[6]]=  SlopeOne()

#Train all models
for label,algo1 in algos.items():
    #Fit algos\
    print("Start",label)
    algo=algo1
    algo.fit(trainset)
    print("Training done!",label)
    
    prediction=algo.test(testset)
    
    
    #Save accuracy
    accuracies[label]=accuracy.rmse(prediction)
    
    #Save predictions for validation set
    preds=[]
    for a,b in enumerate(prediction):
        preds.append(b[3])
    predictions_dict[label]=preds
    preds=[]
    print("Validation Predicted",label)
    
    
    #Save predictions for test set
    final_prediction=[]
    for index, row in df2.iterrows():
        final_prediction.append(algo.predict(row['userID'],row['itemID']).est)
    final_predictions_dict[label]=final_prediction
    final_prediction=[]
    print("Test Predicted",label)
    

#validation set real values
real=[]
for i in range(0,len(testset)): 
    real.append(testset[i][2])


#-----------------------------------------------------------------------------------------------------------------
#Dataframe of all predictions
predictions=pd.DataFrame(predictions_dict)
predictions['Real']=real  
split_index = int(len(predictions)*0.75)

predTrain = predictions.iloc[0:split_index]
predTest = predictions.iloc[split_index:len(predictions)]


#Set up blending on validation set


# Find real labels
y_test = np.array(predTrain['Real'])
y_pred = np.array(predTest['Real'])


X_test = predTrain.drop(['Real'], axis=1)
X_test=np.array(X_test)

X_pred = predTest.drop(['Real'], axis=1)
X_pred=np.array(X_pred)




# Find blending weights
linreg = LinearRegression()
linreg.fit(X_test, y_test)

 # Create dictionary of weights
weights = dict(zip(labels, linreg.coef_))

# Predict final ratings
final_predictions = np.clip(linreg.predict(X_pred), 1, 5)

print('Blending Weights: ')
print(weights, end='\n\n')

print('RMSE on ProbeTrain: %f' % rmse(y_test, linreg.predict(X_test)))
print('RMSE on ProbeTest: %f' % rmse(y_pred, final_predictions))
predTest['Blended']=final_predictions





#Predict on testset
# Find real labels
y_test2 = np.array(predictions['Real'])

X_test2 = predictions.drop(['Real'], axis=1)
X_test2=np.array(X_test2)

X_pred2=pd.DataFrame(final_predictions_dict)
X_pred2=np.array(X_pred2)



# Find blending weights
linreg = LinearRegression()
linreg.fit(X_test2, y_test2)

# Create dictionary of weights
weights = dict(zip(labels, linreg.coef_))

# Predict final ratings
final_predictions2 = np.clip(linreg.predict(X_pred2), 1, 5)

print('Blending Weights: ')
print(weights, end='\n\n')

print('RMSE on ProbeTrain: %f' % rmse(y_test2, linreg.predict(X_test2)))
df2['Rating']=final_predictions2

#Export results
export_order = ['itemID', 'userID', 'Rating']
df2 = df2[export_order]
df2.to_csv('Predictions.txt', sep=',', index=False,header=False)



#Neural Networks regression using Keras package
'''

import numpy

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# define base model
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(7, input_dim=7, kernel_initializer='normal', activation='relu'))
	model.add(Dense(1, kernel_initializer='normal'))
	# Compile model
	model.compile(loss='mean_squared_error', optimizer='adam')
	return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=1)

estimator.fit(X_test, y_test)
keras_prediction = estimator.predict(X_pred)
rmse(y_pred, keras_prediction)





# Find real labels
y_test2 = np.array(predictions['Real'])

X_test2 = predictions.drop(['Real'], axis=1)
X_test2=np.array(X_test2)

X_pred2=pd.DataFrame(final_predictions_dict)
X_pred2=np.array(X_pred2)


seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=1)

estimator.fit(X_test2, y_test2)
keras_prediction = estimator.predict(X_pred2,batch_size=5)
rmse(y_test2, X_test2)
'''