# Transition from R to Python

After many failed attempts working with R's recosystem, I was able to get better results using Python's surprise library. One of the main issues using R's recosystem was that ratings would exceed the range of 1-5, while this library retained this constraint.

## Submission 12

I approached this with SVD, taking some of the better results from the R attempts. I used 20 latent factors, and a learning rate of 0.01.

In [1]:
import pandas
from surprise import *

In [2]:
data = Dataset.load_from_file("train.txt", reader=Reader(line_format='user item rating', sep=' '))
test_set=pandas.read_csv("test.txt", sep=' ', header=None, names=['id','user','book'])

In [3]:
train_set=data.build_full_trainset()

In [None]:
algo=SVD(n_factors=20,n_epochs=10,lr_all=0.01)

In [None]:
algo.train(train_set)

In [None]:
ratings=[]

for i in range(0,test_set.shape[0]):
    ratings.append(algo.predict(str(test_set['user'][i]),test_set['book'][i])[3])

In [None]:
output = pandas.DataFrame({"id": test_set['id'], "rating":ratings})

In [None]:
output.to_csv("py_output1.csv",sep=",",index=False, index_label=False)

## Submission 13

Based on my findings before, I noticed the cost for items was smaller than the cost for users. In terms of bias, maybe this can translate to larger bias term in users. Rather than tuning (which takes a long time), I try to make an educated guess of better learning rate and regularization, increasing those for users, but decreasing it for items.

In both python submissions, I noticed the minimum rating was around 1.2. This is something to adjust for, maybe applying a bigger cost parameter.

This performed slightly worse. I think it may be also due to the fact that I reduced n_epochs to 10, which I should have kept at 15. However, I do not think changing n_epochs from 10 to 15 would change the output as much.

In [None]:
algo=SVD(n_factors=20, n_epochs=10, lr_bu=0.02, reg_bu=0.05, lr_bi=0.001, reg_bi=0.01)

In [None]:
algo.train(train_set)

In [None]:
ratings2=[]

for i in range(0,test_set.shape[0]):
    ratings2.append(algo.predict(str(test_set['user'][i]),test_set['book'][i])[3])

In [None]:
output2 = pandas.DataFrame({"id": test_set['id'], "rating":ratings2})
output2.to_csv("py_output2.csv",sep=",",index=False, index_label=False)

In [None]:
min(ratings2)

## Submission 14

I will see if SVD++ will show if maybe the bias is less important than the implicit rating aspect. I will use the same parameters as submission 12 so I can make a useful comparison. I tried to use the same learning rate, but I do not think it converged (2 hours with no result). Unfortunately, this would require a couple of days to run and verify, so SVD is probably our best bet.

I decided to double the number of latent factors from 20 to 40. This improved rmse by about 0.01

In [None]:
algo=SVD(n_factors=40, n_epochs=10)
algo.train(train_set)

In [None]:
ratings3=[]

for i in range(0,test_set.shape[0]):
    ratings3.append(algo.predict(str(test_set['user'][i]),test_set['book'][i])[3])

In [None]:
ratings

In [None]:
output3 = pandas.DataFrame({"id": test_set['id'], "rating":ratings3})
output3.to_csv("py_output3.csv",sep=",",index=False, index_label=False)

## Submission 15

I will test flooring ratings that are less than 1.5 to 1 to see if that improves the rmse. It barely performed any better or worse, so it is not a major issue.

In [None]:
ratings4=ratings3

In [None]:
for i in range(0,len(ratings4)):
    if (ratings4[i]<1.5):
        ratings4[i]=1

In [None]:
output4 = pandas.DataFrame({"id": test_set['id'], "rating":ratings4})
output4.to_csv("py_output4.csv",sep=",",index=False, index_label=False)

## Submission 16

In similar vein of submission 13, I will adjust the learning rate and regularization rate, but this time increasing on user and decreasing on item. I will follow submission 13's parameters of 40 factors. The minimum is 2.4, which I thought would translate to a worse rmse, but did slightly better again with another 0.01 deduction in the rmse. However, this may be overfitting, and I would go with submission 14.

In [None]:
algo=SVD(n_factors=40, n_epochs=10, lr_bu=0.001, reg_bu=0.01, lr_bi=0.03, reg_bi=0.02)

In [None]:
algo.train(train_set)

In [None]:
ratings5=[]

for i in range(0,test_set.shape[0]):
    ratings5.append(algo.predict(str(test_set['user'][i]),test_set['book'][i])[3])

print(min(ratings5),max(ratings5))

In [None]:
output5 = pandas.DataFrame({"id": test_set['id'], "rating":ratings5})
output5.to_csv("py_output5.csv",sep=",",index=False, index_label=False)

## Submission 17 & 18

I wanted to try a different model besides SVD. I attempted KNN, but the similarity matrix used up too much memory to construct for this package. Co-clustering seemed to only give me the mean rating on every test entry. I decided to take a step back and look at the baseline algorithm, which does not include the user-book interaction. Then I use SVD on the utility matrix (user-item matrix), using only 5 and 20 factors as an initial test. I got a comparable rmse to svd.

I don't understand how to get lower rmse without overfitting.

In [15]:
#algo=KNNWithMeans(k=400,min_k=10,sim_options={'name': 'pearson_baseline', 'user_based': True, 'min_support': 100, 'shrinkage':50})

#algo=CoClustering(n_cltr_u=5, n_cltr_i=5, n_epochs=10,verbose=True)
algo=BaselineOnly(bsl_options={'method':'sgd'})

In [16]:
algo.train(train_set)

Estimating biases using sgd...


In [17]:
ratings6=[]

for i in range(0,test_set.shape[0]):
    ratings6.append(algo.predict(str(test_set['user'][i]),test_set['book'][i])[3])

print(min(ratings6),max(ratings6))

1.26122590988 5


In [25]:
outputbase = pandas.DataFrame({"id": test_set['id'], "rating":ratings6})
outputbase.to_csv("py_outputbase.csv",sep=",",index=False, index_label=False)

In [3]:
import numpy as np
from scipy.sparse import csr_matrix
train_dat=pandas.read_csv("train.txt", header=None, names=['user','book','rating'], sep=' ')
train_dat
ratingsmat=csr_matrix((train_dat['rating'],(train_dat['user'],train_dat['book'])),(train_dat.shape[0],train_dat.shape[0]))
#from sklearn.metrics.pairwise import cosine_similarity

In [4]:
#ratingsmat

<20256439x20256439 sparse matrix of type '<class 'numpy.int64'>'
	with 20256439 stored elements in Compressed Sparse Row format>

In [None]:
#https://stackoverflow.com/questions/31523575/get-u-sigma-v-matrix-from-truncated-svd-in-scikit-learn
from sklearn.utils.extmath import randomized_svd

U, Sigma, VT = randomized_svd(ratingsmat, 
                              n_components=500,
                              n_iter=1,
                              random_state=None)

In [28]:
useriteminteract=[]

for i in range(0,test_set.shape[0]):
    useriteminteract.append(sum(U[test_set['user'][i],:] * Sigma * VT[:,test_set['book'][i]]))

print(min(useriteminteract),max(useriteminteract))

-0.0422788570491 0.166395263091


In [16]:
base_ratings=pandas.read_csv("py_outputbase.csv")

In [29]:
new_ratings=pandas.DataFrame({"id": base_ratings['id'], "rating": base_ratings['rating'] + useriteminteract})

In [42]:
min(new_ratings['rating'])

1.2612259098825829

In [43]:
new_ratings.to_csv("py_output7.csv",sep=",",index=False, index_label=False)