In [2]:
import numpy as np
import pandas as pd

In [3]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

In [4]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Nos falta el nombre de la película....

In [5]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Se unen las bases a traves de un merge....

In [6]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


Veamos cuantos usuarios y películas únicas tenemos...

In [7]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

print('Num. of Users: '+ str(n_users))
print('Num of Movies: '+str(n_items))

Num. of Users: 943
Num of Movies: 1682


## Data Parition

Debemos hacer el data parttion con el que vamos a trabajar, en este caso vamos a usar el 80% para entreamiento y 20% para test.

El objetivo es tener datos para probar las predicciones de calificaciones  a través del SVD.

In [8]:
from sklearn.cross_validation import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2)



Ejemplo de user-item matrix:
<img class="aligncenter size-thumbnail img-responsive" src="BLOG_CCA_8.png" alt="blog8"/>

Ahora debemos convertir los datos en una matriz que contenga los usuarios como renglones y las columnas como peliculas, como se mencionó al principio...

In [9]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]  

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

Ahora tenemos la matriz y ya podemos aplicar SVD...

In [10]:
train_data_matrix

array([[ 5.,  3.,  4., ...,  0.,  0.,  0.],
       [ 4.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 5.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  5.,  0., ...,  0.,  0.,  0.]])

### SVD
A well-known matrix factorization method is **Singular value decomposition (SVD)**. Collaborative Filtering can be formulated by approximating a matrix `X` by using singular value decomposition. The winning team at the Netflix Prize competition used SVD matrix factorization models to produce product recommendations, for more information I recommend to read articles: [Netflix Recommendations: Beyond the 5 stars](http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html) and [Netflix Prize and SVD](http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower-netflix-SVD.pdf).
The general equation can be expressed as follows:
<img src="https://latex.codecogs.com/gif.latex?X=USV^T" title="X=USV^T" />


Given `m x n` matrix `X`:
* *`U`* is an *`(m x r)`* orthogonal matrix
* *`S`* is an *`(r x r)`* diagonal matrix with non-negative real numbers on the diagonal
* *V^T* is an *`(r x n)`* orthogonal matrix

Elements on the diagnoal in `S` are known as *singular values of `X`*. 


Matrix *`X`* can be factorized to *`U`*, *`S`* and *`V`*. The *`U`* matrix represents the feature vectors corresponding to the users in the hidden feature space and the *`V`* matrix represents the feature vectors corresponding to the items in the hidden feature space.
<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="BLOG_CCA_4.png"/>

Now you can make a prediction by taking dot product of *`U`*, *`S`* and *`V^T`*.

<img class="aligncenter size-thumbnail img-responsive" style="max-width:100%; width: 50%; max-width: none" src="BLOG_CCA_5.png"/>

Definimos root mean suared error como métrica para evaluar predicciones de ratings...

In [11]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

Hacemos SVD com 10 factores usando scipy...

In [12]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix,k=10)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)


In [13]:
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.613260499060906


Carelessly addressing only the relatively few known entries is highly prone to overfitting. SVD can be very slow and computationally expensive. More recent work minimizes the squared error by applying alternating least square or stochastic gradient descent and uses regularization terms to prevent overfitting. Alternating least square and stochastic gradient descent methods for CF will be covered in the next tutorials.


In [14]:
X_pred

array([[  3.13904642e+00,   1.29351185e+00,   1.00254449e+00, ...,
          0.00000000e+00,   1.72484194e-02,   5.49856132e-02],
       [  1.22604708e+00,  -1.51966931e-01,   1.59979126e-01, ...,
          0.00000000e+00,  -3.20125785e-03,  -1.13917974e-03],
       [ -2.57855837e-02,   5.55473032e-02,   5.65704435e-02, ...,
          0.00000000e+00,  -3.73799467e-03,  -1.06724791e-02],
       ..., 
       [  1.61596099e+00,  -1.36794668e-01,   2.16129428e-01, ...,
          0.00000000e+00,  -6.36693768e-03,  -4.86716967e-03],
       [  1.20405903e+00,   2.73101983e-01,  -2.72790470e-01, ...,
          0.00000000e+00,  -2.91938972e-04,  -2.65697078e-02],
       [  9.97818290e-01,   1.72865581e+00,   7.28208692e-01, ...,
          0.00000000e+00,   2.71038868e-02,   2.84539183e-02]])

In [15]:
train_data_matrix

array([[ 5.,  3.,  4., ...,  0.,  0.,  0.],
       [ 4.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 5.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  5.,  0., ...,  0.,  0.,  0.]])

In [31]:
np.shape(X_pred)

(943, 1682)

In [22]:
np.savetxt('train_data_matrix.txt', train_data_matrix)

In [20]:
np.shape(train_data_matrix)

(943, 1682)

In [35]:
from numpy import *
U,s,V = linalg.svd(train_data_matrix)

In [42]:
np.shape(V)

(1682, 1682)

In [45]:
pred=np.dot(np.dot(U,np.diag(s)),V[0:943,:])

In [46]:
pred

array([[ -2.65456060e-14,   3.00000000e+00,   4.00000000e+00, ...,
         -2.31422954e-16,   2.42861287e-16,   8.67361738e-19],
       [  4.00000000e+00,   2.96637714e-13,  -2.49800181e-15, ...,
          4.85722573e-16,   3.20923843e-16,  -8.74300632e-16],
       [ -1.72778458e-14,  -3.77961551e-14,   1.72847847e-14, ...,
         -4.73579509e-16,  -9.36750677e-17,   4.64905892e-16],
       ..., 
       [  5.00000000e+00,  -5.09314813e-15,   2.61943245e-15, ...,
         -1.73472348e-16,   4.51028104e-16,  -9.88792381e-17],
       [ -1.36214824e-14,  -1.03875242e-14,  -3.81747585e-15, ...,
         -3.02709247e-16,  -4.55364912e-18,  -6.76542156e-16],
       [ -3.85715765e-14,   5.00000000e+00,  -5.17554749e-15, ...,
          1.16226473e-16,   4.16333634e-17,   5.22151766e-16]])

In [48]:
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 3.66313582407
