## <font color='green'>Movielens - Singular Value Decomposition (Sparse Matrix)<font>

### <font color='green'> 1. Description<font>

Recommendation using singular value decomposition.

The data is taken from: https://grouplens.org/datasets/movielens/25m/
It contains 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. 

We used singular value decomposition (SVD) for the recommendation task. From the rating, we created user-item matrix that contains the ratings
of the movie by the user. This matrix is quite sparse matrix. We decomposed the matrix into the multiplication of 3 matrices: left singular vectors (U), singular values (s), and right singular vectors (V(transposed)). We can limit the number of singular values (and the colums/rows of U and V) from the larger ones to approximate the matrix. After the approximation, we can infer the non-rated part of the original matrix by multiplying the U, s and V. This is well known collaborative filtering meth

In the example result, Japanese animations are recommended to the user who likes such movies.

### <font color='green'> 2. Data Preprocessing<font>

In [1]:
# Please uncomment the below lines to download and unzip the dataset.
#!wget -N http://files.grouplens.org/datasets/movielens/ml-25m.zip
#!unzip -o ml-25m.zip
#!mv ml-25m datasets

In [2]:
# prepare data
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix

df = pd.read_csv("datasets/ml-25m/ratings.csv").drop(['timestamp'], axis=1)
# though userId and movieId starts from 1, do not -1 for consistency
userId = df['userId'].values
movieId = df['movieId'].values
rating = df['rating'].values
mat = coo_matrix((rating, (userId, movieId))).tocsr()
print ("shape of the matrix is {}".format(mat.shape))

shape of the matrix is (162542, 209172)


### <font color='green'> 3. Implementation using Frovedis<font>

In [3]:
# train
import os, time
from frovedis.decomposition import TruncatedSVD as frovTruncatedSVD
from frovedis.exrpc.server import FrovedisServer
FrovedisServer.initialize("mpirun -np 8 {}".format(os.environ['FROVEDIS_SERVER']))

svd = frovTruncatedSVD(n_components=300, algorithm='arpack')
t1 = time.time()
frov_transformed = svd.fit_transform(mat)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))

train time: 6.595 sec


In [4]:
# predict
from distutils.version import LooseVersion
movies = pd.read_csv("datasets/ml-25m/movies.csv")
def print_movies(print_df):
    merged = pd.merge(movies, print_df, on = 'movieId').sort_values('rate',ascending=False)
    to_print = merged[['title','rate']]
    if(LooseVersion(pd.__version__) >= LooseVersion("1.0.0")):
        print (to_print.to_string(index=False, max_colwidth=50))
    else:
        print (to_print.to_string(index=False))

testId = 282

# X ~= USV^T, transformed = US, svd_components_ = V^T
infer = np.matmul(frov_transformed[testId,:], svd.components_)
movieId = np.arange(infer.shape[0])
infer_df = pd.DataFrame({'rate_infer' : infer, 'movieId' : movieId})

rated = mat[testId,:]
rated_data = rated.data
rated_indices = rated.indices
rated_df = pd.DataFrame({'rate_rated' : rated_data, 'movieId': rated_indices})
rated_sorted = rated_df.sort_values('rate_rated',ascending=False)
to_print = rated_sorted.rename(columns={'rate_rated': 'rate'})
print("highly rated movies of user {}:".format(testId))
print_movies(to_print.head(20))

# exclude already rated movies
tmp_df = pd.merge(infer_df, rated_df, on='movieId', how='outer')
not_rated_df = tmp_df[pd.isnull(tmp_df['rate_rated'])]
not_rated_sorted = not_rated_df.sort_values('rate_infer',ascending=False)
to_print = not_rated_sorted.rename(columns={'rate_infer': 'rate'})
print("")
print("recommended movies:")
print_movies(to_print.head(20))

FrovedisServer.shut_down()

highly rated movies of user 282:
                                             title  rate
                                      Akira (1988)   5.0
                            Iron Giant, The (1999)   5.0
                             Boxtrolls, The (2014)   5.0
                  Grand Budapest Hotel, The (2014)   5.0
 From Up on Poppy Hill (Kokuriko-zaka kara) (2011)   5.0
                                       Hugo (2011)   5.0
 Secret World of Arrietty, The (Kari-gurashi no...   5.0
                                   Coraline (2009)   5.0
                Ponyo (Gake no ue no Ponyo) (2008)   5.0
                                     WALL·E (2008)   5.0
 Girl Who Leapt Through Time, The (Toki o kaker...   5.0
          Tekkonkinkreet (Tekkon kinkurîto) (2006)   5.0
                         Paprika (Papurika) (2006)   5.0
                                 MirrorMask (2005)   5.0
 Howl's Moving Castle (Hauru no ugoku shiro) (2...   5.0
 Porco Rosso (Crimson Pig) (Kurenai no buta) (1...   5.

### <font color='green'> 4. Implementation using scikit-learn<font>

In [5]:
# train
from sklearn.decomposition import TruncatedSVD as skTruncatedSVD

svd = skTruncatedSVD(n_components=300, algorithm='arpack')
t1 = time.time()
sk_transformed = svd.fit_transform(mat)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))

train time: 104.333 sec


In [6]:
# predict

# X ~= USV^T, transformed = US, svd_components_ = V^T
infer = np.matmul(sk_transformed[testId,:], svd.components_)
movieId = np.arange(infer.shape[0])
infer_df = pd.DataFrame({'rate_infer' : infer, 'movieId' : movieId})

rated = mat[testId,:]
rated_data = rated.data
rated_indices = rated.indices
rated_df = pd.DataFrame({'rate_rated' : rated_data, 'movieId': rated_indices})
rated_sorted = rated_df.sort_values('rate_rated',ascending=False)
to_print = rated_sorted.rename(columns={'rate_rated': 'rate'})
print("highly rated movies of user {}:".format(testId))
print_movies(to_print.head(20))

# exclude already rated movies
tmp_df = pd.merge(infer_df, rated_df, on='movieId', how='outer')
not_rated_df = tmp_df[pd.isnull(tmp_df['rate_rated'])]
not_rated_sorted = not_rated_df.sort_values('rate_infer',ascending=False)
to_print = not_rated_sorted.rename(columns={'rate_infer': 'rate'})
print("")
print("recommended movies:")
print_movies(to_print.head(20))

highly rated movies of user 282:
                                             title  rate
                                      Akira (1988)   5.0
                            Iron Giant, The (1999)   5.0
                             Boxtrolls, The (2014)   5.0
                  Grand Budapest Hotel, The (2014)   5.0
 From Up on Poppy Hill (Kokuriko-zaka kara) (2011)   5.0
                                       Hugo (2011)   5.0
 Secret World of Arrietty, The (Kari-gurashi no...   5.0
                                   Coraline (2009)   5.0
                Ponyo (Gake no ue no Ponyo) (2008)   5.0
                                     WALL·E (2008)   5.0
 Girl Who Leapt Through Time, The (Toki o kaker...   5.0
          Tekkonkinkreet (Tekkon kinkurîto) (2006)   5.0
                         Paprika (Papurika) (2006)   5.0
                                 MirrorMask (2005)   5.0
 Howl's Moving Castle (Hauru no ugoku shiro) (2...   5.0
 Porco Rosso (Crimson Pig) (Kurenai no buta) (1...   5.