In [1]:
import numpy as np
import pandas as pd

from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

from polara import get_movielens_data

# Prepare data

Using Movielens-1M dataset ([Movielens](https://movielens.org/) is a movie recommendation system, created by researchers)

In [2]:
ratings, movies = get_movielens_data(get_genres=True, split_genres=False)

We have description of the movies

In [3]:
movies = movies.set_index('movieid')
movies.head()

Unnamed: 0_level_0,movienm,genres
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Animation|Children's|Comedy
2,Jumanji (1995),Adventure|Children's|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama
5,Father of the Bride Part II (1995),Comedy


and ratings data

In [4]:
ratings.head(10)

Unnamed: 0,userid,movieid,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Total number of ratings: 

In [5]:
ratings.shape[0]

1000209

Number of users and items:

In [6]:
ratings[['userid', 'movieid']].apply(pd.Series.nunique)

userid     6040
movieid    3706
dtype: int64

Data sparsity:

In [7]:
ratings.shape[0] / np.prod(ratings[['userid', 'movieid']].apply(pd.Series.nunique))

0.044683625622312845

Select favorite movies (to generated recommendations based on it)

In [8]:
movies.loc[movies.movienm.str.contains('hackers', flags=2)]

Unnamed: 0_level_0,movienm,genres
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
170,Hackers (1995),Action|Crime|Thriller


In [9]:
favorite_movies_ids = [170] 

check

In [10]:
movies.loc[favorite_movies_ids]

Unnamed: 0_level_0,movienm,genres
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
170,Hackers (1995),Action|Crime|Thriller


# Recsys model in 3 lines of code

#### 1 Build sparse matrix from ratings data

In [11]:
data_matrix = csr_matrix((ratings.rating.values.astype('f8'), (ratings.userid.values, ratings.movieid.values)))

#### 2 Compute sparse SVD

In [12]:
_, S, Vt = svds(data_matrix, k=50, return_singular_vectors='vh')

#### 3 Generate top-$n$ recommendations based on the known user preferences

In [13]:
movies.loc[np.argsort(np.dot(-Vt.T, Vt[:, favorite_movies_ids].sum(axis=1)))[:15]] # assuming binary preference vector

Unnamed: 0_level_0,movienm,genres
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
293,"Professional, The (a.k.a. Leon: The Profession...",Crime|Drama|Romance|Thriller
353,"Crow, The (1994)",Action|Romance|Thriller
170,Hackers (1995),Action|Crime|Thriller
1479,"Saint, The (1997)",Action|Romance|Thriller
2692,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance
1918,Lethal Weapon 4 (1998),Action|Comedy|Crime|Drama
165,Die Hard: With a Vengeance (1995),Action|Thriller
70,From Dusk Till Dawn (1996),Action|Comedy|Crime|Horror|Thriller
1527,"Fifth Element, The (1997)",Action|Sci-Fi
163,Desperado (1995),Action|Romance|Thriller


# What just has happened?

SVD of the ratings matrix (imputed with zeros):

$$
A \approx U \Sigma V^T
$$

gives compact *representation of users and movies in terms of some hidden (latent) features* encoded by $U$ and $V$ respectively.  
Recommendations are defined by an *orthogonal projection of a user's preferences onto the latent features space of movies*:

$$
\boldsymbol{r} = VV^T \boldsymbol{p},
$$

where $\boldsymbol{r}$ is a vector or predicted relevance scores for all movies, $\boldsymbol{p}$ is a vector of user preferences.  
Top-$n$ recommendations are generated as 

$$\text{arg}\max_n\,r$$

The model is known as *PureSVD*, see Cremonesi, P., Koren, Y., and Turrin, R, [*Performance of recommender algorithms on top-n recommendation tasks*](https://dl.acm.org/citation.cfm?id=1864721), Proceedings of the Fourth ACM Conference on Recommender Systems, 2010.