# Collaborative filtering with side information
** *
This IPython notebook illustrates the usage of the [cmfrec](https://github.com/david-cortes/cmfrec) Python package for collective matrix factorization using the [MovieLens-100k data](https://grouplens.org/datasets/movielens/100k/), consisting of ratings from users about movies + user demographic information + movie genres, and also using the MovieLens-1M data + the movie tag genome.

Collective matrix factorization is a technique for collaborative filtering with additional information about the users and items, based on low-rank joint factorization of different matrices with shared factors – for more details see the paper [_Singh, A. P., & Gordon, G. J. (2008, August). Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 650-658). ACM._](http://ra.adm.cs.cmu.edu/anon/usr/ftp/ml2008/CMU-ML-08-109.pdf).

** Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following [this link](http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb). **
** *
## Sections

[1. Basic model - only movie ratings](#p1)
* [1.1 Loading the ratings data](#p11)
* [1.2 Train and test split](#p12)
* [1.3 Fitting and evaluating the model](#p13)

[2. Adding movie genres](#p2)
* [2.1 Loading the movie genres info](#p21)
* [2.2 Fitting and evaluating model with genres](#p22)

[3. Adding user demographic info](#p3)
* [3.1 Loading the user demographic info](#p31)
* [3.2 Getting user region from zip codes](#p32)
* [3.2 Fitting and evaluating the full model](#p33)

[4. Comparing recommendations](#p4)

[5. Larger example - MovieLens-1M and movie tag genome](#p5)
* [5.1 Loading the data](#p51)
* [5.2 Model with no side info](#p52)
* [5.3 Model with movie tags](#p53)

** *

<a id="p1"></a>
## 1. Basic model - only movie ratings
** *
As a starting point, I'll first try the basic low-rank factorization model using ratings data alone - that is, trying to minimize the following function:

$$ Loss\:(U, V) = \frac{\lVert X  - UV^T\lVert^2\,}{n} + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2) $$

Where $U$ and $V$ are lower-dimensional matrices mapping users and items into a latent space - this is the classic model popularized by Funk. The predicted rating from this model for a given user $i$ and movie $j$ can be calculated as $U[i,:]*V[j,:]^T$

<a id="p11"></a>
### 1.1 Loading the ratings data

Here I'll load the MovieLens-100k ratings data, which can be downloaded from the link presented at the beginning:

In [1]:
import pandas as pd, time
from datetime import datetime

ratings=pd.read_table('D:\\Downloads\\movielens\\ml-100k\\ml-100k\\u.data',sep='\t',engine='python',names=['UserId','ItemId','Rating','Timestamp'])
ratings['Timestamp']=ratings.Timestamp.map(lambda x: datetime(*time.localtime(x)[:6])).map(lambda x: pd.to_datetime(x))
ratings=ratings.sort_values(['UserId','ItemId']).reset_index(drop=True)
ratings.head()

Unnamed: 0,UserId,ItemId,Rating,Timestamp
0,1,1,5,1997-09-23 01:02:38
1,1,2,3,1997-10-15 08:26:11
2,1,3,4,1997-11-03 09:42:40
3,1,4,3,1997-10-15 08:25:19
4,1,5,3,1998-03-13 03:15:12


<a id="p12"></a>
### 1.2 Train and test split

In order to evaluate the model, I'll create a train and test set split to use throughout the whole notebook. As this kind of model can only recommend items that were in the training set to users who also were in the training set, I'll make the test set contain only elements that were present in the train set.

In order to make this more realistic, I'll make it as a temporal split, i.e. splitting the ratings as those who were submitted before and after a certain time cutoff.

In [2]:
time_cutoff='1998-01-01'
train=ratings.loc[ratings.Timestamp<=time_cutoff]
test=ratings.loc[ratings.Timestamp>time_cutoff]
users_train=set(list(train.UserId))
items_train=set(list(train.ItemId))
test=test.loc[test.UserId.map(lambda x: x in users_train)]
test=test.loc[test.ItemId.map(lambda x: x in items_train)]
print(train.shape)
print(test.shape)

(52884, 4)
(5835, 4)


Note that this is a **very** small sample, in a typical setting you would have 3 or 4 orders of magnitude more. Nevertheless, this smallish data is enough to see a difference between models.

In [3]:
print(len(users_train))
print(len(items_train))

529
1493


<a id="p13"></a>
### 1.3 Fitting and evaluating the model

Traditionally, recommendations have been evaluated by their cross-validated RMSE (root mean squared error), but this is not really a good metric and higher values might not translate into better-liked recommendations. There are many additional metrics that can be used, but to keep this example simple, I’ll evaluate the rating that users would have given to the Top-5 recommendations from this model and compare this to recommendations by item popularity and to random recommendations.

In [4]:
from cmfrec import CMF
import numpy as np

# Number of latent factors
k=40

# Regularization parameter
reg=1e-3

# Fitting the model
rec=CMF(k=k, reg_param=reg)
rec.fit(train, random_seed=12345)

# Making predictions
test['Predicted']=test.apply(lambda x: rec.predict(x['UserId'],x['ItemId']),axis=1)

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
  Objective function value: 3.123490
  Number of iterations: 62
  Number of functions evaluations: 66


Evaluating Hold-Out RMSE (the hyperparameters had already been somewhat tuned by cross-validation)

In [5]:
np.sqrt(np.mean((test.Predicted-test.Rating)**2))

1.5022877488223678

Basic evaluation of this model:

In [6]:
avg_ratings=train.groupby('ItemId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 3.9555895865237365
Average rating for bottom-5 (non-)recommendations from this model: 3.2128637059724348


The recommendations from this model are not bad, but the average rating of the Top-5 doesn't manage to beat a most-popular recommendation! This is not surprising given the small size of the ratings data though.

<a id="p2"></a>
## 2. Adding movie genres
** *
The previous model can be extended by adding some additional information about the movies - this can be done by also factorizing the movie-genre matrix and sharing the item-factor matrix in the factorization of the user-item ratings. Now the model becomes:

$$ Loss\:(U, V, Z) = \frac{\lVert X  - UV^T\lVert^2}{n} + \frac{\lVert M-VZ^T \lVert^2}{m}  + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2 + \lVert Z \lVert^2) $$

Where $U$, $V$ and $Z$ are lower-dimensional matrices mapping users, items and genres into a latent space. The predicted rating from this model for a given user $i$ and movie $j$ is still calculated the same as before: $U[i,:]*V[j,:]^T$. However, we can intuitively think that an item-factor matrix that also represents genres might be better than one that does not, and less likely to overfit, as these factors are not so free.

The matrix $V$ however doesn't need to be exactly the same in both terms - we can also add some additional factors that appear in only one factorization, making the follwing formula:

$$ Loss\:(U, V, Z) = \frac{\lVert X  - UV_{main}^T\lVert^2}{n} + \frac{\lVert M-V_{sec}Z^T \lVert^2}{m}  + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2 + \lVert Z \lVert^2) $$

Where $ V_{main} = V_{[1\:to\:k_{main} + k_{shared} ,\:\cdot]}$ and $V_{sec} = V_{[k_{main} +1 \:to\: k_{main} + k_{shared} + k_{sec},\:\cdot]}$


<a id="p21"></a>
### 2.1 Loading the movie genres info

The MovieLens-100k data also comes with a file containing movie information that we can use to enhance the model - note that the package requires the item side information to have a column named _ItemId_ when you pass it to the API. If your data doesn't require any reindexing, you can also pass it as a numpy array and set the option reindex to False.

In [7]:
colnames=['ItemId','Title','ReleaseDate','Sep','Link']+['genre'+str(i) for i in range(19)]
genres=pd.read_table('D:\\Downloads\\movielens\\ml-100k\\ml-100k\\u.item',sep="|",engine='python',names=colnames)

# will save the movie titles for later
movie_id_to_title={i.ItemId:i.Title for i in genres.itertuples()}

genres=genres[['ItemId']+['genre'+str(i) for i in range(19)]]
genres.head()

Unnamed: 0,ItemId,genre0,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8,genre9,genre10,genre11,genre12,genre13,genre14,genre15,genre16,genre17,genre18
0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


<a id="p22"></a>
### 2.2 Fitting and evaluating model with genres

These hypterparameters (number of factors and regularization) were also somewhat tuned beforehand:

In [8]:
# Number of latent factors
k=30
k_main=10
k_sec=10

# Regularization parameter
reg=1e-3

# Fitting the model
rec2=CMF(k=k, k_main=k_main, k_item=k_sec, w_main=2, reg_param=reg)
rec2.fit(train, genres, random_seed=10000)

# Making predictions
test['Predicted']=test.apply(lambda x: rec2.predict(x['UserId'],x['ItemId']),axis=1)

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 4.007323
  Number of iterations: 182
  Number of functions evaluations: 205


RMSE now:

In [9]:
np.sqrt(np.mean((test.Predicted-test.Rating)**2))

1.3038670132348265

Same evaluation as before:

In [10]:
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 4.027565084226646
Average rating for bottom-5 (non-)recommendations from this model: 3.119448698315467


Now we see a huge improbement in terms of RMSE (root mean squared error), and a bit of an improvement in ratings of Top5 - it's not too large, but it's nevertheless an improvement.

Knowing these  generic genres shouldn't be a complete game changer so this is expected.

<a id="p3"></a>
## 2. Adding and user demographic info
** *
The previous model can be extended to incorporate user information in the same way as it added movie genres:


$$ Loss\:(U, V, Z, P) = \frac{\lVert X  - UV^T\lVert^2}{N_x} + \frac{\lVert M-VZ^T \lVert^2}{N_m} + \frac{\lVert Q-UP^T \lVert^2}{N_q}  + \:\lambda\: (\lVert U \lVert^2 + \lVert V \lVert^2 + \lVert Z \lVert^2 + \lVert P \lVert^2) $$

Where $Q$ is the user attribute matrix and $P$ is the new attribute-factor matrix - same as before, some of the factors can be shared and some be specific to one factorization.

Intuitively, since in a typical setting there are usually more users than items (not in this particular example though), and each user has on average fewer rated movies than movies have users rating them, it would be logical to assume that detailed user information should be more valuable than detailed item information.

<a id="p31"></a>
### 3.1 Loading the user demographic info

The MovieLens-100k data also comes with user demographic information - same as before, the data frame passed to the package API should have a column named _UserId_:

In [11]:
user_info=pd.read_table('D:\\Downloads\\movielens\\ml-100k\\ml-100k\\u.user',sep="|",engine='python',
                        names=['UserId','Age','Gender','Occupation','Zipcode'])
user_info.head()

Unnamed: 0,UserId,Age,Gender,Occupation,Zipcode
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


This time, unfortunately, not all the information can be used as it is in the file. The zip code can still provide valuable information if we can link it to a broader geographical area. As these are mostly US users, I'll try to link it to US regions here.

In order to do so, I’m using a [publicly available table](http://federalgovernmentzipcodes.us/) mapping zip codes to states, [another one](http://www.fonz.net/blog/archives/2008/04/06/csv-of-states-and-state-abbreviations/) mapping state names to their abbreviations, and finally classifying the states into regions according to [usual definitions](https://www.infoplease.com/us/states/sizing-states).

<a id="p32"></a>
### 3.2 Getting user region from zip codes

In [12]:
import re

zipcode_abbs=pd.read_csv("D:\\Downloads\\movielens\\zips\\states.csv")
zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}
us_regs_table=[
    ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),
    ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),
    ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),
    ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),
    ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),
    ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')
    ]
us_regs_table=[(x[0],[i.strip() for i in x[1].split(",")]) for x in us_regs_table]
us_regs_dct=dict()
for r in us_regs_table:
    for s in r[1]:
        us_regs_dct[zipcode_abbs_dct[s]]=r[0]
        
zipcode_info=pd.read_csv("D:\\Downloads\\movielens\\free-zipcode-database.csv")
zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()
zipcode_info['State'].loc[zipcode_info.Country!="US"]='UnknownOrNonUS'
zipcode_info['Region']=zipcode_info['State'].copy()
zipcode_info['Region'].loc[zipcode_info.Country=="US"]=zipcode_info.Region.loc[zipcode_info.Country=="US"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')
zipcode_info=zipcode_info[['Zipcode', 'Region']]
zipcode_info.head()

def process_zip(zp):
    try:
        zp=np.int(zp)
        return zp
    except:
        return np.nan

user_info["Zipcode"]=user_info.Zipcode.map(process_zip)
user_info=pd.merge(user_info,zipcode_info,on='Zipcode',how='left')
user_info['Region']=user_info.Region.fillna('UnknownOrNonUS')

user_info=pd.get_dummies(user_info[['UserId','Age','Gender','Occupation','Region']])
users_w_side_info=set(list(user_info.UserId))
ratings=ratings.loc[ratings.UserId.map(lambda x: x in users_w_side_info)]

user_info.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,UserId,Age,Gender_F,Gender_M,Occupation_administrator,Occupation_artist,Occupation_doctor,Occupation_educator,Occupation_engineer,Occupation_entertainment,...,Occupation_technician,Occupation_writer,Region_Middle Atlantic,Region_Midwest,Region_New England,Region_South,Region_Southwest,Region_UnknownOrNonUS,Region_UsOther,Region_West
0,1,24,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,2,53,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,23,0,1,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3,4,24,0,1,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
4,5,33,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


<a id="p33"></a>
### 3.3 Fitting and evaluating the full model

Adding explicit information gives the latent factors a more solid base, so fewer of them are needed the more side info there is available.

In [13]:
# Number of latent factors
k=30
k_main=5
k_genre=5
k_demo=5

# Regularization parameter
reg=1e-3

# This time I'll weight the ratings matrix higher
w_main=4

# Fitting the model
# rec3=CMF(k=k, k_main=k_main, k_item=k_genre, k_user=k_demo, w_main=w_main, reg_param=reg)
rec3=CMF(k=k, k_main=k_main, k_item=k_genre, k_user=k_demo, reg_param=reg, w_main=w_main)
rec3.fit(train, genres, user_info, random_seed=32545)

# Making predictions
test['Predicted']=test.apply(lambda x: rec3.predict(x['UserId'],x['ItemId']),axis=1)

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 5.232346
  Number of iterations: 455
  Number of functions evaluations: 491


Same metrics as before:

In [14]:
np.sqrt(np.mean((test.Predicted-test.Rating)**2))

1.1927529497583915

In [15]:
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.5602718818211856
Average rating for top-5 rated by each user: 4.5298621745788665
Average rating for bottom-5 rated by each user: 2.246554364471669
Average rating for top-5 recommendations of best-rated movies: 4.029096477794793
----------------------
Average rating for top-5 recommendations from this model: 3.992343032159265
Average rating for bottom-5 (non-)recommendations from this model: 3.104134762633997


This time the improvement in RMSE how is also huge after adding just the most basic demographic information, although the Top-5 recommendations don't seem to have improved (nonetheless, this sample is rather small, and perhaps not very representative as it spans a whole year of ratings of old movies).

<a id="p4"></a>
## 4. Comparing recommendations
** *

Now let's see what are of each these models are recommending to a randomly picked user, along with the overall item popularity:

In [16]:
# aggregate statistics
avg_movie_rating=train.groupby('ItemId')['Rating'].mean()
num_ratings_per_movie=train.groupby('ItemId')['Rating'].agg(lambda x: len(tuple(x)))

# function to print recommended lists more nicely
def print_reclist(reclist):
    list_w_info=[str(m+1)+") - "+movie_id_to_title[reclist[m]]+\
        " - Average Rating: "+str(np.round(avg_movie_rating[reclist[m]],2))+\
        " - Number of ratings: "+str(num_ratings_per_movie[reclist[m]]) for m in range(len(reclist))]
    print("\n".join(list_w_info))

In [17]:
# user 943
reclist1=rec.top_n(UserId=943, n=20)
reclist2=rec2.top_n(UserId=943, n=20)
reclist3=rec3.top_n(UserId=943, n=20)

print_reclist(reclist1)
print('Recommendations from ratings-only model:')
print("------")
print('Recommendations from ratings + genre model:')
print_reclist(reclist2)
print("------")
print('Recommendations from ratings + genre + demographics model:')
print_reclist(reclist3)

1) - Schindler's List (1993) - Average Rating: 4.53 - Number of ratings: 167
2) - Full Monty, The (1997) - Average Rating: 4.06 - Number of ratings: 135
3) - Boot, Das (1981) - Average Rating: 4.22 - Number of ratings: 114
4) - Contact (1997) - Average Rating: 3.91 - Number of ratings: 245
5) - L.A. Confidential (1997) - Average Rating: 4.08 - Number of ratings: 128
6) - Sense and Sensibility (1995) - Average Rating: 4.18 - Number of ratings: 160
7) - Empire Strikes Back, The (1980) - Average Rating: 4.24 - Number of ratings: 201
8) - Air Force One (1997) - Average Rating: 3.71 - Number of ratings: 189
9) - English Patient, The (1996) - Average Rating: 3.76 - Number of ratings: 257
10) - Toy Story (1995) - Average Rating: 3.92 - Number of ratings: 267
11) - Braveheart (1995) - Average Rating: 4.22 - Number of ratings: 164
12) - Close Shave, A (1995) - Average Rating: 4.54 - Number of ratings: 68
13) - Casablanca (1942) - Average Rating: 4.46 - Number of ratings: 137
14) - Silence of th

<a id="p5"></a>
## 5. Larger example - MovieLens-1M and movie tag genome
** *

The example above was rather small, comprising only 50k ratings evaluated on antoher 5k ratings, and not very informative side information about items and/or users. Fortunately, the latest MovieLens datasets also come with the so-called _tag genomes_ as described in [Vig, J., Sen, S., & Riedl, J. (2012). The tag genome: Encoding community knowledge to support novel interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(3), 13.](http://dl.acm.org/citation.cfm?id=2362395).

This time I'll fit a model to the [MovieLens-1M](https://grouplens.org/datasets/movielens/1m/) dataset using these very detailed movie tags as side information + the same user information. The MovieLens-1M itself doesn't come with the tag genome data, but they can be obtained by matching the movies by title to newer [MovieLens-20M](https://grouplens.org/datasets/movielens/20m/) release. Unfortunately, this time the age comes only as a categorical feature rather than numeric as was in the MovieLens-100k, and further releases such as the MovieLens-20M no longer include user demographic information.

<a id="p51"></a>
## 5.1 Loading the data


In [18]:
import numpy as np, pandas as pd, time, re
from datetime import datetime
from cmfrec import CMF

ratings=pd.read_table('D:\\Downloads\\movielens\\ml-1m\\ml-1m\\ratings.dat',sep='::',engine='python',names=['UserId','ItemId','Rating','Timestamp'])
ratings['Timestamp']=ratings.Timestamp.map(lambda x: datetime(*time.localtime(x)[:6])).map(lambda x: pd.to_datetime(x))
ratings=ratings.sort_values(['UserId','ItemId']).reset_index(drop=True)
ratings.head()

Unnamed: 0,UserId,ItemId,Rating,Timestamp
0,1,1,5,2001-01-07 01:37:48
1,1,48,5,2001-01-07 01:39:11
2,1,150,5,2001-01-01 00:29:37
3,1,260,4,2001-01-01 00:12:40
4,1,527,5,2001-01-07 01:36:35


Temporal train-test split - this time it's not necessry to discard so much data for the split:

In [19]:
time_cutoff='2002-01-01'
train=ratings.loc[ratings.Timestamp<=time_cutoff]
test=ratings.loc[ratings.Timestamp>time_cutoff]
users_train=set(list(train.UserId))
items_train=set(list(train.ItemId))
test=test.loc[test.UserId.map(lambda x: x in users_train)]
test=test.loc[test.ItemId.map(lambda x: x in items_train)]
print(train.shape)
print(test.shape)

(972814, 4)
(27103, 4)


Movie tags taken from a different dataset and joined by title:

In [20]:
movie_titles=pd.read_table('D:\\Downloads\\movielens\\ml-1m\\ml-1m\\movies.dat',sep='::',engine='python',header=None)
movie_titles.columns=['ItemId','title','genres']
movie_titles=movie_titles[['ItemId','title']]

# will save the movie titles for later
movie_id_to_title={i.ItemId:i.title for i in movie_titles.itertuples()}

movies=pd.read_csv('D:\\Downloads\\movielens\\ml-latest\\ml-latest\\movies.csv')
movies=movies[['movieId','title']]
movies=pd.merge(movies,movie_titles)
movies=movies[['movieId','ItemId']]

tags=pd.read_csv('D:\\Downloads\\movielens\\ml-latest\\ml-latest\\genome-scores.csv')
tags_wide=tags.pivot(index='movieId', columns='tagId', values='relevance')
tags_wide=tags_wide.fillna(0)
tags_wide.columns=["tag"+str(i) for i in tags_wide.columns.values]

movies=pd.merge(movies,tags_wide,how='inner',left_on='movieId',right_index=True)
del movies['movieId']
movies.head()

Unnamed: 0,ItemId,tag1,tag2,tag3,tag4,tag5,tag6,tag7,tag8,tag9,...,tag1119,tag1120,tag1121,tag1122,tag1123,tag1124,tag1125,tag1126,tag1127,tag1128
0,1,0.024,0.024,0.05475,0.092,0.14825,0.215,0.06625,0.27025,0.2605,...,0.0365,0.018,0.04525,0.03275,0.1245,0.04175,0.02,0.03475,0.0835,0.02525
1,2,0.038,0.04175,0.037,0.04875,0.11075,0.07325,0.0495,0.10775,0.102,...,0.039,0.01925,0.01725,0.02425,0.13425,0.02225,0.016,0.0145,0.096,0.02025
2,3,0.042,0.0525,0.02725,0.07975,0.05625,0.07025,0.05975,0.18275,0.05175,...,0.0395,0.02625,0.02725,0.0345,0.16925,0.03525,0.01725,0.01875,0.09925,0.02
3,4,0.036,0.0385,0.035,0.03125,0.071,0.045,0.02475,0.083,0.0515,...,0.05375,0.033,0.02275,0.04025,0.196,0.057,0.0155,0.01475,0.06625,0.014
4,5,0.04075,0.05125,0.058,0.03675,0.07575,0.12675,0.02975,0.08175,0.03075,...,0.04,0.0285,0.021,0.0265,0.15475,0.0205,0.017,0.01575,0.11275,0.01975


Just like before, this time I'll also try to get the users' geographical region from the same database as before

In [21]:
zipcode_abbs=pd.read_csv("D:\\Downloads\\movielens\\zips\\states.csv")
zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}
us_regs_table=[
    ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),
    ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),
    ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),
    ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),
    ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),
    ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')
    ]
us_regs_table=[(x[0],[i.strip() for i in x[1].split(",")]) for x in us_regs_table]
us_regs_dct=dict()
for r in us_regs_table:
    for s in r[1]:
        us_regs_dct[zipcode_abbs_dct[s]]=r[0]
        
zipcode_info=pd.read_csv("D:\\Downloads\\movielens\\free-zipcode-database.csv")
zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()
zipcode_info['State'].loc[zipcode_info.Country!="US"]='UnknownOrNonUS'
zipcode_info['Region']=zipcode_info['State'].copy()
zipcode_info['Region'].loc[zipcode_info.Country=="US"]=zipcode_info.Region.loc[zipcode_info.Country=="US"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')
zipcode_info=zipcode_info[['Zipcode', 'Region']]


users=pd.read_table('D:\\Downloads\\movielens\\ml-1m\\ml-1m\\users.dat',sep='::',names=["UserId","Gender","Age","Occupation","Zipcode"], engine='python')
users["Zipcode"]=users.Zipcode.map(lambda x: np.int(re.sub("-.*","",x)))
users=pd.merge(users,zipcode_info,on='Zipcode',how='left')
users['Region']=users.Region.fillna('UnknownOrNonUS')

users['Occupation']=users.Occupation.map(lambda x: str(x))
users['Age']=users.Age.map(lambda x: str(x))
users=pd.get_dummies(users[['UserId','Gender','Age','Occupation','Region']])
users.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,UserId,Gender_F,Gender_M,Age_1,Age_18,Age_25,Age_35,Age_45,Age_50,Age_56,...,Occupation_8,Occupation_9,Region_Middle Atlantic,Region_Midwest,Region_New England,Region_South,Region_Southwest,Region_UnknownOrNonUS,Region_UsOther,Region_West
0,1,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,2,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
2,3,0,1,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,4,0,1,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4,5,0,1,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [22]:
items_w_side_info=set(list(movies.ItemId))
users_w_side_info=set(list(users.UserId))
test=test.loc[test.ItemId.map(lambda x: x in items_w_side_info)]
test=test.loc[test.UserId.map(lambda x: x in users_w_side_info)]

<a id="p52"></a>
## 5.2 Model with no side info

In [23]:
from cmfrec import CMF

rec1m_basic=CMF(k=50, reg_param=5e-5)
rec1m_basic.fit(train)
test['Predicted']=test.apply(lambda x: rec1m_basic.predict(x['UserId'],x['ItemId']),axis=1)
np.sqrt(np.mean((test.Predicted-test.Rating)**2))

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
  Objective function value: 1.394558
  Number of iterations: 254
  Number of functions evaluations: 266


1.0324476780685568

In [24]:
avg_ratings=train.groupby('ItemId')['Rating'].mean().to_frame().rename(columns={"Rating":"AvgRating"})
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.568925405023363
Average rating for top-5 rated by each user: 4.471820258948972
Average rating for bottom-5 rated by each user: 2.423076923076923
Average rating for top-5 recommendations of best-rated movies: 3.9897182025894895
----------------------
Average rating for top-5 recommendations from this model: 4.036938309215537
Average rating for bottom-5 (non-)recommendations from this model: 3.0304645849200305


<a id="p53"></a>
## 5.3 Model with movie tags and user demographics

In [25]:
%%time
rec1m_basic=CMF(k=40, k_main=10, k_item=15, k_user=0, w_main=3, reg_param=1e-4)
rec1m_basic.fit(train, movies, users)
test['Predicted']=test.apply(lambda x: rec1m_basic.predict(x['UserId'],x['ItemId']),axis=1)

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 3.240518
  Number of iterations: 568
  Number of functions evaluations: 610
Wall time: 8min 24s


In [26]:
np.sqrt(np.mean((test.Predicted-test.Rating)**2))

1.0203154742497973

In [27]:
test2=pd.merge(test,avg_ratings,left_on='ItemId',right_index=True,how='left')

print('Averge movie rating:',test2.groupby('UserId')['Rating'].mean().mean())
print('Average rating for top-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 rated by each user:',test2.sort_values(['UserId','Rating'],ascending=True).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for top-5 recommendations of best-rated movies:',test2.sort_values(['UserId','AvgRating'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('----------------------')
print('Average rating for top-5 recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=False).groupby('UserId')['Rating'].head(5).mean())
print('Average rating for bottom-5 (non-)recommendations from this model:',test2.sort_values(['UserId','Predicted'],ascending=True).groupby('UserId')['Rating'].head(5).mean())

Averge movie rating: 3.568925405023363
Average rating for top-5 rated by each user: 4.471820258948972
Average rating for bottom-5 rated by each user: 2.423076923076923
Average rating for top-5 recommendations of best-rated movies: 3.9897182025894895
----------------------
Average rating for top-5 recommendations from this model: 4.025133282559025
Average rating for bottom-5 (non-)recommendations from this model: 3.0144706778370143
