Weasyl Collaborative Filtering Example
-----------------------------------------

This notebook contains an example of user to user collaborative filtering on weasyl. This is something I did a few years ago successfully, but not in a performant manner: It would take tens of seconds to find recommendations for just one user.

This is an attempt to revisit the problem with more performant python libraries.

Before running this notebook, you'll want to create a .csv file with all the favorites from within postgresql as follows:
```
weasyl=# COPY (SELECT userid, targetid FROM favorite WHERE type='s') TO '/tmp/favorites.csv' DELIMITER ',' CSV HEADER;
COPY 4389453
```
or unzip a favorites.csv.gz that I provide.

You'll also need to install `numpy`, `pandas`, `scipy` and `ipython[notebook]` inside your ve to run this code.

I consulted a few sources for this to see what other people have done. The most useful resource was http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/ which I've tried to modify here to use a sparse scipy matrix.

In [1]:
import time

import numpy as np
import pandas as pd
import scipy.sparse

Last time I did this, we used unary ratings (i.e. every submission was either favorited by a user or not). As such it made sense to use the [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index) and treat everything in terms of sets.

However, going forward we want to support favorites, likes, and dislikes. So treat all favorites as score `2` (likes will eventually be 1 and dislikes will be -1).

In [2]:
df = pd.read_csv('favorites.csv')
df['rating'] = 2  # Everything is favorites currently
df.head()

Unnamed: 0,userid,targetid,rating
0,3,23,2
1,3,24,2
2,3,25,2
3,3,29,2
4,10,25,2


Now that we've created a dataframe, we want to create a sparse matrix from it. Since our userids and targetids are not contiguous (e.g. the lowest userid who has favorited anything is 3 and not every user has favorites and many favorites are anonymized when we export the db, etc.), we'll make a mapping of new indices to them.

In [3]:
user_map = {x[1]: x[0] for x in enumerate(df.userid.unique())}
item_map = {x[1]: x[0] for x in enumerate(df.targetid.unique())}

# Show a few items
{k: user_map[k] for k in user_map.keys()[:10]}

{3: 0,
 6: 3784,
 10: 1,
 12: 43,
 13: 1242,
 15: 3,
 17: 2,
 20: 158,
 21: 266,
 131083: 37171}

Now construct our sparse matrix. This will take a while.

In [4]:
n_users = df.userid.unique().shape[0]
n_items = df.targetid.unique().shape[0]
assert n_users == len(user_map)
assert n_items == len(item_map)

print("{} users.".format(n_users))
print("{} items.".format(n_items))

38903 users.
857382 items.


In [5]:
ratings = scipy.sparse.csr_matrix((df['rating'], (df.userid.map(user_map), df.targetid.map(item_map))),
                                  shape=(n_users, n_items))

The function below was adapted from the blog post linked above. See: http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/#Collaborative-filtering

I've made two changes:
 * I no longer use epsilon because scipy dies when I try to add a scalar to a sparse matrix. This shouldn't be matter because all users in the matrix have at least one favorite.
 * I use `sim.diagonal()` instead of `np.diag()` because sparse matrices complain mightily when it comes to `np.diag()`
 * Things die if we try to do regular division. Instead, we construct a diagonal matrix with the
   reciprocals of everything we would have divided with and both pre-multiply (to scale every row by the first user) and post multiply (to scale every column by the second user)

In [6]:
def fast_similarity(ratings, kind='user', epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    if kind == 'user':
        sim = ratings.dot(ratings.T)
    elif kind == 'item':
        sim = ratings.T.dot(ratings)
    norms = np.array([np.sqrt(sim.diagonal())])
    norms_sparse_diag = scipy.sparse.diags(1/norms.ravel(), format='csr')
    return (norms_sparse_diag * sim * norms_sparse_diag)

In [11]:
# Now calculate user-user similarities. This used to take quite a while but with the fixes above it's pretty fast.
before = time.time()
user_similarity = fast_similarity(ratings, kind='user')
print("User similarities calculated in {} seconds.".format(time.time() - before))

User similarities calculated in 2.99600410461 seconds.


We now have user-user similarities (e.g. the cosine similarity between all pairs of users). Use them to generate recommendations.

Here we will again use the code from the blog post above. However, we must adapt it to sparse matrices.

I've found that either due to a memory leak in my version of scipy (0.19 on Ubuntu 64-bit) or due to the number of non-zero elements in the full recommendation array, we can't calculate all the recommendations at once.

So for now we'll write a method that calculates it for one user at a time as needed. This method will only use the k-most similar users to make its suggestions.

In practice this usually creates better results, but we still have to run cross validation to confirm that.

In [13]:
rev_item_map = {v: k for k, v in item_map.iteritems()}

def recs_for_user(userid, similarity, ratings, k=20, count=10):
    """
    Generates recommendations for a weasyl user.
    
    Args:
        userid (int): The weasyl userid.
        similarity (matrix): The user to user similarity matrix.
        ratings (matrix): The user x item rating matrix.
        k (int, optional): How many closest other users to use. Defaults to 20.
        count (int, optional): How many items to return for the user. Defaults to 10.
    
    Returns:
        An array of weasyl submission ids for the user.
    """
    if userid not in user_map:
        print("User has favorited anything.")
        return []
    user_index = user_map[userid]
    top_friends = np.argpartition(similarity[:,user_index].toarray().ravel(), -k)[-2:-k-2:-1]
    top_friends = top_friends[top_friends != user_index]  # Don't use ourselves for recommendations.
    # TODO: Don't include columns for items we've rated ourselves.
    preds = similarity[user_index, top_friends].dot(ratings[top_friends, :])
    # TODO: Use argpartition to speed this up
    return [rev_item_map[x] for x in np.argsort(preds.toarray().ravel())[-count:]]

Now we can use this function to generate recommendations. Filter out things we've already favorited ourselves.
In the future we should do that before calculating recommendations.

In [16]:
before = time.time()
recs = recs_for_user(2061, user_similarity, ratings, count=20)
print("Recommendations generated in {} seconds".format(time.time() - before))
for x in recs:
    if ratings[user_map[2061], item_map[x]]:
        continue  # Don't include our own favorites.
    print("https://www.weasyl.com/submission/{}".format(x))

Recommendations generated in 0.0863590240479 seconds
https://www.weasyl.com/submission/1114995
https://www.weasyl.com/submission/1364444
https://www.weasyl.com/submission/880203
https://www.weasyl.com/submission/883949
https://www.weasyl.com/submission/689866
https://www.weasyl.com/submission/1239404
https://www.weasyl.com/submission/882610
