## Recommender Systems - Following Grus'  DS book (with some changes in the code)

## 1. User-Based collaborative filtering:

We have a set of users, where the users list their preferences. For example:

In [1]:
users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

It is straightforward to interpret the User/Item descriptions as a matrix. Next, we need a measure of similitude between two users. A simple metric such as the dot product can be used to quantify user similitude:

In [13]:
import numpy as np
def cosine_similarity(v, w):
    return np.dot(v,w)/(np.linalg.norm(v)*np.linalg.norm(w))

Each user entry is a n-th component vector. The dimension of the vector space depends on the number of unique interests, so we need to find the set of unique interests, and then assign a vector component to each of these:

In [141]:
unique_interests = sorted(list(set([interest for user in users_interests for interest in user])))
n = len(unique_interests)
print n

36


Now, with the set of unique interests, we can define a function that converts each user interest list in a n-th (36-th) dimensional vector to compute cosine similarity:

In [142]:
def get_num_vector(user):
    out = [1 if interest in user else 0 for interest in unique_interests]
    return out

Now, we can calculate the similarity matrix between users:

In [143]:
user_similarities = [[cosine_similarity(get_num_vector(interest_vector_i), get_num_vector(interest_vector_j)) \
                      for interest_vector_i in users_interests] for interest_vector_j in users_interests]

Now, we can quantify similitude between users, which we need to make recommendations based on this. We proceed by defining a function that finds the most similar users to a given user (user_id):

In [144]:
def most_similar_users_to(user_id):
    return sorted([(user, score) for (user, score) in enumerate(user_similarities[user_id])
           if user_id != user and score >0], key=lambda x: x[1], reverse=True)

In [145]:
most_similar_users_to(0)

[(9, 0.56694670951384085),
 (1, 0.33806170189140655),
 (8, 0.1889822365046136),
 (13, 0.16903085094570328),
 (5, 0.15430334996209191)]

Now with the similitude, we can make our recommendations. For each interest, we add the user similitudes of other users, and make a suggestion based on this score:

In [146]:
def user_based_suggestions(user_id, include_current_interests = False):
    recommendations = {interest: 0 for interest in unique_interests}  #Create dict of interests/scores to be populated
    for other_user, score in most_similar_users_to(user_id):
        for interest in users_interests[other_user]:
            if include_current_interests:
                recommendations[interest] += score
            else:
                if interest not in users_interests[user_id]:
                    recommendations[interest] += score
    return sorted([(key, value) for key, value in recommendations.items()], key = lambda x:x[1], reverse=True)

Now, for user 0 we have:

In [148]:
user_based_suggestions(0)

[('MapReduce', 0.56694670951384085),
 ('Postgres', 0.50709255283710986),
 ('MongoDB', 0.50709255283710986),
 ('NoSQL', 0.33806170189140655),
 ('neural networks', 0.1889822365046136),
 ('deep learning', 0.1889822365046136),
 ('artificial intelligence', 0.1889822365046136),
 ('MySQL', 0.16903085094570328),
 ('databases', 0.16903085094570328),
 ('programming languages', 0.15430334996209191),
 ('Python', 0.15430334996209191),
 ('C++', 0.15430334996209191),
 ('R', 0.15430334996209191),
 ('Haskell', 0.15430334996209191),
 ('Java', 0),
 ('Hadoop', 0),
 ('Mahout', 0),
 ('Storm', 0),
 ('regression', 0),
 ('statistics', 0),
 ('scipy', 0),
 ('mathematics', 0),
 ('Spark', 0),
 ('numpy', 0),
 ('pandas', 0),
 ('theory', 0),
 ('libsvm', 0),
 ('probability', 0),
 ('HBase', 0),
 ('decision trees', 0),
 ('Big Data', 0),
 ('scikit-learn', 0),
 ('machine learning', 0),
 ('statsmodels', 0),
 ('support vector machines', 0),
 ('Cassandra', 0)]

## 2. Item-based collaborative filtering

As discussed by Grus, if the dimensionality of the vector space becomes too large then the similitude between users is hard to quantify (curse of dimensionality). An alternative approach is to use Item-based collaborative filtering. We determine the similitude between items directly, using the user/item matrix. We start by transposing the user interest matrix (which we did not calculate before but is pretty straightforward):

In [149]:
user_interest_matrix = [get_num_vector(user) for user in users_interests]
interest_user_matrix = [[user_interest_vector[j] for user_interest_vector in user_interest_matrix] for j in range(len(unique_interests))]

Now we proceed as before, using cosine similarity:

In [150]:
interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j) for user_vector_i in interest_user_matrix] 
                         for user_vector_j in interest_user_matrix]

And we use the same code for User-based recommendations:

In [151]:
def most_similar_interests_to(interest_id):
    return sorted([(unique_interests[interest], score) for (interest, score) in enumerate(interest_similarities[interest_id])
           if interest_id != interest and score >0], key=lambda x: x[1], reverse=True)

In [152]:
most_similar_interests_to(0)

[('Hadoop', 0.81649658092772592),
 ('Java', 0.66666666666666674),
 ('MapReduce', 0.57735026918962584),
 ('Spark', 0.57735026918962584),
 ('Storm', 0.57735026918962584),
 ('Cassandra', 0.40824829046386296),
 ('artificial intelligence', 0.40824829046386296),
 ('deep learning', 0.40824829046386296),
 ('neural networks', 0.40824829046386296),
 ('HBase', 0.33333333333333337)]

Now, for every user we can find the best recommendations based on this similarity scores:

In [168]:
def item_based_suggestions(user_id, include_current_interests = False):
    recommendations = {interest: 0 for interest in unique_interests}  #Create dict of interests/scores to be populated
    for interests in users_interests[user_id]:
        for similar_interests, score in most_similar_interests_to(unique_interests.index(interests)):
            if include_current_interests:
                recommendations[similar_interests] += score
            else:
                if similar_interests not in users_interests[user_id]:
                    recommendations[similar_interests] += score
    return sorted([(key, value) for key, value in recommendations.items()], key = lambda x:x[1], reverse=True)

In [169]:
item_based_suggestions(0)

[('MapReduce', 1.8618073195657989),
 ('Postgres', 1.3164965809277258),
 ('MongoDB', 1.3164965809277258),
 ('NoSQL', 1.2844570503761732),
 ('MySQL', 0.57735026918962584),
 ('programming languages', 0.57735026918962584),
 ('Haskell', 0.57735026918962584),
 ('databases', 0.57735026918962584),
 ('neural networks', 0.40824829046386296),
 ('deep learning', 0.40824829046386296),
 ('artificial intelligence', 0.40824829046386296),
 ('C++', 0.40824829046386296),
 ('Python', 0.28867513459481292),
 ('R', 0.28867513459481292),
 ('Java', 0),
 ('Hadoop', 0),
 ('Mahout', 0),
 ('Storm', 0),
 ('regression', 0),
 ('statistics', 0),
 ('scipy', 0),
 ('mathematics', 0),
 ('Spark', 0),
 ('numpy', 0),
 ('pandas', 0),
 ('theory', 0),
 ('libsvm', 0),
 ('probability', 0),
 ('HBase', 0),
 ('decision trees', 0),
 ('Big Data', 0),
 ('scikit-learn', 0),
 ('machine learning', 0),
 ('statsmodels', 0),
 ('support vector machines', 0),
 ('Cassandra', 0)]