# Chapter 22. Recommender Systems

In [117]:
from __future__ import division
import math, random
from collections import defaultdict, Counter
from linear_algebra import dot

Another common data challenge is producing [recommendations](https://en.wikipedia.org/wiki/Recommender_system) of some sort.  
Netflix recommends movies that you might want to watch, Amazon recommends products that you might want to buy, Twitter recommends followers, and so on.  
In this chapter, we'll examine a few different ways to use data to make recommendations.  
In particular, we'll look at the data set of `users_interests` that we've used before:

In [118]:
users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

We'll use this data to address the problem of recommending new interests to a user based on her currently specified interests.

## Manual Curation

Given DataSciencester's limited number of users and interests, you could probably just spend an afternoon manually recommending interests for each user.  
However, this method doesn't scale very well, and it's limited by your personal knowledge and imagination.  
Intead, let's think about what we can do with our data.

## Recommending What's Popular

One easy approach is to simply recommend what's popular:

In [119]:
popular_interests = Counter(interest
                            for user_interests in users_interests
                            for interest in user_interests).most_common()

popular_interests

[('Python', 4),
 ('R', 4),
 ('Java', 3),
 ('regression', 3),
 ('statistics', 3),
 ('probability', 3),
 ('HBase', 3),
 ('Big Data', 3),
 ('neural networks', 2),
 ('Hadoop', 2),
 ('deep learning', 2),
 ('pandas', 2),
 ('artificial intelligence', 2),
 ('libsvm', 2),
 ('C++', 2),
 ('Postgres', 2),
 ('MongoDB', 2),
 ('scikit-learn', 2),
 ('machine learning', 2),
 ('statsmodels', 2),
 ('Cassandra', 2),
 ('NoSQL', 1),
 ('Mahout', 1),
 ('Storm', 1),
 ('MySQL', 1),
 ('programming languages', 1),
 ('Haskell', 1),
 ('mathematics', 1),
 ('Spark', 1),
 ('numpy', 1),
 ('theory', 1),
 ('decision trees', 1),
 ('MapReduce', 1),
 ('scipy', 1),
 ('databases', 1),
 ('support vector machines', 1)]

Having computed this, we can just suggest to a user the most popular interests that she hasn't already specified:

In [120]:
def most_popular_new_interests(user_interests, max_results=5):
    suggestions = [(interest, frequency)
                    for interest, frequency in popular_interests
                    if interest not in user_interests]
    return suggestions[:max_results]

So, if you are user 1 and have the following interests:

then you would be recommended:

In [121]:
most_popular_new_interests(users_interests[1], 5)

[('Python', 4), ('R', 4), ('Java', 3), ('regression', 3), ('statistics', 3)]

If you are user 3, who has already specified many of those interests listed above, you would instead be recommended:

In [122]:
most_popular_new_interests(users_interests[3], 5)

[('Java', 3),
 ('HBase', 3),
 ('Big Data', 3),
 ('neural networks', 2),
 ('Hadoop', 2)]

While user 8, who has many interests in common with user 1: 

gets the same 5 recommendations:

In [123]:
most_popular_new_interests(users_interests[8], 5)

[('Python', 4), ('R', 4), ('Java', 3), ('regression', 3), ('statistics', 3)]

This technique can be somewhat useful, but "lots of people are interested in Python so you should be too" is not the most compelling sales pitch.  
However, if someone is brand new to our site and we know nothing about them, that might be the best we can do.  
Let's see how we can do better by basing each user's recommendations on her particular interests.

## User-Based Collaborative Filtering

One way of taking a user's interests into account is to look for users who are somehow similar to her, and then suggest the things that those users are interested in.  
In order to do this, we'll need a way to measure how similar two users are.  
Here we'll use a metric called [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).  
Given two vectors, `v` and `w`, cosine similarity is defined as:

In [124]:
def cosine_similarity(v, w):
    return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))

Our function measures the "angle" between `v` and `w`.  
If `v` and `w` point in the same direction, then the numerator and denominator are equal, and their cosine similarity equals 1.  
If `v` and `w` point in opposite directions, then their cosine similarity equals -1.  
If `v` is 0 whenever `w` is not (and vice versa) then `dot(v, w)` is 0 and so the cosine similarity will be 0.  
We'll apply `cosine_similarity()` to vectors of 0s and 1s, with each vector `v` representing one user's interests.  
`v[i]` will be 1 if the user has specified the $i$th interest, and 0 otherwise.  
Accordingly, "similar users" will mean "users whose interest vectors most nearly point in the same direction."  
Users with identical interests will have a similarity of 1, and users with no common interests will have similarity 0.  
Otherwise the similarity will fall in between, with numbers closer to 1 indicating "more similar" and numbers closer to 0 meaning "less similar."

A good place to begin is by collecting the known interests and implicitly assigning indices to them.  
We can do this by using a [set comprehension](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html#set-comprehensions) to find the unique interests, putting them in a list, and then sorting them.  
The first interest in the resulting list will be interest 0, and so on:

In [125]:
unique_interests = sorted(list({ interest 
                                 for user_interests in users_interests 
                                 for interest in user_interests}))

unique_interests

['Big Data',
 'C++',
 'Cassandra',
 'HBase',
 'Hadoop',
 'Haskell',
 'Java',
 'Mahout',
 'MapReduce',
 'MongoDB',
 'MySQL',
 'NoSQL',
 'Postgres',
 'Python',
 'R',
 'Spark',
 'Storm',
 'artificial intelligence',
 'databases',
 'decision trees',
 'deep learning',
 'libsvm',
 'machine learning',
 'mathematics',
 'neural networks',
 'numpy',
 'pandas',
 'probability',
 'programming languages',
 'regression',
 'scikit-learn',
 'scipy',
 'statistics',
 'statsmodels',
 'support vector machines',
 'theory']

Next we want to produce an "interest" vector of zeroes and ones for each user.  
We just need to iterate over the `unique_interests` list, substituting a 1 if the user has that interest, and a 0 if not:

In [126]:
def make_user_interest_vector(user_interests):
    """ given a list of interests, produce a vector whose ith element is 1 """
    """ if unique_interests[i] is in the list, 0 otherwise """
    return [1 if interest in user_interests
            else 0 for interest in unique_interests]

after which, we can create a matrix of user interests simply by `map`-ping this function against the list of lists of interests:

In [127]:
# this will create a list of nine vectors, one for each user
# each vector will have a length of 36, corrsponding to each interest
user_interest_matrix = map(make_user_interest_vector, users_interests)
print user_interest_matrix

[[1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0,

Now `user_interest_matrix[i][j]` equals 1 if user $i$ specified interest $j$, 0 otherwise.

Since we are using a small data set, it's relatively simple to compute the pairwise similarities between all of our users:

In [128]:
user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j)
                      for interest_vector_j in user_interest_matrix]
                      for interest_vector_i in user_interest_matrix]

print user_similarities

[[1.0, 0.3380617018914066, 0.0, 0.0, 0.0, 0.1543033499620919, 0.0, 0.0, 0.1889822365046136, 0.5669467095138409, 0.0, 0.0, 0.0, 0.1690308509457033, 0.0], [0.3380617018914066, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6, 0.0], [0.0, 0.0, 1.0, 0.18257418583505536, 0.0, 0.16666666666666666, 0.0, 0.20412414523193154, 0.0, 0.0, 0.23570226039551587, 0.0, 0.47140452079103173, 0.0, 0.0], [0.0, 0.0, 0.18257418583505536, 1.0, 0.22360679774997896, 0.3651483716701107, 0.4472135954999579, 0.0, 0.0, 0.0, 0.5163977794943222, 0.22360679774997896, 0.5163977794943222, 0.0, 0.2581988897471611], [0.0, 0.0, 0.0, 0.22360679774997896, 1.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5773502691896258], [0.1543033499620919, 0.0, 0.16666666666666666, 0.3651483716701107, 0.0, 1.0, 0.0, 0.0, 0.0, 0.20412414523193154, 0.23570226039551587, 0.20412414523193154, 0.47140452079103173, 0.0, 0.0], [0.0, 0.0, 0.0, 0.4472135954999579, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.2886751345948129, 0.25, 0.0, 0.0, 

after which, `user_similarities[i][j]` gives us the similarity between users $i$ and $j$.

For example, `user_similarities[0][9]` is 0.57 because those two users share interests in Hadoop, Java, and Big Data.  
On the other hand, `user_similarities[0][8]` is only 0.19 because those two users only share an interest in Big Data.  
In particular, `user_similarities[i]` is the vector of user $i$'s similarities to every other user.  
We can use this to write a function that finds the most similar users to a given user.  
While doing so, we'll make sure *not* to include the user herself, nor any users with zero similarity.  
The results will be sorted from most similar to least similar:

In [129]:
def most_similar_users_to(user_id):
    # find other users with nonzero similarity
    pairs = [(other_user_id, similarity)
              for other_user_id, similarity in enumerate(user_similarities[user_id])
              if user_id != other_user_id and similarity > 0]
    # sort them, most similar first
    return sorted(pairs,
                  key=lambda (_, similarity):
                  similarity, reverse=True)

most_similar_users_to(0)

[(9, 0.5669467095138409),
 (1, 0.3380617018914066),
 (8, 0.1889822365046136),
 (13, 0.1690308509457033),
 (5, 0.1543033499620919)]

Now how can we use this to suggest new interests to a user?  
For each interest, we can just add up the `user_similarities` of the users who share that interest:

In [130]:
def user_based_suggestions(user_id, include_current_interests=False):
    # sum up the similarities
    suggestions = defaultdict(float)
    for other_user_id, similarity in most_similar_users_to(user_id):
        for interest in users_interests[other_user_id]:
            suggestions[interest] += similarity
    # convert those suggestions to a sorted list 
    suggestions = sorted(suggestions.items(),
                         key=lambda (_, weight): weight,
                         reverse=True)
    # and (if needed) exclude existing interests
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight)
                 for suggestion, weight in suggestions
                 if suggestion not in users_interests[user_id]]
    
user_based_suggestions(0)

[('MapReduce', 0.5669467095138409),
 ('MongoDB', 0.50709255283711),
 ('Postgres', 0.50709255283711),
 ('NoSQL', 0.3380617018914066),
 ('neural networks', 0.1889822365046136),
 ('deep learning', 0.1889822365046136),
 ('artificial intelligence', 0.1889822365046136),
 ('databases', 0.1690308509457033),
 ('MySQL', 0.1690308509457033),
 ('programming languages', 0.1543033499620919),
 ('Python', 0.1543033499620919),
 ('Haskell', 0.1543033499620919),
 ('C++', 0.1543033499620919),
 ('R', 0.1543033499620919)]

These seem like pretty decent suggestions for someone whose stated interests are "Big Data" and database-related topics.  
The numbers assigned to the weights aren't intrinsically meaningful, we just use them for comparison and ordering.  
This approach doesn't work as well when the number of items (interests, in our case) grows very large.  
Recall the curse of dimensionality from Chapter 12 where in large-dimensional vector spaces most vectors are very far apart and, therefore, point in different directions, resulting in "most similar interests" that aren't very similar at all.  
Imagine a site like Amazon.com and a customer named Jeff who has bought thousands of items over the course of the past couple of decades.  
Amazon's recommender system could try to find similar users to Jeff based on buying patterns, but most likely there is no one else in the world who has an even remotely similar purchase history to Jeff.  
Whoever Jeff's "most similar" shopper is, he probably isn't very similar at all and would almost certainly make for lousy recommendations.  
Fortunately, there are other approaches.

## Item-Based Collaborative Filtering