> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser type in the console:


> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`

In [1]:
import pandas as pd
import numpy as np

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Recommendation Engines

_Author: Alex Combs (NYC) _

---

<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*
- Explain what a recommendation engines is
- Explain the math behind recommendation engines
- Explain the types of recommendation engines and their pros and cons

### Lesson Guide
- [What is a recommendation engine?](#what-is-a-recommendation-engine)
    - [Why bother?](#why-bother)
    - [Who uses recommendation systems?](#who-uses-recommendation-systems)
    - [Explicit data vs Implicit data](#explicit-data-vs-implicit-data)
	- [Two classical recommendation methods](#two-classical-recommendation-methods)


- [User-based Collaborative Filtering](#user-based-collaborative-filtering)
    - [So, let's see how we construct it](#so-lets-see-how-we-construct-it)
	- [Formula](#formula)
    - [Cosine similarity using sci-kit learn](#cosine-similarity-using-sci-kit-learn)
    - [The problem with zero](#the-problem-with-zero)
    - [Exercise: Find the similarity between X and Y and X and Z for the following.](#exercise-find-the-similarity-between-x-and-y-and-x-and-z-for-the-following)


- [Item-based Collaborative Filtering](#item-based-collaborative-filtering)
    - [Exercise: Center the values by row and find the cosine similarity for each row vs. row 5 (S5)](#exercise-center-the-values-by-row-and-find-the-cosine-similarity-for-each-row-vs-row--s)


- [Content-based Filtering](#content-based-filtering)
    - [Example](#example)


- [Independent Exercise](#independent-exercise)
- [Conclusion](#conclusion)
- [Extra Practice](#extra-practice)
- [Additional Resources](#additional-resources)


<a id="what-is-a-recommendation-engine"></a>
## What is a recommendation engine?
---

At its most basic: A system designed to match users to things that they will like.

- The "things" can be products, brands, media, or even other people. 
- Ideally, they should be things the user doesn't know about. 
- **The goal is to rank all the possible things that are available to the user and to only present the top items**

<a id="why-bother"></a>
### Why bother?

- 1/4 to a 1/3 of consumer choices at Amazon are driven by personalized recommendations
- Netflix says there recommendation engine reduces churn saving the company in excess of $1 billion a year
- Hulu [has shown](http://tech.hulu.com/blog/2011/09/19/recommendation-system.html) that showing recommended TV shows results in over 3x more clicks than only showing the most popular TV shows.

<a id="who-uses-recommendation-systems"></a>
### Who uses recommendation systems?

- Netflix
- Pandora
- Hulu
- Tinder
- Facebook
- Barnes & Noble (receipts recommend other books)
- Target (sent directed ads based on motherhood predictions)

<a id="the-data-for-recommendations"></a>
### The data for recommendations


To make a prediction about what someone will like, we need to have data. 

<a id="explicit-data-vs-implicit-data"></a>
### Explicit data vs Implicit data

#### Explicit
- Explicity given/pro-actively acquired
- Clear signals
- Cost associated with acquisition (time/cognitive)
- Limited and sparse data because of this


#### Implicit
- Provided/collected passively (digital exhaust)
- Signals can be difficult to interpret
- Enormous quantities

<a id="if-you-have-the-data-you-can-build-it"></a>
### If you have the data, you can build it....

But how?

<a id="two-classical-recommendation-methods"></a>
### Two classical recommendation methods

- **Similar people**
    - If you like the same 5 movies as someone else, you'll likely enjoy other movies they like.
    - There are two main types: (a) Find users who are similar and recommend what they like (**user-based**), or (b) recommend items that are similar to already-liked items (**item-based**).
   

- **Similar items**
    - If you enjoy certain characteristics of movies (e.g. certain actors, genre, etc.), you'll enjoy other movies with those characteristics.
    - Note this can easily be done using machine learning methods! Each movie can be decomposed into features. Then, for each user we compute a model -- the target can be a binary classifier (e.g. "LIKE"/"DISLIKE") or regression (e.g. star rating).

- The first is called **Collaborative Filtering**
- The second is called **Content-based Filtering**

<a id="user-based-collaborative-filtering"></a>
## User-based Collaborative Filtering
---

We'll first look at user-based filtering. The idea behind this method is finding your taste **doppelgänger**. This is the person who is most similar to you based upon the ratings both of you have given to a mix of products.

<a id="so-lets-see-how-we-construct-it"></a>
## So, let's see how we construct it

We begin with what's called a utility matrix. This is a **user** (rows) x **product** (columns) matrix.
<img src="http://i.imgur.com/Ce838dV.png">

***Check:*** If we want to find the most similar users, what do we need?

If we want to find the users most similar to user A, we need a similarity metric.

One metric we can use is **cosine similarity**. Cosine similarity uses the cosine between two vectors to compute a scalar value that represents how closely related these vectors are. 

- Angle of $0^{\circ}$ (same direction): $\cos(0^{\circ}) = 1$. Perfectly similar.
- Angle of $90^{\circ}$ (orthogonal): $\cos(90^{\circ}) = 0$. Perfectly dissimilar.
- Angle of $180^{\circ}$ (opposite direction): $\cos(90^{\circ}) = -1$.


Doesn't this sound a lot like the correlation coefficient? It turns out that cosine similarity is identical to the **uncentered correlation coefficient**! As a bonus, if the points are mean-centered, then this formula also depicts the **Pearson correlation coefficient**.

<a id="formula"></a>
### Formula
You may be familiar with the Euclidean dot product formula from trigonometry:

$$\mathbf{A} \cdot \mathbf{B} = \|\mathbf{A}\|\|\mathbf{B}\|\cos{\theta}$$

Let's rewrite it a bit to give our new similarity measure:

$$similarity = \cos{\theta} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}$$

Keep in mind that $\frac{\mathbf{A}}{\|\mathbf{A}\|}$ is the unit vector along the direction of $\mathbf{A}$.

<a id="cosine-similarity-using-sci-kit-learn"></a>
## Cosine similarity using sci-kit learn

With that, let's calculate the cosine similarity of A against all other users. We'll start with B. We have a sparse matrix so let's just fill in 0 for the missing values.

<a id="a-vs-b"></a>
### A vs B
```python
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
np.array([0,4,0,4,0,5,0]).reshape(1,-1))
```
 This give us cosine similarity of **.1835**

This is a low rating and makes sense since they have no ratings in common.

Let's run it for user A and C now.

<a id="a-vs-c"></a>
### A vs C
```python
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
np.array([2,0,2,0,1,0,0]).reshape(1,-1))
```

This gives us a cosine simularity of **.8852.**

#### This indicates these users are very similar. But are they?

<a id="the-problem-with-zero"></a>
### The problem with zero

By inputing 0 to fill the missing values, we have indicated strong negative sentiment for the missing ratings and thus agreement where there is none. We should instead represent that with a neutral value. We can do this by **mean centering** the values at zero. Let's see how that works.

We add up all the ratings for user A and then divide by the total ratings. In this case that is 17/4 or 4.25. We then subtract this 4.25 from every individual rating. We then do the same for all other users, subtracting their mean ratings from each of their ratings. <br><br>That gives us the following table:

<img src="http://i.imgur.com/QuM7xsa.png">


<a id="a-vs-b"></a>
### A vs B
```python
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
.reshape(1,-1),\
np.array([0,-.33,0,-.33,0,.66,0])\
.reshape(1,-1))
```

This new figure for this is:  **.3077**


<a id="a-vs-c"></a>
### A vs C
```python
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
.reshape(1,-1),\
np.array([.33,0,.33,0,-.66,0,0])\
.reshape(1,-1))
```
The new figure for this is: **-0.246**

<a id="so-what-happened"></a>
## So what happened?

So the A and B got more similar and A and C got further apart, which is what we'd hope to see. This centering process also has another benefit in that easy and hard raters are put on the same basis.

<a id="exercise-find-the-similarity-between-x-and-y-and-x-and-z-for-the-following"></a>
## Exercise: Find the similarity between X and Y and X and Z for the following.

|User |Snarky's Potato Chips	| SoSo Smoth Lotion	|Duffly Beer	|BetterTap Water	|XXLargeLivin' Football Jersey	|Snowy Cotton Ballas	|Disposos Diapers|
|:-:|---|---|---|---|---|---|---|---|
| X| |4| | 3| | 4| |
| Y| |3.5| | 2.5| | 4| 4|
| Z| | 4| | 3.5| | 4.5| 4.5|

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

import numpy as np

# 1. Vectorize the data

X = np.array([0.0, 4.0, 3.0, 0.0, 4.0, 0.0])
Y = np.array([0.0, 3.5, 2.5, 0.0, 4.0, 4.0])
Z = np.array([0.0, 4.0, 3.5, 0.0, 4.5, 4.5])

# 2. Mean-center

X[X!=0] -= np.mean(X[X!=0])
Y[Y!=0] -= np.mean(Y[Y!=0])
Z[Z!=0] -= np.mean(Z[Z!=0])

# 3. Cosine similarity

print("similarity(X, Y) = " + str(cosine_similarity(X.reshape(1,-1), Y.reshape(1,-1))))
print("similarity(Y, Z) = " + str(cosine_similarity(Y.reshape(1,-1), Z.reshape(1,-1))))
print("similarity(X, Z) = " + str(cosine_similarity(X.reshape(1,-1), Z.reshape(1,-1))))

similarity(X, Y) = [[0.83333333]]
similarity(Y, Z) = [[0.98473193]]
similarity(X, Z) = [[0.73854895]]


<a id="but-how-do-we-predict-the-rating-of-an-item-for-a-user"></a>
## But how do we predict the rating of an item for a user?

| User |Snarky's Potato Chips	| SoSo Smoth Lotion	|Duffly Beer	|BetterTap Water |XXLargeLivin' Football Jersey	|Snowy Cotton Ballas	|Disposos Diapers|
| - |:-------:| --:| --- | --- | --- | --- | --- |
| X| &nbsp; |4  | &nbsp; | 3  | &nbsp; | 4  | &nbsp;  |
| Y| &nbsp; |3.5| &nbsp; | 2.5| &nbsp; | 4  | 4  |
| Z| &nbsp; |4  | &nbsp; | 3.5| &nbsp; | 4.5| 4.5|

What can we predict User X will rate Disposos Diapers?

Next we'll find the expected rating for User X for Disposo's Diapers using the weighted results of the two closest users (we only have two here, but normally k would be selected) Y and Z.

We do this by weighing each user's similarity to X and multiplying by their rating. We then divide by the sum of their similarities to arrive at our rating.

For k of 2:<br>
** (1st closest cosine sim x their product rating + 2nd closest cosine sim x their product rating) / (sum of 1st and 2nd's cosine sims) **

$$\frac{0.83333333 \cdot (4) + 0.73854895 \cdot (4.5)}{0.83333333 + 0.73854895} = 4.23$$

#### Check: What might be some problems with user-based filtering?

- Frequently-liked items will necessarily have users who like all kinds of other items. So, recommendations based on frequently-liked items may be inaccurate.

- User-based filtering also suffers from the **cold-start problem**. If a new user joins and has very few likes, then it is difficult to pair them with a similar user.

- Lastly, suppose that a user with few likes adds a new like. This may significantly change the recommendations. Hence, as users add likes, the recommendations must be continually and quickly updated.

In practice, there is a type of collaborative filtering that can perform much better than user-based filtering: **item-based filtering**.

<a id="item-based-collaborative-filtering"></a>
## Item-based Collaborative Filtering

Let's take a look at an example ratings table. Here we have songs on the left and users on the top.

<img src="http://i.imgur.com/JoBHXcG.png">

In item-based filtering, we are trying to find similarities across items rather than users.<br>
Just as in user-based filtering, we need to center our values by row.

<a id="exercise-center-the-values-by-row-and-find-the-cosine-similarity-for-each-row-vs-row--s"></a>
## Exercise: Center the values by row and find the cosine similarity for each row vs. row 5 (S5)

The nearest songs should have been S1 and S3. To calculate the rating for our target song, S5, for U3, using a k of 2, we have the following equation:

For k of 2:<br>
** (1st closest cosine sim S1 x User 3's product rating + 2nd closest cosine sim S3 x User 3's product rating) / (sum of 1st and 2nd's cosine sims) **

$$\frac{0.98 \cdot (4) + 0.72 \cdot (5)}{0.98 + 0.72} = 4.42$$

Therefore, based on this item-to-item collaborative filtering, we can see U3 is likely to rate S5 very highly at 4.42 from our calculations.

<a id="content-based-filtering"></a>
## Content-based Filtering

Finally, there is another method called content-based filtering. In content-based filtering, the items are broken down into "feature baskets". These are the characteristics that represent the item. The idea is that if you like the features of song X, then finding a song that has similar characteristics will tell us that you're likely to like it as well.


The quintessential example of this is Pandora with it's musical genome. Each song is rated on ~450 characteristics by a trained musicologist.

<a id="example"></a>
## Example 
Content-based filtering begins by mapping each item into
a feature space. Both users and items are represented by
vectors in this space.
Item vectors measure the degree to which the item is
described by each feature, and user vectors measure a
user’s preferences for each feature.
Ratings are generated by taking dot products of user &
item vectors. 

<img src="http://i.imgur.com/NzHksKK.png">

<a id="independent-exercise"></a>
## Independent Exercise
---

Write a function that takes in a utility matrix with users along the index and songs along the columns as seen above in the item-to-item filtering example. The function should accept a target user and song (as strings) that it will return a rating for. 

Use the following as your utility matrix:

In [8]:
df = pd.DataFrame({'U1':[2 , None, 1, None, 3], 
                   'U2': [None, 3, None, 4, None],
                   'U3': [4, None, 5, 4, None], 
                   'U4': [None, 3, None, 4, None], 
                   'U5': [5, None, 4, None, 5]},
                  index = ['S1', 'S2', 'S3', 'S4', 'S5'])

df

Unnamed: 0,U1,U2,U3,U4,U5
S1,2.0,,4.0,,5.0
S2,,3.0,,3.0,
S3,1.0,,5.0,,4.0
S4,,4.0,4.0,4.0,
S5,3.0,,,,5.0


<a id="conclusion"></a>
## Conclusion
---

We have looked at the major types of recommender systems in this lesson. Let's quickly wrap up by looking at the pros and cons of each.

#### Collaborative Filtering 

Pros:
- No need to hand craft features

Cons:
- Needs a large existing set of ratings (cold-start problem)
- Sparsity occurs when the number of items far exceeds what a person could purchase

#### Content-based Filtering

Pros:
- No need for a large number of users

Cons:
- Lacks serendipity
- May be difficult to generate the right features
- Hard to create cross-content recommendations (different feature spaces)

In fact, the best solution -- and the one most likely in use in any large-scale, production system is a combination of both of these. This is known as a **hybrid system**. By combining the two systems, you can get the best of both worlds.

<a id="extra-practice"></a>
## Extra Practice
---

Using the [MovieLens dataset](https://grouplens.org/datasets/movielens/100k/), experiment with building a recommender system. Check the "Additional Resources" for more information and some considerations on how to evaluate these systems.

<a id="additional-resources"></a>
## Additional Resources
---

- [Wharton Study of Recommender Systems](http://knowledge.wharton.upenn.edu/article/recommended-for-you-how-well-does-personalized-marketing-work/)
- [Netflix Recommendations](https://www.rtinsights.com/netflix-recommendations-machine-learning-algorithms/)
- [Netflix Paper](http://dl.acm.org/citation.cfm?id=2843948)
- [NY Times Rec System](https://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine)
- [Evaluating Rec Systems](https://www.quora.com/How-do-you-measure-and-evaluate-the-quality-of-recommendation-engines)