# Collaborative filtering

**I like what you like**

We are going to start our exploration of <span style="color:red">recommendation systems</span>.   

Recommendation systems are everywhere, from `Amazon`: 

<img src="figures/00-chapter2-amazon.png" width="50%">

to `last.fm` recommending music or concerts:

<img src="figures/00-chapter2-last.fm.png" width="50%">

## Amazon example
Amazon combines two bits of information to make a recommendation. 
* The first is that I viewed ``The Lotus Sutra`` translated by Gene Reeves; 
* the second, that customers who viewed that translation of the Lotus Sutra also viewed several other translations. 

The **recommendation method** we are looking at is called **collaborative filtering**. 
* It's called <font style="color:red">collaborative</font> because it makes recommendations **based on other people**
* Search among other users to find one that is similar
* Once I find that similar person I can see what she likes and recommend those objects

## How do I find someone who is similar?

A simple 2D explanation
* Suppose **users** rate **books** on a 5 star system
* We restrict our ratings to **two books** (2D case): 
  - Neal Stephenson's *Snow Crash* 
  - Steig Larsson's *The Girl with the Dragon Tattoo*.

First, here's a table showing **3 users** who rated these books


<img src="figures/00-chapter2-table-rating.png" width="60%">

<img src="figures/00-chapter2-similarity-2D.png" width="40%">

## Find the most similar person 
I would like to **recommend a book** to the mysterious **Ms. X** who rated 
* Snow Crash 4 stars 
* The Girl with the Dragon Tattoo 2 stars. 

To find the person who is **most similar, or closest**, to Ms. X.  
I do this by computing <font style="color:red">distance</font>.

### Manhattan Distance
The easiest distance measure to compute is what is called Manhattan Distance.

The distance, $d_{1}$, between two vectors $\mathbf {p}$, $\mathbf {q}$ in an $n$-dimensional real vector space
$$d_{1}(\mathbf {p} ,\mathbf {q} )=\|\mathbf {p} -\mathbf {q} \|_{1}=\sum _{i=1}^{n}|p_{i}-q_{i}|$$
where ($\mathbf {p}$ ,$\mathbf {q}$ ) are vectors
$\mathbf {p} =(p_{1},p_{2},\dots ,p_{n}){\text{ and }}\mathbf {q} =(q_{1},q_{2},\dots ,q_{n})\$

In the 2D case
* (x1, y1) might be Amy 
* (x2, y2) might be the elusive Ms. X. 

 Manhattan Distance is then calculated by
 $$|x_1 - x_2| + |y_1 - y_2|$$

So the Manhattan Distance for **Amy** and **Ms. X** is 4:

<img src="figures/00-chapter2-similarity-2D-2.png" width="40%">

Computing the distance between Ms. X and all three people gives us:


|Person |Distance from Ms. X|
|----|---|
|Amy| 4|
|Bill| 5|
|Jim| 5|

### The closest match
**Amy** is the closest match. 
* We can **look in her history** and see, for example, that 
  * she gave five stars to Paolo Bacigalupi's *The Windup Girl* 
* We would **recommend** that book to Ms. X.

* One benefit of **Manhattan Distance** is that it is **fast** to compute. 
* If we are Facebook and are trying to find the most similar person among one million users, **fast is good**.

### Euclidean Distance

If $\mathbf{p} = (p_1, p_2,..., p_n)$ and $\mathbf{q} = (q_1, q_2,..., q_n)$ are two points in Euclidean $n$-space, then the distance ($d_2$) from $\mathbf{p}$ to $\mathbf{q}$, or from $\mathbf{q}$ to $\mathbf{p}$ is given by the <font style="color:red">Pythagorean formula</font>:


$$
d_2(\mathbf{q} ,\mathbf{p} ) = \sqrt{(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}+\cdots +(q_{n}-p_{n})^{2}} ={\sqrt {\sum _{i=1}^{n}(q_{i}-p_{i})^{2}}}.
$$

### Pythagorean Theorem

<img src="figures/00-chapter2-similarity-2D-pitagora.png" width="40%">

### Euclidean distance between Ms. X and Amy
$$\sqrt{(5 - 2)^2 + (5 - 4)^2} = \sqrt{3^2 + 1^2} = \sqrt{10} = 3.16$$

Computing the rest of the distances we get

| Person |Distance from Ms. X|
|----|---|
|Amy| 3.16|
|Bill| 3.61|
|Jim| 3.61|

## N-dimensional thinking

Suppose we work for an online streaming music service and we want to make the experience more compelling by **recommending bands**.  
Let's say users can rate bands on a star system 0-5 stars and they can give **half star ratings** (for 
example, you can give a band 2.5 stars).  
The following chart shows **8 users** and their ratings of **8 bands**.

| Band | Angelica| Bill | Chan | Dan | Hailey | Jordyn | Sam | Veronica |
|----|---|---|---|---|---|---|---|---|
| Blues Traveler | 3.5 |2| 5| 3| - |-| 5 |3|
| Broken Bells | 2| 3.5| 1| 4 |4 |4.5| 2 |-|
| Deadmau5 | - |4 |1| 4.5 |1 |4 |-| -|
| Norah Jones | 4.5| - |3 |-| 4| 5| 3| 5|
| Phoenix | 5 | 2 | 5 | 3| -| 5| 5| 4|
| Slightly Stoopid | 1.5| 3.5| 1| 4.5| -| 4.5| 4 |2.5|
| The Strokes | 2.5 |- |- |4 |4 |4 |5| 3|
| Vampire Weekend | 2| 3| -| 2| 1| 4| -| -|

### The Manhattan Distance

|Band|Angelica| Bill| Difference|
|----|---|---|---|
|Blues Traveler| 3.5| 2| 1.5|
|Broken Bells| 2| 3.5| 1.5|
|Deadmau|5| - |- |
|Norah Jones| 4.5| -| -|
|Phoenix |5| 2| 3|
|Slightly Stoopid |1.5| 3.5| 2|
|The Strokes |2.5 |-| -|
|Vampire Weekend |2 |3| 1|
|**Manhattan Distance**:|- | -|9|

### The Euclidean Distance

|Band|Angelica| Bill| Difference|Difference$^2$|
|----|---|---|---|---|
|Blues Traveler| 3.5| 2| 1.5| 2.25|
|Broken Bells| 2| 3.5| 1.5| 2.25|
|Deadmau|5| - |4| -|
|Norah Jones| 4.5| -|-|- |
|Phoenix |5| 2| 3| 9|
|Slightly Stoopid |1.5| 3.5| 2|4|
|The Strokes |2.5 |-| -|-|
|Vampire Weekend |2 |3| 1|1|
|Sum of squares|-|-|-|18.5|
|**Euclidean Distance**:|-|- |-|4.3|

To parse that out a bit more:
$$Euclidean = \sqrt{(3.5−2)^2 +(2− 3.5)^2 +(5−2)^2+(1.5− 3.5)^2+(2− 3)^2}$$
            $$= \sqrt{1.52+(−1.5)^2+ 3^2+(−2)^2+1^2}$$
            $$= \sqrt{2.25+2.25+9+ 4+1}$$
            $$= \sqrt{18.5} $$
            $$= 4.3$$

### A flaw: missing values

It looks like we discovered a flaw with using these distance measures.  
* Computing the distance between **Hailey** and **Veronica**, we noticed they **only rated two bands in common** (Norah Jones and The Strokes), 
* whereas when we computed the distance between **Hailey** and **Jordyn**, we noticed they rated **five bands in common**. 
* This seems to skew our distance measurement, since the Hailey-Veronica distance is in 2 dimensions while the Hailey-Jordyn distance is in 5 dimensions. 

* Manhattan Distance and Euclidean Distance work best when there are **no missing values**. 

## A generalization of distance

We can generalize Manhattan Distance and Euclidean Distance to what is called the 
**Minkowski Distance** Metric:
$$d(x, y) = (\sum_{k=1}^n | x_k − y_k |^r)^{1/r}$$

When
- r = 1: The formula is Manhattan Distance.
- r = 2: The formula is Euclidean Distance
- r = ∞: Supremum Distance

## Some code

Representing the data using Python's dictionary:

In [None]:
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, 
                      "Norah Jones": 4.5, "Phoenix": 5.0, 
                      "Slightly Stoopid": 1.5, 
                      "The Strokes": 2.5, "Vampire Weekend": 2.0}, 
         "Bill":     {"Blues Traveler": 2.0, "Broken Bells": 3.5, 
                      "Deadmau5": 4.0, 
                      "Phoenix": 2.0, "Slightly Stoopid": 3.5, 
                      "Vampire Weekend": 3.0}, 
         "Chan":     {"Blues Traveler": 5.0, "Broken Bells": 1.0, 
                      "Deadmau5": 1.0, "Norah Jones": 3.0, 
                      "Phoenix": 5, "Slightly Stoopid": 1.0}, 
         "Dan":      {"Blues Traveler": 3.0, "Broken Bells": 4.0, 
                      "Deadmau5": 4.5, "Phoenix": 3.0, 
                      "Slightly Stoopid": 4.5, "The Strokes": 4.0, 
                      "Vampire Weekend": 2.0}, 
         "Hailey":   {"Broken Bells": 4.0, "Deadmau5": 1.0, 
                      "Norah Jones": 4.0, "The Strokes": 4.0, 
                      "Vampire Weekend": 1.0}, 
         "Jordyn":   {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, 
                      "Phoenix": 5.0, "Slightly Stoopid": 4.5, 
                      "The Strokes": 4.0, "Vampire Weekend": 4.0}, 
         "Sam":      {"Blues Traveler": 5.0, "Broken Bells": 2.0, 
                      "Norah Jones": 3.0, "Phoenix": 5.0, 
                      "Slightly Stoopid": 4.0, "The Strokes": 5.0}, 
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, 
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5, 
                      "The Strokes": 3.0}}

We can get the ratings of a particular user as follows:

In [None]:
users["Veronica"]

### The code to compute Manhattan distance

I'd like to write a function that computes the Manhattan distance as follows:

In [None]:
def manhattan(rating1, rating2): 
    """Computes the Manhattan distance. Both rating1 and rating2 are 
    dictionaries of the form {'The Strokes': 3.0, 'Slightly 
    Stoopid': 2.5}"""
    distance = 0 
    commonRatings = False

    for key in rating1: 
        if key in rating2: 
            distance += abs(rating1[key] - rating2[key]) 
            commonRatings = True
    if commonRatings: 
        return distance 
    else: 
        return -1 #Indicates no ratings in common

To test the function:

In [None]:
manhattan(users['Hailey'], users['Veronica']) 

In [None]:
manhattan(users['Hailey'], users['Jordyn'])

### Find the closest person
A function that returns a sorted list with the closest person first:

In [None]:
def compute_nearest_neighbor(username, users):
    """
    creates a sorted list of users based on their distance to username
    """ 
    distances = [] 
    for user in users: 
        if user != username: 
            distance = manhattan(users[user], users[username]) 
            distances.append((distance, user)) 
            print("Distance(%s, %s) = %f"%(username,user,distance))
    # sort based on distance -- closest first
    distances.sort() 
    return distances

In [None]:
compute_nearest_neighbor('Hailey', users)

###  Make recommendations
* We find `Hailey` nearest neighbor (`Veronica` in this case). 
* We will then **find** `bands that Veronica has rated but Hailey has not`. Also, 
* For example, 
  * Hailey has **not rated** the great band `Phoenix`. 
  * Veronica has **rated** `Phoenix` a '4' so we will assume Hailey is likely to enjoy the band as well. 
  
Here is a function to make recommendations.

In [None]:
def recommend(username, users):
    """
    Give list of recommendations
    """
    # first find nearest neighbor
    nearest = compute_nearest_neighbor(username, users)[0][1] 
    print("nearest neighbor:", nearest)
    recommendations = [] 
    # now find bands neighbor rated that user didn't 
    neighborRatings = users[nearest] 
    userRatings = users[username] 
    for artist in neighborRatings: 
        if not artist in userRatings:
            recommendations.append((artist, neighborRatings[artist]))
    recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse = True)
    return recommendations

In [None]:
recommend('Hailey', users)

That fits with our expectations. 
* As we saw above, `Hailey`'s nearest neighbor was `Veronica` and Veronica gave `Phoenix` a '**4**'. 

Let's try a few more:

In [None]:
recommend('Chan', users) 

We think `Chan` will like `The Strokes`

In [None]:
recommend('Sam', users) 

We predict that `Sam` will not like `Deadmau5`.

In [None]:
recommend('Angelica', users) 

Hmm. For Angelica we got back an **empty set** meaning we have **no recommendations** for her.  
Let us see what went wrong:

In [None]:
compute_nearest_neighbor('Angelica', users) 

Angelica's nearest neighbor is Veronica.  
When we look at their ratings:

* We see that **Angelica rated every band that Veronica did**. 
* We have no new ratings, so no recommendations. 