# Notebook 17: Collaborative Filtering
***

During lecture, we discussed Collaborative Filtering, a common method for making user recommendations and estimating unknown user ratings for new items they have not yet rated/tried. An important step in collaborative filtering is to determine the $k$ items in our data set most similar to a given item; this is the $k$-nearest neighbors problem. To solve it efficiently, we will introduce and practice using the `NearestNeighbors` method from the ubitquitous Scikit-learn (`sklearn`) package.

We'll need numpy, Pandas and the Scikit-learn NearestNeighbors packages for this notebook, so let's load them.

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors

<br>

### Exercise 1: Meeting your neighbors

Suppose we have a data set consisting of four users and their ratings (0 = hate it, 10 = love it) for three candy bars as follows:

In [2]:
dfCandy = pd.DataFrame({"Jeanne" : [3, 5, 10],
                        "George" : [0, 9, 7],
                        "Kathy" : [5, 6, 1],
                        "Rich" : [9, 4, 4]})

# These are the candy bars corresponding to the rows of the data frame
treats = ["Whatchamacallit", "Milky Way", "Almond Joy"]
dfCandy.rename({0 : treats[0], 1 : treats[1], 2 : treats[2]}, inplace=True)

dfCandy.head()

Unnamed: 0,Jeanne,George,Kathy,Rich
Whatchamacallit,3,0,5,9
Milky Way,5,9,6,4
Almond Joy,10,7,1,4


Finish the following function to compute the distances from a given user in a DataFrame to each other user, and return the indices of the $k$ users with the lowest distances to the given user. You can either write your own function to compute Euclidean distance or use the `numpy.linalg.norm` function.

As a brief test, the distances from Jeanne to each of the users are: `[0, 5.83, 9.27, 8.54]`

In [3]:
def neighbors(df, user, k=1):
    '''
    Loop over all users and compute the distance to the input user.
    Return the indices/names of the k nearest other users as a list.
    '''
    distances = [0, 0, 0, 0] # TODO
    # TODO -- probably some other calculations
    return nearest

In [4]:
# SOLUTION:

def neighbors(df, user, k=1):
    '''
    Loop over all users and compute the distance to the input user.
    Return the indices/names of the k nearest other users as a list.
    '''
    distances = [np.linalg.norm(df[user] - df[u]) for u in df.columns]
    distances_sorted = np.sort(distances)[1:k+1]
    nearest = [distances.index(distances_sorted[i]) for i in range(k)]
    return nearest

Suppose a new user, Elizabeth, is considering trying a Whatchamacallit bar, and has 

<br>

### Exercise 2: Item-item collaborative filtering

Our homemade code for finding the $k$ nearest neighbors from **Exercise 1** is good and all, but - as with most things - there are some nice Python packages that can also take care of this in a more efficient manner. The homemade approach will become less and less efficient as the size of our utility matrix grows, so let's explore using Scikit-learn's [`NearestNeighbors`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) methods.

Suppose we have a set of movies A-F and 12 users given by the utility matrix below (and from lecture).

In [5]:
movies = [[1,0,3,0,0,5,0,0,5,0,4,0],
          [0,0,5,4,0,0,4,0,0,2,1,3],
          [2,4,0,1,2,0,3,0,4,3,5,0],
          [0,2,4,0,5,0,0,4,0,0,2,0],
          [0,0,4,3,4,2,0,0,0,0,2,5],
          [1,0,3,0,3,0,0,2,0,0,4,0]]

dfM = pd.DataFrame(movies)
dfM.rename({0:"A",1:"B",2:"C",3:"D",4:"E",5:"F"}, inplace=True)
dfM.head(6)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
A,1,0,3,0,0,5,0,0,5,0,4,0
B,0,0,5,4,0,0,4,0,0,2,1,3
C,2,4,0,1,2,0,3,0,4,3,5,0
D,0,2,4,0,5,0,0,4,0,0,2,0
E,0,0,4,3,4,2,0,0,0,0,2,5
F,1,0,3,0,3,0,0,2,0,0,4,0


We can start by setting up a `NearestNeighbors` object to find the two nearest neighbors to a given data point.

In [15]:
# Set up a NearestNeighbors object to find the two nearest neighbors
neigh = NearestNeighbors(n_neighbors=2)

Then we need to fit the model to the movies data set. Without suppressing the output, we can also see some of the specifics of the NearestNeighbors object that was fit.

In [16]:
# Fit the model based on the movies data
neigh.fit(movies)  
#note: fitting off movies array instead of df here, because by default kNN will exclude the point from
#its own nearest neighbors, as we might want!


NearestNeighbors(n_neighbors=2)

We can have a quick check on if our method is working properly by checking what the nearest neighbors of Movie A are. The first array in the output contains the distances to the neighbors, and the second array provides the indices within the data set of the nearest points. If we ask for the nearest points to a movie that's actually in the data set, what should the closest neighbor always be? Let's check!

In [18]:
# What movie is most similar to movie A?
print(dfM.loc["A",:])
print(neigh.kneighbors(np.reshape(dfM.loc["A",:], (1,-1))))  

#print(neigh.kneighbors(dfM.loc["A",:]))
#dfM.loc["A",:].reshape((1,-1))

0     1
1     0
2     3
3     0
4     0
5     5
6     0
7     0
8     5
9     0
10    4
11    0
Name: A, dtype: int64


ValueError: Length of passed values is 1, index implies 12.

Now that was using all of the data. Modify the code above to only use the movies also rated by User 5 for fitting the `NearestNeighbors` model. The `loc` method for Pandas DataFrames will be userful here.

In [11]:
# SOLUTION:

# Set up a NearestNeighbors object to find the two nearest neighbors
neigh = NearestNeighbors(n_neighbors=2)

# Fit the model based on the movies data
neigh.fit(dfM.loc[dfM.iloc[:,4] != 0,:])

# What movie is most similar to movie A?
print(neigh.kneighbors(dfM.loc["A",:].reshape((1,-1)))) 

AttributeError: 'Series' object has no attribute 'reshape'

In the previous steps, you may have noticed that the distances computed are "[minkowski](https://en.wikipedia.org/wiki/Minkowski_distance)", which can correspond to either Manhattan distance or Euclidean distance. 

In [None]:
print(neigh)

Neither of these are what we used in lecture, which is the cosine distance. Check out the [Scikit-learn documentation on `NearestNeighbors`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) and make a modification to use the cosine distance instead. Note that the only algorithm option that works with the cosine distance is `algorithm="brute"`.

In [12]:
# SOLUTION:

# Set up a NearestNeighbors object to find the two nearest neighbors
neigh = NearestNeighbors(n_neighbors=2, metric="cosine", algorithm="brute")

# Fit the model based on the movies data
neigh.fit(dfM.loc[dfM.iloc[:,4] != 0,:])

# What movie is most similar to movie A?
print(neigh.kneighbors(dfM.loc["A",:].reshape((1,-1))))

AttributeError: 'Series' object has no attribute 'reshape'

**Question:** It's close, but why doesn't this match what we saw in class? Perform a calculation that accounts for the difference and recompute the two nearest neighbors to Movie A and their distances. After that, they should match what we saw in lecture, and the world will be complete.

**Solution:**

We need to ***center*** the ratings for each item first by substracting from each row the item's mean rating (not including 0s).

In [None]:
# SOLUTION:

def center(dfRow):
    subtract = dfRow.loc[dfRow != 0].mean()
    dfRow.loc[dfRow != 0] -= subtract
    return dfRow

dfM_centered = dfM
dfM_centered = dfM_centered.apply(center, axis=1)
dfM_centered.head(6)

In [None]:
# SOLUTION:

# Set up a NearestNeighbors object to find the two nearest neighbors
neigh = NearestNeighbors(n_neighbors=2, metric="cosine", algorithm="brute")

# Fit the model based on the movies data
neigh.fit(dfM_centered.loc[dfM_centered.iloc[:,4] != 0,:])

# What movie is most similar to movie A?
print(neigh.kneighbors(np.reshape(dfM.loc["A",:], (1,-1))))