<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Recommendations-with-MovieTweetings:-Collaborative-Filtering" data-toc-modified-id="Recommendations-with-MovieTweetings:-Collaborative-Filtering-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Recommendations with MovieTweetings: Collaborative Filtering</a></span><ul class="toc-item"><li><span><a href="#Recommendations-with-MovieTweetings:-Collaborative-Filtering" data-toc-modified-id="Recommendations-with-MovieTweetings:-Collaborative-Filtering-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Recommendations with MovieTweetings: Collaborative Filtering</a></span><ul class="toc-item"><li><span><a href="#Measures-of-Similarity" data-toc-modified-id="Measures-of-Similarity-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Measures of Similarity</a></span></li><li><span><a href="#User-Item-Matrix" data-toc-modified-id="User-Item-Matrix-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>User-Item Matrix</a></span></li><li><span><a href="#Creating-the-User-Item-Matrix" data-toc-modified-id="Creating-the-User-Item-Matrix-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Creating the User-Item Matrix</a></span></li><li><span><a href="#Calculating-User-Similarities" data-toc-modified-id="Calculating-User-Similarities-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Calculating User Similarities</a></span></li><li><span><a href="#Why-the-NaN's?" data-toc-modified-id="Why-the-NaN's?-1.1.5"><span class="toc-item-num">1.1.5&nbsp;&nbsp;</span>Why the NaN's?</a></span></li><li><span><a href="#Using-the-Nearest-Neighbors-to-Make-Recommendations" data-toc-modified-id="Using-the-Nearest-Neighbors-to-Make-Recommendations-1.1.6"><span class="toc-item-num">1.1.6&nbsp;&nbsp;</span>Using the Nearest Neighbors to Make Recommendations</a></span></li><li><span><a href="#Using-the-Nearest-Neighbors-to-Make-Recommendations" data-toc-modified-id="Using-the-Nearest-Neighbors-to-Make-Recommendations-1.1.7"><span class="toc-item-num">1.1.7&nbsp;&nbsp;</span>Using the Nearest Neighbors to Make Recommendations</a></span><ul class="toc-item"><li><span><a href="#Udacity--subset-and-sort-methods" data-toc-modified-id="Udacity--subset-and-sort-methods-1.1.7.1"><span class="toc-item-num">1.1.7.1&nbsp;&nbsp;</span>Udacity  subset and sort methods</a></span></li><li><span><a href="#Implementing-my-subset-and-sort-solution" data-toc-modified-id="Implementing-my-subset-and-sort-solution-1.1.7.2"><span class="toc-item-num">1.1.7.2&nbsp;&nbsp;</span>Implementing my subset and sort solution</a></span></li><li><span><a href="#Solution-3---Split-implementation" data-toc-modified-id="Solution-3---Split-implementation-1.1.7.3"><span class="toc-item-num">1.1.7.3&nbsp;&nbsp;</span>Solution 3 - Split implementation</a></span></li></ul></li></ul></li><li><span><a href="#Final-Implementation-of-Collaborative-Filtering-Recommender" data-toc-modified-id="Final-Implementation-of-Collaborative-Filtering-Recommender-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Final Implementation of Collaborative Filtering Recommender</a></span><ul class="toc-item"><li><span><a href="#Now-What?" data-toc-modified-id="Now-What?-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Now What?</a></span></li></ul></li></ul></li></ul></div>

# Recommendations with MovieTweetings: Collaborative Filtering
_**Andrew Leung - April 04, 2022**_

## Recommendations with MovieTweetings: Collaborative Filtering

One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

There are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.


In this notebook, you will be working on performing **neighborhood-based collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook you will be implementing **user-based collaborative filtering**.  However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**.  First, let's read in our data and necessary libraries.

**NOTE**: Because of the size of the datasets, some of your code cells here will take a while to execute, so be patient!

In [39]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t
from scipy.sparse import csr_matrix
from IPython.display import HTML


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

print(reviews.head())

   user_id  movie_id  rating   timestamp                 date  month_1  \
0        1     68646      10  1381620027  2013-10-12 23:20:27        0   
1        1    113277      10  1379466669  2013-09-18 01:11:09        0   
2        2    422720       8  1412178746  2014-10-01 15:52:26        0   
3        2    454876       8  1394818630  2014-03-14 17:37:10        0   
4        2    790636       7  1389963947  2014-01-17 13:05:47        0   

   month_2  month_3  month_4  month_5  ...  month_9  month_10  month_11  \
0        0        0        0        0  ...        0         1         0   
1        0        0        0        0  ...        0         0         0   
2        0        0        0        0  ...        0         1         0   
3        0        0        0        0  ...        0         0         0   
4        0        0        0        0  ...        0         0         0   

   month_12  year_2013  year_2014  year_2015  year_2016  year_2017  year_2018  
0         0          1  

### Measures of Similarity

When using **neighborhood** based collaborative filtering, it is important to understand how to measure the similarity of users or items to one another.  

There are a number of ways in which we might measure the similarity between two vectors (which might be two users or two items).  In this notebook, we will look specifically at two measures used to compare vectors:

* **Pearson's correlation coefficient**

Pearson's correlation coefficient is a measure of the strength and direction of a linear relationship. The value for this coefficient is a value between -1 and 1 where -1 indicates a strong, negative linear relationship and 1 indicates a strong, positive linear relationship. 

If we have two vectors x and y, we can define the correlation between the vectors as:


$$CORR(x, y) = \frac{\text{COV}(x, y)}{\text{STDEV}(x)\text{ }\text{STDEV}(y)}$$

where 

$$\text{STDEV}(x) = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

and 

$$\text{COV}(x, y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$

where n is the length of the vector, which must be the same for both x and y and $\bar{x}$ is the mean of the observations in the vector.  

We can use the correlation coefficient to indicate how alike two vectors are to one another, where the closer to 1 the coefficient, the more alike the vectors are to one another.  There are some potential downsides to using this metric as a measure of similarity.  You will see some of these throughout this workbook.


* **Euclidean distance**

Euclidean distance is a measure of the straightline distance from one vector to another.  Because this is a measure of distance, larger values are an indication that two vectors are different from one another (which is different than Pearson's correlation coefficient).

Specifically, the euclidean distance between two vectors x and y is measured as:

$$ \text{EUCL}(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

Different from the correlation coefficient, no scaling is performed in the denominator.  Therefore, you need to make sure all of your data are on the same scale when using this metric.

**Note:** Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale.  If some measures are on a 5 point scale, while others are on a 100 point scale, you are likely to have non-optimal results due to the difference in variability of your features.  In this case, we will not need to scale data because they are all on a 10 point scale, but it is always something to keep in mind!

------------

### User-Item Matrix

In order to calculate the similarities, it is common to put values in a matrix.  In this matrix, users are identified by each row, and items are represented by columns.  


![alt text](images/userxitem.png "User Item Matrix")


In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and **User 2**, **User 3**, and **User 4** all used **Item 2**.  However, there are also a large number of missing values in the matrix for users who haven't used a particular item.  A matrix with many missing values (like the one above) is considered **sparse**.

Our first goal for this notebook is to create the above matrix with the **reviews** dataset.  However, instead of 1 values in each cell, you should have the actual rating.  

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the **reviews** dataframe, which you can see by running the cell below.

In [40]:
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()

Unnamed: 0,user_id,movie_id,rating
0,1,68646,10
1,1,113277,10
2,2,422720,8
3,2,454876,8
4,2,790636,7


### Creating the User-Item Matrix

In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). 

However, I quickly ran into a memory error (a common theme throughout this notebook).  I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results! 

_____

`1.` Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) might help you!

In [41]:
# Create user-by-item matrix
# user_items.pivot_table(index="user_id", columns="movie_id", aggfunc="max")

user_by_movie = user_items.groupby(by=["user_id", "movie_id"])["rating"].max().unstack()

Check your results below to make sure your matrix is ready for the upcoming sections.

In [42]:
assert movies.shape[0] == user_by_movie.shape[1], "Oh no! Your matrix should have {} columns, and yours has {}!".format(movies.shape[0], user_by_movie.shape[1])
assert reviews.user_id.nunique() == user_by_movie.shape[0], "Oh no! Your matrix should have {} rows, and yours has {}!".format(reviews.user_id.nunique(), user_by_movie.shape[0])
print("Looks like you are all set! Proceed!")
HTML('<img src="images/greatjob.webp">')

Looks like you are all set! Proceed!


In [43]:
user_by_movie.tail(10)

movie_id,8,10,12,25,91,417,439,443,628,833,...,8144778,8144868,8206708,8289196,8324578,8335880,8342748,8342946,8402090,8439854
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53959,,,,,,,,,,,...,,,,,,,,,,
53960,,,,,,,,,,,...,,,,,,,,,,
53961,,,,,,,,,,,...,,,,,,,,,,
53962,,,,,,,,,,,...,,,,,,,,,,
53963,,,,,,,,,,,...,,,,,,,,,,
53964,,,,,,,,,,,...,,,,,,,,,,
53965,,,,,,,,,,,...,,,,,,,,,,
53966,,,,,,,,,,,...,,,,,,,,,,
53967,,,,,,,,,,,...,,,,,,,,,,
53968,,,,,,,,,,,...,,,,,,,,,,


`2.` Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [48]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    movies = user_by_movie.loc[user_id][user_by_movie.loc[user_id].notnull()==True].index.array
    
    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    movies_seen= {}
    
    for i in user_by_movie.index:
        movies_seen[i]= movies_watched(i)
    
    
    return movies_seen


# Use your function to return dictionary
movies_seen = create_user_movie_dict()

In [47]:
#testing the user/movies seen dictionary
create_user_movie_dict()

{1: <PandasArray>
 [68646, 113277]
 Length: 2, dtype: int64,
 2: <PandasArray>
 [ 422720,  454876,  790636,  816711, 1091191, 1103275, 1322269, 1390411,
  1398426, 1431045, 1433811, 1454468, 1535109, 1675434, 1798709, 2017038,
  2024544, 2294629, 2361509, 2381249, 2726560, 2883512, 3079380]
 Length: 23, dtype: int64,
 3: <PandasArray>
 [1790864, 2170439, 2203939]
 Length: 3, dtype: int64,
 4: <PandasArray>
 [1300854]
 Length: 1, dtype: int64,
 5: <PandasArray>
 [54953, 120863]
 Length: 2, dtype: int64,
 6: <PandasArray>
 [2103281]
 Length: 1, dtype: int64,
 7: <PandasArray>
 [1764234, 1790885, 2053463]
 Length: 3, dtype: int64,
 8: <PandasArray>
 [385002, 1220198, 1462900, 1512685, 1631707, 1986994, 1999995]
 Length: 7, dtype: int64,
 9: <PandasArray>
 [65207, 363163, 985699]
 Length: 3, dtype: int64,
 10: <PandasArray>
 [1253863]
 Length: 1, dtype: int64,
 11: <PandasArray>
 [3294634]
 Length: 1, dtype: int64,
 12: <PandasArray>
 [1255953]
 Length: 1, dtype: int64,
 13: <PandasArray>


`3.` If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this workbook.

In [52]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 
    
    '''
    
    # Do things to create updated dictionary
    
    movies_to_analyze = {}
    
    for (key, value) in movies_seen.items():
    
        if len(value) > lower_bound:
            
            movies_to_analyze[key] = value
       
    
    
    
    return movies_to_analyze


# Use your function to return your updated dictionary
movies_to_analyze = create_movies_to_analyze(movies_seen)

In [53]:
# Run the tests below to check that your movies_to_analyze matches the solution
assert len(movies_to_analyze) == 23512, "Oops!  It doesn't look like your dictionary has the right number of individuals."
assert len(movies_to_analyze[2]) == 23, "Oops!  User 2 didn't match the number of movies we thought they would have."
assert len(movies_to_analyze[7])  == 3, "Oops!  User 7 didn't match the number of movies we thought they would have."
print("If this is all you see, you are good to go!")

If this is all you see, you are good to go!


### Calculating User Similarities

Now that you have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users.  Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
```

However, this took a very long time to run, and other methods of performing these operations did not fit on the workspace memory!

Therefore, rather than creating a dataframe with all possible pairings of users in our data, your task for this question is to look at a few specific examples of the correlation between ratings given by two users.  For this question consider you want to compute the [correlation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) between users.

`4.` Using the **movies_to_analyze** dictionary and **user_by_movie** dataframe, create a function that computes the correlation between the ratings of similar movies for two users.  Then use your function to compare your results to ours using the tests below.  

In [54]:
#find shared elements/movies
shared = np.intersect1d(movies_to_analyze[2],movies_to_analyze[66])
#subset with index label (user id) and select only the shared movies (returns a series that is turned to an array)
user_by_movie.loc[2,shared.tolist()].array

<PandasArray>
[8.0, 10.0, 8.0]
Length: 3, dtype: float64

In [55]:
#pair-wise correlation value between the two arrays of ratings for the same movies

'''
can create pearson correlation like last module or as follows

def my_corrcoef1( x, y ):    
    mean_x = np.mean( x )
    mean_y = np.mean( y )
    std_x  = np.std ( x )
    std_y  = np.std ( y )
    n      = len    ( x )
    return (( x - mean_x ) * ( y - mean_y )).sum() / n / ( std_x * std_y )
'''
#np.corrcoef returns a correlation matrix; index into the correct location for the value - because there are only two
#arrays, location will hold true
np.corrcoef(user_by_movie.loc[2,shared.tolist()].array,user_by_movie.loc[66,shared.tolist()].array)[0][1]

0.7559289460184543

In [56]:
def compute_correlation(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the correlation between the matching ratings between the two users
    '''
    shared = np.intersect1d(movies_to_analyze[user1],movies_to_analyze[user2])
    
    x  = user_by_movie.loc[user1,shared.tolist()].array
    y  = user_by_movie.loc[user2,shared.tolist()].array
    
    corr = np.corrcoef(x,y)[0][1]
    
    return corr #return the correlation

In [57]:
# Test your function against the solution
assert compute_correlation(2,2) == 1.0, "Oops!  The correlation between a user and itself should be 1.0."
assert round(compute_correlation(2,66), 2) == 0.76, "Oops!  The correlation between user 2 and 66 should be about 0.76."
assert np.isnan(compute_correlation(2,104)), "Oops!  The correlation between user 2 and 104 should be a NaN."

print("If this is all you see, then it looks like your function passed all of our tests!")

If this is all you see, then it looks like your function passed all of our tests!


  c /= stddev[:, None]
  c /= stddev[None, :]


### Why the NaN's?

If the function you wrote passed all of the tests, then you have correctly set up your function to calculate the correlation between any two users.  

`5.` But one question is, why are we still obtaining **NaN** values?  As you can see in the code cell above, users 2 and 104 have a correlation of **NaN**. Why?

Think and write your ideas here about why these NaNs exist, and use the cells below to do some coding to validate your thoughts. You can check other pairs of users and see that there are actually many NaNs in our data - 2,526,710 of them in fact. These NaN's ultimately make the correlation coefficient a less than optimal measure of similarity between two users.




In [58]:
# Which movies did both user 2 and user 104 see?
np.intersect1d(movies_to_analyze[2],movies_to_analyze[104])

array([ 454876,  816711, 1454468, 1535109], dtype=int64)

In [59]:
# What were the ratings for each user for those movies?

#as a result of user 2 having all the same rating for each movie, the sd is 0 and dividing by 0 gives you the NaN
user_by_movie.loc[[2,104],np.intersect1d(movies_to_analyze[2],movies_to_analyze[104]) ]

movie_id,454876,816711,1454468,1535109
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,8.0,8.0,8.0,8.0
104,9.0,7.0,7.0,9.0


`6.` Because the correlation coefficient proved to be less than optimal for relating user ratings to one another, we could instead calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.  This function should be very similar to your previous function.  When you feel confident with your function, test it against our results.

In [60]:
# Which movies did both user 2 and user 104 see?
shared = np.intersect1d(movies_to_analyze[2],movies_to_analyze[104])

In [61]:
x  = user_by_movie.loc[2,shared.tolist()].array
y  = user_by_movie.loc[104,shared.tolist()].array

In [62]:
#np.linalg.norm takes in a 1d or 2d array and gives the fobenius norm which is the euclidean distance in the Minkowski
#equation
np.linalg.norm(x-y)

2.0

In [63]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    '''
    shared = np.intersect1d(movies_to_analyze[user1],movies_to_analyze[user2])
    
    x  = user_by_movie.loc[user1,shared.tolist()].array
    y  = user_by_movie.loc[user2,shared.tolist()].array
    
    dist = np.linalg.norm(x-y)
    return dist #return the euclidean distance

In [64]:
# Read in solution euclidean distances
import pickle
df_dists = pd.read_pickle("data/Term2/recommendations/lesson1/data/dists.p")

In [65]:
# Test your function against the solution
assert compute_euclidean_dist(2,2) == df_dists.query("user1 == 2 and user2 == 2")['eucl_dist'][0], "Oops!  The distance between a user and itself should be 0.0."
assert round(compute_euclidean_dist(2,66), 2) == round(df_dists.query("user1 == 2 and user2 == 66")['eucl_dist'][1], 2), "Oops!  The distance between user 2 and 66 should be about 2.24."
assert np.isnan(compute_euclidean_dist(2,104)) == np.isnan(df_dists.query("user1 == 2 and user2 == 104")['eucl_dist'][4]), "Oops!  The distance between user 2 and 104 should be 2."

print("If this is all you see, then it looks like your function passed all of our tests!")

If this is all you see, then it looks like your function passed all of our tests!


### Using the Nearest Neighbors to Make Recommendations

In the previous question, you read in **df_dists**. Therefore, you have a measure of distance between each user and every other user. This dataframe holds every possible pairing of users, as well as the corresponding euclidean distance.

Because of the **NaN** values that exist within the correlations of the matching ratings for many pairs of users, as we discussed above, we will proceed using **df_dists**. You will want to find the users that are 'nearest' each user.  Then you will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [66]:
#finding closest neighbours - after looking at df_dists - use user1 as lookup individual
#remove self as a neighbour - example with user id == 2
user_subset = df_dists[(df_dists["user1"]==2) & (df_dists["user2"]!=2)].sort_values(by=["eucl_dist"], ascending=True)

In [67]:
#find the neighbours of individual requssted - use user2 column and turn to array
user_subset["user2"].array

<PandasArray>
[22915, 34706, 33207, 30884, 12856, 20390, 32951, 39417, 35310,  2138,
 ...
 14927, 15981, 30036,  3064,  9914, 49739,  3514, 36807, 32494, 52737]
Length: 2808, dtype: int64

In [68]:
#array of movies for requested user that they have watched and liked (above minimum threshold)
user_items[(user_items["user_id"]==2)&(user_items["rating"]>=7)]["movie_id"].array

<PandasArray>
[ 422720,  454876,  790636,  816711, 1091191, 1103275, 1322269, 1390411,
 1398426, 1431045, 1433811, 1454468, 1535109, 1675434, 1798709, 2017038,
 2024544, 2294629, 2361509, 2726560, 2883512, 3079380]
Length: 22, dtype: int64

In [69]:
#movie titles from ids
movie_ids_wanted = user_items[(user_items["user_id"]==2)&(user_items["rating"]>=7)]["movie_id"].array
movies[movies["movie_id"].isin(movie_ids_wanted)]["movie"].tolist()

['Marie Antoinette (2006)',
 'Life of Pi (2012)',
 'Dallas Buyers Club (2013)',
 'World War Z (2013)',
 'Lone Survivor (2013)',
 'Two Lovers (2008)',
 'August: Osage County (2013)',
 'In the Heart of the Sea (2015)',
 'Straight Outta Compton (2015)',
 'Deadpool (2016)',
 'Disconnect (2012)',
 'Gravity (2013)',
 'Captain Phillips (2013)',
 'The Intouchables (2011)',
 'Her (2013)',
 'All Is Lost (2013)',
 '12 Years a Slave (2013)',
 'Frozen (2013)',
 'The Intern (2015)',
 'The Longest Ride (2015)',
 'Chef (2014)',
 'Spy (2015)']

### Using the Nearest Neighbors to Make Recommendations

In this following section, I take a look at how sort order changes particularly with my subset and sort method versus Udacity's approach. In this first example, I looked at sorting in the user_items dataframe to grab movies above a certain rating.

In [70]:
#make recommendations by gathering movies watched and liked over a rating of 7 by neighbours
#this is my original method where I was thinking I would do the groupby here and remove all the movies already watched 
#but this method would only implicitly take into account the order the neighbours - recommendations would not necessarily
#come from cloest neighbours if there is a shuffle in record order afterwards
#might worked if additionally tweaked

neighbours = user_subset["user2"].array
user_watched = user_items[(user_items["user_id"]==2)&(user_items["rating"]>=7)]["movie_id"].array

user_items[(user_items["user_id"].isin(neighbours)) & (user_items["rating"]>=7) & ~(user_items["movie_id"].isin(user_watched))]


Unnamed: 0,user_id,movie_id,rating
559,66,33467,8
560,66,33870,9
561,66,36775,9
562,66,37008,7
563,66,37017,7
...,...,...,...
712301,53966,5362988,8
712303,53966,5442430,10
712307,53966,5503688,7
712326,53966,6000478,7


#### Udacity  subset and sort methods

Using user_id== 2 The methods used by Udacity to return users with sorted distances produces the following order:

In [71]:
df = df_dists[(df_dists["user1"]==2)].sort_values(by=["eucl_dist"], ascending=True)
df

Unnamed: 0,user1,user2,eucl_dist
0,2,2,0.000000
35,2,755,0.000000
1161,2,22915,0.000000
1836,2,35310,0.000000
1808,2,34706,0.000000
...,...,...,...
2575,2,49739,10.344080
165,2,3514,10.440307
1915,2,36807,10.862780
1699,2,32494,11.532563


#### Implementing my subset and sort solution

In my solution, instead of using the .iloc after the sort to remove the user2 == 2 (paired with itself), I used the & (df_dists["user2"]!=2) as part of a boolean selection. I succesfully remove user2 == 2 but as you can see the records at the top change compared to the solution:

In [73]:
df_dists[(df_dists["user1"]==2) & (df_dists["user2"]!=2)].sort_values(by=["eucl_dist"], ascending=True)

Unnamed: 0,user1,user2,eucl_dist
1161,2,22915,0.000000
1808,2,34706,0.000000
1729,2,33207,0.000000
1590,2,30884,0.000000
656,2,12856,0.000000
...,...,...,...
2575,2,49739,10.344080
165,2,3514,10.440307
1915,2,36807,10.862780
1699,2,32494,11.532563


#### Solution 3 - Split implementation

In my solution, instead of using the .iloc after the sort to remove the user2 == 2 (paired with itself), I used the & (df_dists["user2"]!=2) as part of a boolean selection. I succesfully remove user2 == 2 but as you can see the records at the top change compared to the solution:

In [74]:
df_dists[df_dists['user1']==2].sort_values(by='eucl_dist').iloc[1:]

Unnamed: 0,user1,user2,eucl_dist
35,2,755,0.000000
1161,2,22915,0.000000
1836,2,35310,0.000000
1808,2,34706,0.000000
1729,2,33207,0.000000
...,...,...,...
2575,2,49739,10.344080
165,2,3514,10.440307
1915,2,36807,10.862780
1699,2,32494,11.532563


In [72]:
df[df["user2"]!=2].sort_values(by=["eucl_dist"], ascending=True)

Unnamed: 0,user1,user2,eucl_dist
35,2,755,0.000000
2053,2,39417,0.000000
1590,2,30884,0.000000
92,2,2138,0.000000
2800,2,53793,0.000000
...,...,...,...
2575,2,49739,10.344080
165,2,3514,10.440307
1915,2,36807,10.862780
1699,2,32494,11.532563


## Final Implementation of Collaborative Filtering Recommender

In [75]:
def find_closest_neighbors(user):
    '''
    INPUT:
        user - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # I treated ties as arbitrary and just kept whichever was easiest to keep using the head method
    # You might choose to do something less hand wavy - order the neighbors
    
    #this lists selects the same as the udacity answer but the sorting differs for some reason
    
#     user_subset = df_dists[(df_dists["user1"]==user) & (df_dists["user2"]!=user)].sort_values(by=["eucl_dist"], ascending=True)
#     closest_neighbors = user_subset["user2"].array
    closest_users = df_dists[df_dists['user1']==user].sort_values(by='eucl_dist').iloc[1:]['user2']
    closest_neighbors = np.array(closest_users)
    
    return closest_neighbors
    
    
    
def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
    
    movies_liked  = user_items[(user_items["user_id"]==user_id)&(user_items["rating"]>=min_rating)]["movie_id"].array
    
    return movies_liked


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = movies[movies["movie_id"].isin(movie_ids)]["movie"].tolist()
    

    return movie_lst
    
    
def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    
    #get movies watched by user
    movies_seen = movies_watched(user)
    neighbours = find_closest_neighbors(user)
    
#     print(movies_seen)
#     print(neighbours)
    recommended = np.array([])
    
    for n in neighbours:
#         print(n)
        neighbour_likes = movies_liked(n)
#         print(neighbour_likes)
        
        new_recs = np.setdiff1d(neighbour_likes,movies_seen,assume_unique=True)
#         print(new_recs)
        
        #answer for udacity used an np.unique as well
        recommended = np.concatenate([recommended,new_recs], axis=0)
        
#         print(len(recommended))
        
        
        if len(recommended) >= num_recs:
            break
            
            
    #print(recommended)  
    recommendations = movie_names(recommended)
    return recommendations

def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''
    users = np.unique(df_dists["user1"])

    # Make the recommendations for each user
    recs={}
    
    for u in users:
        print(u)
        recs[u] = make_recommendations(u, num_recs)
        
    
    all_recs = recs
    
    
    return all_recs

all_recs = all_recommendations(10)

2
3
7
8
9
17
22
24
25
26
30
31
32
33
34
39
42
43
44
45
48
51
52
53
54
55
57
61
64
66
71
73
74
75
77
79
85
86
89
90
96
97
99
101
102
104
106
108
109
112
119
121
122
124
125
129
130
135
136
138
139
140
141
146
147
149
152
155
156
157
158
159
162
164
166
171
174
175
180
181
183
187
190
199
200
201
202
206
207
208
209
212
214
219
221
224
225
227
228
230
235
238
240
242
246
251
257
258
259
260
261
262
265
266
268
270
272
273
274
275
277
287
288
290
291
292
293
294
295
296
298
304
306
307
308
309
310
313
314
315
319
321
328
331
332
336
338
341
342
344
347
348
349
350
354
355
359
360
362
369
376
388
389
392
393
396
408
412
413
414
415
417
418
420
424
427
428
431
434
436
440
441
443
445
447
449
452
453
454
458
459
460
461
463
464
466
469
470
471
472
482
484
485
486
487
489
490
494
495
496
497
499
501
506
508
510
511
513
514
515
523
527
528
531
540
554
555
561
564
565
567
568
576
581
584
588
590
591
593
594
595
596
597
602
606
607
608
610
611
613
615
616
620
621
623
624
628
633
635
638
640
646


3986
3987
3990
3996
3997
4001
4003
4006
4009
4012
4013
4014
4019
4020
4026
4028
4029
4030
4035
4037
4040
4044
4045
4048
4050
4053
4056
4058
4059
4063
4065
4066
4068
4069
4071
4081
4087
4088
4091
4093
4094
4098
4099
4100
4101
4103
4105
4109
4115
4116
4118
4119
4121
4123
4124
4126
4127
4129
4130
4133
4134
4135
4138
4141
4142
4143
4145
4147
4148
4149
4151
4153
4154
4157
4158
4162
4164
4167
4173
4175
4178
4179
4181
4185
4186
4187
4189
4190
4192
4193
4195
4198
4199
4200
4203
4204
4206
4208
4210
4212
4214
4216
4217
4221
4222
4224
4225
4230
4231
4232
4234
4235
4238
4239
4241
4242
4243
4244
4247
4251
4252
4255
4256
4259
4260
4261
4266
4267
4268
4269
4271
4272
4275
4278
4279
4280
4281
4283
4285
4287
4290
4293
4294
4295
4297
4301
4303
4304
4306
4310
4313
4315
4316
4319
4320
4323
4329
4330
4332
4333
4334
4335
4336
4337
4339
4340
4341
4349
4350
4352
4357
4359
4360
4363
4365
4369
4370
4377
4380
4382
4383
4386
4395
4397
4399
4400
4401
4402
4404
4407
4409
4414
4415
4418
4419
4420
4421
4422
4425
4430


7608
7611
7615
7616
7619
7623
7626
7628
7630
7633
7636
7639
7643
7645
7647
7650
7653
7654
7661
7663
7664
7666
7668
7670
7675
7677
7678
7685
7687
7694
7696
7697
7698
7699
7701
7702
7703
7704
7706
7711
7712
7715
7720
7721
7722
7723
7724
7725
7726
7730
7734
7736
7737
7738
7739
7740
7741
7742
7744
7745
7746
7748
7749
7753
7754
7756
7760
7761
7762
7763
7764
7766
7767
7768
7769
7770
7773
7774
7775
7781
7782
7783
7784
7789
7790
7792
7796
7797
7799
7801
7807
7808
7810
7812
7817
7827
7834
7838
7840
7842
7843
7846
7848
7849
7853
7854
7855
7857
7858
7860
7863
7865
7871
7873
7878
7879
7887
7888
7889
7891
7893
7895
7897
7898
7902
7903
7906
7909
7910
7913
7914
7915
7916
7920
7923
7925
7927
7929
7933
7934
7935
7936
7937
7938
7939
7940
7947
7948
7949
7950
7951
7953
7954
7955
7957
7958
7960
7962
7964
7965
7966
7968
7969
7970
7971
7972
7974
7975
7978
7980
7981
7988
7992
7993
7997
8001
8003
8007
8008
8009
8010
8012
8013
8016
8017
8018
8020
8027
8030
8032
8034
8039
8043
8044
8045
8046
8049
8051
8053
8059


11143
11146
11152
11153
11154
11155
11157
11158
11160
11164
11169
11170
11177
11178
11181
11184
11186
11189
11190
11193
11194
11195
11197
11203
11204
11206
11209
11211
11212
11218
11219
11220
11221
11222
11223
11226
11228
11230
11233
11235
11237
11238
11243
11244
11246
11252
11253
11254
11256
11258
11259
11260
11261
11262
11264
11265
11267
11268
11271
11272
11275
11284
11288
11289
11290
11295
11296
11298
11302
11304
11305
11308
11309
11310
11312
11313
11316
11317
11319
11320
11321
11324
11326
11327
11328
11329
11330
11331
11332
11333
11334
11335
11336
11337
11338
11339
11344
11345
11346
11347
11348
11354
11356
11358
11361
11365
11367
11368
11369
11370
11371
11372
11373
11374
11375
11376
11378
11379
11380
11381
11382
11383
11384
11385
11387
11389
11392
11393
11394
11395
11399
11400
11401
11403
11405
11407
11408
11409
11410
11414
11415
11417
11420
11424
11428
11431
11432
11437
11439
11444
11445
11446
11447
11450
11451
11452
11456
11461
11462
11464
11465
11466
11475
11477
11479
11482
1148

14267
14268
14275
14276
14282
14283
14285
14288
14293
14295
14296
14297
14299
14301
14307
14311
14314
14320
14323
14324
14325
14326
14331
14333
14335
14337
14338
14340
14343
14344
14345
14346
14347
14348
14349
14353
14356
14359
14360
14364
14365
14367
14370
14375
14376
14378
14380
14383
14384
14385
14387
14396
14399
14400
14402
14403
14404
14408
14409
14412
14414
14420
14422
14429
14432
14433
14434
14438
14441
14444
14447
14448
14451
14454
14455
14456
14460
14461
14463
14464
14465
14474
14475
14476
14477
14478
14479
14483
14484
14485
14486
14489
14491
14494
14496
14499
14500
14501
14502
14503
14511
14513
14514
14518
14520
14525
14527
14529
14530
14535
14537
14539
14541
14545
14547
14551
14554
14557
14558
14559
14563
14566
14567
14569
14571
14572
14574
14576
14577
14578
14581
14582
14584
14586
14587
14589
14590
14591
14592
14596
14597
14600
14603
14608
14611
14612
14613
14617
14619
14621
14625
14631
14632
14637
14641
14642
14643
14644
14645
14646
14647
14649
14650
14652
14654
14658
1466

17422
17423
17424
17425
17429
17431
17433
17436
17437
17438
17440
17444
17445
17446
17448
17450
17452
17454
17455
17456
17458
17460
17462
17463
17465
17466
17467
17468
17470
17471
17472
17473
17475
17477
17478
17479
17480
17482
17483
17487
17488
17490
17491
17492
17496
17497
17499
17500
17504
17505
17506
17509
17510
17511
17512
17514
17515
17516
17517
17518
17521
17522
17523
17525
17527
17529
17530
17532
17534
17535
17537
17538
17540
17541
17542
17545
17546
17548
17549
17550
17555
17557
17558
17563
17564
17565
17566
17567
17569
17570
17573
17577
17578
17579
17580
17581
17586
17587
17588
17589
17596
17602
17605
17607
17612
17613
17614
17616
17617
17618
17620
17621
17622
17623
17627
17628
17629
17631
17634
17638
17639
17641
17643
17657
17661
17664
17666
17667
17668
17670
17672
17673
17674
17676
17677
17678
17681
17682
17683
17684
17686
17687
17688
17692
17693
17695
17696
17697
17699
17701
17702
17705
17706
17707
17709
17711
17713
17714
17715
17716
17717
17724
17726
17728
17730
17734
1773

20514
20515
20517
20519
20520
20521
20522
20525
20527
20528
20532
20535
20536
20538
20539
20543
20545
20546
20547
20548
20550
20551
20553
20558
20562
20566
20571
20574
20575
20579
20580
20582
20586
20587
20588
20590
20592
20593
20596
20597
20598
20599
20601
20605
20608
20609
20610
20615
20617
20619
20620
20622
20624
20629
20630
20637
20639
20642
20643
20644
20648
20649
20653
20654
20656
20658
20660
20661
20667
20669
20670
20671
20675
20676
20677
20678
20679
20680
20684
20685
20686
20690
20694
20695
20696
20697
20698
20700
20702
20706
20707
20708
20710
20711
20714
20724
20725
20727
20729
20733
20735
20737
20740
20744
20745
20746
20749
20750
20752
20756
20758
20759
20762
20763
20765
20766
20776
20779
20781
20785
20786
20788
20790
20792
20793
20794
20795
20796
20801
20803
20808
20811
20815
20816
20822
20823
20832
20833
20835
20845
20848
20850
20852
20854
20856
20861
20866
20867
20869
20875
20876
20887
20888
20889
20892
20893
20894
20898
20899
20901
20903
20908
20911
20914
20916
20919
2092

23736
23737
23738
23740
23742
23747
23752
23753
23761
23763
23767
23768
23769
23771
23772
23776
23779
23783
23784
23785
23788
23790
23792
23794
23798
23800
23802
23807
23809
23811
23814
23818
23819
23823
23824
23825
23828
23829
23830
23834
23835
23836
23837
23838
23839
23840
23842
23845
23847
23849
23850
23851
23852
23853
23856
23858
23861
23862
23864
23865
23866
23868
23869
23870
23871
23872
23876
23880
23882
23883
23884
23885
23889
23894
23899
23900
23909
23910
23911
23912
23915
23916
23918
23919
23920
23922
23923
23924
23926
23928
23930
23933
23934
23935
23939
23940
23943
23945
23950
23951
23954
23957
23959
23960
23963
23966
23967
23968
23973
23974
23978
23980
23981
23982
23983
23985
23987
23988
23991
23992
23993
23994
24000
24003
24005
24009
24014
24015
24020
24023
24026
24027
24030
24033
24034
24035
24038
24041
24042
24044
24049
24050
24051
24056
24057
24059
24060
24063
24068
24069
24071
24075
24077
24080
24083
24085
24087
24096
24097
24098
24099
24101
24105
24110
24112
24116
2411

26968
26971
26973
26974
26976
26983
26986
26987
26989
26993
26994
26995
26996
26998
27005
27008
27009
27014
27019
27021
27022
27025
27027
27029
27031
27032
27040
27041
27044
27046
27048
27049
27053
27056
27057
27058
27059
27061
27062
27063
27066
27068
27069
27070
27071
27073
27074
27075
27080
27082
27088
27091
27093
27094
27097
27098
27101
27102
27103
27104
27106
27112
27117
27118
27119
27120
27121
27122
27123
27131
27133
27134
27139
27140
27142
27143
27147
27149
27150
27151
27152
27158
27163
27164
27165
27166
27167
27169
27173
27179
27185
27187
27189
27190
27191
27192
27194
27195
27199
27202
27204
27205
27207
27208
27216
27218
27220
27221
27222
27233
27234
27236
27239
27241
27242
27243
27245
27246
27248
27250
27252
27253
27260
27263
27265
27266
27267
27270
27273
27274
27276
27278
27280
27284
27286
27287
27293
27296
27297
27298
27299
27301
27303
27304
27305
27309
27310
27311
27313
27314
27316
27317
27321
27322
27324
27326
27328
27330
27333
27334
27335
27336
27340
27342
27344
27347
2735

30151
30152
30154
30156
30157
30158
30161
30162
30164
30168
30170
30172
30173
30178
30182
30183
30186
30189
30192
30194
30196
30197
30199
30202
30203
30208
30210
30213
30218
30224
30225
30227
30229
30230
30231
30232
30239
30240
30241
30246
30249
30253
30257
30263
30264
30269
30271
30275
30276
30280
30283
30287
30289
30290
30291
30294
30298
30299
30300
30303
30304
30305
30308
30309
30310
30312
30315
30317
30320
30321
30323
30330
30331
30334
30335
30338
30340
30342
30343
30348
30351
30352
30359
30361
30363
30364
30366
30368
30371
30373
30375
30377
30378
30380
30381
30382
30383
30385
30388
30389
30391
30392
30394
30398
30400
30401
30403
30408
30410
30411
30416
30424
30425
30426
30428
30429
30431
30432
30433
30436
30444
30446
30448
30449
30454
30457
30460
30463
30464
30467
30468
30470
30472
30477
30480
30486
30488
30494
30498
30499
30501
30504
30505
30507
30510
30512
30513
30521
30522
30523
30525
30526
30527
30530
30532
30535
30537
30539
30540
30541
30542
30544
30545
30550
30551
30552
3055

33206
33207
33210
33211
33212
33218
33221
33223
33228
33234
33235
33237
33240
33245
33246
33247
33248
33250
33253
33258
33259
33266
33267
33268
33269
33273
33274
33277
33278
33281
33282
33283
33286
33287
33289
33290
33291
33293
33294
33298
33299
33302
33305
33309
33311
33314
33316
33322
33324
33325
33333
33334
33339
33341
33343
33346
33348
33351
33352
33353
33364
33367
33368
33369
33371
33374
33379
33380
33381
33384
33389
33392
33396
33399
33401
33405
33407
33409
33410
33412
33414
33417
33418
33419
33421
33422
33424
33432
33433
33434
33436
33439
33440
33441
33444
33447
33449
33451
33453
33454
33456
33458
33459
33460
33461
33464
33465
33466
33467
33469
33470
33474
33477
33479
33481
33484
33485
33486
33487
33488
33490
33497
33498
33500
33503
33506
33508
33509
33510
33512
33515
33517
33518
33520
33525
33527
33531
33532
33535
33536
33537
33538
33540
33542
33545
33547
33550
33551
33552
33553
33556
33557
33558
33560
33561
33562
33566
33567
33568
33569
33574
33575
33577
33579
33580
33584
3358

36443
36447
36450
36453
36454
36460
36461
36462
36466
36467
36468
36469
36472
36476
36477
36479
36483
36488
36492
36496
36499
36503
36505
36507
36508
36509
36512
36513
36515
36519
36521
36523
36524
36525
36526
36527
36536
36538
36540
36542
36545
36551
36553
36555
36556
36557
36560
36561
36562
36563
36565
36567
36568
36571
36574
36575
36577
36579
36583
36584
36586
36593
36594
36595
36598
36599
36600
36601
36602
36603
36606
36608
36610
36612
36614
36616
36617
36618
36621
36624
36627
36630
36634
36635
36637
36639
36641
36642
36643
36646
36647
36648
36651
36652
36653
36655
36658
36662
36665
36666
36667
36668
36669
36670
36671
36674
36675
36677
36678
36680
36681
36683
36685
36689
36690
36691
36693
36694
36697
36698
36699
36707
36709
36710
36711
36717
36723
36726
36727
36728
36730
36732
36734
36735
36736
36737
36739
36740
36745
36746
36748
36749
36750
36754
36756
36758
36760
36763
36772
36773
36775
36779
36780
36782
36785
36786
36787
36788
36792
36794
36797
36807
36809
36811
36812
36820
3682

39552
39553
39557
39563
39564
39568
39570
39571
39573
39574
39576
39577
39578
39579
39581
39583
39586
39589
39590
39591
39593
39594
39596
39598
39600
39601
39603
39604
39606
39608
39609
39610
39611
39617
39620
39621
39622
39627
39629
39633
39636
39639
39640
39642
39643
39645
39646
39647
39648
39651
39652
39655
39656
39657
39658
39659
39660
39665
39677
39680
39681
39683
39688
39689
39690
39691
39694
39696
39697
39707
39708
39711
39712
39713
39716
39719
39720
39722
39727
39728
39732
39733
39738
39744
39747
39752
39754
39757
39759
39760
39761
39763
39765
39768
39771
39775
39781
39785
39786
39787
39788
39789
39791
39792
39795
39797
39798
39799
39801
39805
39811
39812
39814
39819
39820
39822
39823
39827
39828
39830
39831
39833
39834
39835
39836
39839
39841
39843
39846
39850
39853
39854
39855
39857
39861
39863
39869
39871
39873
39874
39875
39878
39882
39885
39886
39888
39892
39895
39898
39901
39907
39908
39909
39913
39917
39920
39921
39922
39923
39926
39932
39934
39937
39938
39939
39940
3994

42756
42757
42760
42765
42768
42769
42771
42773
42774
42776
42778
42780
42781
42782
42785
42786
42787
42788
42789
42793
42794
42798
42799
42806
42808
42811
42816
42817
42819
42824
42827
42829
42830
42834
42836
42837
42840
42845
42846
42848
42850
42858
42861
42862
42863
42869
42870
42871
42875
42880
42881
42883
42884
42886
42890
42895
42898
42899
42901
42907
42912
42914
42916
42917
42920
42923
42924
42926
42932
42933
42936
42937
42939
42940
42941
42942
42943
42945
42946
42947
42948
42951
42963
42964
42965
42968
42973
42974
42977
42978
42979
42980
42983
42985
42990
42993
42996
42997
42998
43001
43005
43009
43010
43011
43013
43015
43017
43018
43019
43020
43022
43024
43031
43032
43034
43036
43037
43039
43040
43041
43043
43052
43054
43061
43069
43073
43074
43075
43079
43080
43081
43083
43085
43086
43087
43088
43091
43093
43096
43097
43100
43103
43108
43109
43111
43114
43115
43119
43120
43122
43123
43124
43127
43131
43132
43135
43136
43137
43138
43140
43141
43145
43147
43148
43149
43150
4315

45910
45911
45912
45916
45918
45928
45935
45938
45940
45942
45943
45944
45945
45949
45950
45964
45965
45966
45968
45969
45974
45975
45979
45982
45984
45985
45986
45987
45991
45993
45995
45996
45998
46002
46003
46006
46008
46011
46014
46018
46019
46020
46021
46023
46024
46027
46028
46031
46032
46033
46035
46039
46040
46041
46042
46047
46050
46051
46052
46054
46055
46062
46064
46065
46066
46068
46069
46070
46074
46076
46078
46079
46081
46082
46083
46085
46086
46091
46094
46096
46098
46101
46102
46106
46107
46108
46109
46111
46113
46119
46121
46123
46124
46125
46129
46134
46135
46136
46138
46139
46140
46141
46142
46148
46149
46150
46153
46154
46155
46157
46158
46159
46160
46163
46166
46168
46172
46174
46175
46177
46180
46181
46184
46186
46187
46188
46190
46191
46192
46194
46199
46203
46206
46207
46210
46211
46212
46214
46218
46219
46220
46224
46225
46226
46227
46228
46230
46231
46232
46236
46238
46242
46243
46245
46247
46248
46251
46255
46256
46257
46258
46259
46270
46271
46272
46277
4628

48996
48999
49000
49003
49006
49007
49011
49013
49014
49019
49021
49022
49024
49025
49027
49030
49034
49035
49040
49042
49045
49046
49048
49051
49052
49054
49056
49057
49059
49060
49061
49065
49066
49067
49068
49078
49080
49081
49083
49086
49088
49091
49093
49095
49096
49097
49098
49103
49104
49105
49106
49109
49112
49113
49116
49121
49126
49127
49130
49134
49136
49137
49139
49149
49151
49154
49162
49164
49166
49171
49172
49174
49175
49177
49181
49182
49183
49188
49189
49191
49192
49194
49196
49197
49199
49200
49202
49206
49207
49209
49210
49211
49212
49214
49215
49221
49222
49226
49227
49230
49233
49234
49235
49240
49241
49243
49245
49246
49247
49249
49254
49256
49257
49258
49260
49261
49262
49263
49266
49267
49272
49273
49274
49275
49278
49279
49281
49282
49284
49288
49289
49290
49291
49292
49293
49295
49298
49300
49305
49309
49315
49316
49319
49320
49321
49322
49330
49331
49332
49334
49336
49337
49340
49341
49342
49344
49346
49349
49351
49354
49355
49360
49364
49367
49371
49372
4937

52121
52123
52124
52125
52126
52127
52128
52130
52133
52135
52138
52139
52141
52142
52151
52154
52155
52156
52159
52161
52163
52175
52176
52180
52181
52185
52186
52189
52190
52195
52196
52198
52202
52204
52205
52206
52207
52208
52211
52222
52223
52226
52227
52229
52232
52236
52237
52243
52245
52246
52247
52249
52251
52252
52255
52256
52260
52262
52265
52267
52269
52270
52272
52273
52275
52276
52278
52279
52281
52282
52283
52284
52286
52294
52298
52299
52300
52302
52304
52306
52307
52308
52309
52311
52313
52314
52315
52316
52321
52329
52331
52338
52341
52342
52344
52346
52348
52351
52352
52354
52359
52362
52363
52366
52367
52368
52369
52372
52373
52374
52375
52380
52381
52382
52383
52386
52388
52390
52392
52393
52395
52397
52401
52402
52404
52408
52413
52414
52415
52423
52427
52429
52430
52432
52434
52435
52436
52439
52440
52441
52443
52445
52447
52450
52451
52452
52454
52455
52458
52460
52461
52462
52464
52465
52466
52467
52469
52470
52474
52477
52478
52481
52483
52486
52488
52492
5249

In [76]:
#examine one example
make_recommendations(2)

['Philadelphia (1993)',
 'Training Day (2001)',
 'About Schmidt (2002)',
 'Insomnia (2002)',
 'The United States of Leland (2003)',
 'Shattered Glass (2003)',
 'Man on Fire (2004)',
 'Flipped (2010)',
 'Silver Linings Playbook (2012)',
 'Lawless (2012)',
 '50/50 (2011)',
 'Crazy, Stupid, Love. (2011)',
 'The Perks of Being a Wallflower (2012)',
 'Before I Go to Sleep (2014)',
 'Zero Dark Thirty (2012)',
 'American Hustle (2013)',
 'Django Unchained (2012)',
 'Side Effects (2013)',
 'Gone Girl (2014)',
 'Enough Said (2013)',
 'Nightcrawler (2014)']

In [80]:
# This loads our solution dictionary so you can compare results
all_recs_sol = pd.read_pickle("data/Term2/recommendations/lesson1/data/all_recs.p")

In [81]:
assert all_recs[2] == make_recommendations(2), "Oops!  Your recommendations for user 2 didn't match ours."
assert all_recs[26] == make_recommendations(26), "Oops!  It actually wasn't possible to make any recommendations for user 26."
assert all_recs[1503] == make_recommendations(1503), "Oops! Looks like your solution for user 1503 didn't match ours."
print("If you made it here, you now have recommendations for many users using collaborative filtering!")
HTML('<img src="images/greatjob.webp">')

<PandasArray>
[ 422720,  454876,  790636,  816711, 1091191, 1103275, 1322269, 1390411,
 1398426, 1431045, 1433811, 1454468, 1535109, 1675434, 1798709, 2017038,
 2024544, 2294629, 2361509, 2381249, 2726560, 2883512, 3079380]
Length: 23, dtype: int64
[  755 22915 35310 ... 36807 32494 52737]
755
[ 107818  139654  257360  278504  301976  323944  328107  790636  817177
 1045658 1212450 1306980 1454468 1535109 1570728 1659337 1726592 1790885
 1800241 1853728 2053463 2267998 2390361 2872718]
[ 107818  139654  257360  278504  301976  323944  328107  817177 1045658
 1212450 1306980 1570728 1659337 1726592 1790885 1800241 1853728 2053463
 2267998 2390361 2872718]
[ 107818.  139654.  257360.  278504.  301976.  323944.  328107.  817177.
 1045658. 1212450. 1306980. 1570728. 1659337. 1726592. 1790885. 1800241.
 1853728. 2053463. 2267998. 2390361. 2872718.]
<PandasArray>
[3954660, 5222918, 7291268]
Length: 3, dtype: int64
[]
<PandasArray>
[ 107120,  109830,  110366,  112757,  120338,  198781,  23060

### Now What?

If you made it this far, you have successfully implemented a solution to making recommendations using collaborative filtering. 

`8.` Let's do a quick recap of the steps taken to obtain recommendations using collaborative filtering.  

In [82]:
# Check your understanding of the results by correctly filling in the dictionary below
a = "pearson's correlation and spearman's correlation"
b = 'item based collaborative filtering'
c = "there were too many ratings to get a stable metric"
d = 'user based collaborative filtering'
e = "euclidean distance and pearson's correlation coefficient"
f = "manhattan distance and euclidean distance"
g = "spearman's correlation and euclidean distance"
h = "the spread in some ratings was zero"
i = 'content based recommendation'

sol_dict = {
    'The type of recommendation system implemented here was a ...': d,# letter here,
    'The two methods used to estimate user similarity were: ': e,# letter here,
    'There was an issue with using the correlation coefficient.  What was it?':h # letter here
}

t.test_recs(sol_dict)

"That's right! All of your solutions look good!"

Additionally, let's take a closer look at some of the results.  There are two solution files that you read in to check your results, and you created these objects

* **df_dists** - a dataframe of user1, user2, euclidean distance between the two users
* **all_recs_sol** - a dictionary of all recommendations (key = user, value = list of recommendations)  

`9.` Use these two objects along with the cells below to correctly fill in the dictionary below and complete this notebook!

In [84]:
a = 567
b = 1503
c = 1319
d = 1325
e = 2526710
f = 0
g = 'Use another method to make recommendations - content based, knowledge based, or model based collaborative filtering'

sol_dict2 = {
    'For how many pairs of users were we not able to obtain a measure of similarity using correlation?':e, # letter here,
    'For how many pairs of users were we not able to obtain a measure of similarity using euclidean distance?':f, # letter here,
    'For how many users were we unable to make any recommendations for using collaborative filtering?':c, # letter here,
    'For how many users were we unable to make 10 recommendations for using collaborative filtering?':d, # letter here,
    'What might be a way for us to get 10 recommendations for every user?':g # letter here   
}

t.test_recs2(sol_dict2)

"That's right! All of your solutions look good!"

In [None]:
# Use the cells below for any work you need to do!

In [79]:
# Users without recs

count = 0

for k,v in all_recs.items():
    
    if len(v) == 0:
        
        count+= 1

print(count)

1319


In [70]:
# NaN euclidean distance values


df_dists[df_dists["eucl_dist"].isnull()].sum()

user1        0.0
user2        0.0
eucl_dist    0.0
dtype: float64

In [82]:
# Users with less than 10 recs

# Users without recs

count = 0

for k,v in all_recs_sol.items():
    
    if len(v) < 10:
        
        count+= 1

print(count)


1325
