# Collaborative Filtering

# Imports

In [14]:
import pandas as pd
import numpy as np

# Data

In [15]:
user_ratings = pd.read_csv('user_ratings.csv', index_col=False)
user_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


# Exercises

> # 1. Pivoting your data

Transform the user_ratings DataFrame to a DataFrame containing ratings with one row per user and one column per movie and call it user_ratings_table.

In [16]:
user_ratings = user_ratings[['userId', 'title', 'rating']]
user_ratings.sample(15)

Unnamed: 0,userId,title,rating
50328,251,Toy Story 3 (2010),5.0
38732,232,Ray (2004),4.0
36995,425,Harry Potter and the Chamber of Secrets (2002),3.5
77741,600,"Bachelor, The (1999)",2.5
50325,232,Toy Story 3 (2010),5.0
55468,65,Into the Wild (2007),5.0
64294,186,Hollow Man (2000),4.0
92056,177,To Have and Have Not (1944),2.5
77812,274,Behind Enemy Lines (2001),2.5
68947,28,"French Connection, The (1971)",4.0


In [17]:
#user_ratings_pivot = user_ratings.pivot(index='userId',
#                                       columns='title',
#                                       values='rating')
# Deu erro: correção no link abaixo:
# https://www.statology.org/valueerror-index-contains-duplicate-entries-cannot-reshape/#:~:text=How%20to%20Fix%3A%20ValueError%3A%20Index%20contains%20duplicate%20entries%2C%20cannot%20reshape,-One%20error%20you&text=This%20error%20usually%20occurs%20when,share%20the%20same%20index%20values.

In [20]:
user_ratings_pivot = user_ratings.pivot_table(index='userId', columns='title', values='rating', aggfunc='mean')

In [21]:
user_ratings_pivot.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


> # 2. Finding similar users

Collaborative filtering is built around the premise that users who have ranked items similarly in the past have similar tastes, and therefore are likely to rate new items in a similar fashion.

A subset of the movies dataset has been loaded as user_ratings_subset. The DataFrame contains user ratings with a row for each user and a column for each movie.

Examine user_ratings_subset. Which user is most similar to User A?

![](subset_user_ratings.png)

The user_B is most similar to the user_A because they gave similar ratings for the same movies, Pulp Fiction and The Matrix. 

> # 3. Challenges with missing values

You may have noticed that the pivoted DataFrames you have been working with often have missing data. This is to be expected since users rarely see all movies, and most movies are not seen by everyone, resulting in gaps in the user-rating matrix.

In this exercise, you will explore another subset of the user ratings table user_ratings_subset that has missing values and observe how different approaches in dealing with missing data may impact its usability.

- Fill the gaps in the user_ratings_subset with zeros.
- Print and inspect the results.

In [5]:
user_ratings_subset = pd.read_csv('user_ratings_subset2.csv', index_col=0)

In [6]:
user_ratings_subset

Unnamed: 0,Forrest Gump,Pulp Fiction,Toy Story,The Matrix
User_A,10,9,7,
User_B,10,9,7,0.0
User_C,10,9,7,8.0


In [7]:
# Fill in missing values with 0
user_ratings_table_filled = user_ratings_subset.fillna(0)

# Inspect the result
print(user_ratings_table_filled)

                   Forrest Gump   Pulp Fiction   Toy Story   The Matrix
User_A                        10              9           7         0.0
User_B                        10              9           7         0.0
User_C                        10              9           7         8.0


### Question

Based on this user_ratings_table_filled, who now looks most similar to User_A?

Possible Answers

- Both User B and User C

- User B ✔️

- User C

> # 4. Compensating for incomplete data

For most datasets, the majority of users will have rated only a small number of items. As you saw in the last exercise, how you deal with users who do not have ratings for an item can greatly influence the validity of your models.

In this exercise, you will fill in missing data with information that should not bias the data that you do have.

You'll get the average score each user has given across all their ratings, and then use this average to center the users' scores around zero. Finally, you'll be able to fill in the empty values with zeros, which is now a neutral score, minimizing the impact on their overall profile, but still allowing the comparison of users.


- Find the average of the ratings given by each user in user_ratings_table and store them as avg_ratings.
- Subtract the row averages from each row in user_ratings_table, and store it as user_ratings_table_centered.
- Fill the empty values in the newly created user_ratings_table_centered with zeros.

In [8]:
# Get the average rating for each user 
avg_ratings = user_ratings_subset.mean(axis=1)

# Center each users ratings around 0
user_ratings_table_centered = user_ratings_subset.sub(avg_ratings, axis=0)

# Fill in the missing data with 0s
user_ratings_table_normed = user_ratings_table_centered.fillna(0)

In [9]:
user_ratings_table_normed

Unnamed: 0,Forrest Gump,Pulp Fiction,Toy Story,The Matrix
User_A,1.333333,0.333333,-1.666667,0.0
User_B,3.5,2.5,0.5,-6.5
User_C,1.5,0.5,-1.5,-0.5


> # User-based to item-based

By now you have a dataset with no empty values that is primed for use.

In the preceding video, you learned about both user-based recommendations and item-based recommendations. User-based recommendations compare amongst users, and item-based recommendations compare different items.

In other words, you could use user-based data to find similar users based on how they rated different movies, while you could use item-based data to find similar movies based on how they have been rated by the users.

In this exercise, you will switch between the two and compare their values.

user_ratings_subset, a subset of the user-based DataFrame you have been working with, has been loaded for you.

**Question**

> Based on the data in user_ratings_subset, which user is most similar to User_A?

Possible Answers

A) User_B ✔️

B) User_C

C) User_D

In [10]:
# Data 
my_dict = {
    "The Sandlot Oceans": [1, 1, 4, 4],
    "Eleven": [4, 5, 2, 1],
    "The Lion King": [1, 1, 5, 4],
    "John Wick": [5, 4, 2, 2]
}

indexes = ['Uset_A', 'User_B', 'User_C', 'User_D']

In [11]:
user_ratings_subset = pd.DataFrame(my_dict, index=indexes)
user_ratings_subset

Unnamed: 0,The Sandlot Oceans,Eleven,The Lion King,John Wick
Uset_A,1,4,1,5
User_B,1,5,1,4
User_C,4,2,5,2
User_D,4,1,4,2


- Transpose the `user_ratings_subset` table so that it is indexed by the movies and store the result as `movie_ratings_subset`.

In [12]:
# Transpose the user_ratings_subset DataFrame
movie_ratings_subset = user_ratings_subset.T

movie_ratings_subset

Unnamed: 0,Uset_A,User_B,User_C,User_D
The Sandlot Oceans,1,1,4,4
Eleven,4,5,2,1
The Lion King,1,1,5,4
John Wick,5,4,2,2


**Question**

Based on this new transposed data, what movie appears most similar to The Sandlot?

Possible Answers

A) Pulp Fiction

B) The Lion King ✔️

C) John Wick

> # Similar and different movie ratings

Some types of movies might be liked by one group of people, but hated by another. This might reflect the type of movie far more than its quality. Take, for example, horror movies — many people absolutely love them, while others hate them.

By understanding which movies were reviewed in a similar way, we can often find very similar movies.

In this exercise, you will compare movies and see whether they have received similar reviewing patterns.

The DataFrame `movie_ratings_centered` has been loaded with a row per movie, and the centered ratings it received as the values.

1. Assign the values for Star Wars: Episode IV and Star Wars: Episode V to sw_IV and sw_V.
Find their cosine similarity.

2. Find the cosine similarity between the ratings for Jurassic Park (jurassic_park) and Pulp Fiction (pulp_fiction).


```python
from sklearn.metrics.pairwise import cosine_similarity

# Assign the arrays to variables
sw_IV = movie_ratings_centered.loc['Star Wars: Episode IV - A New Hope (1977)', :].values.reshape(1, -1)
sw_V = movie_ratings_centered.loc['Star Wars: Episode V - The Empire Strikes Back (1980)', :].values.reshape(1, -1)

# Find the similarity between two Star Wars movies
similarity_A = cosine_similarity(sw_IV, sw_V)
print(similarity_A)
```

```python
# Assign the arrays to variables
jurassic_park = movie_ratings_centered.loc['Jurassic Park (1993)', :].values.reshape(1, -1)
pulp_fiction = movie_ratings_centered.loc['Pulp Fiction (1994)', :].values.reshape(1, -1)

# Find the similarity between Pulp Fiction and Jurassic Park
similarity_B = cosine_similarity(jurassic_park, pulp_fiction)
print(similarity_B)
```

> ## Como exercício, refazer essa parte usando outros filmes e a nossa base de dados original `user_ratings`.

> # Finding similarly liked movies

Just like you calculated the similarity between two movies, you can calculate it across all users to find the most similar movie to another based on how users have rated them.

The approach is similar to how you worked with content-based filtering.

You will find the similarity scores between all movies and then drill down on the movie of interest by isolating and sorting the column containing its similarity scores.

`movie_ratings_centered` has once again been loaded, containing each movie as a row, and their centered ratings stored as the values.


- Calculate the similarity matrix between all movies in movie_ratings_centered and store it as similarities.
- Wrap the similarities matrix in a DataFrame, with the indices of movie_ratings_centered as the columns and rows.

```python
from sklearn.metrics.pairwise import cosine_similarity

# Generate the similarity matrix
similarities = cosine_similarity(movie_ratings_centered)

# Wrap the similarities in a DataFrame
cosine_similarity_df = pd.DataFrame(similarities, index=movie_ratings_centered.index, 
                                    columns=movie_ratings_centered.index)

# Find the similarity values for a specific movie
cosine_similarity_series = cosine_similarity_df.loc['Star Wars: Episode IV - A New Hope (1977)']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

print(ordered_similarities)
```

**Output**
```
 title
    Star Wars: Episode IV - A New Hope (1977)                                         1.000e+00
    Star Wars: Episode V - The Empire Strikes Back (1980)                             5.357e-01
    Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    7.840e-02
    Lord of the Rings: The Fellowship of the Ring, The (2001)                         2.058e-02
    Schindler's List (1993)                                                           1.360e-02
    Terminator 2: Judgment Day (1991)                                                -9.149e-04
    Shawshank Redemption, The (1994)                                                 -2.903e-02
    Usual Suspects, The (1995)                                                       -3.002e-02
    Matrix, The (1999)                                                               -3.638e-02
    Apollo 13 (1995)                                                                 -4.985e-02
    Silence of the Lambs, The (1991)                                                 -5.016e-02
    Jurassic Park (1993)                                                             -7.222e-02
    Pulp Fiction (1994)                                                              -8.387e-02
    American Beauty (1999)                                                           -8.889e-02
    Toy Story (1995)                                                                 -1.321e-01
    Forrest Gump (1994)                                                              -1.431e-01
    Braveheart (1995)                                                                -1.481e-01
    Seven (a.k.a. Se7en) (1995)                                                      -1.521e-01
    Fight Club (1999)                                                                -2.036e-01
    Independence Day (a.k.a. ID4) (1996)                                             -2.364e-01
 ```

# Using K-nearest neighbors

- Find similar items, even if an item is not similar to any other item the user has rated. 

## User-user similarity

```python
similarities = cosine_similarity(user_ratings_pivot)
cosine_similarity_df = pd.DataFrame(similarities, 
                                    index=user_ratings_pivot.index,
                                    columns=user_ratings_pivot.index)
cosine_similarity_df.head()
```

![](img/user_user_cosine_similarity.png)

## Step by step KNN

```python
user_similarity_series = user_similarities.loc['user_001']
ordered_similarities = user_similarity_series.sort_values(ascending=False)
nearest_neighbors = ordered_similarities[1:4].index
print(nearest_neighbors)
```

**Output**
```python
user_007
user_042
user_003
```

```python
# Ratings
neighbor_ratings = user_ratings_table.reindex(nearest_neighbors)
neighbor_ratings['Catch-22'].mean()
```

**Output**
```
3.2
```

- this is the rating the user_001 is likely to give the movie Catch-22 based on the rating of the nearest neighbors.

## Using scikit-learn's KNN

```python
user_ratings_pivot.drop("Catch-22", axis=1, inplace=True)
target_user_x = user_ratings_pivot.loc[["user_001"]]
print(target_user_x)
```

<img src="img/user_001.png" alt="drawing" width="700"/>


```python
other_users_y = user_ratings_table["Catch-22"]
print(other_users_y)
```

<img src="img/other_users_y.png" alt="drawing" width="300"/>

```python
other_users_x = user_ratings_pivot[other_users_y.notnull()]
print(other_users_x)other_user
```

<img src="img/other_users_x.png" alt="drawing" width="600"/>

```python
other_users_y.dropna(inplace=True) 
print(other_users_y)
```

<img src="img/other_users_y2.png" alt="drawing" width="300"/>



```python
from sklearn.neighbors import KNeighborsRegressor
user_knn = KNeighborsRegressor(metric='cosine', n_neighbors=3)
user_knn.fit(other_users_x, other_users_y)
user_user_pred = user_knn.predict(target_user_x)
print(user_user_pred)
```

**Output**
3.3

## 💻 Exercises

### 01. Stepping through K-nearest neighbors

You have just seen how K-nearest neighbors can be used to infer how someone might rate an item based on the wisdom of a (similar) crowd. In this exercise, you will step through this process yourself to ensure a good understanding of how it works.

To get you started, as you have generated similarity matrices many times before, that step has been done for you with the user similarity matrix wrapped in a DataFrame loaded as user_similarities.

This has each user as the rows and columns, and where they meet the corresponding similarity score.

In this exercise, you will be working with user_001's similarity scores, find their nearest neighbors, and based on the ratings those neighbors gave a movie, infer what rating user_001 might give it if they saw it.

- Find the IDs of User_A's 10 nearest neighbors by extracting the top 10 users in ordered_similarities and storing them as nearest_neighbors.
- Extract the ratings the users in nearest_neighbors gave from user_ratings_table as neighbor_ratings.
- Calculate the average rating these users gave to the movie Apollo 13 (1995) to infer what User_A might give it if they had seen it.

**Preparing the data**

In [35]:
user_ratings_pivot.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


- As we can observe the `user_ratings_pivot` table has a huge amount of NaN values.
- As we see previously in this chapter, we are going to center the score values in zero, in order to make zero a neutral score.
- In this way, we can compute the cosine similarity without problems.

In [42]:
# Centralize scores =======================
# Get the average rating for each user 
avg_user_score_ratings = user_ratings_pivot.mean(axis=1)
# Center each users ratings around 0
user_ratings_table_centered = user_ratings_pivot.sub(avg_user_score_ratings, axis=0)
# Fill in the missing data with 0s
user_ratings_table_normed = user_ratings_table_centered.fillna(0)
user_ratings_table_normed.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.366379,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
# Creating the user_similiarities table
similarities = cosine_similarity(user_ratings_table_normed)
similarities

array([[ 1.00000000e+00,  1.26451574e-03,  5.52577176e-04, ...,
         7.52238457e-02, -2.57125541e-02,  1.09323166e-02],
       [ 1.26451574e-03,  1.00000000e+00,  0.00000000e+00, ...,
        -6.00082818e-03, -6.00909967e-02,  2.49992083e-02],
       [ 5.52577176e-04,  0.00000000e+00,  1.00000000e+00, ...,
        -1.30006374e-02,  0.00000000e+00,  1.95499646e-02],
       ...,
       [ 7.52238457e-02, -6.00082818e-03, -1.30006374e-02, ...,
         1.00000000e+00,  5.07144903e-02,  5.44538770e-02],
       [-2.57125541e-02, -6.00909967e-02,  0.00000000e+00, ...,
         5.07144903e-02,  1.00000000e+00, -1.24714266e-02],
       [ 1.09323166e-02,  2.49992083e-02,  1.95499646e-02, ...,
         5.44538770e-02, -1.24714266e-02,  1.00000000e+00]])

In [45]:
cosine_similarity_df = pd.DataFrame(similarities, 
                                    index=user_ratings_table_normed.index,
                                    columns=user_ratings_table_normed.index)
cosine_similarity_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.001265,0.000553,0.048419,0.021847,-0.045497,-0.0062,0.047013,0.01951,-0.008754,...,0.018127,-0.017172,-0.015221,-0.037059,-0.029121,0.012016,0.055261,0.075224,-0.025713,0.010932
2,0.001265,1.0,0.0,-0.017164,0.021796,-0.021051,-0.011114,-0.048085,0.0,0.003012,...,-0.050551,-0.031581,-0.001688,0.0,0.0,0.006226,-0.020504,-0.006001,-0.060091,0.024999
3,0.000553,0.0,1.0,-0.01126,-0.031539,0.0048,0.0,-0.032471,0.0,0.0,...,-0.004904,-0.016117,0.017749,0.0,-0.001431,-0.037289,-0.007789,-0.013001,0.0,0.01955
4,0.048419,-0.017164,-0.01126,1.0,-0.02962,0.013956,0.058091,0.002065,-0.005874,0.05159,...,-0.037687,0.063122,0.02764,-0.013782,0.040037,0.02059,0.014628,-0.037569,-0.017884,-0.000995
5,0.021847,0.021796,-0.031539,-0.02962,1.0,0.009111,0.010117,-0.012284,0.0,-0.033165,...,0.015964,0.012427,0.027076,0.012461,-0.036272,0.026319,0.031896,-0.001751,0.093829,-0.000278


In [46]:
user_similarities = cosine_similarity_df

In [49]:
# Isolate the similarity scores for user_1 and sort
user_similarity_series = user_similarities.loc[1]
ordered_similarities = user_similarity_series.sort_values(ascending=False)

In [51]:
# Find the top 10 most similar users
nearest_neighbors = ordered_similarities[1:11].index
nearest_neighbors

Int64Index([301, 597, 414, 477, 57, 369, 206, 535, 590, 418], dtype='int64', name='userId')

In [56]:
# Extract the ratings of the neighbors
neighbor_ratings = user_ratings_pivot.reindex(nearest_neighbors)

In [53]:
# Calculate the mean rating given by the users nearest neighbors
print(neighbor_ratings['(500) Days of Summer (2009)'].mean())

4.5


### 02. Getting KNN data in shape

Now that you understand the ins and outs of how K-nearest neighbors works, you can leverage scikit-learn's implementation of KNN while recognizing what it is doing underneath the hood.

In the next two exercises, you will step through how to prepare your data for scikit-learn's KNN model, and then use it to make inferences about what rating a user might give a movie they haven't seen.

For consistency, you will once again be working with User_1 and the rating they would give Apollo 13 (1995) if they saw it.

The users_to_ratings DataFrame has again been loaded for you. This contains each user with its own row and each rating they gave as the values.

Similarly, user_ratings_table has been loaded, which contains the raw rating values (pre-centering and filling with zeros).

- Drop the column corresponding to the movie you are predicting for (Apollo 13 (1995)) from the users_to_ratings DataFrame in place.
- Extract the ratings for user_001 from the resulting users_to_ratings table and store them as target_user_x.
- Get the raw ratings for Apollo 13 (1995) from the user_ratings_table and store it as other_users_y.
- Select only the users from users_to_ratings that have rated the movie and store it as other_users_x.
- Drop the rows from the other_users_y target that have not rated the movie.

In [64]:
users_to_ratings = user_ratings_table_normed # normalized data
user_ratings_table = user_ratings_pivot # raw data

In [63]:
# Drop the column you are trying to predict
users_to_ratings.drop("(500) Days of Summer (2009)", axis=1, inplace=True)

# Get the data for the user you are predicting for
target_user_x = users_to_ratings.loc[[1]]

In [65]:
# Get the target data from user_ratings_table
other_users_y = user_ratings_table["(500) Days of Summer (2009)"]
other_users_y

userId
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
      ... 
606    NaN
607    NaN
608    NaN
609    NaN
610    3.5
Name: (500) Days of Summer (2009), Length: 610, dtype: float64

In [66]:
# Get the data for only those that have seen the movie
other_users_x = users_to_ratings[other_users_y.notnull()]

# Remove those that have not seen the movie from the target
other_users_y.dropna(inplace=True)

In [68]:
other_users_y.head()

userId
15    4.0
18    4.0
22    0.5
41    3.5
62    4.5
Name: (500) Days of Summer (2009), dtype: float64

### KNN predictions

With the data in the correct shape from the last exercise, you can now use it to infer how `user_001` feels about Apollo 13 (1995)

As a reminder, the data you prepared in the last exercise (and have been loaded into this one) are:

- `target_user_x` - Centered ratings that user_001 has given to the movies they have seen.
- `other_users_x` - Centered ratings for all other users and the movies they have rated excluding the movie Apollo 13.
- `other_users_y` - Raw ratings that all other users have given the movie Apollo 13.

You will use `other_users_x` and `other_users_y` to fit a `KNeighborsRegressor` from `scikit-learn` and use it to predict what `user_001` might have rated Apollo 13 (1995).


- Import `KNeighborsRegressor` from scikit-learn.
- Instantiate the regressor as user_knn with the metric specified as cosine and set to 10.

- Fit the user_knn regressor on the other_users_x and other_users_y data.
- Using the trained model, predict what user_001 (whose ratings are stored in target_user_x) would have given the movie.

In [69]:
# Import the regressor
from sklearn.neighbors import KNeighborsRegressor

# Instantiate the user KNN model
user_knn = KNeighborsRegressor(metric='cosine', n_neighbors=10)

In [70]:
# Fit the model and predict the target user
user_knn.fit(other_users_x, other_users_y)
user_user_pred = user_knn.predict(target_user_x)

print(user_user_pred)

[3.95]


> Perfect! One advantage of using a library like scikit-learn for these steps is that you are able to iterate easily. For example, you can try the above again, but this time with a different n_neighbors value, or even try to replace KNeighborsRegressor with KNeighborsClassifier to find the most common neighbors' rating as opposed to their average.

In [81]:
# Import the regressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the user KNN model
user_knn = KNeighborsRegressor(n_neighbors=25)
# Fit the model and predict the target user
user_knn.fit(other_users_x, other_users_y)
user_user_pred = user_knn.predict(target_user_x)
print(user_user_pred)

[3.9]
