# Collaborative Filtering
  
Discover new items to recommend to users by finding others with similar tastes. Learn to make user-based and item-based recommendations—and in what context they should be used. Use k-nearest neighbors models to leverage the wisdom of the crowd and predict how someone might rate an item they haven’t yet encountered.

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pandas.crosstab()</td>
    <td>Compute a cross-tabulation of two or more factors.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>sklearn.metrics.jaccard_score()</td>
    <td>Compute the Jaccard similarity coefficient score between two sets.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>DataFrame.column.str.get_dummies()</td>
    <td>Create dummy/indicator variables for categorical variables.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>DataFrame.set_index()</td>
    <td>Set the DataFrame index (row labels) using one or more existing columns.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>scipy.spatial.distance.pdist()</td>
    <td>Compute pairwise distances between observations in a dataset.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>scipy.spatial.distance.squareform()</td>
    <td>Convert the output of `pdist` to a square, symmetric distance matrix.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>DataFrame.sort_values()</td>
    <td>Sort a DataFrame by one or more columns.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>sklearn.feature_extraction.text.TfidfVectorizer()</td>
    <td>Convert a collection of raw documents to a matrix of TF-IDF features.</td>
  </tr>
  <tr>
    <td>9</td>
    <td>TfidfVectorizer.fit_transform()</td>
    <td>Learn vocabulary and idf, and return the document-term matrix.</td>
  </tr>
  <tr>
    <td>10</td>
    <td>TfidfVectorizer.get_feature_names_out()</td>
    <td>Get feature names for the output of `TfidfVectorizer`.</td>
  </tr>
  <tr>
    <td>11</td>
    <td>sklearn.metrics.pairwise.cosine_similarity()</td>
    <td>Compute cosine similarity between samples in two arrays or dataframes.</td>
  </tr>
  <tr>
    <td>12</td>
    <td>DataFrame.index.duplicated()</td>
    <td>Check for duplicated indices in a DataFrame.</td>
  </tr>
  <tr>
    <td>13</td>
    <td>DataFrame.reindex()</td>
    <td>Change to a new index using specified indexers.</td>
  </tr>
  <tr>
    <td>14</td>
    <td>DataFrame.drop()</td>
    <td>Delete specified labels from rows or columns in a DataFrame.</td>
  </tr>
  <tr>
    <td>15</td>
    <td>DataFrame.values.reshape(1,-1)</td>
    <td>Reshape the values in a DataFrame into a new shape.</td>
  </tr>
</table>
  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: scikit-learn  
Version: 1.3.0  
Summary: A set of python modules for machine learning and data mining  
  
Name: scipy  
Version: 1.10.1  
Summary: Fundamental algorithms for scientific computing in Python  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [1]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Setting a standard style
plt.style.use('ggplot')

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Collaborative filtering
  
In the last chapter, we used the items a customer liked to make suggestions of other similar items. This works well when we have a lot of information about the items, but not much data on how people feel about them. In this chapter, we will find the users that have the most similar preferences to the user we are making recommendations for and based on that group's preferences, make suggestions.
  
**Collaborative filtering**
  
This form of recommendation is called collaborative filtering. Collaborative filtering is the name given to the prediction, or filtering, of items that might interest a user based on the preferences of similar users. It works around the premise that person A has similar tastes to person B and C.
  
<center><img src='../_images/collaborative-filtering-rec-sys.png' alt='img' width='740'></center>
  
**Collaborative filtering**
  
and both person B and C also like a certain item,
  
<center><img src='../_images/collaborative-filtering-rec-sys1.png' alt='img' width='740'></center>
  
**Collaborative filtering**
  
then it is likely that person A would also like that new item.
  
<center><img src='../_images/collaborative-filtering-rec-sys2.png' alt='img' width='740'></center>
  
**Finding similar users**
  
But how do we go about programmatically finding users with similar interests? Rating data is often difficult to compare between users. Even here it is not immediately clear how User_1 and User_2 compare.
  
<center><img src='../_images/collaborative-filtering-rec-sys3.png' alt='img' width='740'></center>
  
**Finding similar users**
  
We need to get this data into a matrix of users and the items they rated. Now we can see what items both users have seen. Based on this matrix we can compare across users, here it is apparent that User_1 and User_3 have more similar preferences than User_1 and User_2.
  
<center><img src='../_images/collaborative-filtering-rec-sys4.png' alt='img' width='740'></center>
  
**Working with real data**
  
Time for some real data! We will continue working with the book ratings dataset from the previous chapters containing each user, the book they rated, and the rating score.
  
<center><img src='../_images/collaborative-filtering-rec-sys5.png' alt='img' width='740'></center>
  
**Pivoting our data**
  
As the data is in a DataFrame, pandas' `.pivot()` method can be used to reshape the data around specified columns. We want the users as the `index=`, the `columns=` representing the books, and the ratings as the corresponding `values=` like you see here.
  
<center><img src='../_images/collaborative-filtering-rec-sys6.png' alt='img' width='740'></center>
  
**Data sparsity**
  
The first thing that may become apparent after this transform is the number of missing entries, demonstrated by the NaN values. This is expected - a user will rarely have rated every item, and it's similarly rare that an item will have been rated by every person. This is an issue, as most similarity metrics do not handle missing data very well. How can we deal with this? We cannot just drop all the rows and columns that have missing data as with data this sparse that could be the whole data frame!
  
<center><img src='../_images/collaborative-filtering-rec-sys7.png' alt='img' width='740'></center>
  
**Filling the missing values**
  
Similarly, you might suggest filling the empty values with 0s, which might be valid for some machine learning models, but can create issues with recommendation engines. Take for example the second user here. They loved Catcher in the Rye, and enjoyed Fifty Shades of Grey, but have not rated The Great Gatsby. If we were to fill this NaN with a 0, we would be incorrectly implying they greatly disliked the book compared to the others, which we can't say for sure.
  
<center><img src='../_images/collaborative-filtering-rec-sys8.png' alt='img' width='740'></center>
  
**Filling the missing values**
  
One alternative is to center each user's ratings around 0 by deducting the row average and then fill in the missing values with 0. This means the missing data is replaced with neutral scores.
  
<center><img src='../_images/collaborative-filtering-rec-sys9.png' alt='img' width='740'></center>
  
**Filling the missing values**
  
We first find the row means. Then subtract it from the rest of the row, you can see the rows centered around 0 here.
  
<center><img src='../_images/collaborative-filtering-rec-sys10.png' alt='img' width='740'></center>
  
**Filling the missing values**
  
We then fill the NaNs with 0s. This is not a perfect solution, as the values lose some of their interpretability, and these values should not be used as predictions in themselves, but suffice when comparing between users.
  
<center><img src='../_images/collaborative-filtering-rec-sys11.png' alt='img' width='740'></center>
  
**Let's practice!**
  
We can now calculate similarities between users and we will get to that soon, but first let's work through shaping the data!

### Pivoting your data
  
In this chapter, you will go one step further in generating personalized recommendations — you will find items that users, similar to the one you are making recommendations for, have liked.
  
The first step you will need to start with is formatting your data. You begin with a dataset containing users and their ratings as individual rows with the following columns:
  
- `user`: User ID
- `title`: Title of the movie
- `rating`: Rating the user gave the movie
  
You will need to transform the DataFrame into a user rating matrix where each row represents a user, and each column represents the movies on the platform. This will allow you to easily compare users and their preferences.
  
---
  
1. Inspect the first five rows of the `user_ratings` DataFrame to observe which columns would be most appropriate to pivot the data around.
2. Which column from user_ratings should become the index of the pivoted DataFrame?
   - [x] userID
   - [ ] title
   - [ ] rating
3. Transform the `user_ratings` DataFrame to a DataFrame containing ratings with one row per user and one column per movie and call it `user_ratings_table`.

In [2]:
user_ratings = pd.read_csv('../_datasets/user_ratings.csv')
print(user_ratings.shape)
user_ratings.head()

(100836, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [3]:
# Getting a list of columns
print(user_ratings.columns)

Index(['userId', 'movieId', 'rating', 'timestamp', 'title', 'genres'], dtype='object')


In [4]:
# Dropping columns that are not required for this exercise
user_ratings = user_ratings.drop(['movieId', 'timestamp', 'genres'], axis=1)

# Inspect the first 5 rows of user_ratings
user_ratings.head()

Unnamed: 0,userId,rating,title
0,1,4.0,Toy Story (1995)
1,5,4.0,Toy Story (1995)
2,7,4.5,Toy Story (1995)
3,15,2.5,Toy Story (1995)
4,17,4.5,Toy Story (1995)


In [5]:
user_ratings = user_ratings.drop_duplicates(keep='first')
print(user_ratings.shape)

(100833, 3)


In [6]:
# Transform the table
user_ratings_table = user_ratings.pivot_table(index='userId', columns='title', values='rating')
# Inspect the transformed table
user_ratings_table.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...All the Marbles (1981),...And Justice for All (1979),00 Schneider - Jagd auf Nihil Baxter (1994),1-900 (06) (1994),10 (1979),10 Cent Pistol (2015),10 Cloverfield Lane (2016),10 Items or Less (2006),10 Things I Hate About You (1999),10 Years (2011),"10,000 BC (2008)",100 Girls (2000),100 Streets (2016),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),...,Zipper (2015),Zodiac (2007),Zombeavers (2014),Zombie (a.k.a. Zombie 2: The Dead Are Among Us) (Zombi 2) (1979),Zombie Strippers! (2008),Zombieland (2009),Zone 39 (1997),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zoom (2006),Zoom (2015),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
1,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,4.0,
2,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,3.0,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,


Good work! With this data in a matrix, you will be able to compare between users much more easily.

### Challenges with missing values
  
You may have noticed that the pivoted DataFrames you have been working with often have missing data. This is to be expected since users rarely see all movies, and most movies are not seen by everyone, resulting in gaps in the user-rating matrix.
  
In this exercise, you will explore another subset of the user ratings table user_ratings_subset that has missing values and observe how different approaches in dealing with missing data may impact its usability.
  
---
  
1. Take a look at the user_ratings_subset that has been loaded for you. The None value represents a situation where a user has not made a rating. Based on the table, which user is most similar to User_A?
  
Possible answers
  
- [x] Both User_B and User_C
- [ ] User_B
- [ ] User_C
2. Fill the gaps in the `user_ratings_subset` with zeros.
3. Print and inspect the results.
4. Based on this user_ratings_table_filled, who now looks most similar to User_A?
  
Possible answers
  
- [ ] Both User B and User C
- [x] User B
- [ ] User C

In [7]:
data = {
    'Forrest Gump': [10, 10, 10],
    'Pulp Fiction': [9, 9, 9],
    'Toy Story': [7, 7, 7],
    'The Matrix': [None, 0, 8]
}

index = ['User_A', 'User_B', 'User_C']

user_ratings_subset = pd.DataFrame(data, index=index)
user_ratings_subset.index.name = 'User'
user_ratings_subset.columns.name = 'Movie'

print(user_ratings_subset)

Movie   Forrest Gump  Pulp Fiction  Toy Story  The Matrix
User                                                     
User_A            10             9          7         NaN
User_B            10             9          7         0.0
User_C            10             9          7         8.0


In [8]:
# Fill in missing values with 0
user_ratings_table_filled = user_ratings_subset.fillna(0)

# Inspect the result
print(user_ratings_table_filled)

Movie   Forrest Gump  Pulp Fiction  Toy Story  The Matrix
User                                                     
User_A            10             9          7         0.0
User_B            10             9          7         0.0
User_C            10             9          7         8.0


True, `User_B` now looks a lot more similar to `User_A` when you fill in the missing values with zero, but you know from the unfilled data this should not be the case. Merely filling in gaps with zeros without adjusting the data otherwise can cause issues by skewing the reviews more negative and should not be done.

### Compensating for incomplete data
  
For most datasets, the majority of users will have rated only a small number of items. As you saw in the last exercise, how you deal with users who do not have ratings for an item can greatly influence the validity of your models.
  
In this exercise, you will fill in missing data with information that should not bias the data that you do have.
  
You'll get the average score each user has given across all their ratings, and then use this average to center the users' scores around zero. Finally, you'll be able to fill in the empty values with zeros, which is now a neutral score, minimizing the impact on their overall profile, but still allowing the comparison of users.
  
`user_ratings_table` with a row per user has been loaded for you.
  
---
  
1. Find the average of the ratings given by each user in `user_ratings_table` and store them as avg_ratings.
2. Subtract the row averages from each row in `user_ratings_table`, and store it as `user_ratings_table_centered`.
3. Fill the empty values in the newly created `user_ratings_table_centered` with zeros.

In [9]:
# Get the average rating for each user 
avg_ratings = user_ratings_table.mean(axis=1)

# Center each users ratings around 0
user_ratings_table_centered = user_ratings_table.sub(avg_ratings, axis=0)

# Fill in the missing data with 0s
user_ratings_table_normed = user_ratings_table_centered.fillna(0)

In [10]:
user_ratings_table_normed.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...All the Marbles (1981),...And Justice for All (1979),00 Schneider - Jagd auf Nihil Baxter (1994),1-900 (06) (1994),10 (1979),10 Cent Pistol (2015),10 Cloverfield Lane (2016),10 Items or Less (2006),10 Things I Hate About You (1999),10 Years (2011),"10,000 BC (2008)",100 Girls (2000),100 Streets (2016),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),...,Zipper (2015),Zodiac (2007),Zombeavers (2014),Zombie (a.k.a. Zombie 2: The Dead Are Among Us) (Zombi 2) (1979),Zombie Strippers! (2008),Zombieland (2009),Zone 39 (1997),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zoom (2006),Zoom (2015),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.366379,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.948276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Great work! You will now be able to compare between rows without adding an unnecessary bias to the data when values are missing.

## Finding similarities
  
We have been focusing on finding similar users so far in this chapter. This is called user-based collaborative filtering. Comparisons between items, or item-based collaborative filtering, is also possible.
  
**Item-based recommendations**
  
It assumes if Item A and B receive similar reviews, either positive or negative,
  
<center><img src='../_images/finding-similarities-rec-sys.png' alt='img' width='740'></center>
  
**Item-based recommendations**
  
Then however other people feel about A,
  
<center><img src='../_images/finding-similarities-rec-sys1.png' alt='img' width='740'></center>
  
**Item-based recommendations**
  
They should feel the same way about B.
  
<center><img src='../_images/finding-similarities-rec-sys2.png' alt='img' width='740'></center>
  
**User-based to item-based**
  
If we have our data prepped as we did in the last few exercises we can switch between these two approaches,
  
<center><img src='../_images/finding-similarities-rec-sys3.png' alt='img' width='740'></center>
  
**User-based to item-based**
  
by transposing the matrices giving us the items as rows and the users as columns.
  
<center><img src='../_images/finding-similarities-rec-sys4.png' alt='img' width='740'></center>
  
**User-based to item-based**
  
This can be achieved in pandas by looking at the book rating DataFrame (user_ratings_pivot) we have generated previously and shown here. By calling `.T` on a DataFrame to get its transposed version. We can see the user-based matrix on top, and the corresponding item-based matrix on the bottom. We will discuss in more depth which matrix is preferred later in this chapter, but the high-level answer, like with many questions in data science, is that it depends on the data. We will focus on item-based filtering for now as the items can be a little more relatable.
  
<center><img src='../_images/finding-similarities-rec-sys5.png' alt='img' width='740'></center>
  
**Cosine similarities**
  
With the item-based matrix containing a row per book, shown here, we can calculate the similarities and distances between items in the dataset, like what we did with our content-based recommendations last chapter. We'll continue to use cosine distance, but is worth noting that as we have centered the data around zero, the cosine values can now range from -1 to 1, with 1 being the most similar, and -1 the least. This does not have any impact on the process, so don't be concerned if you see some negative cosine values!
  
<center><img src='../_images/finding-similarities-rec-sys6.png' alt='img' width='740'></center>
  
**Cosine similarities**
  
Even though the range of the output can be different, the way we calculate similarities is the same. Let's compare two books, The Lord of the Rings and The Hobbit, from our dataset with the cosine distance. As a reminder, cosine similarity compares two NumPy arrays, so we need to do some reshaping first.
  
We first get the rows we want to compare.
  
Then we need to turn them into a NumPy array with dot values.
  
And reshape them into a 1d array. As you can see, the two books are found to be quite similar (remember the values are between -1 and 1). This is expected as they are by the same author, but if we repeat it with two very different books, we might even get a negative value.
  
<center><img src='../_images/finding-similarities-rec-sys7.png' alt='img' width='740'></center>
  
**Cosine similarities**
  
Comparing items is all well and good, but you want of course to start making recommendations. Let's do so by finding the most similar items overall! To do this, we need to find the similarities between all the items at once. Just like we did with content-based recommendations, we can call `cosine_similarity()` on the full dataset. Resulting in a similarity matrix between all items. Tidying this up by wrapping it in a DataFrame with the index and columns the item names gets us a usable lookup table with the similarities for all items.
  
<center><img src='../_images/finding-similarities-rec-sys8.png' alt='img' width='740'></center>
  
**Cosine similarities**
  
With this matrix calculated, we can even make recommendations by finding the items that have been rated most similar to the one a user liked by selecting the item you want to compare against and sort its similarities. Here you can see that the most similarly rated different item to The Hobbit was Lord of the Rings, which makes sense as they share characters and author.
  
<center><img src='../_images/finding-similarities-rec-sys9.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Let us try this with the movies dataset you have been working on!

### User-based to item-based
  
By now you have a dataset with no empty values that is primed for use.
  
In the preceding video, you learned about both user-based recommendations and item-based recommendations. User-based recommendations compare amongst users, and item-based recommendations compare different items.
  
In other words, you could use user-based data to find similar users based on how they rated different movies, while you could use item-based data to find similar movies based on how they have been rated by the users.
  
In this exercise, you will switch between the two and compare their values.
  
user_ratings_subset, a subset of the user-based DataFrame you have been working with, has been loaded for you.
  
---
  
1. Based on the data in user_ratings_subset, which user is most similar to User_A?
  
```python
In [1]:
user_ratings_subset
Out[1]:

        The Sandlot  Ocean's Eleven  The Lion King  John Wick
User_A            1               4              1          5
User_B            1               5              1          4
User_C            4               2              5          2
User_D            4               1              4          2
```
  
Possible answers
  
- [x] User_B
- [ ] User_C
- [ ] User_D
2. Transpose the user_ratings_subset table so that it is indexed by the movies and store the result as movie_ratings_subset.
3. Based on this new transposed data, what movie appears most similar to The Sandlot?
  
```python
print(movie_ratings_subset)
                User_A  User_B  User_C  User_D
The Sandlot          1       1       4       4
Ocean's Eleven       4       5       2       1
The Lion King        1       1       5       4
John Wick            5       4       2       2
```
  
Possible answers
  
- [ ] Pulp Fiction
- [x] The Lion King
- [ ] John Wick
  
Awesome! You are now able to switch between the data needed for user-based models and item-based models. This will allow you to build recommendations using both kinds of data to see which suits your use case the best.

In [11]:
# Transpose the user_ratings_table_normed DataFrame
movie_ratings_subset = user_ratings_table_normed.T

movie_ratings_subset.head()

userId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,...,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.311444
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Similar and different movie ratings
  
Some types of movies might be liked by one group of people, but hated by another. This might reflect the type of movie far more than its quality. Take, for example, horror movies — many people absolutely love them, while others hate them.
  
By understanding which movies were reviewed in a similar way, we can often find very similar movies.
  
In this exercise, you will compare movies and see whether they have received similar reviewing patterns.
  
The DataFrame `movie_ratings_centered` has been loaded with a row per movie, and the centered ratings it received as the values.
  
---
  
1. Assign the values for `Star Wars: Episode IV` and `Star Wars: Episode V` to `sw_IV` and `sw_V`.
Find their cosine similarity.
2. Find the cosine similarity between the ratings for Jurassic Park (`jurassic_park`) and Pulp Fiction (`pulp_fiction`).

In [12]:
# Renamed for the exercise
movie_ratings_centered = movie_ratings_subset.copy()

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# Assign the arrays to variables
sw_IV = movie_ratings_centered.loc['Star Wars: Episode IV - A New Hope (1977)', :].values.reshape(1, -1)
sw_V = movie_ratings_centered.loc['Star Wars: Episode V - The Empire Strikes Back (1980)', :].values.reshape(1, -1)

# Find the similarity between two Star Wars movies
similarity_A = cosine_similarity(sw_IV, sw_V)
print(similarity_A)

[[0.69423917]]


In [14]:
# Assign the arrays to variables
jurassic_park = movie_ratings_centered.loc['Jurassic Park (1993)', :].values.reshape(1, -1)
pulp_fiction = movie_ratings_centered.loc['Pulp Fiction (1994)', :].values.reshape(1, -1)

# Find the similarity between Pulp Fiction and Jurassic Park
similarity_B = cosine_similarity(pulp_fiction, jurassic_park)
print(similarity_B)

[[0.03407117]]


Great work! As you can see, the two Star Wars movies generated a much larger similarity rating than Jurassic Park and Pulp fiction. This is expected, as although they are all award-winning movies, the users who like one Star Wars movie are very likely to like the other, while totally different users may like Jurassic Park and Pulp Fiction.

### Finding similarly liked movies
  
Just like you calculated the similarity between two movies, you can calculate it across all users to find the most similar movie to another based on how users have rated them.
  
The approach is similar to how you worked with content-based filtering.
  
You will find the similarity scores between all movies and then drill down on the movie of interest by isolating and sorting the column containing its similarity scores.
  
`movie_ratings_centered` has once again been loaded, containing each movie as a row, and their centered ratings stored as the values.
  
---
  
1. Calculate the similarity matrix between all movies in `movie_ratings_centered` and store it as similarities.
2. Wrap the similarities matrix in a DataFrame, with the indices of `movie_ratings_centered` as the columns and rows.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# Generate the similarity matrix
similarities = cosine_similarity(movie_ratings_centered)

# Wrap the similarities in a DataFrame
cosine_similarity_df = pd.DataFrame(similarities, index=movie_ratings_centered.index, columns=movie_ratings_centered.index)

# Find the similarity values for a specific movie
cosine_similarity_series = cosine_similarity_df.loc['Star Wars: Episode IV - A New Hope (1977)']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

print(ordered_similarities)

title
Star Wars: Episode IV - A New Hope (1977)                                         1.000000
Star Wars: Episode V - The Empire Strikes Back (1980)                             0.694239
Star Wars: Episode VI - Return of the Jedi (1983)                                 0.646330
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.449865
Indiana Jones and the Last Crusade (1989)                                         0.404551
                                                                                    ...   
Angels in the Outfield (1994)                                                    -0.299073
Superman III (1983)                                                              -0.302429
Bio-Dome (1996)                                                                  -0.313563
Rambo III (1988)                                                                 -0.316152
Batman & Robin (1997)                                                            -0.

Fantastic! As you can see, the most similar movie to Star Wars: Episode IV was Star Wars: Episode V, followed by Indiana Jones, another action-packed movie from the same era.

## Using K-nearest neighbors
  
**Using K-nearest neighbors**
  
You are now able to find similar items based on how the users in your dataset have rated them.
  
**Beyond similar items**
  
But what if we wanted to not only find similarly rated items, but actually predict how a user might rate an item even if it is not similar to any item they have seen! One approach is to find similar users using a K nearest neighbors model and see how they liked the item.
  
<img src='../_images/using-knn-recomendation-sys.png' alt='img' width='720'>
  
**K-nearest neighbors**
  
As a reminder, K-NN finds the k users that are closest measured by a specified metric, to the user in question. It then averages the rating those users gave the item we are trying to get a rating for. In this example, k equals 3, so it finds the 3 nearest users and gets their rating. This allows us to predict how we think a user might feel about an item, even if they haven't seen it before. Scikit-learn has a pre-built KNN model we will use later, but it is valuable to understand how it works by going through the process step by step first.
  
<img src='../_images/using-knn-recomendation-sys1.png' alt='img' width='720'>
  
**User-user similarity**
  
We continue with our book rating DataFrame, this time predicting what rating User_1 might give the book "Catch-22" which they have not read. We previously generated the similarity scores between all items, in the item-based DataFrame. As we are now looking to find similar users, we repeat the process, but on the user-based DataFrame, and assign the users as columns and indices.
  
<img src='../_images/using-knn-recomendation-sys2.png' alt='img' width='720'>
  
**Understanding the similarity matrix**
  
Examining the output, we see a grid of all users as rows and columns, and where they meet, their similarity score.
  
<img src='../_images/using-knn-recomendation-sys3.png' alt='img' width='720'>
  
**Understanding the similarity matrix**
  
So User_1 and User_3 here are quite similar.
  
<img src='../_images/using-knn-recomendation-sys4.png' alt='img' width='720'>
  
**Understanding the similarity matrix**
  
While User_1 and User_2 are not.
  
<img src='../_images/using-knn-recomendation-sys5.png' alt='img' width='720'>
  
**Step by step KNN**
  
Lets set k to 3 and find the KNN to User_1. We select User_1's similarity values Then order them to find the 3 most similar users getting just their names using `.index`.
  
<img src='../_images/using-knn-recomendation-sys6.png' alt='img' width='720'>
  
**Step by step KNN**
  
We then find the ratings these users gave to the book from our original ratings DataFrame and get the mean. This rating represents the rating the user would likely give to Catch-22 based on the ratings users similar to them gave it.
  
<img src='../_images/using-knn-recomendation-sys7.png' alt='img' width='720'>
  
**Using scikit-learn's KNN**
  
Let's look how this can be done using scikit-learn. For this, we need two datasets: the centered user-based rating DataFrame, with a row per user, a column per item, and values of the ratings centered around 0, and the original user_ratings_table with uncentered scores and missing values.
  
<img src='../_images/using-knn-recomendation-sys8.png' alt='img' width='720'>
  
**Using scikit-learn's KNN**
  
We drop the catch-22 column as that will be our target, and separate the user we are predicting for. Note we use double brackets to keep this as a DataFrame. The original raw ratings for the item we are predicting on are extracted. Think of this as your Y values in your model.
  
<img src='../_images/using-knn-recomendation-sys9.png' alt='img' width='720'>
  
**Using scikit-learn's KNN**
  
As we only care about neighbors that have read the book, we filter the users that have actually rated it. We similarly drop the rows in the ratings that are empty. Think of other_users_x and other_users_y as your x and y training values, while target_users_x is the data you are trying to predict with.
  
<img src='../_images/using-knn-recomendation-sys10.png' alt='img' width='720'>
  
**Using scikit-learn's KNN**
  
We can then import and instantiate the `KNeighborsRegressor` model from sklearn specifying cosine similarities as the metric. We fit it the same way we fit any model and predict on the user values we want to predict.
  
<img src='../_images/using-knn-recomendation-sys11.png' alt='img' width='720'>
  
**Using scikit-learn's KNN**
  
An advantage of using the sklearn approach is that you can quickly change parameters, or even try out classification as opposed to regression, where the most common rating is predicted as opposed to the average like seen here!
  
<img src='../_images/using-knn-recomendation-sys12.png' alt='img' width='720'>
  
**Let's practice!**
  
Now its time to try this yourself.

### Stepping through K-nearest neighbors
  
You have just seen how K-nearest neighbors can be used to infer how someone might rate an item based on the wisdom of a (similar) crowd. In this exercise, you will step through this process yourself to ensure a good understanding of how it works.
  
To get you started, as you have generated similarity matrices many times before, that step has been done for you with the user similarity matrix wrapped in a DataFrame loaded as `user_similarities`.
  
This has each user as the rows and columns, and where they meet the corresponding similarity score.
  
In this exercise, you will be working with `user_001`'s similarity scores, find their nearest neighbors, and based on the ratings those neighbors gave a movie, infer what rating `user_001` might give it if they saw it.
  
---
  
1. Find the IDs of `User_A`'s 10 nearest neighbors by extracting the top 10 users in ordered_similarities and storing them as `nearest_neighbors`.
2. Extract the ratings the users in `nearest_neighbors` gave from `user_ratings_table` as `neighbor_ratings`.
3. Calculate the average rating these users gave to the movie Apollo 13 (1995) to infer what `User_A` might give it if they had seen it.

In [22]:
user_ratings.head()

Unnamed: 0,userId,rating,title
0,1,4.0,Toy Story (1995)
1,5,4.0,Toy Story (1995)
2,7,4.5,Toy Story (1995)
3,15,2.5,Toy Story (1995)
4,17,4.5,Toy Story (1995)


In [25]:
user_ratings.columns

Index(['userId', 'rating', 'title'], dtype='object')

In [30]:
user_similarities = user_ratings.pivot(index='userId', columns='userId', values='rating')

ValueError: The name userId occurs multiple times, use a level number

In [None]:
# Isolate the similarity scores for user_1 and sort
user_similarity_series = user_similarities.loc['user_001']
ordered_similarities = user_similarity_series.sort_values(ascending=False)

# Find the top 10 most similar users
nearest_neighbors = ordered_similarities[1:11].____

# Extract the ratings of the neighbors
neighbor_ratings = user_ratings_table.____(nearest_neighbors)

# Calculate the mean rating given by the users nearest neighbors
print(neighbor_ratings['Apollo 13 (1995)'].____())