*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

In [1]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
#from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Matplotlib is building the font cache; this may take a moment.


In [2]:
!apt install wget
!apt install unzip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  wget
0 upgraded, 1 newly installed, 0 to remove and 15 not upgraded.
Need to get 348 kB of archives.
After this operation, 1012 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 wget amd64 1.20.3-1ubuntu2 [348 kB]
Fetched 348 kB in 1s (407 kB/s)m[33m
debconf: delaying package configuration, since apt-utils is not installed

7[0;23r8[1ASelecting previously unselected package wget.
(Reading database ... 53381 files and directories currently installed.)
Preparing to unpack .../wget_1.20.3-1ubuntu2_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........................................................] 87[24;0f[42m[30mProgress: [ 20%][49m[39m [###########...............................................] 8Unpacking wget (1.20.3-1ubuntu2) ...
7[24;0f[42m[30mProgres

In [3]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip


--2022-02-13 22:27:42--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.2.33, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip’


2022-02-13 22:27:46 (7.26 MB/s) - ‘book-crossings.zip’ saved [26085508/26085508]

Archive:  book-crossings.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


In [2]:

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

In [152]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [153]:
# Count the occurances of user and book (isbn)
c1 = df_ratings['user'].value_counts()
c2 = df_ratings['isbn'].value_counts()

# Remove users and books where the occurances are less than 200 and 100 respectively
df_ratings = df_ratings[~df_ratings['user'].isin(c1[c1 < 200].index)]
df_ratings = df_ratings[~df_ratings['isbn'].isin(c2[c2 < 100].index)]

# Merge the dataframes on isbn
df = pd.merge(right=df_ratings, left=df_books, on='isbn')

# Remove duplicates
df = df.drop_duplicates(['title', 'user'])

# Create a pivot
df_pivot = df.pivot(index = 'title', columns = 'user', values = 'rating').fillna(0)

In [154]:
# Create a sparse row matrix
df_csr = csr_matrix(df_pivot.values)

In [155]:
# Create the KNN-model
nbrs = NearestNeighbors(metric='cosine', algorithm='brute', p=2).fit(df_csr)

In [156]:
# For usage in the get_recommends function
titles = list(df_pivot.index.values)

In [157]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):
  if not book:
    return 'Please enter a book title'

  distances, indices = nbrs.kneighbors(df_pivot.loc[book].values.reshape(1, -1), len(titles), True)
  recommended_books = [book, sum([[[df_pivot.index[indices.flatten()[i]], distances.flatten()[i]]] for i in range(5, 0, -1)], [])]

  return recommended_books

In [158]:
get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")

["Where the Heart Is (Oprah's Book Club (Paperback))",
 [["I'll Be Seeing You", 0.8016211],
  ['The Weight of Water', 0.77085835],
  ['The Surgeon', 0.7699411],
  ['I Know This Much Is True', 0.7677075],
  ['The Lovely Bones: A Novel', 0.7234864]]]

Use the cell below to test your function. The `test_book_recommendation()` function will inform you if you passed the challenge or need to keep trying.

In [159]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7234864]]]
You passed the challenge! 🎉🎉🎉🎉🎉


___
### Failed attempts below (excluded from posted solution)

In [None]:
# add your code here - consider creating a new cell for each section of code

# Filter out users that have placed less than 200 ratings
df_r_u_1 = df_ratings['user'].value_counts().to_frame('count')
user_to_keep = df_r_u_1[df_r_u_1.loc[:, 'count'] >= 200].index.values.tolist()

In [None]:
# Filter out books (isbn) that have less than 100 ratings
df_r_b_1 = df_ratings['isbn'].value_counts().to_frame('count')
books_to_keep = df_r_b_1[df_r_b_1.loc[:, 'count'] >= 100].index.values.tolist()

In [None]:
# Create two new DFs, one which contains the books to keep and the other with the users to keep.
df_r_b = df_ratings[df_ratings['isbn'].isin(books_to_keep)]
df_r_u = df_ratings[df_ratings['user'].isin(user_to_keep)]

# Store the unaltered (original) index values from both lists and remove duplicates.
df_r_index = list(set(df_r_b.index.values.tolist() + df_r_u.index.values.tolist()))

# Finally, create a new ratings DF without the users and ratings as defined earlier.
df_r = df_ratings[df_ratings.index.isin(df_r_index)].reset_index()

In [None]:
# Create a new books dataframe where only the books listed in books_to_keep are present.
df_b = df_books[df_books['isbn'].isin(books_to_keep)].reset_index()

In [None]:
# Merge the dataframes
df_br = pd.merge(df_r, df_b, on='isbn')

In [None]:
df_br.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 137423 entries, 0 to 137422
Data columns (total 7 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   index_x  137423 non-null  int64  
 1   user     137423 non-null  int32  
 2   isbn     137423 non-null  object 
 3   rating   137423 non-null  float32
 4   index_y  137423 non-null  int64  
 5   title    137423 non-null  object 
 6   author   137423 non-null  object 
dtypes: float32(1), int32(1), int64(2), object(3)
memory usage: 7.3+ MB


In [None]:
df_br_no_duplicates = df_br.drop_duplicates(['title', 'user'], keep='first')
df_br_pivot = df_br_no_duplicates.pivot(index = 'title', columns = 'user', values = 'rating').fillna(0)
df_br_csr = csr_matrix(df_br_pivot.values)

In [None]:
nbrs = NearestNeighbors(n_neighbors=5, algorithm='brute', metric='cosine').fit(df_br_csr)

In [None]:
#r = np.array(df_r['rating'].tolist()) # All ratings
#b_str = df_r['isbn'].tolist()
#b_int = {v: k for k, v in enumerate(set(b_str))}
#b_reverse = {v: k for k, v in b_int.items()} # Use this dict to reverse the book integer created on the next line back to it's original string
#b = np.array([b_int[x] for x in b_str]) # All book ISBNs as integers
#u = np.array(df_r['user'].tolist())
#u_int = {k: v for v, k in enumerate(set(u_str))} # Use this dict to reverse user integers created on the next line back to 'user'
#u = np.array([u_int[x] for x in u_str]) # All users as integers

In [None]:
#bur = np.column_stack((b, u, r))

In [None]:
#data = csr_matrix(bur)

In [None]:
#nbrs = NearestNeighbors(n_neighbors=5, algorithm='auto', metric='cosine').fit(data)

In [None]:
#isbn = df_b.loc[df_b['title'] == "Where the Heart Is (Oprah's Book Club (Paperback))" , 'isbn'].iloc[0]
#
## get the ratings for that isbn
##ratings = np.array(df_r.loc[df_r['isbn'] == isbn, 'rating'].tolist())
#selection = df_r.loc[df_r['isbn'] == isbn]
#r_rat = np.array(selection['rating'].tolist())
#b_rat = (np.ones((1, len(r_rat))) * b_int[selection['isbn'].iloc[0]]).flatten() # Transforming the ISBN string to it's corresponding value in 'y_int' from before.
#u_rat = np.array(selection['user'].tolist())

In [None]:
X = df_br_pivot[df_br_pivot.index == "Where the Heart Is (Oprah's Book Club (Paperback))"]
X = X.to_numpy().reshape(1, -1).flatten()

0.07407636

In [None]:
distances, indices = nbrs.kneighbors(X, len(df_br_pivot.index.values), True)

In [None]:
distances.flatten()

array([0.        , 0.90959823, 0.9172571 , 0.9198669 , 0.9210188 ,
       0.92478853, 0.9278776 , 0.9285197 , 0.9293628 , 0.9294567 ,
       0.92954266, 0.92961556, 0.931002  , 0.9314079 , 0.9316913 ,
       0.9318649 , 0.93216383, 0.9322732 , 0.93348557, 0.9343451 ,
       0.9347042 , 0.93550926, 0.9383493 , 0.9393388 , 0.94028395,
       0.9404569 , 0.94047284, 0.9409793 , 0.94120157, 0.9414102 ,
       0.94165516, 0.9419774 , 0.9420757 , 0.9434027 , 0.9435293 ,
       0.944297  , 0.9444123 , 0.9446945 , 0.94528985, 0.94533575,
       0.9454994 , 0.9454994 , 0.9455485 , 0.9458667 , 0.94603056,
       0.9468189 , 0.9469409 , 0.9473244 , 0.94751686, 0.9475685 ,
       0.9477997 , 0.94792914, 0.94835633, 0.9486094 , 0.94861627,
       0.9490222 , 0.9491406 , 0.9492118 , 0.94925433, 0.949293  ,
       0.94963634, 0.94964415, 0.94994247, 0.9501943 , 0.9502621 ,
       0.95041597, 0.9509487 , 0.95115525, 0.95129293, 0.9514584 ,
       0.9519501 , 0.95227563, 0.9523064 , 0.95241606, 0.95261

In [None]:
indices.flatten()

array([654, 539, 558, 240, 614])

In [None]:
df_br_pivot.index[indices.flatten()[4]]

'The Weight of Water'

In [None]:
X = np.column_stack((b_rat, u_rat, r_rat))

In [None]:
X

array([[4.36540e+04, 2.77901e+05, 7.00000e+00, ..., 4.36540e+04,
        2.76641e+05, 0.00000e+00]])

In [None]:
distances, indices = nbrs.kneighbors(X, n_neighbors=5)

ValueError: X has 1755 features, but NearestNeighbors is expecting 3 features as input.

In [None]:
indices

array([[  1201,   1228, 237642, 446102, 116345],
       [  1228, 446102,   1201, 410691, 237642],
       [  1329, 163843, 485325, 192261, 546497],
       ...,
       [614555, 441683, 388627, 467139, 274909],
       [614580, 532749, 311013, 444083, 451608],
       [616184, 612644, 556230, 539628, 301685]])

In [None]:
df_b.loc[df_b['isbn'] == b_reverse[indices.flatten()[1]], 'title'].iloc[0]

IndexError: single positional indexer is out-of-bounds