*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

In [None]:
# Importa librerías (quizá hagan faltan más o quizás no).
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [None]:
# Obtenemos los archivos con los datos (los datasets).
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2021-01-15 17:10:44--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.3.33, 104.26.2.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘book-crossings.zip’

book-crossings.zip      [        <=>         ]  24.88M  1.03MB/s    in 24s     

2021-01-15 17:11:09 (1.03 MB/s) - ‘book-crossings.zip’ saved [26085508]

Archive:  book-crossings.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


In [None]:
# Convertimos los archivos '.csv' en dataframes.
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

First of all, we are going to check the missing values in both dataframes:

In [None]:
df_books.isnull().sum() 

isbn      0
title     0
author    1
dtype: int64

In [None]:
df_ratings.isnull().sum()

user      0
isbn      0
rating    0
dtype: int64

In the `books` dataframe there is one null data in the `author` column, so we are going to drop that row with that null value and keep the dataframe with valid entries in the same variable:

In [None]:
df_books.dropna(inplace=True) # Mantiene el DataFrame con las entradas válidas en la misma variable.

Now that we have dropped that row, we check again the missing values in the `books` dataframe. Now we should get no missing values like it occurs in the `ratings` dataframe:

In [None]:
df_books.isnull().sum()

isbn      0
title     0
author    0
dtype: int64

Now that we have both dataframes with no missing values, it's time to remove users with less than 200 ratings. First of all, we print the shape of both dataframes:

In [None]:
df_books.shape # Número de filas y columnas.

(271378, 3)

In [None]:
df_ratings.shape # Número de filas y columnas.

(1149780, 3)

As we can see, we have 1.149.780 rows and 3 columns in the `ratings` dataframe, that corresponds to the `user`, `isbn` and `rating` columns. After that we group all the unique users using the `user` column and the `value_counts` method:

In [None]:
ratings = df_ratings['user'].value_counts() # Se realiza una cuenta de los usuarios únicos con la columna 'user' del Dataframe. 
ratings.sort_values(ascending=False).head() # Se ordenan los valores en orden descendente.

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
Name: user, dtype: int64

Now we can check the total number of unique users that have less than 200 ratings:

In [None]:
len(ratings[ratings < 200]) # Longitud del número de usuarios únicos con menos de 200 calificaciones de películas.

104378

After that we do the sum for every single result of the `user` column inside the dataframe for the number of users that have less than 200 ratings using the `ratings` variable created before:

In [None]:
df_ratings['user'].isin(ratings[ratings < 200].index).sum() # 'isin' comprueba si es verdad que cada resultado tiene menos de 
                                                            # 200 calificaciones y se realiza la suma de esos resultados verdaderos 
                                                            # en todo el dataframe.

622224

Now that we have checked and counted those results of users that have less than 200 ratings in the dataframe, we are going to remove those results and store the remaining ones (only the users with more than 200 ratings) in a variable:

In [None]:
df_ratings_rm = df_ratings[~df_ratings['user'].isin(ratings[ratings < 200].index)] # ~ es el operador de complemento a uno, por 
                                                                                   # lo que invierte los bits.
df_ratings_rm.shape

(527556, 3)

As we can see in the `shape` method, the resulting rows are 527.556, that is the result of subtracting the 1.149.780 initial rows of the dataframe with the 622.224 rows of users with less that 200 ratings.

Now that we have done this and we have stored the results in a variable, it's time to remove the books with less than 100 ratings.

First of all we group all the unique books in the `ratings` dataset using the `isbn` column and the `value_counts` method:

In [None]:
ratings = df_ratings['isbn'].value_counts() # Se realiza una cuenta de los libros únicos usando la columna 'isbn' del Dataframe.
ratings.sort_values(ascending=False).head() # Ordena los valores en orden descendente.

0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
Name: isbn, dtype: int64

Now we can check the total number of unique books that have less than 100 ratings in the `ratings` dataframe:

In [None]:
len(ratings[ratings < 100]) # Longitud del número de libros únicos que tienen menos de 100 calificaciones.

339825

After that we do the sum for every single result of the `isbn` column for the number of books that have less than 100 ratings using the dataframe that only contains the users with more than 200 ratings that we created before:

In [None]:
df_ratings_rm['isbn'].isin(ratings[ratings < 100].index).sum() # 'isin' comprueba si es verdad que cada resultado tiene menos 
                                                               # de 100 calificaciones y se realiza la suma de esos resultados 
                                                               # verdaderos en todo el dataframe.

477775

Now that we have checked and counted those results of books that have less than 100 ratings in the dataframe, we are going to remove those results and store the remaining ones (only the books with more than 100 ratings) in the same variable that contains the dataframe with the results of users with more than 200 ratings:

In [None]:
df_ratings_rm = df_ratings_rm[~df_ratings_rm['isbn'].isin(ratings[ratings < 100].index)] # ~ es el operador de complemento a uno, por 
                                                                                         # lo que se invierten los bits.
df_ratings_rm.shape

(49781, 3)

As result, we have 49.781 results, that is the result of subtracting the rows of the dataframe of users with more than 200 ratings (527.556) with the 477.775 rows of books with less than 100 ratings. 

So, those 49.781 rows are the results of removing users with less than 200 ratings and books with less than 100 ratings.

Now that we have removed both users with less than 200 ratings and books with less than 100 ratings as required in the statement, we are in conditions to prepare the dataset for the KNN (K-Nearest Neighbors) algorithm.

First of all, we create a pivot table in spreadsheet style using the final dataframe of 49.781 results and we use the `user` as index of the dataframe, `isbn` as the columns and `rating` as the values.

In [None]:
df = df_ratings_rm.pivot_table(index=['user'],columns=['isbn'],values='rating').fillna(0).T # Se crea una tabla dinámica en forma de 
                                                                                            # hoja de cálculo.
                                                                                            # fillna(0): Reemplaza los elementos 
                                                                                            # no numéricos con 0's.
                                                                                            # T: Coloca las filas como columnas y 
                                                                                            # viceversa.
df.head()

user,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,6543,6563,6575,7158,7286,7346,7915,8067,8245,8681,8936,9856,10447,10819,11601,11676,11993,12538,12824,12982,13082,13273,13552,13850,14422,14521,15408,15418,15957,16106,...,264317,264321,264637,265115,265313,265595,265889,266056,266226,266753,266865,266866,267635,268030,268032,268110,268330,268622,268932,269566,269719,269728,269890,270713,270820,271195,271284,271448,271705,273979,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
002542730X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0
0060008032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0060096195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006016848X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0060173289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we set the index of the `books` dataframe with the `isbn` column of the pivot table. We also join the `title` column of the `books` dataframe into the created pivot table, replacing the `isbn` column of the pivot table, so we can see now the titles of the movies in the spreadsheet instead of the long isbn numbers that identify the books:

In [None]:
df.index = df.join(df_books.set_index('isbn'))['title'] # @join: Une columnas entre dos dataframes.
                                                        # @set_index: Establece el índice del dataframe usando las columnas 
                                                        # existentes.

And after that we sort the index of the dataframe, so the titles of the books appear in alphabetical order:

In [None]:
df = df.sort_index() # @sort_index: Ordena el dataframe por índice ascendente de filas (números primero y después A-Z).
df.head()

user,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,6543,6563,6575,7158,7286,7346,7915,8067,8245,8681,8936,9856,10447,10819,11601,11676,11993,12538,12824,12982,13082,13273,13552,13850,14422,14521,15408,15418,15957,16106,...,264317,264321,264637,265115,265313,265595,265889,266056,266226,266753,266865,266866,267635,268030,268032,268110,268330,268622,268932,269566,269719,269728,269890,270713,270820,271195,271284,271448,271705,273979,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


If we want to see the values of this dataframe, we can do it with the pandas `loc` method. For example:

In [None]:
df.loc["1984"][:5] # @loc: Permite seleccionar valores del dataframe. En este caso se selecciona la fila con el título de 
                   # la película '1984'.
                   # @[:5]: Se muestra los primeros 5 resultados (que son los de los 5 primeros usuarios).

user
254     9.0
2276    0.0
2766    0.0
2977    0.0
3363    0.0
Name: 1984, dtype: float32

Now we are in conditions to build the model implementing neighbor searches. We use all the default options and set `cosine` as the distance metric to measure distances in the model:

In [None]:
model = NearestNeighbors(metric='cosine') # @NearestNeighbors: Aprendizaje no supervisado para implementar la búsqueda de 
                                          # vecinos.
model.fit(df.values) # @fit: Entrena el modelo usando los valores (ratings) del dataframe que hemos creado.

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

Now it's time to create the `get_recommends()` function that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances to the book given in the argument.


First of all, we see the shape of a single row of the dataframe to know a little bit more of the dataframe. For example, we can see the shape of the first row:

In [None]:
df.iloc[0].shape # @iloc: Permite seleccionar valores del dataframe según el índice seleccionado. En este caso, se selecciona 
                 # la primera fila (0).

(888,)

As we can expect, the shape is 888 because every single row has eight hundred eighty eight columns in the dataframe.

We can do now the same with the title of the book given in the statement at the beginning of the challenge:

In [None]:
title = 'The Queen of the Damned (Vampire Chronicles (Paperback))'
df.loc[title].shape

(888,)

As we can expect, the shape is again the same because we select the title of the book that appears in the dataframe, that also has 888 columns.

Now we are going to use that title of the book and we are going to find its K-neighbors. For that, we are going to use all the values (ratings) from the book that we want to classify, that is, all the 888 values from the row of the book inside the dataframe:

In [None]:
distance, indice = model.kneighbors([df.loc[title].values], n_neighbors=6) # @kneighbors: Método para encontrar los k vecinos 
                                                                           # más próximos. Devuelve los índices y las distancias con 
                                                                           # los distintos vecinos.
                                                                           # @n_neighbors: Número de vecinos que obtener (en este 
                                                                           # caso 6: el propio + 5 vecinos).

print(distance)
print(indice)

[[0.         0.51784116 0.53763384 0.73450685 0.74486566 0.7939835 ]]
[[612 660 648 272 667 110]]


As you can see, this returns the indexes and the distances to the neighbors of the title of the book. 
When you see the distance is smaller, that means the neighbor book is nearer (more similar) to the book we want to classify. In addition, we get the indexes from the dataframe of every one of the five neighbor books that are more similar to the book we want to classify.

Now that we have stored in two different variables the results of the distances and the indexes of the neighbor books, we are going to show the titles of those neighbor books:

In [None]:
df.iloc[indice[0]].index.values # @iloc[indice[0]]: Permite seleccionar los valores de la variable 'indice', es decir, todos los 
                                # índices del array.
                                # @index.values: Devuelve un array que muestra los valores de esos índices seleccionados.


array(['The Queen of the Damned (Vampire Chronicles (Paperback))',
       'The Vampire Lestat (Vampire Chronicles, Book II)',
       'The Tale of the Body Thief (Vampire Chronicles (Paperback))',
       'Interview with the Vampire',
       'The Witching Hour (Lives of the Mayfair Witches)', 'Catch 22'],
      dtype=object)

Now we are going to create a dataframe with these five neighbor books ordering them by the distances to the original book we want to classify.

In [None]:
pd.DataFrame({ # @Dataframe: Método del paquete 'pandas' que permite crear un dataframe.
    'title'   : df.iloc[indice[0]].index.values, # 1ra columna: Muestra los valores de cada uno de los índices seleccionados 
                                                 # (nombre de los libros).
    'distance': distance[0] # 2da columna: Cada una de las distancias de los libros vecinos respecto al libro original que 
                            # queremos clasificar.
}) \
.sort_values(by='distance', ascending=True) # @sort_values: Se ordenan los distintos libros vecinos por orden ascendente de 
                                            # distancia respecto al libro original.

Unnamed: 0,title,distance
0,The Queen of the Damned (Vampire Chronicles (P...,0.0
1,"The Vampire Lestat (Vampire Chronicles, Book II)",0.517841
2,The Tale of the Body Thief (Vampire Chronicles...,0.537634
3,Interview with the Vampire,0.734507
4,The Witching Hour (Lives of the Mayfair Witches),0.744866
5,Catch 22,0.793984


Now we are going to define the function that returns to us the list with the five recommended books when you pass as an argument the title of a book that appears in the dataframe. To do so, we use again the k-nearest neighbors algorithm with all the values (ratings) that correspond to the row of the book that we have passed as an argument in the function.

In [None]:
# Función que nos devuelve los libros recomendados.
def get_recommends(title = ""): # Se utiliza como argumento el título de un libro del dataframe entre comillas.
  try:
    book = df.loc[title] # @loc: Permite seleccionar valores del dataframe. En este caso, se selecciona la fila con el título de 
                         # la película seleccionada.
  except KeyError as e:
    print('The given book', e, 'does not exist')
    return

  distance, indice = model.kneighbors([book.values], n_neighbors=6) # @kneighbors: Método para encontrar los 5 vecinos más próximos 
                                                                    # del libro utilizado. Devuelve los índices y las distancias con 
                                                                    # los cinco libros vecinos.

  recommended_books = pd.DataFrame({ # @Dataframe: Método del paquete 'pandas' que permite crear un dataframe.
      'title'   : df.iloc[indice[0]].index.values, # 1ra columna: Muestra los valores de cada uno de los índices seleccionados 
                                                   # (nombre de los libros).
      'distance': distance[0] # 2da columna: Cada una de las distancias de los libros vecinos respecto al libro original que 
                              # queremos clasificar.
    }) \
    .sort_values(by='distance', ascending=False) \
    .head(5).values # @head: Muestra los 5 primeros valores.

  return [title, recommended_books] # La función devuelve el título del libro que queremos clasificar y el dataframe de los libros 
                                    # vecinos con títulos y distancias respecto al libro original.

Now that we have defined the function, we can test it with the book `The Queen of the Damned (Vampire Chronicles (Paperback))` that we have used before and see if we get the same results.

In [None]:
get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")

['The Queen of the Damned (Vampire Chronicles (Paperback))',
 array([['Catch 22', 0.793983519077301],
        ['The Witching Hour (Lives of the Mayfair Witches)',
         0.7448656558990479],
        ['Interview with the Vampire', 0.7345068454742432],
        ['The Tale of the Body Thief (Vampire Chronicles (Paperback))',
         0.5376338362693787],
        ['The Vampire Lestat (Vampire Chronicles, Book II)',
         0.5178411602973938]], dtype=object)]

As we can see, we get an array with the same neighbor books and the same distances as before. So, we can see that we get the same results both with the function that we have created and without the use of it.

Now we are going to use the cell below to test the function. The `test_book_recommendation()` function will inform you if you passed the challenge or need to keep trying:

In [None]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You havn't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", array([["I'll Be Seeing You", 0.8016210794448853],
       ['The Weight of Water', 0.7708583474159241],
       ['The Surgeon', 0.7699410915374756],
       ['I Know This Much Is True', 0.7677075266838074],
       ['The Lovely Bones: A Novel', 0.7234864234924316]], dtype=object)]
You passed the challenge! 🎉🎉🎉🎉🎉
