<a href="https://colab.research.google.com/github/alanpirotta/freecodecamp_certif/blob/main/fcc_book_recommendation_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

In [None]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

In [None]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [None]:
# add your code here - consider creating a new cell for each section of code

In [None]:
print(f'Books: {len(df_books)}')
df_books.head()

Books: 271379


Unnamed: 0,isbn,title,author
0,195153448,Classical Mythology,Mark P. O. Morford
1,2005018,Clara Callan,Richard Bruce Wright
2,60973129,Decision in Normandy,Carlo D'Este
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,393045218,The Mummies of Urumchi,E. J. W. Barber


In [None]:
print(f'User ratings: {len(df_ratings)}')
df_ratings.head()

User ratings: 1149780


Unnamed: 0,user,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0


### Reduce the datasets, removing users with less than 200 ratings, and books with less than 100 ratings
First, get the series with the users and books that pass the criteria

In [None]:
high_count_rating_users = df_ratings.groupby(['user'])['user'].count()
high_count_rating_users = high_count_rating_users.sort_values()
print(f'total users: {len(high_count_rating_users)}')
high_count_rating_users = high_count_rating_users[ high_count_rating_users >= 200]
print(f'Users with more than 200 ratings: {len(high_count_rating_users)}')

total users: 105283
Users with more than 200 ratings: 905


In [None]:
high_count_rating_books = df_ratings.groupby(['isbn'])['isbn'].count()
high_count_rating_books = high_count_rating_books.sort_values()
print(f'total books: {len(high_count_rating_books)}')
high_count_rating_books = high_count_rating_books[ high_count_rating_books >= 100]
print(f'Books with more than 100 ratings: {len(high_count_rating_books)}')

total books: 340556
Books with more than 100 ratings: 731


Second, Check a value that shouldn't stay in the filtered DF

In [None]:
df_ratings.groupby(['user'])['user'].count().sort_values()

In [None]:
high_count_rating_users

In [None]:
print('Dropped user:')
print(f'User 276725 is in original DF? {276725 in df_ratings.user}')
print(f'User 276725 is in filtered DF? {276725 in high_count_rating_users.index}')
print("")
print('Ok user:')
print(f'User 36554 is in original DF? {36554 in df_ratings.user}')
print(f'User 36554 is in filtered DF? {36554 in high_count_rating_users.index}')

Dropped user:
User 276725 is in original DF? True
User 276725 is in filtered DF? False

Ok user:
User 36554 is in original DF? True
User 36554 is in filtered DF? True


Third, filter the original datasets, dropping the users and books with low ratings

In [None]:
f_df_books = df_books[df_books['isbn'].isin(high_count_rating_books.index)]
f_df_ratings = df_ratings[(df_ratings['isbn'].isin(high_count_rating_books.index)) & (df_ratings['user'].isin(high_count_rating_users.index))]

### Join the two dataframes into one

In [None]:
data = f_df_ratings.merge(right=f_df_books, on='isbn')
data.head()

Unnamed: 0,user,isbn,rating,title,author
0,277427,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
1,3363,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
2,11676,002542730X,6.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
3,12538,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
4,13552,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner


Check if the two DF have the same amount of rows.
As they don't have the same amount, i extract the isbn values that for some reason, weren't joined, and check if those values are in df_books.
Conclusion: Those 4 isbn books aren´t in the df_books dataframe. As i don´t know the names, it's ok if they are dropped.

*If the test is not ok, i can add these rows changing the join mehotd in df.merge*

In [None]:
print(len(data))
print(len(f_df_ratings))
missing_books = f_df_ratings[ ~(f_df_ratings['isbn'].isin(data['isbn']))].isbn.unique()

49517
49781


In [None]:
for book in missing_books:
    print(f'Book {book} values in original df_books:\n {df_books.isbn.isin([book]).value_counts()}\n')
print("")
for book in missing_books:
    print(f'Book {book} occurrencies in original df_ratings:\n {df_ratings.isbn.isin([book]).value_counts()[True]}\n')

Book 0679781587 values in original df_books:
 False    271379
Name: isbn, dtype: int64

Book 0749397543 values in original df_books:
 False    271379
Name: isbn, dtype: int64

Book 0552124753 values in original df_books:
 False    271379
Name: isbn, dtype: int64

Book 0091867770 values in original df_books:
 False    271379
Name: isbn, dtype: int64


Book 0679781587 occurrencies in original df_ratings:
 639

Book 0749397543 occurrencies in original df_ratings:
 160

Book 0552124753 occurrencies in original df_ratings:
 127

Book 0091867770 occurrencies in original df_ratings:
 112



### Several checks to see if the data is ok to use
**First:** Check if the same name is in original and merged df. As they are the same, the merge is ok.

In [None]:
df_books[ df_books['isbn'] == "0140067477"]['title']

73    The Tao of Pooh
Name: title, dtype: object

In [None]:
data[ data['isbn'] == "0140067477"].iloc[0,3]

'The Tao of Pooh'

**Second:** Check if there are any NaN values. There aren't any

In [None]:
data.user.isnull().value_counts()

False    49517
Name: user, dtype: int64

**Third:** check if any rating is below 0 or above 10. There aren't any.

In [None]:
data[ (data['rating'] < 0) | (data['rating'] > 10)]

Unnamed: 0,user,isbn,rating,title,author


**Fourth:** Check duplicates (same user rating two or more times the same book).

Same title has 2 isbn numbers! i'll drop the duplicates

In [None]:
data[data[['user','title']].duplicated(keep=False)].sort_values(by='user')

Unnamed: 0,user,isbn,rating,title,author
3432,254,0439064872,9.0,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling
3541,254,0439136369,9.0,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling
17985,254,0439136350,9.0,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling
17923,254,0439064864,9.0,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling
31476,6251,0679444815,0.0,Timeline,Michael Crichton
...,...,...,...,...,...
3757,278418,0440225701,0.0,The Street Lawyer,JOHN GRISHAM
11745,278418,0385490992,0.0,The Street Lawyer,John Grisham
15161,278418,0451181379,0.0,The Door to December,Dean R. Koontz
13591,278418,044023722X,0.0,A Painted House,John Grisham


In [None]:
data = data.drop_duplicates(['user','title'])

In [None]:
data[ (data.title == 'Harry Potter and the Prisoner of Azkaban (Book 3)') & (data.user == 254)]

Unnamed: 0,user,isbn,rating,title,author
3541,254,439136369,9.0,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling


### Create the correct pivot table/matrix for the NearestNeighbor model to work, and the model

In [None]:
user_title_matrix = data.pivot(index='title', columns='user', values='rating').fillna(0)
matrix_values = user_title_matrix.values
titles_list = list(user_title_matrix.index.values)
nbrs = NearestNeighbors(n_neighbors=2, algorithm='brute', metric='cosine').fit(user_title_matrix)

### Create the function
The `data` DataFrame is the one to use in the model for the function. 

**First:** I created the pivot table to get the correct matrix for the model.

In [None]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):
  title_index = titles_list.index("Where the Heart Is (Oprah's Book Club (Paperback))")
  title_ratings = matrix_values[title_index]
  distances, title_indexes = nbrs.kneighbors(X=np.reshape(title_ratings,(1,-1)), n_neighbors=6)
  results=[]
  for distance, title_index in zip(distances[0], title_indexes[0]):
    results.insert(0,[user_title_matrix.iloc[title_index].name,distance])
  results.pop(-1)
  recommended_books = [ book, results]
  return recommended_books

Use the cell below to test your function. The `test_book_recommendation()` function will inform you if you passed the challenge or need to keep trying.

In [None]:
get_recommends(book = "Where the Heart Is (Oprah's Book Club (Paperback))")

["Where the Heart Is (Oprah's Book Club (Paperback))",
 [["I'll Be Seeing You", 0.8016211],
  ['The Weight of Water', 0.77085835],
  ['The Surgeon', 0.7699411],
  ['I Know This Much Is True', 0.7677075],
  ['The Lovely Bones: A Novel', 0.7234864]]]

In [None]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7234864]]]
You passed the challenge! 🎉🎉🎉🎉🎉
