*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

In [1]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [2]:
# not the suggested library
from sklearn.neighbors import KNeighborsClassifier

In [3]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2022-09-12 21:46:23--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.2.33, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip.4’


2022-09-12 21:46:23 (67.8 MB/s) - ‘book-crossings.zip.4’ saved [26085508/26085508]

Archive:  book-crossings.zip
replace BX-Book-Ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [4]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [5]:
# ignoring case differences that should not exist
def create_duplicates_to_modify_from_existing_dataframes(df_books,df_ratings):
    return df_books.copy(), df_ratings.copy()

def remove_case_differences(df_books,df_ratings):
  df_books.isbn = df_books.isbn.str.upper()
  df_books.author = df_books.author.str.title()
  df_ratings.isbn = df_ratings.isbn.str.upper()



In [6]:
# removing duplicates
def remove_duplicate_books(df_books):
  df_books = df_books[~df_books.duplicated(subset="isbn")]


In [7]:
# merging everything together
def creating_one_joined_table(df_ratings, df_books):
  df_all = df_ratings.join(df_books.set_index("isbn"),on="isbn")
  return df_all

def ensuring_less_memory_usage(df_all):
  # rating as integer 8
  df_all.rating = df_all.rating.astype("int8")
  df_all.user = df_all.user.astype("int32")

In [8]:
def create_base_table_modifications():
  df_books_copy, df_ratings_copy = create_duplicates_to_modify_from_existing_dataframes(df_books,df_ratings)
  # remove_case_differences(df_books_copy,df_ratings_copy)
  # remove_duplicate_books(df_books_copy)
  df_all = creating_one_joined_table(df_ratings_copy,df_books_copy)
  ensuring_less_memory_usage(df_all)
  # adding a better unique identifier
  df_all["author_title"] = df_all["author"] + df_all["title"]
  return df_all



In [9]:
# removing sparse user and book data (users that have rated less than 200 books and books rated less than 100 times)
def limit_by_user_and_book_simultaneous(df_all,book_identifier_column):
  user_groupings = df_all.groupby("user").count()
  users_to_include_first = user_groupings[user_groupings[book_identifier_column] > 199].index

  book_groupings = df_all.groupby(book_identifier_column).count()
  books_to_include_first = book_groupings[book_groupings["user"] > 99].index

  simulataneous_filtering = df_all[(df_all["user"].isin(users_to_include_first)) & (df_all[book_identifier_column].isin(books_to_include_first))]
  
  user_first_filtering = df_all[(df_all["user"].isin(users_to_include_first))]
  book_first_filtering = df_all[(df_all[book_identifier_column].isin(books_to_include_first))]
  
  # creating the second filtering
  user_groupings = book_first_filtering.groupby("user").count()
  users_to_include_second = user_groupings[user_groupings[book_identifier_column] > 199].index

  book_groupings = user_first_filtering.groupby(book_identifier_column).count()
  books_to_include_second = book_groupings[book_groupings["user"] > 99].index

  user_first_filtering = user_first_filtering[user_first_filtering[book_identifier_column].isin(books_to_include_second)]
  book_first_filtering = book_first_filtering[book_first_filtering["user"].isin(users_to_include_second)]

  return simulataneous_filtering, user_first_filtering, book_first_filtering

In [10]:
def reindex_by_row_number(df):
    df["index"] = list(range(len(df)))
    df.set_index("index",inplace=True)


In [11]:
def okay_data(model):
  output = ""
  necessary_books = set(["The Queen of the Damned (Vampire Chronicles (Paperback))",'Catch 22','The Witching Hour (Lives of the Mayfair Witches)','Interview with the Vampire','The Tale of the Body Thief (Vampire Chronicles (Paperback))','The Vampire Lestat (Vampire Chronicles, Book II)',"Where the Heart Is (Oprah's Book Club (Paperback))"])
  two_more_necessary_books_among_maybe_others = set(["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True'])
  if necessary_books.difference(model.title) != set():
    print("There are some books that are supposed to be suggested that are missing from this data.")
    output = False
  else:
    two_book_difference = two_more_necessary_books_among_maybe_others.difference(model.title)
    if len(two_book_difference) < 3:
        output = True
        if len(two_book_difference) == 0:
          print("Every book that might be necessary are present; this data is great.")
    return output
  

In [12]:

def create_nearest_neighbors_model(a_dataframe,number_of_neighbors,number_of_recommendations,index_column="author_title"):
    """Note that the first table inserted into the model needs to be the one that will be used as an index"""
    my_basic_model = a_dataframe[([index_column] + ["user","rating"])]
    my_basic_model = my_basic_model.groupby(["author_title","user"]).mean().reset_index()
    my_basic_model = my_basic_model.pivot(index=index_column, columns='user', values='rating').fillna(0).astype("int8")
    my_model = csr_matrix(my_basic_model.values)

    # making the neigherest neighbors model
    N_predicted_neighbours = number_of_neighbors
    KNN = NearestNeighbors(metric='cosine', n_neighbors=N_predicted_neighbours, n_jobs=-1)
    KNN.fit(my_model)
    distances, indices = KNN.kneighbors(my_basic_model)
    # return distances, indices
    def get_recommends(book = ""):
        index_items = list(set(a_dataframe[a_dataframe.title == book][index_column].to_list()))
        full_neighbors_for_all_isbns = []
        for index in index_items:
          testing = np.where(my_basic_model.index==index_items[0])[0][0]
          isbns_of_related_books = my_basic_model.index[indices[testing][:(number_of_recommendations + 1)]]
          book_names_of_related_books = [a_dataframe[a_dataframe[index_column] == x]["title"].to_list()[0] for x in isbns_of_related_books]
          distances_of_related_books = distances[testing][:(number_of_recommendations + 1)]
        full_neighbors_for_all_isbns += [[item[0],item[1]] for item in list(zip(book_names_of_related_books,distances_of_related_books))]
        full_neighbors_for_all_isbns.sort(key=(lambda x: x[1]))

        # removing duplicates if there are any
        already_found = []
        unique_suggestions_list = []
        for pair in full_neighbors_for_all_isbns:
          current_suggestion = pair[0]
          if current_suggestion not in already_found:
            unique_suggestions_list.append(pair)
            already_found.append(current_suggestion)

        returned_suggestions = unique_suggestions_list[1:(number_of_recommendations + 1)] 
        returned_suggestions.reverse()

        return [book,returned_suggestions]
    return get_recommends



In [13]:

# creating various models
df_all = create_base_table_modifications()
simulataneous_filtering, user_first_filtering, book_first_filtering = limit_by_user_and_book_simultaneous(df_all,"author_title")
simulataneous_filtering_isbn, user_first_filtering_isbn, book_first_filtering_isbn = limit_by_user_and_book_simultaneous(df_all,"isbn")

# testing models (but no longer)
# possible_models = [simulataneous_filtering, user_first_filtering, book_first_filtering,simulataneous_filtering_isbn, user_first_filtering_isbn, book_first_filtering_isbn]
# okay_data(book_first_filtering)

# creating the actual model used
get_recommends = create_nearest_neighbors_model(simulataneous_filtering_isbn,6,5,index_column="author_title")

In [14]:
simulataneous_filtering_isbn

Unnamed: 0,user,isbn,rating,title,author,author_title
1456,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,James Finn GarnerPolitically Correct Bedtime S...
1469,277427,0060930535,0,The Poisonwood Bible: A Novel,Barbara Kingsolver,Barbara KingsolverThe Poisonwood Bible: A Novel
1471,277427,0060934417,0,Bel Canto: A Novel,Ann Patchett,Ann PatchettBel Canto: A Novel
1474,277427,0061009059,9,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,Janet EvanovichOne for the Money (Stephanie Pl...
1484,277427,0140067477,0,The Tao of Pooh,Benjamin Hoff,Benjamin HoffThe Tao of Pooh
...,...,...,...,...,...,...
1147304,275970,0804111359,0,Secret History,DONNA TARTT,DONNA TARTTSecret History
1147436,275970,140003065X,0,A Fine Balance,Rohinton Mistry,Rohinton MistryA Fine Balance
1147439,275970,1400031346,0,The No. 1 Ladies' Detective Agency,Alexander McCall Smith,Alexander McCall SmithThe No. 1 Ladies' Detect...
1147440,275970,1400031354,0,Tears of the Giraffe (No.1 Ladies Detective Ag...,Alexander McCall Smith,Alexander McCall SmithTears of the Giraffe (No...


In [15]:
# comparing against the instructions (but not ideally as I finished before I optimized it)
get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")

# [
#   'The Queen of the Damned (Vampire Chronicles (Paperback))',
#   [
#     ['Catch 22', 0.793983519077301], 
#     ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
#     ['Interview with the Vampire', 0.7345068454742432], 0.7345068863988313
#     ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
#     ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
#   ]
# ]

['The Queen of the Damned (Vampire Chronicles (Paperback))',
 [['Catch 22', 0.7939835419270879],
  ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448657003312193],
  ['Interview with the Vampire', 0.7345068863988313],
  ['The Tale of the Body Thief (Vampire Chronicles (Paperback))',
   0.5376338446489461],
  ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178411864186413]]]

In [16]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016210581447822], ['The Weight of Water', 0.7708583572697412], ['The Surgeon', 0.7699410973804288], ['I Know This Much Is True', 0.7677075092617776], ['The Lovely Bones: A Novel', 0.7234864549790632]]]
You passed the challenge! 🎉🎉🎉🎉🎉
