### Simpleast Book recommendation system ###
### This is a very basic ML-Model that recommends similar books if someone enters A particular book name.

##### example Link: https://thecleverprogrammer.com/2021/01/17/book-recommendation-system/ 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("D:/abir/ai_ml_projects/recommender/book_recommender/data/books.csv",sep=',',error_bad_lines = False)

b'Skipping line 3350: expected 12 fields, saw 13\nSkipping line 4704: expected 12 fields, saw 13\nSkipping line 5879: expected 12 fields, saw 13\nSkipping line 8981: expected 12 fields, saw 13\n'


In [3]:
df.isnull().sum()

bookID                0
title                 0
authors               0
average_rating        0
isbn                  0
isbn13                0
language_code         0
  num_pages           0
ratings_count         0
text_reviews_count    0
publication_date      0
publisher             0
dtype: int64

#### Find Duplicate books with title

In [4]:
### Duplicate Entries ###
unique_books = df.title.unique()
counter = {}

all_books = list(df.title.values)
for title in unique_books:    
    counter[title] = all_books.count(title)

dict(sorted(counter.items(), key=lambda item:item[1],reverse=True))

{'The Iliad': 9,
 'The Brothers Karamazov': 9,
 'Anna Karenina': 8,
 'The Odyssey': 8,
 "'Salem's Lot": 8,
 "Gulliver's Travels": 8,
 'The Picture of Dorian Gray': 7,
 "A Midsummer Night's Dream": 7,
 'Treasure Island': 6,
 'Collected Stories': 6,
 'The Histories': 6,
 'Robinson Crusoe': 6,
 'The Secret Garden': 6,
 'The Great Gatsby': 6,
 'Macbeth': 6,
 'Jane Eyre': 6,
 'The Scarlet Letter': 6,
 'Sense and Sensibility': 6,
 'Romeo and Juliet': 6,
 'War and Peace': 5,
 'Atlas Shrugged': 5,
 'Memoirs of a Geisha': 5,
 'Pride and Prejudice': 5,
 'A Tale of Two Cities': 5,
 'The House of Mirth': 5,
 'Don Quixote': 5,
 'One Hundred Years of Solitude': 5,
 'The Shining': 5,
 "Charlotte's Web": 5,
 'The Idiot': 5,
 'King Lear': 5,
 'Much Ado about Nothing': 5,
 'The Return of the King (The Lord of the Rings  #3)': 5,
 'Paradise Lost': 5,
 'Dracula': 5,
 'Frankenstein': 5,
 'The Enemy (Jack Reacher  #8)': 5,
 'Eugene Onegin': 5,
 'The Communist Manifesto': 5,
 "The Hitchhiker's Guide to the G

In [5]:
df[df.title == 'The Iliad'].head(2)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
403,1371,The Iliad,Homer/Robert Fagles/Bernard Knox,3.86,140275363,9780140275360,eng,683,288792,3423,4/29/1999,Penguin Classics
405,1374,The Iliad,Homer/Robert Fitzgerald/Andrew Ford,3.86,374529051,9780374529055,en-US,588,692,81,4/3/2004,Farrar Straus and Giroux


It is obvious that the duplicate book title doesn't say that it is a simple duplicate entry. Instead all the entries are real. Hence, no need to drop any row.

#### Feature Creation

In [6]:
## Conversion of rating from numeric to categorical

df2 = df.copy()

df2.loc[ (df2['average_rating'] >= 0) & (df2['average_rating'] <= 1) , 'rating_between'] = "between 0 and 1"
df2.loc[ (df2['average_rating'] > 1) & (df2['average_rating'] <= 2), 'rating_between'] = "between 1 and 2"
df2.loc[ (df2['average_rating'] > 2) & (df2['average_rating'] <= 3), 'rating_between'] = "between 2 and 3"
df2.loc[ (df2['average_rating'] > 3) & (df2['average_rating'] <= 4), 'rating_between'] = "between 3 and 4"
df2.loc[ (df2['average_rating'] > 4) & (df2['average_rating'] <= 5), 'rating_between'] = "between 4 and 5"

In [7]:
## One hot encoding ##
rating_df   = pd.get_dummies(df2['rating_between'])
language_df = pd.get_dummies(df2['language_code'])

In [8]:
features = pd.concat([rating_df,language_df,df['ratings_count'],df['average_rating']],1)

In [19]:
features.head(5)

Unnamed: 0,between 0 and 1,between 1 and 2,between 2 and 3,between 3 and 4,between 4 and 5,ale,ara,en-CA,en-GB,en-US,...,por,rus,spa,srp,swe,tur,wel,zho,ratings_count,average_rating
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2095690,4.57
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2153167,4.49
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,6333,4.42
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2339585,4.56
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,41428,4.78


In [20]:
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

In [27]:
min_max_scaler = MinMaxScaler()
features = min_max_scaler.fit_transform(features)

In [28]:
model = NearestNeighbors(n_neighbors=6, algorithm='ball_tree')
model.fit(features)
dist, idlist = model.kneighbors(features)

In [54]:
def suggest_books(new_book):
    book_list = []
    book_id = df2[df2.title == new_book].index ## There are multiple entries with single book title.
    book_id = book_id[0]
    for new_id in idlist[book_id]:
        book_list.append(df2.loc[new_id].title)
    return book_list

In [55]:
suggest_books('Harry Potter and the Half-Blood Prince (Harry Potter  #6)')

['Harry Potter and the Half-Blood Prince (Harry Potter  #6)',
 'Harry Potter and the Order of the Phoenix (Harry Potter  #5)',
 'The Fellowship of the Ring (The Lord of the Rings  #1)',
 'Harry Potter and the Chamber of Secrets (Harry Potter  #2)',
 'Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)',
 'The Lightning Thief (Percy Jackson and the Olympians  #1)']