<a href="https://colab.research.google.com/github/abyanjan/Recommender-Systems-with-Python/blob/master/NearestNeighbor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recommending Products Based on Nearest Neighbors Model

**Book Recommendation**

In [1]:
!pip install -q surprise

[K     |████████████████████████████████| 11.8MB 342kB/s 
[?25h  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone


### Data

The data used here is the book-crossing dataset available at http://www2.informatik.uni-freiburg.de/~cziegler/BX/

In [2]:
# downloading the data
!wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip

--2021-03-26 09:16:01--  http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Resolving www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)... 132.230.105.133
Connecting to www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)|132.230.105.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘BX-CSV-Dump.zip’


2021-03-26 09:16:03 (16.1 MB/s) - ‘BX-CSV-Dump.zip’ saved [26085508/26085508]



In [3]:
# unzipping the data
import zipfile
with zipfile.ZipFile('BX-CSV-Dump.zip', 'r') as zip_ref:
    zip_ref.extractall('data')

In [4]:
# check the list of data files
%ls 'data'

BX-Book-Ratings.csv  BX-Books.csv  BX-Users.csv


The dataset contains three files
- BX-Users: contains information on users including demographic data if available
- BX-Books : contains information on books identified their 'isbn' number and data on 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher'
- BX-Book-Ratings : contains the book rating information, ratings for the books are in a scale from 1-10 (higher values denoting higher appreciation)

In [5]:
import pandas as pd
import numpy as np
import scipy

In [6]:
# reading ratings data
data = pd.read_csv("data/BX-Book-Ratings.csv", sep=';', header=0, names=['user','isbn','rating'],encoding='latin-1')
data.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [7]:
# reading books data
books = pd.read_csv("data/BX-Books.csv", sep=';', header=0,error_bad_lines=False, usecols=[0,1,2],index_col=0,
                   names=['isbn','title','author'],encoding='latin-1')
books.head()

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
195153448,Classical Mythology,Mark P. O. Morford
2005018,Clara Callan,Richard Bruce Wright
60973129,Decision in Normandy,Carlo D'Este
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
393045218,The Mummies of Urumchi,E. J. W. Barber


In [8]:
# setting up a function to get metadata on any book by its isbn number
def bookMeta(isbn):
  title = books.loc[isbn,'title']
  author = books.loc[isbn,'author']
  return title, author

In [9]:
# testing the bookMeta function
bookMeta('0195153448')

('Classical Mythology', 'Mark P. O. Morford')

In [11]:
# setting a function to get top N favourite books for a user
def favBooks(user, N):
  # filtering out ratings for the specified user only
  userdata = data[data['user'] == user]
  # sorting the data by descending order of the ratings and only selecting top N rated books
  sorted_ratings = userdata.sort_values('rating', ascending =False)[:N]
  # adding book meta data
  sorted_ratings['title'] = sorted_ratings['isbn'].apply(bookMeta)
  return sorted_ratings

There may be ratings given to books that we may not have information about in the books data. So, we will make sure that the ratings data only contains the books that we have information about.

In [12]:
# making sure that we only have the ratings for the books that we have information about, that is stored in books dataframe
data = data[data['isbn'].isin(books.index)]

In [13]:
# checking favBooks function
favBooks(204622,5)

Unnamed: 0,user,isbn,rating,title
844955,204622,0967560500,10,"(Natural Hormonal Enhancement, Rob Faigin)"
844935,204622,0671027360,10,"(Angels &amp; Demons, Dan Brown)"
844926,204622,0385504209,10,"(The Da Vinci Code, Dan Brown)"
844958,204622,097173660X,9,"(Life After School Explained, Cap &amp; Compass)"
844920,204622,0060935464,9,"(To Kill a Mockingbird, Harper Lee)"


### Ratings Matrix

Here, we will construct a rating matris that will have users as the rows and books as columns and the values will be fille by the corressponding ratings. So, it will hold the ratings given by a user to a book.

Before constructing the rating matrix, we can check the number of unique books and users in our data.

In [14]:
# number of users per isbn - unique books
user_per_ISBN = data.isbn.value_counts()
print(f'Number of unique isbn: {len(user_per_ISBN)}')
print()
user_per_ISBN.head(10)

Number of unique isbn: 270170



0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
044023722X     647
0142001740     615
067976402X     614
0671027360     586
0446672211     585
Name: isbn, dtype: int64

In [15]:
# number of books users have read - unique number of users
ISBN_per_user = data.user.value_counts()
print(f'Number of unique users: {len(ISBN_per_user)}')
print()
ISBN_per_user.head(10)

Number of unique users: 92107



11676     11144
198711     6456
153662     5814
98391      5779
35859      5646
212898     4290
278418     3996
76352      3329
110973     2971
235105     2943
Name: user, dtype: int64

Here, we have 92017 unique users and 270170 unique books. So, with this information our ratings matrix will be of size 92017 x 270170. Not only our rating matrix be big, but it will also be very sparse because all the users may not have rated all the books and many books may be missing ratings.  

So, to reduce the sparsity, we will keep only books that have been rated by more than 10 users and users who have read more than 10 books

In [16]:
# only select books with more tha 1o ratings
data = data[data['isbn'].isin(user_per_ISBN[user_per_ISBN > 10].index)]

# only select users with more than 10 book reads
data = data[data['user'].isin(ISBN_per_user[ISBN_per_user > 10].index)]

In [17]:
data.shape

(405709, 3)

In [18]:
# creating the rating matrix
user_item_rating_matrix = pd.pivot_table(data, values='rating',index=['user'], columns=['isbn'])
user_item_rating_matrix.head()

isbn,0002005018,0002251760,0002259834,0002558122,0006480764,000648302X,0006485200,000649840X,000651202X,0006512062,0006543545,0006546684,0006547834,0006550576,0006550649,0006550789,0006550924,0007106572,0007110928,0007122039,0007141076,0007154615,000716226X,0007170866,0020125305,0020125607,0020198817,0020198906,0020199600,0020264763,002026478X,0020264801,0020360754,002040400X,0020418809,0020427859,0020442009,0020442106,0020442203,0020442300,...,8495501090,8495501112,8495501198,849550152X,8495618605,8804342838,8804375914,880449509X,8806116053,8806142100,8806143042,8806163698,8807809907,880781000X,8807810212,880781076X,880781210X,8807813025,8807813823,8817106100,8817106119,8817106259,8817125539,8817131628,881787017X,8838918600,8845205118,8845247414,8845407039,884590184X,8845906884,8845915611,8878188212,8885989403,9074336329,9074336469,950491036X,9681500830,9681500954,9871138016
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
8,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
99,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
242,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
243,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
254,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### User Similarity
Computing similarity between the users based on the ratings they have given to books using hamming distance.

In [19]:
# computing distance between two users
user1 = 204622
user2 = 255489

In [20]:
user1_ratings = user_item_rating_matrix.transpose()[user1]
user1_ratings.head()

isbn
0002005018   NaN
0002251760   NaN
0002259834   NaN
0002558122   NaN
0006480764   NaN
Name: 204622, dtype: float64

In [21]:
user2_ratings = user_item_rating_matrix.transpose()[user2]
user2_ratings.head()

isbn
0002005018   NaN
0002251760   NaN
0002259834   NaN
0002558122   NaN
0006480764   NaN
Name: 255489, dtype: float64

In [22]:
# computing distance with hamming distance metric
# hamming distance shows the disaggrement

from scipy.spatial.distance import hamming
hamming(user1_ratings, user2_ratings)

0.9999352792699502

The hamming distance shows the disagreement bwtween two data points. So, higher value shows that the data points are far from each other. 

In [23]:
# setting up the function to calculate distance between any two users
def distance(user1, user2):
  try:
    user1_ratings = user_item_rating_matrix.transpose()[user1]
    user2_ratings = user_item_rating_matrix.transpose()[user2]
    distance = hamming(user1_ratings, user2_ratings)
  except:
    distance = np.NaN
  return distance

In [24]:
# checking the distance function
distance(user1, user2)

0.9999352792699502

### Finding Top N Nearest Neighbors

After we calculate the distance between the users, based on the similarity we can find the nearest N neighbors.

In [25]:
# taking an example for a user
user = 204622

In [26]:
# getting all the users
all_users = pd.DataFrame(user_item_rating_matrix.index)

# remove the current user from the all_users list
all_users = all_users[all_users.user!=user]
all_users.head()

Unnamed: 0,user
0,8
1,99
2,242
3,243
4,254


In [27]:
# distance between all users and the active user
all_users['distance'] = all_users['user'].apply(lambda x: distance(user,x))

In [28]:
all_users.head()

Unnamed: 0,user,distance
0,8,1.0
1,99,1.0
2,242,0.999935
3,243,0.999935
4,254,1.0


In [29]:
# finding k nearest neighbor
k = 10
k_nearest_users = all_users.sort_values('distance')['user'][:k]
k_nearest_users

3201     82893
3368     87555
2624     68555
1813     48046
5401    140036
7584    198711
565      16795
8866    232131
239       7346
9693    251422
Name: user, dtype: int64

In [30]:
# putting everything to a function
def nearest_negihbors(user, k =10):
  all_users = pd.DataFrame(user_item_rating_matrix.index)
  all_users = all_users[all_users.user!=user]
  all_users['distance'] = all_users['user'].apply(lambda x: distance(user,x))
  k_nearest_users = all_users.sort_values('distance')['user'][:k]
  return k_nearest_users

### Recommending the Books

When we have found out the top N nearest neighbors for an user, we can calculate the ratings that the user will give to a new book as an average ratings of the ratings that its nearest neighbors have given to the book.

In [31]:
# getting ratings for the nearest users
nn_ratings = user_item_rating_matrix[user_item_rating_matrix.index.isin(k_nearest_users)]
nn_ratings

isbn,0002005018,0002251760,0002259834,0002558122,0006480764,000648302X,0006485200,000649840X,000651202X,0006512062,0006543545,0006546684,0006547834,0006550576,0006550649,0006550789,0006550924,0007106572,0007110928,0007122039,0007141076,0007154615,000716226X,0007170866,0020125305,0020125607,0020198817,0020198906,0020199600,0020264763,002026478X,0020264801,0020360754,002040400X,0020418809,0020427859,0020442009,0020442106,0020442203,0020442300,...,8495501090,8495501112,8495501198,849550152X,8495618605,8804342838,8804375914,880449509X,8806116053,8806142100,8806143042,8806163698,8807809907,880781000X,8807810212,880781076X,880781210X,8807813025,8807813823,8817106100,8817106119,8817106259,8817125539,8817131628,881787017X,8838918600,8845205118,8845247414,8845407039,884590184X,8845906884,8845915611,8878188212,8885989403,9074336329,9074336469,950491036X,9681500830,9681500954,9871138016
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
7346,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.0,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16795,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
48046,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
68555,,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
82893,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
87555,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,0.0,,,,,,,,0.0,,0.0,0.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
140036,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.0,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
198711,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,0.0,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
232131,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
251422,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [32]:
# taking the average ratings of the nearest neighbors for the books
avg_rating = nn_ratings.mean(skipna=True).dropna()
avg_rating.head()

isbn
0007154615    1.5
0020125305    0.0
0020125607    0.0
0020198817    0.0
0020198906    8.0
dtype: float64

In [33]:
# finding books that have been already read by current user
books_aleardy_read = user_item_rating_matrix.transpose()[user].dropna().index
books_aleardy_read

Index(['006016848X', '0060935464', '0140042598', '0140178724', '0142004278',
       '0380732238', '0385504209', '0425109720', '0425152898', '0440136482',
       '0440241162', '0451191145', '0451197127', '0553096060', '0671027360',
       '0671027387', '0671666258', '0688174574', '0743225708', '076790592X',
       '0785264280', '0786868716', '0802131867', '0802132952', '0971880107',
       '1853260045', '1853260126', '1853260207', '185326041X', '1878424114'],
      dtype='object', name='isbn')

In [34]:
# take the ratings only for the books that have been not read by the user
ratings = avg_rating[~ avg_rating.index.isin(books_aleardy_read)]

In [None]:
ratings

isbn
0007154615    1.5
0020125305    0.0
0020125607    0.0
0020198817    0.0
0020198906    8.0
             ... 
1883473004    0.0
1885171080    0.0
1885211066    0.0
1885211279    6.0
193156146X    0.0
Length: 4738, dtype: float64

In [35]:
# get top 3 books having highest average ratings
N = 3
top_n_isbns = ratings.sort_values(ascending=False).index[:N]
top_n_isbns

Index(['0553802976', '0618002235', '0590353403'], dtype='object', name='isbn')

In [36]:
# extracting metadata for the isbns
pd.Series(top_n_isbns).apply(bookMeta)

0              (Love, Greg &amp; Lauren, Greg Manning)
1    (The Two Towers (The Lord of the Rings, Part 2...
2    (Harry Potter and the Sorcerer's Stone (Book 1...
Name: isbn, dtype: object

In [37]:
# putting everything into a function
def topN(user, N=3):
  k_nearest_users = nearest_negihbors(user)
  nn_ratings = user_item_rating_matrix[user_item_rating_matrix.index.isin(k_nearest_users)]
  avg_rating = nn_ratings.mean(skipna=True).dropna()
  books_aleardy_read = user_item_rating_matrix.transpose()[user].dropna().index
  avg_ratings = avg_rating[~ avg_rating.index.isin(books_aleardy_read)]
  top_n_isbns = avg_ratings.sort_values(ascending = False).index[:N]
  return pd.Series(top_n_isbns).apply(bookMeta)

**Recommending book to a user**

In [38]:
# printing favorite books for the given user
pd.set_option('display.max_colwidth', None)
user = 204813
favBooks(user, 10)

Unnamed: 0,user,isbn,rating,title
845417,204813,399149848,10,"(Birthright, Nora Roberts)"
845407,204813,385504209,10,"(The Da Vinci Code, Dan Brown)"
845382,204813,373218036,10,"(Truly, Madly Manhattan, Nora Roberts)"
845359,204813,142001805,10,"(The Eyre Affair: A Novel, Jasper Fforde)"
845431,204813,446527793,10,"(The Guardian, Nicholas Sparks)"
845416,204813,399149392,10,"(Chesapeake Blue (Quinn Brothers (Hardcover)), Nora Roberts)"
845432,204813,446531332,9,"(Nights in Rodanthe, Nicholas Sparks)"
845434,204813,446606243,9,"(The Tenth Justice, Brad Meltzer)"
845451,204813,671027360,9,"(Angels &amp; Demons, Dan Brown)"
845433,204813,446532452,9,"(The Wedding, Nicholas Sparks)"


In [39]:
# getting top 10 recommendations for the user
top_recommendation = topN(user, 10)

In [40]:
pd.set_option('display.max_colwidth', None)
book = top_recommendation.apply(lambda x: x[0])
author = top_recommendation.apply(lambda x: x[1])

In [44]:
# recommendation
pd.DataFrame({'Book':book, 'Author':author})


Unnamed: 0,Book,Author
0,Waiting For Nick (Silhouette Special Edition),Nora Roberts
1,Wringer (Trophy Newbery),Jerry Spinelli
2,"The Star Wars Trilogy: Star Wars, the Empire Strikes Back, Return of the Jedi",George Lucas
3,"One, Two, Buckle My Shoe",Agatha Christie
4,On the Road,Jack Kerouac
5,Dead Poets Society,N.H. Kleinbaum
6,Go Ask Alice (Avon/Flare Book),James Jennings
7,Carolina Moon,Nora Roberts
8,Illusions: The Adventures of a Reluctant Messiah,Richard Bach
9,You Just Don't Duct Tape a Baby!: True Tales and Sensible Suggestions from a Veteran Pediatrician,Norman Weinberger


Looking from the recommended books, we can see that it consists of two books by the author 'Nora Roberts', which also appears three times in the favourites book list of the user. So, we see that our recommendation is working good.

We can try again for some other user.

In [45]:
pd.set_option('display.max_colwidth', None)
user = 48046
favBooks(user, 10)

Unnamed: 0,user,isbn,rating,title
207955,48046,0060199652,10,"(Prodigal Summer, Barbara Kingsolver)"
207957,48046,0060391626,10,"(I Know This Much Is True (Oprah's Book Club), Wally Lamb)"
208075,48046,0609609521,10,"(When the Elephants Dance : A Novel, TESS URIZA HOLTHE)"
208027,48046,0446391301,10,"(Geek Love, Katherine Dunn)"
208097,48046,068486441X,10,"(Eating The Cheshire Cat: A Novel, Helen Ellis)"
208014,48046,0385504209,10,"(The Da Vinci Code, Dan Brown)"
207979,48046,0156027321,10,"(Life of Pi, Yann Martel)"
208037,48046,0451187849,10,"(We the Living, Ayn Rand)"
208103,48046,0743467523,9,"(Dreamcatcher, Stephen King)"
208061,48046,0553580191,9,"(Seize the Night, DEAN KOONTZ)"


In [46]:
# getting top 10 recommendations for the user
top_recommendation = topN(user, 10)

In [47]:
book = top_recommendation.apply(lambda x: x[0])
author = top_recommendation.apply(lambda x: x[1])
# recommendation
pd.DataFrame({'Book':book, 'Author':author})

Unnamed: 0,Book,Author
0,Riding Shotgun,Rita Mae Brown
1,Skyward,Mary Alice Monroe
2,The Odyssey,Robert Fagles
3,Falling Leaves Brit Edition,Adeline Yen Mah
4,Welcome to the Monkey House,Kurt Vonnegut
5,Bed &amp; Breakfast,Lois Battle
6,Illusions: The Adventures of a Reluctant Messiah,Richard Bach
7,Fast Forward,Judy Mercer
8,Walking After Midnight,KAREN ROBARDS
9,The Tortilla Curtain,T. Coraghessan Boyle


## Applying Nearesr Neighbor for Recommendation with Surprise Library

Surprise is an easy-to-use Python scikit for recommender systems.

In [48]:
from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate
from surprise import KNNBasic

Surprise library requires the dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order.

In [49]:
data.head()

Unnamed: 0,user,isbn,rating
31,276762,034544003X,0
33,276762,0380711524,5
34,276762,0451167317,0
89,276798,3423084049,0
97,276798,3548603203,6


Here, in our data, 'user' corresponds to the user ids, 'isbn' is the item ids and 'rating' is simply the rating for a item by a user. So, we have the data in the order surprise library requires.

In [50]:
data.shape

(405709, 3)

In [51]:
# create the data to use with surprise library
# specify the rating scale
reader = Reader(rating_scale=(1, 10))
data_surp = Dataset.load_from_df(df = data, reader=reader)

In [None]:
#data_surp.raw_ratings

We can use the algorithm 'KNNBasic' from surprise library for recommendations with nearest neighbor.

In [53]:
# selecte the algorithm
algo = KNNBasic(k=60, min_k=1, sim_options={'name':'MSD', 'user_based':True})

In [54]:
# splitting data into train and test data
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data_surp, test_size=.25)

In [55]:
trainset.n_users

10646

In [66]:
# fit the model on train data
algo.fit(trainset=trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f1ed4600490>

In [67]:
# prediction for test data
predictions = algo.test(testset)

In [None]:
# predictions

In [69]:
# evaluation with mean squared error
accuracy.rmse(predictions=predictions)

RMSE: 3.8928


3.892813412173743

we have and rmse score of 3.89, whcih shows that the predictions for the ratings are on average off by 3.89, which is not very good, but it is a decent score.

In [70]:
# get a prediction for specific users and items.
uid = 48046
iid = '0060199652'
r_ui = 10
pred = algo.predict(uid, iid, r_ui=r_ui, verbose=True)

user: 48046      item: 0060199652 r_ui = 10.00   est = 7.90   {'actual_k': 60, 'was_impossible': False}


So, here the actual rating given by the user for the book was 10, and the model predicted the rating to be 7.90.

**Recommending books for a user**

In [71]:
# recommending books for a user
user = 48046

In [72]:
# select books that have been rated by the user
read_books = data[data.user==user]['isbn'].tolist()
#print(read_books)

In [73]:
# create a list of books that the user has not read
not_read_books = [book for book in data.isbn.unique() if book not in read_books]

In [74]:
len(not_read_books)

15330

In [75]:
# predicting ratings for each of the not read books
user = 48046
pred_ratings = {}
for book in not_read_books:
  pred = algo.predict(uid=user, iid=book)
  # extract only the predicted rating
  est_rating = pred.est
  # add the predictions
  pred_ratings.update({book:est_rating})


In [79]:
# sort the ratings
#dict(sorted(x.items(), key=lambda item: item[1]))
sorted_ratings = sorted(pred_ratings.items(), key=lambda item: item[1], reverse=True)

In [80]:
# select top 10 
sorted_ratings = sorted_ratings[:10]
sorted_ratings

[('0385497288', 10),
 ('351836605X', 10),
 ('2277241202', 10),
 ('3404149114', 10),
 ('3404130014', 10),
 ('033035034X', 10),
 ('0618219064', 10),
 ('3596154049', 10),
 ('3596200261', 10),
 ('0785268839', 10)]

In [86]:
# get the book name and author
book  = [bookMeta(info[0])[0] for info in sorted_ratings]
author = [bookMeta(info[0])[1] for info in sorted_ratings]
pd.DataFrame({'Book':book, 'Author':author})

Unnamed: 0,Book,Author
0,The Unknown Errors of Our Lives: Stories,Chitra Banerjee Divakaruni
1,Stiller,Max Frisch
2,L' Alchimiste,Paul Coelho
3,Das Zweite GedÃ?Â¤chtnis.,Ken Follett
4,Feuerkind. Thriller.,Stephen King
5,Death Is Now My Neighbour,Colin Dexter
6,The Wind Done Gone: A Novel,Alice Randall
7,Der Besuch des Leibarztes.,Per Olov Enquist
8,"Fischer TaschenbÃ?Â¼cher, Bd.26, SchÃ?Â¶ne neue Welt",Aldous Huxley
9,Wild at Heart: Discovering the Secret of a Man's Soul,John Eldredge


In [88]:
# creating a function for recommendating
def recommend_books(user_id):
  # select books that have been rated by the user
  read_books = data[data.user==user_id]['isbn'].tolist()
  # create a list of books that the user has not read
  not_read_books = [book for book in data.isbn.unique() if book not in read_books]
  pred_ratings = {}
  for book in not_read_books:
    pred = algo.predict(uid=user_id, iid=book)
    # extract only the predicted rating
    est_rating = pred.est
    # add the predictions
    pred_ratings.update({book:est_rating})

  # take top 10 books with highest ratings
  sorted_ratings = sorted(pred_ratings.items(), key=lambda item: item[1], reverse=True)[:10]
  book  = [bookMeta(info[0])[0] for info in sorted_ratings]
  author = [bookMeta(info[0])[1] for info in sorted_ratings]
  return pd.DataFrame({'Book':book, 'Author':author})

In [89]:
user = 48046
recommend_books(user_id=user)

Unnamed: 0,Book,Author
0,The Unknown Errors of Our Lives: Stories,Chitra Banerjee Divakaruni
1,Stiller,Max Frisch
2,L' Alchimiste,Paul Coelho
3,Das Zweite GedÃ?Â¤chtnis.,Ken Follett
4,Feuerkind. Thriller.,Stephen King
5,Death Is Now My Neighbour,Colin Dexter
6,The Wind Done Gone: A Novel,Alice Randall
7,Der Besuch des Leibarztes.,Per Olov Enquist
8,"Fischer TaschenbÃ?Â¼cher, Bd.26, SchÃ?Â¶ne neue Welt",Aldous Huxley
9,Wild at Heart: Discovering the Secret of a Man's Soul,John Eldredge
