Package location on my system:

Works with isolated virtual environment for SharpestMinds by Python 3.10

## Collaborative Filtering

A memory based book recommender system based on book rating

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

We will create an recommender engine based on Item Based Collaborative Filtering (IBCF) which searches for the most similar books based on the user ratings. We can download the data from [here](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing).

In [2]:
book_ratings = pd.read_csv('BX-CSV-Dump\\BX-Book-Ratings.csv',sep=";", encoding="latin")
books = pd.read_csv('BX-CSV-Dump\\BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)



  books = pd.read_csv('BX-CSV-Dump\\BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)
Skipping line 6452: expected 8 fields, saw 9
Skipping line 43667: expected 8 fields, saw 10
Skipping line 51751: expected 8 fields, saw 9

Skipping line 92038: expected 8 fields, saw 9
Skipping line 104319: expected 8 fields, saw 9
Skipping line 121768: expected 8 fields, saw 9

Skipping line 144058: expected 8 fields, saw 9
Skipping line 150789: expected 8 fields, saw 9
Skipping line 157128: expected 8 fields, saw 9
Skipping line 180189: expected 8 fields, saw 9
Skipping line 185738: expected 8 fields, saw 9

Skipping line 209388: expected 8 fields, saw 9
Skipping line 220626: expected 8 fields, saw 9
Skipping line 227933: expected 8 fields, saw 11
Skipping line 228957: expected 8 fields, saw 10
Skipping line 245933: expected 8 fields, saw 9
Skipping line 251296: expected 8 fields, saw 9
Skipping line 259941: expected 8 fields, saw 9
Skipping line 261529: expected 8 fields, saw 9

  b

* Explore both datasets

In [3]:
book_ratings.head(5)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [4]:
books.head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
books_info = pd.merge(book_ratings, books, on='ISBN')

In [6]:
df_book_ratings = books_info[['ISBN', 'User-ID', 'Book-Rating', 'Book-Title']]

In [7]:
df_book_ratings.head(5)

Unnamed: 0,ISBN,User-ID,Book-Rating,Book-Title
0,034545104X,276725,0,Flesh Tones: A Novel
1,034545104X,2313,5,Flesh Tones: A Novel
2,034545104X,6543,0,Flesh Tones: A Novel
3,034545104X,8680,5,Flesh Tones: A Novel
4,034545104X,10314,9,Flesh Tones: A Novel


* create dataframe with name 'df_book_features' from book_ratings that have `ISBN` as index, `User-ID` as columns and values are `Book-Rating`.
    - The data are quite big so it's OK to use a sample only in case your PC has limited RAM.


In [8]:
df_book_ratings_big_chunk = df_book_ratings.sample(frac=0.12, replace=False)

In [9]:
df_book_ratings_pivot = df_book_ratings_big_chunk.pivot(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)

In [10]:
df_book_features_ratings_matrix = csr_matrix(df_book_ratings_pivot)

* create the instance of the NearestNeighbors class

In [11]:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', radius=1)

* fit the NearestNeighbors using'df_book_features'

In [12]:
model_knn.fit(df_book_features_ratings_matrix)

* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books
    * the output will be the Book-Titles of the top 5 most similar books.
    * for every book in the top 5 most similar books, print also the distance from the selected book (ISBN we chose as input to the function)

In [13]:
def recommender(query_index, n_neighbors=11):
    distances, indices = model_knn.kneighbors(df_book_ratings_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors, return_distance=True)
    for i in range(len(distances.flatten())):
        if i == 0:
            print(f'Recommendations for {ISBN}:{df_book_ratings_big_chunk[df_book_ratings_big_chunk.ISBN == df_book_ratings_pivot.index[indices.flatten()[i]]]["Book-Title"].values[0]} \n')
        else:
            print(f'{i}:{df_book_ratings_big_chunk[df_book_ratings_big_chunk.ISBN == df_book_ratings_pivot.index[indices.flatten()[i]]]["Book-Title"].values[0]}, with distance:{distances.flatten()[i]}')
    return

* Apply the function to book of your choice

In [44]:
query_index = np.random.choice(df_book_ratings_pivot.shape[0])
ISBN = df_book_ratings_pivot.index[query_index]

In [45]:
recommender(query_index)

Recommendations for 0349102147:Warchild (Star Trek Deep Space Nine, No 7) 

1:Espedair Street, with distance:0.0
2:The Toughest Indian in the World, with distance:0.29289321881345254
3:Complicity, with distance:0.3356361611700802
4:Life of Pi, with distance:0.8713713320610403
5:The Lovely Bones: A Novel, with distance:0.9186361203475397
6:Late Bloomer (Michaels, Fern), with distance:1.0
7:Larger Than Life, with distance:1.0
8:Blood on the Tongue, with distance:1.0
9:Crown Jewel : A Novel, with distance:1.0
10:Magnolia Moon (Callahan Brothers Trilogy), with distance:1.0
