## Collaborative Filtering

In [1]:
# import pandas
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity   

We will create an recommender engine based on Item Based Collaborative Filtering (IBCF) which searches for the most similar books based on the user ratings. We can download the data from [here](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing).

In [2]:
url = '/home/henri/Documents/Lighthouse-lab/Databases/w10-d2-db/'
book_ratings = pd.read_csv(url+'BX-Book-Ratings.csv',sep=";", encoding="latin")
books = pd.read_csv(url+'BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


* Explore both datasets

In [3]:
book_ratings.head() # rating 0-10

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [4]:
book_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [5]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


* create dataframe with name 'df_book_features' from book_ratings that have `ISBN` as index, `User-ID` as columns and values are `Book-Rating`.


In [6]:
books = books.drop(['Image-URL-S','Image-URL-M'],axis=1)

In [7]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-L          271357 non-null  object
dtypes: object(6)
memory usage: 12.4+ MB


In [8]:
book_ratings.groupby('ISBN')['Book-Rating'].mean()

ISBN
 0330299891    3.0
 0375404120    1.5
 0586045007    0.0
 9022906116    3.5
 9032803328    0.0
              ... 
cn113107       0.0
ooo7156103     7.0
§423350229     0.0
´3499128624    8.0
Ô½crosoft      7.0
Name: Book-Rating, Length: 340556, dtype: float64

In [9]:
book_ratings.groupby('ISBN')['Book-Rating'].count()

ISBN
 0330299891    2
 0375404120    2
 0586045007    1
 9022906116    2
 9032803328    1
              ..
cn113107       1
ooo7156103     1
§423350229     1
´3499128624    1
Ô½crosoft      1
Name: Book-Rating, Length: 340556, dtype: int64

In [10]:
books = books.merge(book_ratings.groupby('ISBN')['Book-Rating'].mean(), on='ISBN')
books = books.merge(book_ratings.groupby('ISBN')['Book-Rating'].count(),on='ISBN')

In [11]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L,Book-Rating_x,Book-Rating_y
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,0.0,1
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,4.928571,14
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,5.0,3
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,4.272727,11
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,0.0,1


In [16]:
books = books.rename(columns={'Book-Rating_x':'Avg Rating',	'Book-Rating_y':'Review Count'})

In [17]:
books_clean = books[(books['Avg Rating'] > 0) & (books['Review Count']>1)]
books_clean.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-L,Avg Rating,Review Count
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,4.928571,14
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,5.0,3
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,4.272727,11
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,4.212121,33
6,425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,1.6,5


In [22]:
book_ratings = book_ratings[book_ratings['User-ID'].map(book_ratings['User-ID'].value_counts()) > 70]

book_ratings = book_ratings[book_ratings['ISBN'].map(book_ratings['ISBN'].value_counts()) > 50]

In [23]:
# create dataframe with name 'df_book_features' from book_ratings that have ISBN as index, User-ID as columns and values are Book-Rating
df_book_features=book_ratings.set_index(['ISBN'])
df_book_features

Unnamed: 0_level_0,User-ID,Book-Rating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
002542730X,277427,10
0060930535,277427,0
0060934417,277427,0
0061009059,277427,9
014029628X,277427,0
...,...,...
0451203771,275970,0
0553268880,275970,0
0553275976,275970,0
0804111359,275970,0


* create the instance of the NearestNeighbors class, use cossine similarity

In [24]:
knn = NearestNeighbors()

In [25]:
knn.fit(cosine_similarity(df_book_features))

NearestNeighbors()

* fit the NearestNeighbors using'df_book_features'

* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books 
    * the output will be the Book-Titles of the top 5 most similar books.
    * print also the distance from selected movie

In [None]:
def similarMovie():
    

* Apply the function