## Collaborative Filtering

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix

We will create an recommender engine based on Item Based Collaborative Filtering (IBCF) which searches for the most similar books based on the user ratings. We can download the data from [here](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing).

In [2]:
book_ratings = pd.read_csv('BX-CSV-Dump\\BX-Book-Ratings.csv',sep=";", encoding="latin")
books = pd.read_csv('BX-CSV-Dump\\BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)
books_users = pd.read_csv('BX-CSV-DUMP\\BX-Users.csv', sep=";", encoding="latin", error_bad_lines=False)



  books = pd.read_csv('BX-CSV-Dump\\BX-Books.csv',sep=";", encoding="latin", error_bad_lines=False)
b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: 

* Explore both datasets

In [3]:
book_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [4]:
books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


In [5]:
books_users

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
...,...,...,...
278853,278854,"portland, oregon, usa",
278854,278855,"tacoma, washington, united kingdom",50.0
278855,278856,"brampton, ontario, canada",
278856,278857,"knoxville, tennessee, usa",


In [6]:
books_info = pd.merge(books, book_ratings, on='ISBN')

In [7]:
book_ratings_count = pd.DataFrame(books_info.groupby('ISBN')['Book-Rating'].count())
book_ratings_count

Unnamed: 0_level_0,Book-Rating
ISBN,Unnamed: 1_level_1
0000913154,1
0001010565,2
0001046438,1
0001046713,1
000104687X,1
...,...
B000234N76,1
B000234NC6,1
B00029DGGO,1
B0002JV9PY,1


* create dataframe with name 'df_book_features' from book_ratings that have `ISBN` as index, `User-ID` as columns and values are `Book-Rating`.
    - The data are quite big so it's OK to use a sample only in case your PC has limited RAM.


In [8]:
books_info.drop_duplicates(['ISBN'], inplace=True)
df_book_ratings = books_info

* create the instance of the NearestNeighbors class

In [9]:
df_book_ratings.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L', 'Book-Author', 'Year-Of-Publication', 'Publisher'], inplace=True)

In [10]:
df_book_ratings.reset_index(drop=True, inplace=True)
df_book_ratings

Unnamed: 0,ISBN,Book-Title,User-ID,Book-Rating
0,0195153448,Classical Mythology,2,0
1,0002005018,Clara Callan,8,5
2,0060973129,Decision in Normandy,8,0
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,8,0
4,0393045218,The Mummies of Urumchi,8,0
...,...,...,...,...
270146,0440400988,There's a Bat in Bunk Five,276463,7
270147,0525447644,From One to One Hundred,276579,4
270148,006008667X,Lily Dale : The True Story of the Town that Ta...,276680,0
270149,0192126040,Republic (World's Classics),276680,0


In [11]:
df_book_ratings_big_chunk = df_book_ratings[:100000]

In [12]:
df_book_ratings_big_chunk

Unnamed: 0,ISBN,Book-Title,User-ID,Book-Rating
0,0195153448,Classical Mythology,2,0
1,0002005018,Clara Callan,8,5
2,0060973129,Decision in Normandy,8,0
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,8,0
4,0393045218,The Mummies of Urumchi,8,0
...,...,...,...,...
99995,0373191960,Family Of The Year (Harlequin Silhouette Roman...,278418,0
99996,0373192339,"Bachelor'S Baby (Silhouette Romance, No 1233)",51883,0
99997,3312001994,Der Zeitreisende. Die Visionen des Henry Dunant.,214202,0
99998,037319045X,Bachelor At The Wedding (Wedding Wager) (Silho...,51883,0


In [13]:
df_book_ratings_pivot = df_book_ratings_big_chunk.pivot(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)

In [14]:
df_book_features_ratings_matrix = csr_matrix(df_book_ratings_pivot.values)

In [15]:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')#, n_neighbors=20, n_jobs=-1)

In [16]:
model_knn.fit(df_book_features_ratings_matrix)

* fit the NearestNeighbors using'df_book_features'

In [17]:
query_index = np.random.choice(df_book_ratings_pivot.shape[0])
print(query_index)

5978


In [18]:
df_book_ratings_pivot.iloc[query_index,:].values.reshape(1,-1)

array([[0., 0., 0., ..., 0., 0., 0.]])

In [19]:
distances, indices = model_knn.kneighbors(df_book_ratings_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors=6)

In [20]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:'.format(df_book_ratings_pivot.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, df_book_ratings_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for 0140065520:
1: 0345359860, with distance of 0.0:
2: 0553239317, with distance of 0.0:
3: 0140065520, with distance of 0.0:
4: 0380674548, with distance of 0.0:
5: 0553254251, with distance of 0.0:


* create function that returns top 5 most similar movies (according to KNN model) for selected ISBN
    * the input will be Book-Title from the DataFrame books
    * the output will be the Book-Titles of the top 5 most similar books.
    * for every book in the top 5 most similar books, print also the distance from the selected book (ISBN we chose as input to the function)

In [21]:
def recommender(title, top_items=5):

    query_index = df_book_ratings_big_chunk[df_book_ratings_big_chunk['Book-Title'] == title].index
    distances, indices = model_knn.kneighbors(df_book_ratings_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors=6)

    for i in range(0, len(distances.flatten())):
        print(indices.flatten()[i])
        if i == 0:
            print('Recommendations for {0}:'.format(df_book_ratings_big_chunk[df_book_ratings_big_chunk.index == query_index[0]]['Book-Title']))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, df_book_ratings_big_chunk[df_book_ratings_big_chunk.ISBN == df_book_ratings_pivot.index[indices.flatten()[i]]]['Book-Title'], distances.flatten()[i]))
    return

In [22]:
recommender('Holy the Firm')

91902
Recommendations for 91902    Holy the Firm
Name: Book-Title, dtype: object:
66668
1: 41861    Eyewitnesses to Massacre: American Missionarie...
Name: Book-Title, dtype: object, with distance of 1.0:
66669
2: 22236    The Place Where You Are Standing Is Holy: A Je...
Name: Book-Title, dtype: object, with distance of 1.0:
66666
3: 8635    The Da Vinci Legacy
Name: Book-Title, dtype: object, with distance of 1.0:
66667
4: 84401    People of the Wolf Special Intro Edition (Firs...
Name: Book-Title, dtype: object, with distance of 1.0:
66664
5: 2251    The Hunt Begins: The Great Hunt, Part 1 (The W...
Name: Book-Title, dtype: object, with distance of 1.0:


* Apply the function to book of your choice

In [23]:
recommender('Der Fluch der Kaiserin. Ein Richter- Di- Roman.')

6828
Recommendations for 33    Der Fluch der Kaiserin. Ein Richter- Di- Roman.
Name: Book-Title, dtype: object:
78486
1: 53433    Our Nell : A Scrapbook Biography of Nellie L. ...
Name: Book-Title, dtype: object, with distance of 0.0:
60202
2: 53412    Noah's Flood : The New Scientific Discoveries ...
Name: Book-Title, dtype: object, with distance of 0.0:
48813
3: 53428    Redneck Heaven: Portrait of a Vanishing Culture
Name: Book-Title, dtype: object, with distance of 0.0:
67145
4: 53446    From the Cop Shop: Weird and Wonderful Tales f...
Name: Book-Title, dtype: object, with distance of 0.0:
53375
5: 53435    Solomon Gursky was here
Name: Book-Title, dtype: object, with distance of 0.0:
