<a href="https://colab.research.google.com/github/abyanjan/Recommender-Systems-with-Python/blob/master/Recommendation_with_matrix_factorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recommendation Systems with Matrix Factorization

**Book Recommendation**

In [None]:
!pip install -q surprise

[K     |████████████████████████████████| 11.8MB 234kB/s 
[?25h  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone


### Data

The data used here is the book-crossing dataset available at http://www2.informatik.uni-freiburg.de/~cziegler/BX/

In [None]:
# downloading the data
!wget http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip

--2021-03-27 13:02:56--  http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Resolving www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)... 132.230.105.133
Connecting to www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)|132.230.105.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘BX-CSV-Dump.zip’


2021-03-27 13:03:01 (6.29 MB/s) - ‘BX-CSV-Dump.zip’ saved [26085508/26085508]



In [None]:
# unzipping the data
import zipfile
with zipfile.ZipFile('BX-CSV-Dump.zip', 'r') as zip_ref:
    zip_ref.extractall('data')

In [None]:
# check the list of data files
%ls 'data'

BX-Book-Ratings.csv  BX-Books.csv  BX-Users.csv


The dataset contains three files
- BX-Users: contains information on users including demographic data if available
- BX-Books : contains information on books identified their 'isbn' number and data on 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher'
- BX-Book-Ratings : contains the book rating information, ratings for the books are in a scale from 1-10 (higher values denoting higher appreciation)

In [None]:
import pandas as pd
import numpy as np
import scipy

In [None]:
# reading ratings data
data = pd.read_csv("data/BX-Book-Ratings.csv", sep=';', header=0, names=['user','isbn','rating'],encoding='latin-1')
data.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [None]:
# reading books data
books = pd.read_csv("data/BX-Books.csv", sep=';', header=0,error_bad_lines=False, usecols=[0,1,2],index_col=0,
                   names=['isbn','title','author'],encoding='latin-1')
books.head()

Unnamed: 0_level_0,title,author
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
195153448,Classical Mythology,Mark P. O. Morford
2005018,Clara Callan,Richard Bruce Wright
60973129,Decision in Normandy,Carlo D'Este
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
393045218,The Mummies of Urumchi,E. J. W. Barber


In [None]:
# setting up a function to get metadata on any book by its isbn number
def bookMeta(isbn):
  title = books.loc[isbn,'title']
  author = books.loc[isbn,'author']
  return title, author

In [None]:
# testing the bookMeta function
bookMeta('0195153448')

('Classical Mythology', 'Mark P. O. Morford')

In [None]:
# setting a function to get top N favourite books for a user
def favBooks(user, N):
  # filtering out ratings for the specified user only
  userdata = data[data['user'] == user]
  # sorting the data by descending order of the ratings and only selecting top N rated books
  sorted_ratings = userdata.sort_values('rating', ascending =False)[:N]
  # adding book meta data
  sorted_ratings['title'] = sorted_ratings['isbn'].apply(bookMeta)
  return sorted_ratings

There may be ratings given to books that we may not have information about in the books data. So, we will make sure that the ratings data only contains the books that we have information about.

In [None]:
# making sure that we only have the ratings for the books that we have information about, that is stored in books dataframe
data = data[data['isbn'].isin(books.index)]

In [None]:
# checking favBooks function
favBooks(204622,5)

Unnamed: 0,user,isbn,rating,title
844955,204622,0967560500,10,"(Natural Hormonal Enhancement, Rob Faigin)"
844935,204622,0671027360,10,"(Angels &amp; Demons, Dan Brown)"
844926,204622,0385504209,10,"(The Da Vinci Code, Dan Brown)"
844958,204622,097173660X,9,"(Life After School Explained, Cap &amp; Compass)"
844920,204622,0060935464,9,"(To Kill a Mockingbird, Harper Lee)"


We will be using surprise library to perform matrix factorization for the recommender system. Surprise is an easy-to-use Python scikit for recommender systems.

In [None]:
from surprise import Dataset, Reader, accuracy
from surprise.model_selection import cross_validate, KFold
from surprise import SVD, SVDpp

Surprise library requires the dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order.

In [None]:
data.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Here, in our data, 'user' corresponds to the user ids, 'isbn' is the item ids and 'rating' is simply the rating for a item by a user. So, we have the data in the order surprise library requires.

In [None]:
# create the data to use with surprise library
# specify the rating scale
reader = Reader(rating_scale=(1, 10))
data_surp = Dataset.load_from_df(df = data, reader=reader)

###**Using SVD**

In [None]:
# setting the algorithm
algo = SVD()

In [None]:
# perform cross validation 
kf = KFold(n_splits=3, random_state=1)

RMSE = []
for train, test in kf.split(data_surp):
  # train and test algorithm.
    algo.fit(train)
    predictions = algo.test(test)

    # Compute and print Root Mean Squared Error
    RMSE.append(accuracy.rmse(predictions, verbose=True))

print(f"Mean RMSE score: {np.mean(RMSE)}")

RMSE: 3.4961
RMSE: 3.4976
RMSE: 3.4978
Mean RMSE score: 3.497152051197096


Here, we get a cross validated score RMSE of 3.49, which is not very good considering the range of our score is between 1 and 10. So, on average our prediction of ratings is off by 3.49.  
We can try to improve the performance of the SVD model with fine tuning.

**Fine Tuning SVD**

In [None]:
from surprise.model_selection import GridSearchCV

In [None]:
#tuning the learning rate
# set param grid for learning rate
param_grid = {'n_epochs':[20, 50, 80],
              'lr_all':[0.002, 0.005]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=2)
gs.fit(data_surp)

In [None]:
# best RMSE score
print(gs.best_score['rmse'])

# best parameters that gave the best RMSE score
print(gs.best_params['rmse'])

3.461282842274363
{'n_epochs': 20, 'lr_all': 0.002}


Fine tuning did not help in any imrpovemnt.

We can now fit the algorithm with best parameters on the whole data.

In [None]:
# select the best algo
algo = gs.best_estimator['rmse']
algo.fit(data_surp.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fab3b72f5d0>

**Making Prediction**

In [None]:
# get a prediction for specific users and items.
uid = 276798
iid = '3548603203'
r_ui = 6
pred = algo.predict(uid, iid, r_ui=r_ui, verbose=True)

user: 276798     item: 3548603203 r_ui = 6.00   est = 3.06   {'was_impossible': False}


We get the prediction for a product by a user, where 'r_ui' is th actual rating and 'est' is th predicted rating.

Now, we can make predictions for ratings for the books that a user has not read and recommend top N books with the highest ratings.

In [None]:
# creating a function for recommendating
def recommend_books(user_id, algo, N):
  # select books that have been rated by the user
  read_books = data[data.user==user_id]['isbn'].tolist()
  # create a list of books that the user has not read
  not_read_books = [book for book in data.isbn.unique() if book not in read_books]
  pred_ratings = {}
  for book in not_read_books:
    pred = algo.predict(uid=user_id, iid=book)
    # extract only the predicted rating
    est_rating = pred.est
    # add the predictions
    pred_ratings.update({book:est_rating})

  # sort accoriding to highest ratings
  sorted_ratings = sorted(pred_ratings.items(), key=lambda item: item[1], reverse=True)[:N]

  # if there are more than N books with rating of 10, take random N books with
  # the rating of 10
  n_books_10 = len({ key:value for (key,value) in sorted_ratings if value == 10})

  if n_books_10 > N:
    book_10_ratings = np.random.choice(sorted_ratings, N, replace=False)
    book  = [bookMeta(info[0])[0] for info in book_10_ratings]
    author = [bookMeta(info[0])[1] for info in book_10_ratings]
  else:
    book  = [bookMeta(info[0])[0] for info in sorted_ratings]
    author = [bookMeta(info[0])[1] for info in sorted_ratings]

 
  return pd.DataFrame({'Book':book, 'Author':author})

In [None]:
# print recommendation for a user
user=276798
recommend_books(user_id=user,algo=algo, N=10)

Unnamed: 0,Book,Author
0,Free,Paul Vincent
1,The Red Tent (Bestselling Backlist),Anita Diamant
2,Ender's Game (Ender Wiggins Saga (Paperback)),Orson Scott Card
3,Harry Potter and the Chamber of Secrets Postca...,J. K. Rowling
4,Falling Up,Shel Silverstein
5,1984,George Orwell
6,The Godfather,Mario Puzo
7,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling
8,The Pact: A Love Story,Jodi Picoult
9,Anne Frank: The Diary of a Young Girl,ANNE FRANK
