## Introduction
A book recommender system using collaborative-filtering, built using PySpark.
- Create spark session and load data into spark dataframe
- Feature engineering
    - Convert string cols to integer
- Model
    - Alternating Least Squares (ALS) model for collaborative filtering from Spark ML Lib
    - Fit model to train set
    - Predict on test set and evaluate root mean squared error (RMSE)
- Generate recommendations
    - Predict ratings on unrated books for each user, using fitted model
    - Recommend top-n books

In [1]:
# install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 28.2 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
[?25h  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853644 sha256=df4d044b783fec92a8ce4070542244043d21257c2df65e92d34df72d191565d7
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected pa

### Imports

In [2]:
# core
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import sklearn
import random, os
# spark & ML
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [3]:
# create spark session
spark = SparkSession.builder.appName('recsys').getOrCreate()

## Data and preprocessing

In [4]:
# load data into spark dataframe
ratings_df = spark.read.csv('../input/books-dataset/books_data/ratings.csv', sep=';',
                            inferSchema=True,header=True)
ratings_df.show()

+-------+----------+-----------+
|User-ID|      ISBN|Book-Rating|
+-------+----------+-----------+
| 276725|034545104X|          0|
| 276726|0155061224|          5|
| 276727|0446520802|          0|
| 276729|052165615X|          3|
| 276729|0521795028|          6|
| 276733|2080674722|          0|
| 276736|3257224281|          8|
| 276737|0600570967|          6|
| 276744|038550120X|          7|
| 276745| 342310538|         10|
| 276746|0425115801|          0|
| 276746|0449006522|          0|
| 276746|0553561618|          0|
| 276746|055356451X|          0|
| 276746|0786013990|          0|
| 276746|0786014512|          0|
| 276747|0060517794|          9|
| 276747|0451192001|          0|
| 276747|0609801279|          0|
| 276747|0671537458|          9|
+-------+----------+-----------+
only showing top 20 rows



In [5]:
# show schema
ratings_df.printSchema()

root
 |-- User-ID: integer (nullable = true)
 |-- ISBN: string (nullable = true)
 |-- Book-Rating: integer (nullable = true)



In [6]:
# load books data into spark dataframe
books_df = spark.read.csv('../input/books-dataset/books_data/books.csv', sep=';', inferSchema=True, header=True)
books_df = books_df.drop('Image-URL-S', 'Image-URL-M', 'Image-URL-L')
books_df.show()

+----------+--------------------+--------------------+-------------------+--------------------+
|      ISBN|          Book-Title|         Book-Author|Year-Of-Publication|           Publisher|
+----------+--------------------+--------------------+-------------------+--------------------+
|0195153448| Classical Mythology|  Mark P. O. Morford|               2002|Oxford University...|
|0002005018|        Clara Callan|Richard Bruce Wright|               2001|HarperFlamingo Ca...|
|0060973129|Decision in Normandy|        Carlo D'Este|               1991|     HarperPerennial|
|0374157065|Flu: The Story of...|    Gina Bari Kolata|               1999|Farrar Straus Giroux|
|0393045218|The Mummies of Ur...|     E. J. W. Barber|               1999|W. W. Norton &amp...|
|0399135782|The Kitchen God's...|             Amy Tan|               1991|    Putnam Pub Group|
|0425176428|What If?: The Wor...|       Robert Cowley|               2000|Berkley Publishin...|
|0671870432|     PLEADING GUILTY|       

In [7]:
# convert string to int for ALS
stringToInt = StringIndexer(inputCol='ISBN', outputCol='ISBN_int').fit(ratings_df)
ratings_df = stringToInt.transform(ratings_df)
ratings_df.show()

+-------+----------+-----------+--------+
|User-ID|      ISBN|Book-Rating|ISBN_int|
+-------+----------+-----------+--------+
| 276725|034545104X|          0|  1637.0|
| 276726|0155061224|          5| 89067.0|
| 276727|0446520802|          0|   568.0|
| 276729|052165615X|          3|205984.0|
| 276729|0521795028|          6|206014.0|
| 276733|2080674722|          0| 80774.0|
| 276736|3257224281|          8| 43132.0|
| 276737|0600570967|          6|216574.0|
| 276744|038550120X|          7|   232.0|
| 276745| 342310538|         10|135627.0|
| 276746|0425115801|          0|   445.0|
| 276746|0449006522|          0|   606.0|
| 276746|0553561618|          0|   424.0|
| 276746|055356451X|          0|   286.0|
| 276746|0786013990|          0| 27579.0|
| 276746|0786014512|          0| 15790.0|
| 276747|0060517794|          9|  1413.0|
| 276747|0451192001|          0|   937.0|
| 276747|0609801279|          0|  6511.0|
| 276747|0671537458|          9|   914.0|
+-------+----------+-----------+--

In [8]:
# split data into training and test datatset
train_df, test_df = ratings_df.randomSplit([0.8,0.2])

## Model

In [9]:
# ALS model
rec_model = ALS( maxIter=10 ,regParam=0.01,userCol='User-ID',itemCol='ISBN_int',ratingCol='Book-Rating', 
                nonnegative=True, coldStartStrategy="drop")

rec_model = rec_model.fit(train_df)

In [10]:
# making predictions on test set 
predicted_ratings=rec_model.transform(test_df)

## Evaluation

In [11]:
# calculate RMSE
evaluator = RegressionEvaluator(metricName='rmse', predictionCol='prediction',labelCol='Book-Rating')
rmse = evaluator.evaluate(predicted_ratings)
rmse

4.742314506977701

## Recommendation

In [12]:
# function to recommend top-n books for a user using trained model
def recommend_for_user(user_id, n):
    ratings_user = ratings_df.filter(col('User-Id')==user_id)
    pred_ratings_user = rec_model.transform(ratings_user.filter(col('Book-Rating')==0))
    recs_user = books_df.join(pred_ratings_user.select(['ISBN', 'prediction']), on='ISBN')
    recs_user = recs_user.sort('prediction', ascending=False).drop('prediction').limit(n)
    return recs_user

In [13]:
recs_user = recommend_for_user(31987, 5)
recs_user.show()

+----------+--------------------+----------------+-------------------+----------------+
|      ISBN|          Book-Title|     Book-Author|Year-Of-Publication|       Publisher|
+----------+--------------------+----------------+-------------------+----------------+
|0671873210|Tarnished Gold (L...|    V.C. Andrews|               1996|          Pocket|
|0804117942|Spontaneous Heali...|Andrew Weil M.D.|               2000|Ballantine Books|
|0671759345|       Ruby (Landry)|    V.C. Andrews|               1994|          Pocket|
|0671670689|       Dawn (Cutler)|    V.C. Andrews|               1990|          Pocket|
|0060915153|Why Do Clocks Run...|   David Feldman|               1988|       Perennial|
+----------+--------------------+----------------+-------------------+----------------+

