## Book Recommendation Engine

### Problem statement

Many online businesses rely on customer reviews and ratings.Explicit feedback is especially important in major B2B sectors .Consumer product sectors like CPG, Telecom, entertainment depends on consumer ratings and use their feedback and historical transactions to build recommendations that are personalized and most relevant to the user.

### Data Collection

Book ratings from individual Goodreads users.

The data is collected from goodreads .Data is at user-book level .Data has 50k+ user ratings for 10k+ books.

The data is explicit in nature.

## Flow of problem solving

    1) EDA and getting understanding of data 
    2) Traditional method and its shortcomings
    3) Trying to use it on scale using pyspark
    4) Trying to improve the model using deep learning 
    5) Further improvements and underlying biases

### EDA

In [None]:
%matplotlib inline

import pandas as pd

r = pd.read_csv( 'data/ratings.csv' )
b = pd.read_csv( 'data/books.csv' )

In [None]:
r.rating.hist( bins = 5 )

In [None]:
reviews_per_book = r.groupby( 'book_id' ).book_id.apply( lambda x: len( x ))
reviews_per_book.describe()

In [None]:
reviews_per_book.sort_values().head( 10 )

In [None]:
reviews_per_book = r.groupby( 'book_id' ).book_id.apply( lambda x: len( x ))
reviews_per_book.describe()

In [None]:
reviews_per_book.sort_values().head( 10 )

### Traditional method

 We are not opting for user based similarity method as it will take too much time when there are lots of users.
 
 We will try item based similarity method.
 
Here we are building the item-item similarity matrix .We first convert build a list of dictionary. Each dictionary corresponds to a single book. The user_id is the key, while rating given by the user for the book is its value.

In [None]:
listOfDictonaries=[]
indexMap = {}
reverseIndexMap = {}
ptr=0;
testdf = r
testdf=testdf[['user_id','rating']].groupby(testdf['book_id'])
for groupKey in testdf.groups.keys():
    tempDict={}

    groupDF = testdf.get_group(groupKey)
    for i in range(0,len(groupDF)):
        tempDict[groupDF.iloc[i,0]]=groupDF.iloc[i,1]
    indexMap[ptr]=groupKey
    reverseIndexMap[groupKey] = ptr
    ptr=ptr+1
    listOfDictonaries.append(tempDict)

We then use sklearn's DictVectorizer() function to create vectors corresponding to each book. We are trying to create a vector space with users as column vectors. Each point in the vector space represents a book. Rating of the book given an user is its magnitude. We then calculate similarity/distance between books in this vector space.

In [None]:
from sklearn.feature_extraction import DictVectorizer
dictVectorizer = DictVectorizer(sparse=True)
vector = dictVectorizer.fit_transform(listOfDictonaries)

We finally use sklearn's consine_similarity function to calculate pairwise similarity matrix

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
pairwiseSimilarity = cosine_similarity(vector)

In [None]:
import numpy as np
def printBookDetails(bookID):
    print("Title:", b[b['id']==bookID]['original_title'].values[0])
    print("Author:",b[b['id']==bookID]['authors'].values[0])
    print("Printing Book-ID:",bookID)
    print("=================++++++++++++++=========================")


def getTopRecommandations(bookID):
    row = reverseIndexMap[bookID]
    print("------INPUT BOOK--------")
    printBookDetails(bookID)
    print("-------RECOMMENDATIONS----------")
    similarBookIDs = [printBookDetails(indexMap[i]) for i in np.argsort(pairwiseSimilarity[row])[-7:-2][::-1]]

In [None]:
getTopRecommandations(1245)

### Improvements

There are better distance/similarity measures for these kind of problems.We can also try KNN too.

Traditional recommender are yet used and give descent results .May be we can include content (e.g. description of book ) or some other information too to build hybrid model.

This also suffers from cold start problem , where we can use the most popular item to recommend if we don't have a history .

We can also build a hybrid recommender which combines both collaborative and content based recommender ranks.

## Pyspark to see how can we scale it on big data

The recommendation system can work on PySpark, which is a popular framework for Big Data analysis.

We are using Alternating Least Squares model (ALS) with a non-negative matrix factorization algorithm to factorize the user-book matrix. 

Then I can approximate the original matrix and predict the blank cells (user haven't read this book).

In [None]:
! pip install pyspark

In [None]:
# PySpark
from pyspark import SparkContext
from pyspark.sql import SparkSession

# Exploratory Data Analysis (EDA)
from pyspark.sql.functions import col, min, max, avg, lit

# Machine Learning (ML)
from pyspark.ml.recommendation import ALS # Alternating Least Squares model
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator # Cross-Validation
from pyspark.ml.evaluation import RegressionEvaluator # Performance metric

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 400)
from matplotlib import rcParams
sns.set(context='notebook', style='whitegrid', rc={'figure.figsize': (18,4)})
rcParams['figure.figsize'] = 18,4

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Setting random seed for reproducability
SEED = 42
np.random.seed = SEED
np.random.set_state = SEED


In [None]:
sc = SparkContext(appName = "Book-Recommendation")
print(sc)

In [None]:
spark = SparkSession.Builder().getOrCreate()
print(spark)

In [None]:
# Read csv into Spark DataFrame
ratings = spark.read.csv('data/ratings.csv',
                         header = True,
                         inferSchema=True)
print(type(ratings))

In [None]:
to_read = spark.read.csv('data/to_read.csv',
                         header = True,
                         inferSchema=True)
print(type(to_read))

will do it on small sample

In [None]:
ratings = ratings.sample(withReplacement = False, 
                         fraction = 0.01, # 1% of observation
                         seed = 2019)
print(ratings.count())

In [None]:
# Convert the columns to the proper data types
ratings = ratings.select(ratings.user_id,
                         ratings.book_id,
                         ratings.rating.cast("double"))

In [None]:
# Create Generic ALS model - without hyperparameters
als = ALS(userCol="user_id", itemCol="book_id", ratingCol="rating", 
          nonnegative = True, # Non negative matrix factorization
          coldStartStrategy = "drop", # What to do if user do not appear in train and test set
          implicitPrefs = False) # Explicit preference

In [None]:
# Create test and train set
(train, test) = ratings.randomSplit([0.8, 0.2], 
                                    seed = 1234)
print(type(train))

In [None]:
# Add hyperparameters and their respective values to param_grid
param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 50, 100]) \
            .addGrid(als.maxIter, [5, 50, 100]) \
            .addGrid(als.regParam, [.01, .05, .1]) \
            .build()

In [None]:
# Define evaluator as RMSE
evaluator = RegressionEvaluator(metricName = "rmse", 
                                labelCol = "rating", 
                                predictionCol = "prediction")
# Print length of evaluator
print ("Num models to be tested: ", len(param_grid))

In [None]:
# Build cross validation using CrossValidator
cv = CrossValidator(estimator = als, 
                    estimatorParamMaps = param_grid, 
                    evaluator = evaluator, 
                    numFolds = 5)

In [None]:
# Fit generic model to the 'train' dataset
als_mod = als.fit(train)

In [None]:
test_pred = als_mod.transform(test)

In [None]:
# Calculate and print the RMSE of test_predictions
print(evaluator.evaluate(test_pred))

Possible improvements:

Use the latent features to extract unobservable features that imply some kind of user preferences.

Add more information about the products.

Will try deep learning to improve the RMSE score