# Capstone Project: Books recommender system

## Problem Statement

There is always a need for book recommendation when one would like to start picking a new book to read. With the vast availability of book choices we have currently, readers always faced with the difficulty of choosing a book to read and on average take a huge amount of time, effort and energy to select and choose a book to read. This include social media, online browsing and book referrals, which is time-consuming [[1]](https://www.nlb.gov.sg/Portals/0/Docs/AboutUs/2018%20NATIONAL%20READING%20HABITS%20STUDY%20ON%20ADULTS%20-%20REPORT.pdf). With the current lacking of a book recommendation system in libraries, as a data scientist in the national library, we aim to help reader to save time to decide for which books to read, by developing book recommendations system which matches their  preferences. The team will explore the use of popularity-based system, content-based system and collaborative filtering-based system for predicting the books that the readers would rate. The model's success will be evaluated based on the root mean square error that will evaluate the differences between the actual rating and the predicted ratings a reader will rate for a book. The closer the predicted rating to the actual rating, the better the book can be selected for recommendation,which will enhance the reader experience.

### Overall Contents:
- [Background](#1.-Background) **(In this notebook)**
- [Data Collection](#2.-Data-Collection) **(In this notebook)**
- Data Cleaning Booklist
- Data Cleaning Book Interactions
- Exploratory Data Analysis
- Modeling 1 Popularity-based system
- Modeling 2 Content-based system
- Modeling 3 Collaborative-based system
- Evaluation
- Conclusion and Recommendation

## 1. Background

### 1.1 Datasets

As goodreads has one of the world's largest site for readers and book recommendations, we will be using the goodreads datasets obtained from [University of California San Diego Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home?authuser=0) to aid in the development of book recommendation system [1]. 
The assumption is goodread readers are random and come from all around the world, in which the data is symbolic of readers in national libraries.
The dataset contains meta-data of books and user-book interactions.

The datasets obtained are as followed:-

Meta-data of books:-
* goodreads_books 
* goodreads_book_authors
* goodreads_book_series
* goodreads_book_genres_initial

User-book interactions:-
* goodreads_interactions
* book_id_map

For more details, please refer to the data_dictionary.ipynb.

## 2. Data Collection

### 2.1 Libraries Import

In [1]:
#pip install pandas pyarrow

In [2]:
import json
import gzip
import os
import sys
import re
import numpy as np
import pandas as pd
from IPython.display import clear_output

### 2.2 Data Collection

**Overview**

Goodreads_books.json.gz file is stored in a gzip file format and has a size of 2GB. The file could not be load directly and therefore the file is extracted in batches and saved in subfiles.

### 2.2.1 Define function for data collection from json.gz (large files)

In [3]:
# book_total_line = sum(1 for line in gzip.open('./datasets/goodreads_books.json.gz'))
# print(f"Total line in goodreads book json is : {book_total_line}")

In [4]:
# ((book_total_line/5)%5)

In [5]:
def data_collection_json(filepath, min_count, max_count):
    count = 0
    datalist = []
    
    with gzip.open(filepath) as file:
        for line in file:
            data = json.loads(line)
            count +=1
            
            if (count>min_count) & (count<=(max_count)):
                datalist.append(data)
                clear_output(wait=True)
                print(f'progress: {count}/{max_count}')

            elif count>(max_count):
                break
        
    return datalist

### 2.2.2 Collection from goodreads_books.json.gz

In [6]:
#booklist_first = data_collection_json('./datasets/goodreads_books.json.gz', 0, (book_total_line/5))
#booklist_second = data_collection_json('./datasets/goodreads_books.json.gz', (book_total_line/5), ((book_total_line/5)*2))
#booklist_third = data_collection_json('./datasets/goodreads_books.json.gz', ((book_total_line/5)*2), ((book_total_line/5)*3))
#booklist_fourth = data_collection_json('./datasets/goodreads_books.json.gz', ((book_total_line/5)*3), ((book_total_line/5)*4))
#booklist_fifth = data_collection_json('./datasets/goodreads_books.json.gz', ((book_total_line/5)*4), (((book_total_line/5)*5)+100))

In [7]:
#booklist_first = pd.DataFrame(booklist_first)
#booklist_second = pd.DataFrame(booklist_second)
#booklist_third = pd.DataFrame(booklist_third)
#booklist_fourth = pd.DataFrame(booklist_fourth)
#booklist_fifth = pd.DataFrame(booklist_fifth)

### 2.3 Data exportation into other formats

Parquet file format is a format that allows to process large data with small file size. For ease of data analysis of different datasets, the file will be changed into parquet file format.

In [8]:
booklist_authors = pd.read_json("./datasets/goodreads_book_authors.json.gz", lines = True)
booklist_series = pd.read_json("./datasets/goodreads_book_series.json.gz", lines = True)
booklist_works = pd.read_json("./datasets/goodreads_book_works.json.gz", lines = True)
booklist_genres = pd.read_json("./datasets/goodreads_book_genres_initial.json.gz", lines = True)
booklist_interactions = pd.read_csv("./datasets/goodreads_interactions.csv")
book_id_map = pd.read_csv("./datasets/book_id_map.csv")
user_id_map = pd.read_csv("./datasets/user_id_map.csv")

### 2.4. Summary

* The goodreads_books.json has successfully collected and separated into five subfiles. These files will be exported as a parquet files.
* The other documents have been imported and will be exported as a parquet file to be cleaned in the next section.

## Exporting Data

**From goodreads_books.json.gz**

In [9]:
#Placed the # to refrain from executing
# booklist_first.to_csv("./data/booklist_first.csv", index = False) 
# booklist_second.to_csv("./data/booklist_second.csv", index = False) 
# booklist_third.to_csv("./data/booklist_third.csv", index = False) 
# booklist_fourth.to_csv("./data/booklist_fourth.csv", index = False) 
#booklist_fifth.to_csv("./data/booklist_fifth.csv", index = False) 

In [10]:
#booklist_first.to_parquet('./data/booklist_first.parquet', compression='gzip')
#booklist_second.to_parquet('./data/booklist_second.parquet', compression='gzip')
#booklist_third.to_parquet('./data/booklist_third.parquet', compression='gzip')
#booklist_fourth.to_parquet('./data/booklist_fourth.parquet', compression='gzip')
#booklist_fifth.to_parquet('./data/booklist_fifth.parquet', compression='gzip')

**From goodreads_books.json.gz**

In [41]:
booklist_authors.to_parquet('./data/booklist_authors.parquet', compression='gzip')
booklist_series.to_parquet('./data/booklist_series.parquet', compression='gzip')
booklist_works.to_parquet('./data/booklist_works.parquet', compression='gzip')
booklist_genres.to_parquet('./data/booklist_genres.parquet', compression='gzip')
booklist_interactions.to_parquet('./data/booklist_interactions.parquet', compression='gzip')
book_id_map.to_parquet('./data/book_id_map.parquet', compression='gzip')
user_id_map.to_parquet('./data/user_id_map.parquet', compression='gzip')

## References

[1] M. Wan, and J. McAuley, "Item Recommendation on Monotonic Behvaior Chains," *Proceedings of the 12th ACM conference on Recommender Systems, RecSys 2018*, September 2018, pp. 86-94. doi: 10.1145/3240323.3240369 [Online]. Available:https://dl.acm.org/doi/10.1145/3240323.3240369 [Accessed: May 10, 2021].