# Capstone Project: Books recommender system

### Overall Contents:
- Background
- Data Collection
- [Data Cleaning Booklist](#3.-Data-Cleaning-Booklist) **(In this notebook)**
- Data Cleaning Book Interactions
- Exploratory Data Analysis
- Modeling 1 Popularity-based system
- Modeling 2 Content-based system
- Modeling 3 Collaborative-based system
- Evaluation
- Conclusion and Recommendation

### Datasets

The dataset are obtained from [University of California San Diego Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home?authuser=0).

The dataset contains meta-data of books and user-book interactions.

Meta-data of books:-
* goodreads_books 
* goodreads_book_authors
* goodreads_book_series
* goodreads_book_genres_initial

User-book interactions:-
* goodreads_interactions
* book_id_map

For more details on the datasets, please refer to the data_dictionary.ipynb.

## 3. Data Cleaning Booklist

### 3.1 Libraries Import

In [1]:
import numpy as np
import pandas as pd
import re
from IPython.display import clear_output

# Maximum display of columns
pd.options.display.max_colwidth = 2000
pd.options.display.max_rows = 2000

### 3.2 Data Import

In [2]:
booklist_authors = pd.read_parquet("./data/booklist_authors.parquet")
booklist_first = pd.read_parquet("./data/booklist_first.parquet")
booklist_second = pd.read_parquet("./data/booklist_second.parquet")
booklist_third = pd.read_parquet("./data/booklist_third.parquet")
booklist_fourth = pd.read_parquet("./data/booklist_fourth.parquet")
booklist_fifth = pd.read_parquet("./data/booklist_fifth.parquet")
booklist_works = pd.read_parquet("./data/booklist_works.parquet")

In [3]:
# Compilation of booklist_first to booklist_fifth
booklist_compiled = pd.concat([booklist_first, booklist_second, booklist_third, booklist_fourth, booklist_fifth], axis = 0).reset_index(drop=True)

In [4]:
print(f"This booklist_authors has a shape of {booklist_authors.shape}")
print(f"This booklist_compiled has a shape of {booklist_compiled.shape}")
print(f"This booklist_works has a shape of {booklist_works.shape}")

This booklist_authors has a shape of (829529, 5)
This booklist_compiled has a shape of (2360655, 29)
This booklist_works has a shape of (1521962, 16)


### 3.3 booklist_authors

In [5]:
booklist_authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 829529 entries, 0 to 829528
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   average_rating      829529 non-null  float64
 1   author_id           829529 non-null  int64  
 2   text_reviews_count  829529 non-null  int64  
 3   name                829529 non-null  object 
 4   ratings_count       829529 non-null  int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 31.6+ MB


In [6]:
booklist_authors = booklist_authors.rename({"name":"author_name"}, axis = 1)
booklist_authors.head()

Unnamed: 0,average_rating,author_id,text_reviews_count,author_name,ratings_count
0,3.98,604031,7,Ronald J. Fields,49
1,4.08,626222,28716,Anita Diamant,546796
2,3.92,10333,5075,Barbara Hambly,122118
3,3.68,9212,36262,Jennifer Weiner,888522
4,3.82,149918,96,Nigel Pennick,1740


In [7]:
booklist_authors[booklist_authors.author_name == ""]

Unnamed: 0,average_rating,author_id,text_reviews_count,author_name,ratings_count
89637,4.03,6555766,353,,1277
150514,4.0,6421865,70,,388
176864,3.03,16507023,59,,330
800027,4.67,6925419,2,,3
800029,4.67,6925420,2,,3


In [8]:
booklist_authors[booklist_authors.author_name == "Robert   Innes"]

Unnamed: 0,average_rating,author_id,text_reviews_count,author_name,ratings_count
107129,3.94,16062024,156,Robert Innes,930


In [9]:
booklist_authors[booklist_authors.author_name == "Robert Innes"]

Unnamed: 0,average_rating,author_id,text_reviews_count,author_name,ratings_count
765574,4.22,1015284,4,Robert Innes,36


In [10]:
reg_exp = r'\s{2,}'
symbols = booklist_authors['author_name'].apply(lambda x:re.findall(reg_exp,x))
[(symbols[i], i) for i in range(len(symbols)) if symbols[i]!=[]][0:5]

[(['  '], 36), (['  '], 37), (['  '], 39), (['  '], 121), (['  '], 124)]

In [11]:
reg_exp = r'\s{2,}'
booklist_authors['author_name'] = booklist_authors['author_name'].apply(lambda x:re.sub(reg_exp," ",x))
symbols_check = booklist_authors['author_name'].apply(lambda x:re.findall(reg_exp,x))
[(symbols_check[i], i) for i in range(len(symbols_check)) if symbols_check[i]!=[]]

[]

In [12]:
authors_list = booklist_authors[["author_id", "author_name"]]
authors_list.head()

Unnamed: 0,author_id,author_name
0,604031,Ronald J. Fields
1,626222,Anita Diamant
2,10333,Barbara Hambly
3,9212,Jennifer Weiner
4,149918,Nigel Pennick


### 3.4 booklist_compiled

### 3.4.1 Check on dtypes and missing values

In [13]:
booklist_compiled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2360655 entries, 0 to 2360654
Data columns (total 29 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   isbn                  object
 1   text_reviews_count    object
 2   series                object
 3   country_code          object
 4   language_code         object
 5   popular_shelves       object
 6   asin                  object
 7   is_ebook              object
 8   average_rating        object
 9   kindle_asin           object
 10  similar_books         object
 11  description           object
 12  format                object
 13  link                  object
 14  authors               object
 15  publisher             object
 16  num_pages             object
 17  publication_day       object
 18  isbn13                object
 19  publication_month     object
 20  edition_information   object
 21  publication_year      object
 22  url                   object
 23  image_url             object
 24

In [14]:
booklist_compiled[booklist_compiled.isnull()].count()

isbn                    0
text_reviews_count      0
series                  0
country_code            0
language_code           0
popular_shelves         0
asin                    0
is_ebook                0
average_rating          0
kindle_asin             0
similar_books           0
description             0
format                  0
link                    0
authors                 0
publisher               0
num_pages               0
publication_day         0
isbn13                  0
publication_month       0
edition_information     0
publication_year        0
url                     0
image_url               0
book_id                 0
ratings_count           0
work_id                 0
title                   0
title_without_series    0
dtype: int64

### 3.4.2 Selecting columns

Removing columns:- 'url', 'image_url', 'link', 'popular_shelves', 'country_code', 'publication_day', 'publication_month','title_without_series','isbn', 'isbn13'

In [15]:
booklist_compiled = booklist_compiled.drop(['url', 'image_url', 'link', 'popular_shelves', 'country_code', 'publication_day', 'publication_month','title_without_series', 'isbn', 'isbn13'], axis = 1)

### 3.4.3 Removing average_rating and ratings_count is zero

In [16]:
booklist_compiled.shape

(2360655, 19)

In [17]:
print(f'The number of books having no information on ratings are {booklist_compiled[(booklist_compiled.average_rating == "") | (booklist_compiled.ratings_count == "")].shape[0]}')
booklist_compiled[(booklist_compiled.average_rating == "") | (booklist_compiled.ratings_count == "")].head()

The number of books having no information on ratings are 524


Unnamed: 0,text_reviews_count,series,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,authors,publisher,num_pages,edition_information,publication_year,book_id,ratings_count,work_id,title
1687,,[],,,,,,[],,,[],,,,,23699819,,,Infinity Man and the Forever People #1 (The New 52)
10658,,[],,,,,,[],,,[],,,,,2597774,,,Wade of Aquitaine
14340,,[],,,,,,[],,,[],,,,,18521522,,,Sugar Baby Lies
17992,,[],,,,,,[],,,[],,,,,28253116,,,Justice League United #16
21756,,[],,,,,,[],,,[],,,,,17796597,,,Batman Incorporated #6 (Batman Incorporated New 52 #6)


**Analysis: These books have majority of the information missing.**

Some important information including text_reviews_count, series, average_rating, similar_books, authors, ratings_count, work_id are missing in all of these books. Thus, these books will be removed from the booklist_compiled.

In [18]:
# Listing the bookid removed from the booklist_compiled
bookid_removed = booklist_compiled[(booklist_compiled.average_rating == "") | (booklist_compiled.ratings_count == "")][["book_id"]].reset_index(drop=True)
bookid_removed.book_id = bookid_removed.book_id.astype(int)

# Removing these books from the booklist_compiled
booklist_compiled = booklist_compiled[(booklist_compiled.average_rating != "") & (booklist_compiled.ratings_count != "")].reset_index(drop = True)

### 3.4.4 Changing numerical columns to integers/float

**Will change the numerical columns below to integer/float**
* text_reviews_count, average_rating, book_id, ratings_count, work_id

**Assign 0 for false/not-present and 1 for true/available**
* asin, is_ebook, kindle_asin

In [19]:
booklist_compiled.text_reviews_count = booklist_compiled.text_reviews_count.astype(int)
booklist_compiled.average_rating = booklist_compiled.average_rating.astype(float)
booklist_compiled.book_id = booklist_compiled.book_id.astype(int)
booklist_compiled.ratings_count = booklist_compiled.ratings_count.astype(int)
booklist_compiled.work_id = booklist_compiled.work_id.astype(int)

In [20]:
booklist_compiled.is_ebook.unique()

array(['false', 'true'], dtype=object)

In [21]:
booklist_compiled.is_ebook = booklist_compiled.is_ebook.map({"false" : 0, "true" : 1})
booklist_compiled.asin = booklist_compiled.asin.apply(lambda x: 0 if x == "" else 1)
booklist_compiled.kindle_asin = booklist_compiled.kindle_asin.apply(lambda x: 0 if x == "" else 1)

### 3.4.5 Authors

### 3.4.5.1 Author roles

In [22]:
author_roles = []

for index, rows in enumerate (booklist_compiled['authors']):
    for num in range (len(rows)):
        author_roles.extend([booklist_compiled['authors'][index][num]['role']])

In [23]:
list(set(author_roles))[0:20]

['',
 'contributing illustrator',
 'Translator, Editor',
 'Editor. Translator',
 'trjm@ mHmd zydn ',
 'Publisher',
 'contributor with Solidi-chapter',
 'Neville Jason, Narrator',
 'orignally published as',
 'Contri',
 'Compliador',
 'Opening',
 'Pencils/Inks',
 'Notes',
 'Editor (credited as)',
 'online fiction writer',
 'Editor / Author',
 'Editor, and Narrator',
 'Dessin et couleur',
 'Colorer']

### 3.4.5.2 Creating a dataframe of multiple authors and roles

In [24]:
# Listing all authors for each book_id
multiple_authors_list = []

for index, rows in enumerate (booklist_compiled['authors']):
    for num in range (len(rows)):
        multiple_authors_list.extend([[booklist_compiled["work_id"][index], booklist_compiled["book_id"][index], booklist_compiled["title"][index], booklist_compiled['authors'][index][num]['author_id'],  booklist_compiled['authors'][index][num]['role']]])

In [25]:
# Create a dataframe listing all authors for each book_id
multiple_authors_df = pd.DataFrame(multiple_authors_list, columns = ["work_id", "book_id","title","author_id", "role"])
multiple_authors_df.head()

Unnamed: 0,work_id,book_id,title,author_id,role
0,5400751,5333265,W.C. Fields: A Life on Film,604031,
1,1323437,1333909,Good Harbor,626222,
2,8948723,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",10333,
3,6243154,6066819,Best Friends Forever,9212,
4,278577,287140,Runic Astrology: Starcraft and Timekeeping in the Northern Tradition,149918,


In [26]:
multiple_authors_df.dtypes

work_id       int64
book_id       int64
title        object
author_id    object
role         object
dtype: object

In [27]:
# To change the author_id to integer
multiple_authors_df['author_id'] = multiple_authors_df['author_id'].astype(int)
multiple_authors_df.author_id.dtypes

dtype('int32')

In [28]:
multiple_authors_name_df = pd.merge(multiple_authors_df, authors_list, how = 'left', on = 'author_id')
multiple_authors_name_df.head()

Unnamed: 0,work_id,book_id,title,author_id,role,author_name
0,5400751,5333265,W.C. Fields: A Life on Film,604031,,Ronald J. Fields
1,1323437,1333909,Good Harbor,626222,,Anita Diamant
2,8948723,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",10333,,Barbara Hambly
3,6243154,6066819,Best Friends Forever,9212,,Jennifer Weiner
4,278577,287140,Runic Astrology: Starcraft and Timekeeping in the Northern Tradition,149918,,Nigel Pennick


In [29]:
multiple_authors_name_df.isnull().value_counts()

work_id  book_id  title  author_id  role   author_name
False    False    False  False      False  False          3323621
dtype: int64

### 3.4.5.3 Generate a first_author for the main dataframe

#### A) Formulate a table with book_id and author_id from the main dataframe

In [30]:
booklist_compiled.head(1)

Unnamed: 0,text_reviews_count,series,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,authors,publisher,num_pages,edition_information,publication_year,book_id,ratings_count,work_id,title
0,1,[],,0,0,4.0,0,[],,Paperback,"[{'author_id': '604031', 'role': ''}]",St. Martin's Press,256,,1984,5333265,3,5400751,W.C. Fields: A Life on Film


In [31]:
# Check if there are books with no authors listed in the booklist_compiled
no_authors_index = []

for index, value in enumerate (booklist_compiled['authors']):
    if len(value)==0:
        no_authors_index.extend([[index, booklist_compiled['book_id'][index]]])

In [32]:
# Create a dataframe listing books with no authors listed in the booklist compiled
no_authors_booklist_df = pd.DataFrame(no_authors_index, columns = ['index', 'book_id'])
print(f"The number of books that do not have authors listed in the booklist_compiled is {no_authors_booklist_df.shape[0]}")
no_authors_booklist_df.head()

The number of books that do not have authors listed in the booklist_compiled is 13


Unnamed: 0,index,book_id
0,327014,711979
1,665742,7520314
2,818768,694332
3,902769,18247372
4,1013124,6033275


In [33]:
# Listing the book_id and first author_id
first_author_list = []

for index, value in enumerate (booklist_compiled['authors']):
    if len(value)!=0:
        first_author_list.extend([[booklist_compiled['book_id'][index], value[0]["author_id"]]])

In [34]:
# Create a dataframe listing the book_id and first author_id
first_author_df = pd.DataFrame(first_author_list, columns = ["book_id", 'author_id'])
print(f"The first_author_df rows and columns are {first_author_df.shape}")
first_author_df.head()

The first_author_df rows and columns are (2360118, 2)


Unnamed: 0,book_id,author_id
0,5333265,604031
1,1333909,626222
2,7327624,10333
3,6066819,9212
4,287140,149918


In [35]:
first_author_df.dtypes

book_id       int64
author_id    object
dtype: object

In [36]:
# To change the author_id to integer
first_author_df['author_id'] = first_author_df['author_id'].astype(int)
first_author_df.author_id.dtypes

dtype('int32')

#### B) Obtain the first_author_name from the author_list

In [37]:
first_author_name_df = pd.merge(first_author_df, authors_list, how = 'left', on = 'author_id')
first_author_name_df = first_author_name_df.drop(["author_id"], axis = 1)
first_author_name_df.head()

Unnamed: 0,book_id,author_name
0,5333265,Ronald J. Fields
1,1333909,Anita Diamant
2,7327624,Barbara Hambly
3,6066819,Jennifer Weiner
4,287140,Nigel Pennick


#### C) Combine with the main dataframe

In [38]:
booklist_compiled = pd.merge(booklist_compiled, first_author_name_df, how = 'left', on = 'book_id')
booklist_compiled = booklist_compiled.drop(["authors"], axis = 1)
booklist_compiled.head(1)

Unnamed: 0,text_reviews_count,series,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,publisher,num_pages,edition_information,publication_year,book_id,ratings_count,work_id,title,author_name
0,1,[],,0,0,4.0,0,[],,Paperback,St. Martin's Press,256,,1984,5333265,3,5400751,W.C. Fields: A Life on Film,Ronald J. Fields


In [39]:
booklist_compiled.author_name = booklist_compiled.author_name.fillna("")
print(f'The number of books with no author names in booklist_compiled are {len(booklist_compiled[booklist_compiled.author_name == ""])}')

The number of books with no author names in booklist_compiled are 15


In [40]:
print(f'The number of books with no title in booklist_compiled are {len(booklist_compiled[booklist_compiled.title == ""])}')
booklist_compiled[booklist_compiled.title == ""].head(1)

The number of books with no title in booklist_compiled are 7


Unnamed: 0,text_reviews_count,series,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,publisher,num_pages,edition_information,publication_year,book_id,ratings_count,work_id,title,author_name
896296,3,[],,0,0,3.63,0,"[121457, 575252, 2849, 7793505, 809653, 299360, 138200, 1163099, 218201, 1219437, 81814, 15798667, 534963, 428050, 297736, 503558, 1300476, 233331]","Ben has always been content to be brilliant at chemistry and to live apart with his outspoken familiar (George), but recently he has begun to want approval and friendship from other people. But George keeps asking why work so hard and miss so much else, and what is that doubtful character William up to working overtime in the Chemistry Lab?\nAnother humorous, original book by the author of Jennifer, Hecate, Macbeth and Me and From the Mixed-up Files of Mrs Basil E. Frankweiler.",Paperback,Yearling,160,,1985,2433394,8,2440582,,E.L. Konigsburg


In [41]:
booklist_compiled_full = booklist_compiled[(booklist_compiled.author_name != "") & (booklist_compiled.title != "")]
booklist_compiled_missing = booklist_compiled[(booklist_compiled.author_name == "")| (booklist_compiled.title == "")]

In [42]:
print(f"The total booklist_compiled observations are {booklist_compiled.shape[0]}")
print(f"The total booklist_compiled observations without missing titles or author names are {booklist_compiled_full.shape[0]}")
print(f"The total booklist_compiled observations with either missing titles or author names are {booklist_compiled_missing.shape[0]}")

The total booklist_compiled observations are 2360131
The total booklist_compiled observations without missing titles or author names are 2360109
The total booklist_compiled observations with either missing titles or author names are 22


In [43]:
booklist_compiled_full["first_author_title"] = booklist_compiled_full.author_name + ' - ' + booklist_compiled_full.title
booklist_compiled = pd.concat([booklist_compiled_full,booklist_compiled_missing], axis = 0)
booklist_compiled.first_author_title = booklist_compiled.first_author_title.fillna("")
booklist_compiled = booklist_compiled.sort_index()
booklist_compiled.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  booklist_compiled_full["first_author_title"] = booklist_compiled_full.author_name + ' - ' + booklist_compiled_full.title


Unnamed: 0,text_reviews_count,series,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,publisher,num_pages,edition_information,publication_year,book_id,ratings_count,work_id,title,author_name,first_author_title
0,1,[],,0,0,4.0,0,[],,Paperback,St. Martin's Press,256,,1984,5333265,3,5400751,W.C. Fields: A Life on Film,Ronald J. Fields,Ronald J. Fields - W.C. Fields: A Life on Film


### 3.4.6 Creating a dataframe of list of books in the series

In [44]:
# Listing all series of each book_id
series_list = []

for series_index, series_values in enumerate (booklist_compiled['series']):
    for i in range (len(series_values)):
        series_list.extend([[booklist_compiled["work_id"][series_index], booklist_compiled["book_id"][series_index], booklist_compiled["series"][series_index][i]]])

In [45]:
# Create a dataframe for the series of each book_id
series_list_df = pd.DataFrame(series_list, columns = ["work_id", "book_id","series"])
series_list_df.head()

Unnamed: 0,work_id,book_id,series
0,8948723,7327624,189911
1,701117,6066812,151854
2,6243149,6066814,169353
3,54143148,33394837,1052227
4,41333541,89371,1070125


In [46]:
series_list_df.dtypes

work_id     int64
book_id     int64
series     object
dtype: object

In [47]:
series_list_df['series'] = series_list_df['series'].astype(int)
series_list_df.series.dtypes

dtype('int32')

In [48]:
# To indicate whether the book is part of a series or not (1 for yes, 0 for no)
booklist_compiled.series = booklist_compiled.series.apply(lambda x: 1 if len(x)!=0 else 0)
booklist_compiled.head(1)

Unnamed: 0,text_reviews_count,series,language_code,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,publisher,num_pages,edition_information,publication_year,book_id,ratings_count,work_id,title,author_name,first_author_title
0,1,0,,0,0,4.0,0,[],,Paperback,St. Martin's Press,256,,1984,5333265,3,5400751,W.C. Fields: A Life on Film,Ronald J. Fields,Ronald J. Fields - W.C. Fields: A Life on Film


### 3.4.7 Creating a dataframe of list of books with its similar books

In [49]:
# Listing all similar books for each book_id
similar_books_list = []

for similar_index, similar_values in enumerate (booklist_compiled['similar_books']):
    for i in range (len(similar_values)):
        similar_books_list.extend([[booklist_compiled["book_id"][similar_index], booklist_compiled["similar_books"][similar_index][i]]])
        clear_output(wait=True)
        print(f'progress: {similar_index}/{len(booklist_compiled.similar_books)}')

progress: 2360130/2360131


In [51]:
# Creating a dataframe of similar books for each book_id
similar_books_list_df = pd.DataFrame(similar_books_list, columns = ["book_id","similar_books"])
work_book = booklist_compiled[["work_id","book_id"]]
similar_books_list_df = pd.merge(similar_books_list_df, work_book, on = "book_id", how = 'left')
similar_books_list_df.head()

Unnamed: 0,book_id,similar_books,work_id
0,1333909,8709549,1323437
1,1333909,17074050,1323437
2,1333909,28937,1323437
3,1333909,158816,1323437
4,1333909,228563,1323437


In [52]:
similar_books_list_df.dtypes

book_id           int64
similar_books    object
work_id           int32
dtype: object

In [53]:
# To change the similar_books_id to integer
similar_books_list_df['similar_books'] = similar_books_list_df['similar_books'].astype(int)
similar_books_list_df.similar_books.dtypes

dtype('int32')

In [54]:
# Remove the similar books column from booklist_compiled
booklist_compiled = booklist_compiled.drop(["similar_books"], axis = 1)

### 3.4.8 Values in language_code

In [55]:
print(f"The number of different languages in the booklist_compiled is {booklist_compiled.language_code.nunique()}")
booklist_compiled.language_code.unique()

The number of different languages in the booklist_compiled is 227


array(['', 'eng', 'ger', 'spa', 'en-US', 'ita', 'per', 'en-GB', 'tur',
       'ind', 'mon', 'fre', 'por', 'ara', 'en-CA', 'tha', 'lav', 'jpn',
       'pol', 'swe', 'kor', 'fin', 'msa', 'bul', 'nl', 'gre', 'slo',
       'nor', 'heb', 'hun', 'ben', 'scr', 'zho', 'fil', 'rus', 'lit',
       'rum', 'cze', 'dan', 'slv', 'nno', 'pes', 'hye', 'nob', 'cat',
       'en', 'vie', 'nep', 'mar', 'srp', 'urd', 'guj', 'est', 'sqi',
       'ukr', 'afr', 'mul', 'grc', 'kat', 'mkd', 'hin', 'tam', 'mus',
       '--', 'bos', 'enm', 'gla', 'isl', 'glg', 'mal', 'kur', 'wel',
       'pt-BR', 'crh', 'tel', 'es-MX', 'kan', 'mya', 'fao', 'aze', 'ota',
       'arw', 'pra', 'tgl', 'lat', 'dum', 'eus', 'sin', 'mlt', 'ada',
       'apa', 'udm', 'peo', 'bel', 'iro', 'nld', 'ori', 'smn', 'amh',
       'tut', 'frs', 'arg', 'ang', 'abk', 'epo', 'snd', 'pan', 'egy',
       'dut', 'vls', 'jav', 'tlh', 'din', 'gle', 'alg', 'gsw', 'nah',
       'her', 'aus', 'aka', 'chm', 'ace', 'oci', 'ast', 'kok', 'tib',
       'frm', 'i

**Analysis: As '--', 'Select...' is undefined, we will be replace it with ""**

In [56]:
booklist_compiled.language_code = booklist_compiled.language_code.replace("--", "")
booklist_compiled.language_code = booklist_compiled.language_code.replace("Select...", "")

In [57]:
booklist_compiled.dtypes

text_reviews_count       int32
series                   int64
language_code           object
asin                     int64
is_ebook                 int64
average_rating         float64
kindle_asin              int64
description             object
format                  object
publisher               object
num_pages               object
edition_information     object
publication_year        object
book_id                  int32
ratings_count            int32
work_id                  int32
title                   object
author_name             object
first_author_title      object
dtype: object

### 3.5 booklist_works

### 3.5.1 Selecting columns in booklist_works

In [60]:
booklist_works.head()

Unnamed: 0,books_count,reviews_count,original_publication_month,default_description_language_code,text_reviews_count,best_book_id,original_publication_year,original_title,rating_dist,default_chaptering_book_id,original_publication_day,original_language_id,ratings_count,media_type,ratings_sum,work_id
0,1,6,8.0,,1,5333265,1984,W. C. Fields: A Life on Film,5:1|4:1|3:1|2:0|1:0|total:3,,,,3,book,12,5400751
1,22,10162,,,741,25717,2001,Good Harbor,5:517|4:1787|3:2763|2:966|1:196|total:6229,,,,6229,book,20150,1323437
2,2,268,,,7,7327624,1987,,5:49|4:58|3:26|2:5|1:3|total:141,,,,141,book,568,8948723
3,38,89252,7.0,,3504,6066819,2009,Best Friends Forever,5:9152|4:16855|3:19507|2:6210|1:1549|total:53273,,14.0,,53273,book,185670,6243154
4,2,49,,,5,287140,1990,Runic Astrology: Starcraft and Timekeeping in the Northern Tradition,5:6|4:1|3:3|2:3|1:2|total:15,,,,15,book,51,278577


In [61]:
booklist_works.default_description_language_code.unique()

array([''], dtype=object)

In [62]:
booklist_works.original_language_id.unique()

array([''], dtype=object)

In [63]:
# Selecting columns in the booklist_works
booklist_works = booklist_works.drop(["original_publication_month", "default_description_language_code", "original_publication_day", "original_language_id", "default_chaptering_book_id"], axis = 1)

In [64]:
# Create an average_rating column
booklist_works["average_rating"] = booklist_works.ratings_sum/booklist_works.ratings_count
booklist_works.head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,original_title,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating
0,1,6,1,5333265,1984,W. C. Fields: A Life on Film,5:1|4:1|3:1|2:0|1:0|total:3,3,book,12,5400751,4.0
1,22,10162,741,25717,2001,Good Harbor,5:517|4:1787|3:2763|2:966|1:196|total:6229,6229,book,20150,1323437,3.234869
2,2,268,7,7327624,1987,,5:49|4:58|3:26|2:5|1:3|total:141,141,book,568,8948723,4.028369
3,38,89252,3504,6066819,2009,Best Friends Forever,5:9152|4:16855|3:19507|2:6210|1:1549|total:53273,53273,book,185670,6243154,3.485255
4,2,49,5,287140,1990,Runic Astrology: Starcraft and Timekeeping in the Northern Tradition,5:6|4:1|3:3|2:3|1:2|total:15,15,book,51,278577,3.4


In [65]:
booklist_works[booklist_works.average_rating.isnull()].head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,original_title,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating
27,1,7,1,16037548,2012.0,Untold Secrets: Fire & Ice,5:0|4:0|3:0|2:0|1:0|total:0,0,book,0,21811539,
215,2,3,1,28669888,,,5:0|4:0|3:0|2:0|1:0|total:0,0,,0,45394186,
262,1,1,1,2471303,,Tolstoy-V2,5:0|4:0|3:0|2:0|1:0|total:0,0,,0,2478500,
326,1,3,2,36442770,,,5:0|4:0|3:0|2:0|1:0|total:0,0,book,0,58144514,
952,1,7,1,15774736,2013.0,,5:0|4:0|3:0|2:0|1:0|total:0,0,book,0,21485593,


In [66]:
for index, sum in enumerate (booklist_works.ratings_sum):
    if sum == 0:
        booklist_works.loc[[index],["average_rating"]] = 0

In [67]:
booklist_works.loc[[27],]

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,original_title,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating
27,1,7,1,16037548,2012,Untold Secrets: Fire & Ice,5:0|4:0|3:0|2:0|1:0|total:0,0,book,0,21811539,0.0


In [68]:
booklist_works.isnull().sum()[["average_rating"]]

average_rating    0
dtype: int64

### 3.5.2 Placing title and author_name in booklist_works

### 3.5.2.1  Check the presence of similar book_id and work_id in booklist_compiled and vice-versa

#### Book_id and work_id of booklist_works in booklist_compiled

In [69]:
# Proportion of booklist_works book_id in booklist_compiled
booklist_works.best_book_id.isin(booklist_compiled.book_id).astype(int).value_counts()

1    1439794
0      82168
Name: best_book_id, dtype: int64

In [70]:
# Proportion of booklist_works work_id in booklist_compiled
booklist_works.work_id.isin(booklist_compiled.work_id).astype(int).value_counts()

1    1521962
Name: work_id, dtype: int64

#### Book_id and work_id of booklist_compiled in booklist_works

In [71]:
# Proportion of booklist_compiled book_id in booklist_works
booklist_compiled.book_id.isin(booklist_works.best_book_id).astype(int).value_counts()

1    1439794
0     920337
Name: book_id, dtype: int64

In [73]:
# Proportion of booklist_compiled work_id in booklist_works
booklist_compiled.work_id.isin(booklist_works.work_id).astype(int).value_counts()

1    2360131
Name: work_id, dtype: int64

**Analysis: Both booklist_compiled and booklist_works have books with the same work_id.**

### 3.5.2.2  Filling the missing_values in booklist_works original_title column and include author_name

1. Match using book_id
2. Match using information from work_id of books having the max_ratings_count from booklist_compiled

**1. Match using book_id**

In [74]:
# Create a list of information from booklist_compiled
book_title_df = booklist_compiled[["book_id", "title", "author_name"]]
book_title_df = book_title_df.rename({"book_id":"best_book_id", "title": "original_title"}, axis = 1)
book_title_df = book_title_df[book_title_df.original_title != ""]
book_title_df.head()

Unnamed: 0,best_book_id,original_title,author_name
0,5333265,W.C. Fields: A Life on Film,Ronald J. Fields
1,1333909,Good Harbor,Anita Diamant
2,7327624,"The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",Barbara Hambly
3,6066819,Best Friends Forever,Jennifer Weiner
4,287140,Runic Astrology: Starcraft and Timekeeping in the Northern Tradition,Nigel Pennick


In [75]:
# Separate the books in booklist_works having titles with books having missing titles
booklist_works_missing_title = booklist_works[booklist_works.original_title == ""]
booklist_works_ori_title = booklist_works[booklist_works.original_title != ""]
booklist_works_missing_title.head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,original_title,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating
2,2,268,7,7327624,1987.0,,5:49|4:58|3:26|2:5|1:3|total:141,141,book,568,8948723,4.028369
8,2,14,4,34883016,,,5:2|4:2|3:3|2:0|1:0|total:7,7,book,27,56135087,3.857143
11,5,893,65,33394837,,,5:141|4:96|3:30|2:7|1:2|total:276,276,book,1195,54143148,4.32971
12,22,25527,553,89369,1992.0,,5:4686|4:4938|3:3944|2:1274|1:356|total:15198,15198,book,57918,41333541,3.810896
19,2,162,24,21401188,2014.0,,5:27|4:32|3:18|2:3|1:0|total:80,80,book,323,40699074,4.0375


In [76]:
# Fill the information of the missing title with mapping of book_id
booklist_works_missing_title = booklist_works_missing_title.drop(["original_title"], axis = 1)
booklist_works_missing_title = pd.merge(booklist_works_missing_title, book_title_df, on = 'best_book_id', how = 'left')
booklist_works_missing_title.head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,original_title,author_name
0,2,268,7,7327624,1987.0,5:49|4:58|3:26|2:5|1:3|total:141,141,book,568,8948723,4.028369,"The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",Barbara Hambly
1,2,14,4,34883016,,5:2|4:2|3:3|2:0|1:0|total:7,7,book,27,56135087,3.857143,Playmaker: A Venom Series Novella,V.L. Locey
2,5,893,65,33394837,,5:141|4:96|3:30|2:7|1:2|total:276,276,book,1195,54143148,4.32971,The House of Memory (Pluto's Snitch #2),Carolyn Haines
3,22,25527,553,89369,1992.0,5:4686|4:4938|3:3944|2:1274|1:356|total:15198,15198,book,57918,41333541,3.810896,The Te of Piglet,Benjamin Hoff
4,2,162,24,21401188,2014.0,5:27|4:32|3:18|2:3|1:0|total:80,80,book,323,40699074,4.0375,Glimmering Light,Margot Hovley


In [77]:
# Fill the information of the author name of books in booklist_works having existing titles
book_author_df = book_title_df.drop(["original_title"], axis = 1)
booklist_works_ori_title = pd.merge(booklist_works_ori_title, book_author_df, on = 'best_book_id', how = 'left')
booklist_works_ori_title.head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,original_title,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,author_name
0,1,6,1,5333265,1984,W. C. Fields: A Life on Film,5:1|4:1|3:1|2:0|1:0|total:3,3,book,12,5400751,4.0,Ronald J. Fields
1,22,10162,741,25717,2001,Good Harbor,5:517|4:1787|3:2763|2:966|1:196|total:6229,6229,book,20150,1323437,3.234869,Anita Diamant
2,38,89252,3504,6066819,2009,Best Friends Forever,5:9152|4:16855|3:19507|2:6210|1:1549|total:53273,53273,book,185670,6243154,3.485255,Jennifer Weiner
3,2,49,5,287140,1990,Runic Astrology: Starcraft and Timekeeping in the Northern Tradition,5:6|4:1|3:3|2:3|1:2|total:15,15,book,51,278577,3.4,Nigel Pennick
4,21,154,8,287141,1908,The Aeneid for Boys and Girls,5:29|4:21|3:14|2:3|1:0|total:67,67,book,277,278578,4.134328,Alfred J. Church


In [78]:
# Merge the two lists into a single dataframe
booklist_works = pd.concat([booklist_works_missing_title, booklist_works_ori_title], axis = 0)
booklist_works = booklist_works.reset_index(drop=True)
booklist_works.head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,original_title,author_name
0,2,268,7,7327624,1987.0,5:49|4:58|3:26|2:5|1:3|total:141,141,book,568,8948723,4.028369,"The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",Barbara Hambly
1,2,14,4,34883016,,5:2|4:2|3:3|2:0|1:0|total:7,7,book,27,56135087,3.857143,Playmaker: A Venom Series Novella,V.L. Locey
2,5,893,65,33394837,,5:141|4:96|3:30|2:7|1:2|total:276,276,book,1195,54143148,4.32971,The House of Memory (Pluto's Snitch #2),Carolyn Haines
3,22,25527,553,89369,1992.0,5:4686|4:4938|3:3944|2:1274|1:356|total:15198,15198,book,57918,41333541,3.810896,The Te of Piglet,Benjamin Hoff
4,2,162,24,21401188,2014.0,5:27|4:32|3:18|2:3|1:0|total:80,80,book,323,40699074,4.0375,Glimmering Light,Margot Hovley


In [79]:
print(f"The number of books with missing title is {booklist_works[booklist_works.original_title.isnull()].shape[0]}")
booklist_works[booklist_works.original_title.isnull()].head()

The number of books with missing title is 41401


Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,original_title,author_name
9,6,356,25,35494833,,5:54|4:63|3:34|2:30|1:21|total:202,202,book,705,49305010,3.490099,,
12,4,240,1,23373156,2014.0,5:5|4:1|3:1|2:0|1:1|total:8,8,book,33,42749946,4.125,,
50,11,79,16,30973860,,5:5|4:26|3:20|2:4|1:0|total:55,55,book,197,51592180,3.581818,,
85,5,43,2,17354319,2012.0,5:2|4:3|3:5|2:3|1:1|total:14,14,,44,22291818,3.142857,,
128,3,119,7,31084622,,5:29|4:31|3:7|2:1|1:0|total:68,68,book,292,51688433,4.294118,,


In [80]:
print(f"The number of books with missing author name is {booklist_works[booklist_works.author_name.isnull()].shape[0]}")
booklist_works[booklist_works.author_name.isnull()].head()

The number of books with missing author name is 82173


Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,original_title,author_name
9,6,356,25,35494833,,5:54|4:63|3:34|2:30|1:21|total:202,202,book,705,49305010,3.490099,,
12,4,240,1,23373156,2014.0,5:5|4:1|3:1|2:0|1:1|total:8,8,book,33,42749946,4.125,,
50,11,79,16,30973860,,5:5|4:26|3:20|2:4|1:0|total:55,55,book,197,51592180,3.581818,,
85,5,43,2,17354319,2012.0,5:2|4:3|3:5|2:3|1:1|total:14,14,,44,22291818,3.142857,,
128,3,119,7,31084622,,5:29|4:31|3:7|2:1|1:0|total:68,68,book,292,51688433,4.294118,,


**Analysis: There are missing values of authors with available title and missing values for both title and authors.**

**Missing values for both title and author**

Will be filled using work_id as a reference. 
The information will be obtained from the book edition having the highest ratings count from booklist_compiled.

**2. Match using information from work_id of books having the max_ratings_count from booklist_compiled**

In [81]:
# Separate the books from booklist_works with missing title and no missing title
booklist_works = booklist_works.rename({"original_title" : "title"}, axis = 1)
works_missing_title = booklist_works[(booklist_works.title.isnull())]
works_no_missing_title = booklist_works[booklist_works.title.notnull()]

**Books from booklist_works with both missing title and author names**

In [82]:
# Extracting the book edition details having the highest ratings count for each work_id from booklist_compiled 
booklist_compiled_max_work_rating = booklist_compiled.groupby(['work_id'])['ratings_count'].max().reset_index()
booklist_compiled_book_work_rating = booklist_compiled[['book_id', 'work_id', 'ratings_count', 'title', 'author_name']]
booklist_compiled_work_info = pd.merge(booklist_compiled_max_work_rating,booklist_compiled_book_work_rating, on = ["work_id", 'ratings_count'], how = 'left')
booklist_compiled_work_info.head()

Unnamed: 0,work_id,ratings_count,book_id,title,author_name
0,40,615,3730,The Hidden Persuaders,Vance Packard
1,62,888,3402,Kiffe Kiffe Tomorrow,Faiza Guene
2,81,215,7918,Five Little Peppers Abroad,Margaret Sidney
3,84,1315,7932,Baby Island,Carol Ryrie Brink
4,87,31717,8155,A Woman of Substance (Emma Harte Saga #1),Barbara Taylor Bradford


In [83]:
booklist_compiled_work_info[booklist_compiled_work_info.work_id.duplicated()].head()

Unnamed: 0,work_id,ratings_count,book_id,title,author_name
2702,13080,26,13560337,Cilappatikaram: The Tale of an Anklet,Ilankovatikal
7330,31982,1,2808224,The Philosophy of Nietzsche,Friedrich Nietzsche
8534,36916,4,21202617,Minos,Edwin Page
12863,54823,9,27036588,Angel Claws: Coffee Table Book,Alejandro Jodorowsky
15986,69330,1,32137953,Le Genie Du Christianisme,Francois-Rene de Chateaubriand


In [84]:
booklist_compiled_work_info[booklist_compiled_work_info.work_id == 13080]

Unnamed: 0,work_id,ratings_count,book_id,title,author_name
2701,13080,26,10364,The Cilappatikaram of Iḷaṅkō Aṭikaḷ: An Epic of South India,Ilankovatikal
2702,13080,26,13560337,Cilappatikaram: The Tale of an Anklet,Ilankovatikal


In [85]:
# Drop duplication for book editions of the same work_id having the same raings_count
booklist_compiled_work_info = booklist_compiled_work_info.drop_duplicates(subset = "work_id")

In [86]:
# Fill the books having both missing title and author name in booklist_works
booklist_compiled_work_info = booklist_compiled_work_info.drop(["book_id", "ratings_count"], axis = 1)
works_missing_title = works_missing_title.drop(["title", "author_name"], axis = 1)
works_missing_title = pd.merge(works_missing_title,booklist_compiled_work_info, on = "work_id", how = 'left')
print(f"The works_missing_title rows and columns are {works_missing_title.shape}")
works_missing_title.head()

The works_missing_title rows and columns are (41401, 13)


Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,title,author_name
0,6,356,25,35494833,,5:54|4:63|3:34|2:30|1:21|total:202,202,book,705,49305010,3.490099,The Slaughtered Virgin of Zenopolis (Inspector Capstan #1),David Blake
1,4,240,1,23373156,2014.0,5:5|4:1|3:1|2:0|1:1|total:8,8,book,33,42749946,4.125,The Switchblade Mamma,Lindsey Schussman
2,11,79,16,30973860,,5:5|4:26|3:20|2:4|1:0|total:55,55,book,197,51592180,3.581818,Svart stjärna,Jesper Ersgard
3,5,43,2,17354319,2012.0,5:2|4:3|3:5|2:3|1:1|total:14,14,,44,22291818,3.142857,Itsy Bitsy Spider,Charles Reasoner
4,3,119,7,31084622,,5:29|4:31|3:7|2:1|1:0|total:68,68,book,292,51688433,4.294118,Sleuthing at Sweet Springs (Sleuth Sisters #4),Maggie Pill


**Books from booklist_works having title with some having author names while some have missing author names**

In [87]:
# Separate books having titles with and without missing author names
works_no_missing_title_author = works_no_missing_title[works_no_missing_title.author_name.notnull()]
works_no_missing_title_missing_author = works_no_missing_title[works_no_missing_title.author_name.isnull()]

# Fill in the information of author names using the work_id form booklist_compiled
works_no_missing_title_missing_author = works_no_missing_title_missing_author.drop(["author_name"], axis = 1)
booklist_compiled_work_info = booklist_compiled_work_info.drop(["title"], axis = 1)
works_no_missing_title_missing_author = pd.merge(works_no_missing_title_missing_author,booklist_compiled_work_info, on = 'work_id', how = 'left')
works_no_missing_title_missing_author.head()

Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,title,author_name
0,20,164,7,7216124,1975,5:17|4:29|3:27|2:8|1:1|total:82,82,book,299,1040345,3.646341,Des yeux de soie,Francoise Sagan
1,3,126,5,1419534,1981,5:8|4:16|3:16|2:6|1:0|total:46,46,,164,1409928,3.565217,"Eighteenth Century Europe: Tradition and Progress, 1715-1789 (Norton History of Modern Europe)",Isser Woloch
2,2,17,3,1802481,2006,5:2|4:2|3:2|2:1|1:0|total:7,7,,26,1801666,3.714286,Cahier de gribouillages pour adultes qui s'ennuient au bureau,Claire Fay
3,3,24,2,1890255,1990,5:3|4:4|3:3|2:0|1:0|total:10,10,,40,1891601,4.0,An Historical Introduction to American Education,Gerald L. Gutek
4,7,49,5,13412895,1888,5:8|4:16|3:7|2:2|1:0|total:33,33,book,129,18687450,3.909091,Ogni,Anton Chekhov


In [88]:
# Combine all the books into booklist_works
booklist_works = pd.concat([works_no_missing_title_author,works_no_missing_title_missing_author,works_missing_title ], axis = 0)
booklist_works = booklist_works.reset_index(drop=True)
print(f"The booklist_works rows and columns are {booklist_works.shape}")

# Create a column combining first author name and book title
booklist_works = booklist_works.rename({"author_name":"first_author_name"}, axis = 1)
booklist_works["first_author_title"] = booklist_works["first_author_name"] + ' - ' + booklist_works["title"]
booklist_works.head()

The booklist_works rows and columns are (1521962, 13)


Unnamed: 0,books_count,reviews_count,text_reviews_count,best_book_id,original_publication_year,rating_dist,ratings_count,media_type,ratings_sum,work_id,average_rating,title,first_author_name,first_author_title
0,2,268,7,7327624,1987.0,5:49|4:58|3:26|2:5|1:3|total:141,141,book,568,8948723,4.028369,"The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)",Barbara Hambly,"Barbara Hambly - The Unschooled Wizard (Sun Wolf and Starhawk, #1-2)"
1,2,14,4,34883016,,5:2|4:2|3:3|2:0|1:0|total:7,7,book,27,56135087,3.857143,Playmaker: A Venom Series Novella,V.L. Locey,V.L. Locey - Playmaker: A Venom Series Novella
2,5,893,65,33394837,,5:141|4:96|3:30|2:7|1:2|total:276,276,book,1195,54143148,4.32971,The House of Memory (Pluto's Snitch #2),Carolyn Haines,Carolyn Haines - The House of Memory (Pluto's Snitch #2)
3,22,25527,553,89369,1992.0,5:4686|4:4938|3:3944|2:1274|1:356|total:15198,15198,book,57918,41333541,3.810896,The Te of Piglet,Benjamin Hoff,Benjamin Hoff - The Te of Piglet
4,2,162,24,21401188,2014.0,5:27|4:32|3:18|2:3|1:0|total:80,80,book,323,40699074,4.0375,Glimmering Light,Margot Hovley,Margot Hovley - Glimmering Light


In [89]:
booklist_works.isnull().sum()[['title', 'first_author_name']]

title                0
first_author_name    0
dtype: int64

### 3.6 Summary

* Handled data values including changing to integers/floats, remove spaces and rename values for consistency
* Selected columns for booklist_compiled and booklist_works
* Generate new dataframes for easier references including author_list, multiple_authors_name_df, series_list_df, similar_books_list_df, bookid_removed
* Remove books with majority of the information are missing.
* Generate new columns of first_author_name, first_author_title in both booklist_works and booklist_compiled
* Fill in missing values in title in booklist_works

## Exporting Data

In [90]:
#Placed the # to refrain from executing
# Main booklist information
booklist_compiled.to_parquet("./data/booklist_compiled_clean.parquet", compression = 'gzip')
booklist_works.to_parquet("./data/booklist_works_clean.parquet", compression = 'gzip')
booklist_authors.to_parquet("./data/booklist_authors_clean.parquet", compression = 'gzip')

#Authors
authors_list.to_parquet("./data/authors_list.parquet", compression = 'gzip')
multiple_authors_name_df.to_parquet("./data/multiple_authors_name_df.parquet", compression = 'gzip') 

#List of series and similar books
series_list_df.to_parquet("./data/series_list_df.parquet", compression = 'gzip') 
similar_books_list_df.to_parquet("./data/similar_books_list_df.parquet", compression = 'gzip')

# Dataframe of bookid removed
bookid_removed.to_parquet("./data/bookid_removed.parquet", compression = 'gzip')