# Scraping Data from Goodreads.com

In [1]:
import pandas as pd
import numpy as np

from RawDataCollector import GoodReadsScraper

First I import another jupyter notebook to have list of ISBNs and metadata about books

In [2]:
# Importing df_final from other notebook
%run "..\raw\bestsellers_generating.ipynb"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8555 entries, 0 to 8554
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   isbn13   8552 non-null   object
 1   title1   8555 non-null   object
 2   author1  8555 non-null   object
dtypes: object(3)
memory usage: 200.6+ KB

(8555, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8555 entries, 0 to 8554
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   isbn13   8555 non-null   object
 1   title1   8555 non-null   object
 2   author1  8555 non-null   object
dtypes: object(3)
memory usage: 200.6+ KB


author_title_isbn13
gillian flynn gone girl 9780307588388               112
george rr martin a game of thrones 9780553897845     77
kristin hannah the nightingale 9781466850606         74
e l james fifty shades of grey 9781612130293         71
e l james fifty shades darker 9781612130590          70
                    

With available list of ISBNs and Dataframe with books metadata I start:
### Web Scraping

First we check if _bestsellers_ dataframe with _ISBNs_ is defined and is not empty.
Otherwise raise the `ValueError`.

In [3]:
if df_final is not None:
    df1 = df_final
    isbns = df_final['isbn13']
else:
    raise ValueError('variables not found...')

If everything is in order we proceed scraping additional data from the Goodreads.com

In [4]:
scrapped = GoodReadsScraper(isbns)
print(scrapped)

You just scrapped 2935 books from the Goodreads.com!


I instantiate `GoodReadsScraper`, personal module, which takes isbn numbers and scraps data for each book and stores html text as a list.

This will take some time as we are performing thousands of calls.

After scrapping is done we extract **number of pages, edition, cover picture url** and **genres** and convert them into `pandas` dataframe.

In [5]:
df2 = scrapped.data_converter()
print(df2.shape)

(2935, 3)


In [6]:
df3 = scrapped.genre_converter()
print(df3.shape)

(2935, 2)


In [7]:
df4 = scrapped.cover_url_converter()
print(df4.shape)

(2935, 2)


In [8]:
df5 = scrapped.pop_converter()
print(df5.shape)

(2935, 3)


After all neccessary methods are executed we have four different dataframes, which we will join on _ISBN13_ number and merge into one.

In [9]:
result = df1.merge(df2, on='isbn13').merge(df3, on='isbn13').merge(df4, on='isbn13').merge(df5, on='isbn13')
result.tail()

Unnamed: 0,title1,author1,isbn13,pages,edition,genres,cover_url,rating,count
2930,redhead by the side of the road,anne tyler,9780525658429,192 pages,ebook,"[[Fiction], [Contemporary], [Literary Fiction]]",https://i.gr-assets.com/images/S/compressed.ph...,\n 3.80\n,"\n 6,416\n ratings\n"
2931,revenge,james patterson and andrew holmes,9781538700723,,,[],,,
2932,everything i never told you,celeste ng,9780143127550,292 pages,Paperback,"[[Fiction], [Contemporary], [Mystery]]",https://i.gr-assets.com/images/S/compressed.ph...,\n 3.86\n,"\n 301,065\n ratings\n"
2933,the book of lost friends,lisa wingate,9781984819895,,Unknown Binding,"[[Historical], [Historical Fiction], [Fiction]]",,\n 4.27\n,"\n 5,903\n ratings\n"
2934,texas outlaw,james patterson and andrew bourelle,9780316428163,448 pages,Hardcover,"[[Mystery], [Fiction], [Westerns]]",https://i.gr-assets.com/images/S/compressed.ph...,\n 4.43\n,"\n 1,980\n ratings\n"


In [10]:
result.shape

(2935, 9)

In [19]:
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935 entries, 0 to 2934
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title1     2935 non-null   object
 1   author1    2935 non-null   object
 2   isbn13     2935 non-null   object
 3   pages      2599 non-null   object
 4   edition    2795 non-null   object
 5   genres     2935 non-null   object
 6   cover_url  2713 non-null   object
 7   rating     2799 non-null   object
 8   count      2799 non-null   object
dtypes: object(9)
memory usage: 229.3+ KB


I have some missing values but that's ok.

Finally I save data as a pickle file.

In [20]:
result.to_csv('bestsellers_merged.csv', index=False)

In [16]:
# result.to_pickle('complete_bestsellers.pkl')