# Scraping Data from Goodreads.com

In [1]:
import pandas as pd
import numpy as np

from RawDataCollector import GoodReadsScraper

First I import another jupyter notebook to have list of ISBNs and metadata about books

In [2]:
# Importing df_final from other notebook
%run "..\raw\bestsellers_generating.ipynb"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8555 entries, 0 to 8554
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   isbn13   8552 non-null   object
 1   title1   8555 non-null   object
 2   author1  8555 non-null   object
 3   weeks    8555 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 267.5+ KB

(8555, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8555 entries, 0 to 8554
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   isbn13   8555 non-null   object
 1   title1   8555 non-null   object
 2   author1  8555 non-null   object
 3   weeks    8555 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 267.5+ KB


author_title_isbn13
gillian flynn gone girl 9780307588388               112
george rr martin a game of thrones 9780553897845     77
kristin hannah the nightingale 9781466850606         74
e l james fifty shades of grey 9781612

With available list of ISBNs and Dataframe with books metadata I start:
### Web Scraping

First we check if _bestsellers_ dataframe with _ISBNs_ is defined and is not empty.
Otherwise raise the `ValueError`.

In [3]:
if df_final is not None:
    df1 = df_final
    isbns = df_final['isbn13']
    print('All cool!')
else:
    raise ValueError('variables not found...')

All cool!


If everything is in order we proceed scraping additional data from the Goodreads.com

In [4]:
scrapped = GoodReadsScraper(isbns)
html_list = scrapped.store_html()
print(scrapped)

You just scrapped 5 books from the Goodreads.com!


I instantiate `GoodReadsScraper`, personal module, which takes isbn numbers and scraps data for each book and stores html text as a list.

This will take some time as we are performing thousands of calls.

After scrapping is done we extract **number of pages, edition, cover picture url** and **genres** and convert them into `pandas` dataframe.

In [5]:
df2 = scrapped.data_converter()
print(df2.shape)

(5, 4)


In [7]:
df3 = scrapped.cover_url_converter()
print(df3.shape)

(5, 2)


In [8]:
df4 = scrapped.pop_converter()
print(df4.shape)

(5, 3)


In [9]:
df6 = scrapped.description()
df6

Unnamed: 0,isbn13,description
0,9780345541444,"\nAt nearly one hundred years old, Thalia Mars..."


After all neccessary methods are executed we have four different dataframes, which we will join on _ISBN13_ number and merge into one.

In [11]:
result = df1.merge(df2, on='isbn13').merge(df3, on='isbn13').merge(df4, on='isbn13')
result.tail()

Unnamed: 0,title1,author1,isbn13,pages,released,edition,cover_url,rating,count
0,never never,james patterson and candice fox,9780316433174,363 pages,\n (first published August 25th 2...,Hardcover,https://i.gr-assets.com/images/S/compressed.ph...,\n 3.62\n,"\n 13,917\n ratings\n"
1,devil in spring,lisa kleypas,9780062371904,384 pages,\n —\n 37 likes\n,ebook,https://i.gr-assets.com/images/S/compressed.ph...,\n 4.08\n,"\n 22,036\n ratings\n"
2,aftermath:: empire's end,chuck wendig,9781101966969,423 pages,\n —\n 7 likes\n,Hardcover,https://i.gr-assets.com/images/S/compressed.ph...,\n 3.79\n,"\n 9,247\n ratings\n"
3,echoes in death,j d robb,9781250123145,400 pages,\n —\n 12 likes\n,ebook,https://i.gr-assets.com/images/S/compressed.ph...,\n 4.41\n,"\n 19,576\n ratings\n"
4,heartbreak hotel,jonathan kellerman,9780345541444,325 pages,\n —\n 0 likes\n,Nook,https://i.gr-assets.com/images/S/compressed.ph...,\n 3.84\n,"\n 11,852\n ratings\n"


In [12]:
result.shape

(5, 9)

In [13]:
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title1     5 non-null      object
 1   author1    5 non-null      object
 2   isbn13     5 non-null      object
 3   pages      5 non-null      object
 4   released   5 non-null      object
 5   edition    5 non-null      object
 6   cover_url  5 non-null      object
 7   rating     5 non-null      object
 8   count      5 non-null      object
dtypes: object(9)
memory usage: 400.0+ bytes


I have some missing values but that's ok.

Finally I save data as a pickle file.

In [None]:
# result.to_csv('bestsellers_merged.csv', index=False)

In [None]:
# result.to_pickle('complete_bestsellers.pkl')