# Scraping Data from Goodreads.com

In [None]:
import requests
import time
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from GoodReadsScraper import GoodReadsScraper

First I import another jupyter notebook to have list of ISBNs and metadata about books

In [None]:
# Importing df_final from other notebook
%run "..\data\\isbn13_list.ipynb"

With available list of ISBNs and Dataframe with books metadata I start:
### Web Scraping

First we check if _bestsellers_ dataframe with _ISBNs_ is defined and is not empty.
Otherwise raise the `ValueError`.

In [None]:
if df_final is not None:
    df1 = df_final
    isbns = df_final['isbn13']
else:
    raise ValueError('variables not found...')

If everything is in order we proceed scraping additional data from the Goodreads.com

In [None]:
scrapped = GoodReadsScraper(isbns)
print(scrapped)

I instantiate `GoodReadsScraper`, personal module, which takes isbn numbers and scraps data for each book and stores html text as a list.

This will take some time as we are performing thousands of calls.

After scrapping is done we extract **number of pages, edition, cover picture url** and **genres** and convert them into `pandas` dataframe.

In [None]:
df2 = scrapped.data_converter()
print(df2.shape)

In [None]:
df3 = scrapped.genre_converter()
print(df3.shape)

In [None]:
df4 = scrapped.cover_url_converter()
print(df4.shape)

After all neccessary methods are executed we have four different dataframes, which we will join on _ISBN13_ number and merge into one.

In [None]:
result = df1.merge(df2, on='isbn13').merge(df3, on='isbn13').merge(df4, on='isbn13')
result.tail()

In [None]:
result.shape

In [None]:
result.info()

I have some missing values but that's ok.

Finally I save data as a pickle file.

In [None]:
# result.to_csv('complete_bestsellers.csv', index=False)
result.to_pickle('complete_bestsellers.pkl')