# Data sourcing using web scraping and the Goodreads API

The purpose of this notebook is to scrape the LibraryThing website and use the Goodreads API for additional book information. The LibraryThing dataset downloaded from Prof. McAuley's <a href="https://cseweb.ucsd.edu/~jmcauley/datasets.html#social_data" target="_blank">website</a> (see data_preprocessing notebook in extra_info directory) does not contain book titles, author names or ISBSNs. In this notebook, I collect additional information for a subset of the 5000 books with the most reviews. Having the ISBNs for these books will allow me to get book genre information from the Goodreads API.

### Imports

In [1]:
import pandas as pd
from modules.book_info_extractor import extract_book_details, clean_up_dataframe, generate_clean_isbn_and_id_lists
from modules.scraper import write_htmls_to_csv
from modules.goodreads_api_functions import acquire_goodreads_id, get_book_titles, get_book_shelves
from modules.book_genre_extractor import extract_book_genre_info

### Configs

In [2]:
RAW_HTML_OUT_FOLDER_PATH = '/Users/ddefforey1/work/dsi-course/capstone_datasets'
TOP_5K_MOST_REVIEWED_BOOKS_PATH = '/Users/ddefforey1/work/dsi-course/capstone_datasets/top_5K_books.csv'
CLEAN_BOOK_GENRES_PATH = '/Users/ddefforey1/work/dsi-course/capstone_datasets/clean_book_genres.csv'

### Web Requests and Writing Output to a CSV

In [3]:
# importing the book ids for 5K books with the most reviews
top_5K_books = pd.read_csv(TOP_5K_MOST_REVIEWED_BOOKS_PATH)

In [4]:
# make a list of book IDs
books_list = list(top_5K_books.book_id)

In [5]:
raw_htmls_path = write_htmls_to_csv(books_list=books_list, path=RAW_HTML_OUT_FOLDER_PATH)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))


Pages scraped incorrectly: []


### Parsing Data with BeautifulSoup

In [6]:
# loading the csv containing the raw htmls
raw_data = pd.read_csv(raw_htmls_path)
raw_data.head()

Unnamed: 0,book_id,raw_html
0,4979986,<!DOCTYPE html><html>\n<head><title>The Hunger...
1,8384326,<!DOCTYPE html><html>\n<head><title>Twilight b...
2,1541442,<!DOCTYPE html><html>\n<head><title>The Girl w...
3,393681,<!DOCTYPE html><html>\n<head><title>The Book T...
4,8662515,<!DOCTYPE html><html>\n<head><title>Catching F...


In [7]:
book_info = raw_data.raw_html.apply(extract_book_details).copy()

In [8]:
book_info = clean_up_dataframe(df=book_info.copy(), books_list=books_list)
book_info.head()

Unnamed: 0,id,book_title,author,isbn
0,4979986,The Hunger Games,Suzanne Collins,439023483
1,8384326,Twilight (2005),Stephenie Meyer,316015849
2,1541442,The Girl with the Dragon Tattoo (2005),Stieg Larsson,307454541
3,393681,The Book Thief (2007),Markus Zusak,375842209
4,8662515,Catching Fire,Suzanne Collins,439023491


In [9]:
book_info.shape

(3638, 4)

Some books are missing ISBNs because this information was not included on their webpage on the LibraryThing website. At this point, I will continue with the ones that have ISBNs but it is worth noting that this project could be expanded by including books with missing ISBNs.

### Collect Goodreads Book Identifiers using Goodreads API

Having Goodreads book IDs will allow us to collect book genre information, the last piece of information needed for the topic model.

In [10]:
isbns_list, id_list = generate_clean_isbn_and_id_lists(df=book_info.copy())

In [11]:
goodreads_ids = acquire_goodreads_id(isbns_list)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))




In [12]:
# making a dataframe with ISBNs and their corresponding Goodreads IDs
goodreads_book_details = pd.DataFrame({
    'id': id_list,
    'isbn': isbns_list,
    'goodreads_id': goodreads_ids
})

### Collect book genre information using Python interface for Goodreads

In [13]:
# collect book titles from Goodreads to confirm that they match those from the LibraryThing dataset
goodreads_book_details['goodreads_book_titles'] = get_book_titles(goodreads_book_details.goodreads_id)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))




In [14]:
# collect goodreads shelves (from which genre info will be derived)
goodreads_book_details['goodreads_shelves'] = get_book_shelves(goodreads_book_details.goodreads_id)

HBox(children=(IntProgress(value=0, max=5), HTML(value='')))




In [15]:
goodreads_book_details.head()

Unnamed: 0,id,isbn,goodreads_id,goodreads_book_titles,goodreads_shelves
0,4979986,439023483,2767052,"The Hunger Games (The Hunger Games, #1)","[{'@name': 'to-read', '@count': '962270'}, {'@..."
1,8384326,316015849,41865,"Twilight (Twilight, #1)","[{'@name': 'to-read', '@count': '715584'}, {'@..."
2,1541442,307454541,5291539,"The Girl with the Dragon Tattoo (Millennium, #1)","[{'@name': 'to-read', '@count': '903225'}, {'@..."
3,393681,375842209,39395800,The Book Thief,"[{'@name': 'to-read', '@count': '1180327'}, {'..."
4,8662515,439023491,6148028,"Catching Fire (The Hunger Games, #2)","[{'@name': 'to-read', '@count': '250547'}, {'@..."


### Extract book genre information

In [16]:
# removing an academic textbook
goodreads_book_details = goodreads_book_details.drop(index=566, axis=0)
goodreads_book_details = goodreads_book_details.reset_index(drop=True)

In [17]:
# extract book genres from goodreads shelves
goodreads_book_details['book_genres'] = goodreads_book_details.goodreads_shelves.map(extract_book_genre_info)
goodreads_book_details.head()

Unnamed: 0,id,goodreads_id,isbn,goodreads_book_titles,goodreads_shelves,book_genres
0,4979986,2767052,439023483,"The Hunger Games (The Hunger Games, #1)","[{'@name': 'to-read', '@count': '870985'}, {'@...",[science-fiction]
1,8384326,41865,316015849,"Twilight (Twilight, #1)","[{'@name': 'to-read', '@count': '599873'}, {'@...",[fantasy]
2,1541442,5291539,307454541,"The Girl with the Dragon Tattoo (Millennium, #1)","[{'@name': 'to-read', '@count': '745270'}, {'@...","[mystery, fiction, thriller]"
3,393681,39395800,375842209,The Book Thief,"[{'@name': 'to-read', '@count': '1013995'}, {'...","[non-fiction, fiction]"
4,8662515,6148028,439023491,"Catching Fire (The Hunger Games, #2)","[{'@name': 'to-read', '@count': '221799'}, {'@...",[science-fiction]


In [18]:
# saving the dataframe for future use
goodreads_book_details.to_csv(CLEAN_BOOK_GENRES_PATH, index=False)