<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  Book Reviews Capstone Project: Web Scraping


*Delphine Defforey*

___


<font color=navy>
    The purpose of this notebook is to scrape the LibraryThing website for additional information on books. The dataset in the first notebook does not contain book titles, author names or ISBSNs, which I need for my analysis. In this notebook, I collect additional information for a subset of the 5000 books with the most reviews. Having the ISBNs for these books will allow me to get book genre information from the Goodreads API (see notebook #3 for more details).
    </font>

### Imports and Settings

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import time
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
from tqdm import tqdm_notebook

import csv
import datetime

In [2]:
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Configs

In [3]:
raw_html_out_folder = '/Users/ddefforey1/work/capstone_datasets'
top_5K_most_reviewed_books_path = '/Users/ddefforey1/work/capstone_datasets/top_5K_books.csv'
raw_book_info_path = '/Users/ddefforey1/work/capstone_datasets/raw_book_info.csv'
cleaner_book_info_path = '/Users/ddefforey1/work/capstone_datasets/cleaner_book_info2.csv'
missing_ibns_path = '/Users/ddefforey1/work/capstone_datasets/missing_isbns.csv'

### Web Requests and Writing Output to a CSV

In [4]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text, otherwise return None.
    """
    try:
        with closing(get(url)) as resp:
            if is_good_response(resp):
                return resp.text
            else:
                return None

    except RequestException as e:
        print('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


In [5]:
# importing the book ids for 5K books with the most reviews
top_5K_books = pd.read_csv(top_5K_most_reviewed_books_path)

In [6]:
top_5K_books.head()

Unnamed: 0,book_id,n_comments
0,4979986,2255
1,8384326,1661
2,1541442,1232
3,393681,1139
4,8662515,1098


In [7]:
# make a list of book IDs
books_list = list(top_5K_books.book_id)

In [10]:
# writing scraped hmtls to a csv
# uses date-time stamps to avoid accidentally overwriting files
# one second pause added at the end of each loop to avoid getting blocked

failed_book_ids = []

timestamp = datetime.datetime.now().strftime('%y-%m-%dT%H:%M:%S')
raw_htmls_path = f'{raw_html_out_folder}/scraped_raw_html_librarything_{timestamp}.csv'
with open(raw_htmls_path, 'w') as csv_file:
    fieldnames = ['book_id', 'raw_html']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()

    # scrape book info!
    for book_id in tqdm_notebook(books_list):
        url = 'https://www.librarything.com/work/{}'.format(book_id)
        scraped_raw_html = simple_get(url)        
        if scraped_raw_html is not None:
            writer.writerow({'book_id': book_id, 'raw_html': scraped_raw_html})
        else:
            failed_book_ids.append(book_id)
        time.sleep(1)

HBox(children=(IntProgress(value=0, max=5000), HTML(value='')))




In [11]:
# checking if there were any pages that weren't scrapped properly
failed_book_ids

[]

### Parsing Data with BeautifulSoup

In [12]:
raw_htmls_path = '/Users/ddefforey1/work/capstone_datasets/scraped_raw_html_librarything_19-05-02T13:40:13.csv'

In [13]:
# loading the csv containing the raw htmls
raw_data = pd.read_csv(raw_htmls_path)

In [14]:
raw_data.head()

Unnamed: 0,book_id,raw_html
0,4979986,<!DOCTYPE html><html>\n<head><title>The Hunger...
1,8384326,<!DOCTYPE html><html>\n<head><title>Twilight b...
2,1541442,<!DOCTYPE html><html>\n<head><title>The Girl w...
3,393681,<!DOCTYPE html><html>\n<head><title>The Book T...
4,8662515,<!DOCTYPE html><html>\n<head><title>Catching F...


In [15]:
raw_data.shape

(5000, 2)

In [16]:
def extract_book_title(entry):
    """ 
    Returns a book title for a given html
    """
    try:
        return entry.find('div', attrs={'class':'headsummary'}).find('h1').text.strip()
    except:
        return np.nan

In [17]:
def extract_book_author(entry):
    """ 
    Returns the name of the author of a book for a given html
    """
    try:
        return entry.find('div', attrs={'class':'headsummary'}).find('h2').text.strip().replace('by ','')
    except:
        return np.nan

In [18]:
def extract_book_isbn(entry):
    """ 
    Returns a book's International Standard Book Number (ISBN) for a given html
    """
    try:
        return entry.find('div', attrs={'class':'description'}).find('h4').text.strip()
    except:
        return np.nan

In [19]:
def extract_book_details(entry):
    """
    Passes entry into BeautifulSoup, then passes the output into three functions that extract book titles,
    author names and ISBNs from each html. The function returns a pandas series with these three book attributes.
    """
    soup = BeautifulSoup(entry, 'html.parser')
    title = extract_book_title(soup)
    author = extract_book_author(soup)
    isbn = extract_book_isbn(soup)
    return pd.Series([title,author,isbn], index=['book_title', 'author', 'isbn'])

In [20]:
book_info = raw_data.raw_html.apply(extract_book_details)

In [21]:
# saving a copy of the raw book information after it was parsed
book_info.to_csv(raw_book_info_path, index=False)

In [22]:
book_info = pd.read_csv(raw_book_info_path)

In [23]:
book_info.head()

Unnamed: 0,book_title,author,isbn
0,The Hunger Games,Suzanne Collins,Amazon.com Product Description (ISBN 043902348...
1,Twilight (2005),Stephenie Meyer,"Amazon.com Amazon.com Review (ISBN 0316015849,..."
2,The Girl with the Dragon Tattoo (2005),Stieg Larsson,"Amazon.com Amazon.com Review (ISBN 0307454541,..."
3,The Book Thief (2007),Markus Zusak,Amazon.com Product Description (ISBN 037584220...
4,Catching Fire,Suzanne Collins,Amazon.com Product Description (ISBN 043902349...


In [24]:
book_info.shape

(5000, 3)

In [25]:
# checking for missing values
book_info.isnull().sum()

book_title       0
author           1
isbn          1363
dtype: int64

In [26]:
# using a regex to extract ISBN numbers
book_info['isbn'] = book_info.isbn.str.extract(r'ISBN (\d+)\D', expand=False)

In [27]:
book_info['id'] = books_list

In [28]:
book_info = book_info[['id', 'book_title', 'author', 'isbn']]
book_info.head(5)

Unnamed: 0,id,book_title,author,isbn
0,4979986,The Hunger Games,Suzanne Collins,439023483
1,8384326,Twilight (2005),Stephenie Meyer,316015849
2,1541442,The Girl with the Dragon Tattoo (2005),Stieg Larsson,307454541
3,393681,The Book Thief (2007),Markus Zusak,375842209
4,8662515,Catching Fire,Suzanne Collins,439023491


In [29]:
# filling in the information for the book missing its author
book_info.author.iloc[4801] = 'Jeremy Harmer'
book_info.isbn.iloc[4801] = 9780521656139

In [30]:
# making a copy of the dataframe to look into missing ISBNs
book_subset = book_info.copy()

In [31]:
# converting the ISBN column to strings to avoid losing leading zeroes
book_info['isbn'] = book_info.isbn.astype(str)

In [32]:
# saving the dataframe as a csv
book_info.to_csv(cleaner_book_info_path, index=False)

In [33]:
book_info.dtypes

id             int64
book_title    object
author        object
isbn          object
dtype: object

In [34]:
book_subset.isnull().sum()

id               0
book_title       0
author           0
isbn          1362
dtype: int64

In [35]:
book_subset.isbn.fillna(0, inplace=True)

In [36]:
# identifying the book entries missing ISBNs
missing_isbns = book_subset[book_subset.isbn == 0]
missing_isbns.head()

Unnamed: 0,id,book_title,author,isbn
7,9279041,Mockingjay,Suzanne Collins,0
9,1222607,The Road (2006),Cormac McCarthy,0
10,522063,Water for Elephants: A Novel (2006),Sara Gruen,0
13,5403381,Harry Potter and the Sorcerer's Stone,J. K. Rowling,0
23,5197633,Life of Pi (2001),Yann Martel,0


In [37]:
book_info.to_csv(cleaner_book_info_path, index=False)

<font color=navy>
    Some books are missing ISBNs because this information was not included on their webpage on the LibraryThing website. At this point, I will continue with the ones that have ISBNs but I will also save a copy of the dataframe with books missing ISBNs. If I need more reviews to improve my model, it will be worth coming back to those and manually labelling them.
    </font>

In [38]:
missing_isbns.to_csv(missing_ibns_path, index=False)