# Web Scraping of Books to Scrape Using Selenium Take 2
### David Lowe
### August 28, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Books to Scarpe is a fictional bookstore that desperately wants to be scraped according to its site owner. It is a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. This iteration of the script automatically traverses the book listing and detail web pages to capture all the descriptive data about the books and store them in a CSV output file.

Starting URLs: http://books.toscrape.com/

## Task 1. Prepare Environment

In [1]:
import os
import sys
import pandas as pd
import shutil
import re
import boto3
from datetime import datetime, date
from random import randint
from time import sleep
from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# # Set up the parent directory location for loading the dotenv files
# useColab = False
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# # Set up the dotenv file for retrieving environment variables
# useLocalPC = False
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the flag to write the output to a JSON document (setting to TRUE will create the document!)
writeOutput = True

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [4]:
if (notifyStatus): status_notify("Task 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 2. Perform the Scraping and Processing

In [5]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# To scrape the detail information of each individual book
def capture_details(item_page_url):
    # Initialize the web browser
    second_browser_options = Options()
    second_browser_options.headless = True
    detail_page_browser = webdriver.Firefox(options=second_browser_options)
    if verbose: print('Attempting to access the web page:', item_page_url)
    try:
        detail_page_browser.get(item_page_url)
        print('Successfully accessed the web page:', item_page_url)
    except:
        print('The server could not serve up the web page!')
        sys.exit('Script processing cannot continue!!!')

    product_container = detail_page_browser.find_element(By.CLASS_NAME, "product_page")
    item_image = product_container.find_element(By.TAG_NAME, "img").get_attribute("src")
    item_description = product_container.find_elements(By.TAG_NAME, "p")[3].text
    table_rows = product_container.find_elements(By.TAG_NAME, "tr")
    item_upc = table_rows[0].find_element(By.TAG_NAME, "td").text
    item_type = table_rows[1].find_element(By.TAG_NAME, "td").text
    item_tax_text = table_rows[4].find_element(By.TAG_NAME, "td").text
    try:
        item_tax = re.search(r"[0-9\.]+", item_tax_text)[0]
    except AttributeError:
        item_tax = 0.00
    item_reviews = int(table_rows[6].find_element(By.TAG_NAME, "td").text)
    item_quantity_text = table_rows[5].find_element(By.TAG_NAME, "td").text
    try:
        item_quantity = re.search(r"\d+", item_quantity_text)[0]
    except AttributeError:
        item_quantity = 0

    detail_page_browser.quit()
    return item_upc, item_type, item_tax, item_quantity, item_reviews, item_image, item_description

In [7]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['book_title', 'detail_url', 'thumnail_url', 'book_price', 'inventory_status', 'upc_code', 'product_type', 'required_tax', 'stock_quantity', 'number_reviews', 'image_url', 'product_description'])
num_entries = 0

In [8]:
# Specifying the URL of desired web page to be scrapped
website_url = "http://books.toscrape.com"
web_page_url = website_url + "/index.html"

In [9]:
# Initialize the web browser
firefox_options = Options()
firefox_options.headless = False
home_page_browser = webdriver.Firefox(options=firefox_options)

In [10]:
current_page = 1
max_pages = 999
done = False

while (current_page <= max_pages) and (not done):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,4)
    print("Waiting", waitTime, "seconds to retrieve the items on page", web_page_url)
    sleep(waitTime)
    print('Attempting to access the web page:', web_page_url)
    try:
        home_page_browser.get(web_page_url)
        print('Successfully accessed the web page:', web_page_url)
    except:
        print('The server could not serve up the web page!')
        sys.exit('Script processing cannot continue!!!')

    book_section = home_page_browser.find_element(By.TAG_NAME, "ol")
    book_listing = book_section.find_elements(By.TAG_NAME, "li")
    if verbose: print(book_listing)

    for book_item in book_listing:
        thumnail_container = book_item.find_element(By.CLASS_NAME, "image_container")
        detail_url = thumnail_container.find_element(By.TAG_NAME, "a").get_attribute("href")
        thumnail_url = thumnail_container.find_element(By.TAG_NAME, "img").get_attribute("src")
        book_title = book_item.find_element(By.TAG_NAME, "h3").find_element(By.TAG_NAME, "a").get_attribute("title")
        book_price_text = book_item.find_element(By.CLASS_NAME, "price_color").text
        try:
            book_price = re.search(r"[0-9\.]+", book_price_text)[0]
        except AttributeError:
            book_price = 0.00
        inventory_status = book_item.find_element(By.CLASS_NAME, "availability").text

        upc_code, product_type, required_tax, stock_quantity, number_reviews, image_url, product_description = capture_details(detail_url)

        df.loc[num_entries] = [book_title, detail_url, thumnail_url, book_price, inventory_status, upc_code, product_type, required_tax, stock_quantity, number_reviews, image_url, product_description]
        num_entries = num_entries + 1

    try:
        next_button_element = home_page_browser.find_element(By.CLASS_NAME, "next")
        current_page = current_page + 1
        web_page_url = next_button_element.find_element(By.TAG_NAME, "a").get_attribute("href")
    except:
        done = True

Waiting 2 seconds to retrieve the items on page http://books.toscrape.com/index.html
Attempting to access the web page: http://books.toscrape.com/index.html
Successfully accessed the web page: http://books.toscrape.com/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/soumission_998/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/sharp-objects_997/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/the-requiem-red_995/index.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/in

In [11]:
home_page_browser.quit()

In [12]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 3. Finalize the Output

In [13]:
if (notifyStatus): status_notify("Task 3 Finalize the Output has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [14]:
# Spot-checking the dataframe before writing to file
df.head()

Unnamed: 0,book_title,detail_url,thumnail_url,book_price,inventory_status,upc_code,product_type,required_tax,stock_quantity,number_reviews,image_url,product_description
0,A Light in the Attic,http://books.toscrape.com/catalogue/a-light-in...,http://books.toscrape.com/media/cache/2c/da/2c...,51.77,In stock,a897fe39b1053632,Books,0.0,22,0,http://books.toscrape.com/media/cache/fe/72/fe...,It's hard to imagine a world without A Light i...
1,Tipping the Velvet,http://books.toscrape.com/catalogue/tipping-th...,http://books.toscrape.com/media/cache/26/0c/26...,53.74,In stock,90fa61229261140a,Books,0.0,20,0,http://books.toscrape.com/media/cache/08/e9/08...,"""Erotic and absorbing...Written with starling ..."
2,Soumission,http://books.toscrape.com/catalogue/soumission...,http://books.toscrape.com/media/cache/3e/ef/3e...,50.1,In stock,6957f44c3847a760,Books,0.0,20,0,http://books.toscrape.com/media/cache/ee/cf/ee...,"Dans une France assez proche de la nôtre, un h..."
3,Sharp Objects,http://books.toscrape.com/catalogue/sharp-obje...,http://books.toscrape.com/media/cache/32/51/32...,47.82,In stock,e00eb4fd7b871a48,Books,0.0,20,0,http://books.toscrape.com/media/cache/c0/59/c0...,"WICKED above her hipbone, GIRL across her hear..."
4,Sapiens: A Brief History of Humankind,http://books.toscrape.com/catalogue/sapiens-a-...,http://books.toscrape.com/media/cache/be/a5/be...,54.23,In stock,4165285e1663650f,Books,0.0,20,0,http://books.toscrape.com/media/cache/ce/5f/ce...,From a renowned historian comes a groundbreaki...


In [15]:
# Spot-checking the dataframe before writing to file
df.tail()

Unnamed: 0,book_title,detail_url,thumnail_url,book_price,inventory_status,upc_code,product_type,required_tax,stock_quantity,number_reviews,image_url,product_description
995,Alice in Wonderland (Alice's Adventures in Won...,http://books.toscrape.com/catalogue/alice-in-w...,http://books.toscrape.com/media/cache/96/ee/96...,55.53,In stock,cd2a2a70dd5d176d,Books,0.0,1,0,http://books.toscrape.com/media/cache/99/df/99...,
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",http://books.toscrape.com/catalogue/ajin-demi-...,http://books.toscrape.com/media/cache/09/7c/09...,57.06,In stock,bfd5e1701c862ac3,Books,0.0,1,0,http://books.toscrape.com/media/cache/30/98/30...,High school student Kei Nagai is struck dead i...
997,A Spy's Devotion (The Regency Spies of London #1),http://books.toscrape.com/catalogue/a-spys-dev...,http://books.toscrape.com/media/cache/1b/5f/1b...,16.97,In stock,19fec36a1dfb4c16,Books,0.0,1,0,http://books.toscrape.com/media/cache/f9/6b/f9...,"In England’s Regency era, manners and elegance..."
998,1st to Die (Women's Murder Club #1),http://books.toscrape.com/catalogue/1st-to-die...,http://books.toscrape.com/media/cache/2b/41/2b...,53.98,In stock,f684a82adc49f011,Books,0.0,1,0,http://books.toscrape.com/media/cache/f6/8e/f6...,"James Patterson, bestselling author of the Ale..."
999,"1,000 Places to See Before You Die",http://books.toscrape.com/catalogue/1000-place...,http://books.toscrape.com/media/cache/d7/0f/d7...,26.08,In stock,228ba5e7577e1d49,Books,0.0,1,0,http://books.toscrape.com/media/cache/9e/10/9e...,"Around the World, continent by continent, here..."


In [18]:
if (writeOutput):
    out_file = df.to_csv(index=False)
    with open('web_scraping_py_selenium_books_to_scrape_take2.csv', 'w', newline = '\n') as f:
        f.write(out_file)
    print("Number of records written to file:", len(df))

In [19]:
if (notifyStatus): status_notify("Task 3 Finalize the Output completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [20]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 2:28:32.287031
