# Web Scraping of Books to Scrape Using Selenium Take 1
### David Lowe
### August 21, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Books to Scarpe is a fictional bookstore that desperately wants to be scraped according to the site owner. It is a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. This iteration of the script automatically traverses the book listing web pages (about 50 pages and 1000 items) to capture all the basic data about the books and store them in a CSV output file.

Starting URLs: http://books.toscrape.com/

## Task 1. Prepare Environment

In [1]:
import os
import sys
import pandas as pd
import shutil
import boto3
from datetime import datetime, date
from random import randint
from time import sleep
from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.select import Select
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# # Set up the parent directory location for loading the dotenv files
# useColab = False
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# # Set up the dotenv file for retrieving environment variables
# useLocalPC = False
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the flag to write the output to a JSON document (setting to TRUE will create the document!)
writeOutput = True

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [4]:
if (notifyStatus): status_notify("Task 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 2. Perform the Scraping and Processing

In [5]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['book_title', 'detail_url', 'image_url', 'book_price', 'inventory_status'])
num_entries = 0

In [7]:
# Specifying the URL of desired web page to be scrapped
website_url = "http://books.toscrape.com"
web_page_url = website_url + "/index.html"

In [8]:
# Initialize the web browser
firefox_options = Options()
firefox_options.headless = True
home_page_browser = webdriver.Firefox(options=firefox_options)

In [9]:
current_page = 1
max_pages = 999
done = False

while (current_page <= max_pages) and (not done):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,4)
    print("Waiting", waitTime, "seconds to retrieve the items on page", web_page_url)
    sleep(waitTime)
    print('Attempting to access the web page:', web_page_url)
    try:
        home_page_browser.get(web_page_url)
        print('Successfully accessed the web page:', web_page_url)
    except:
        print('The server could not serve up the web page!')
        sys.exit('Script processing cannot continue!!!')

    book_section = home_page_browser.find_element(By.TAG_NAME, "ol")
    book_listing = book_section.find_elements(By.TAG_NAME, "li")
    if verbose: print(book_listing)

    for book_item in book_listing:
        image_container = book_item.find_element(By.CLASS_NAME, "image_container")
        detail_url = image_container.find_element(By.TAG_NAME, "a").get_attribute("href")
        image_url = image_container.find_element(By.TAG_NAME, "img").get_attribute("src")
        book_title = book_item.find_element(By.TAG_NAME, "h3").find_element(By.TAG_NAME, "a").get_attribute("title")
        book_price = book_item.find_element(By.CLASS_NAME, "price_color").text
        inventory_status = book_item.find_element(By.CLASS_NAME, "availability").text
        df.loc[num_entries] = [book_title, detail_url, image_url, book_price, inventory_status]
        num_entries = num_entries + 1

    try:
        next_button_element = home_page_browser.find_element(By.CLASS_NAME, "next")
        current_page = current_page + 1
        web_page_url = next_button_element.find_element(By.TAG_NAME, "a").get_attribute("href")
    except:
        done = True

Waiting 4 seconds to retrieve the items on page http://books.toscrape.com/index.html
Attempting to access the web page: http://books.toscrape.com/index.html
Successfully accessed the web page: http://books.toscrape.com/index.html
Waiting 2 seconds to retrieve the items on page http://books.toscrape.com/catalogue/page-2.html
Attempting to access the web page: http://books.toscrape.com/catalogue/page-2.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/page-2.html
Waiting 2 seconds to retrieve the items on page http://books.toscrape.com/catalogue/page-3.html
Attempting to access the web page: http://books.toscrape.com/catalogue/page-3.html
Successfully accessed the web page: http://books.toscrape.com/catalogue/page-3.html
Waiting 2 seconds to retrieve the items on page http://books.toscrape.com/catalogue/page-4.html
Attempting to access the web page: http://books.toscrape.com/catalogue/page-4.html
Successfully accessed the web page: http://books.toscrape.com/cat

In [10]:
home_page_browser.quit()

In [11]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 3. Finalize the Output

In [12]:
if (notifyStatus): status_notify("Task 3 Finalize the Output has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [13]:
# Spot-checking the dataframe before writing to file
df.head()

Unnamed: 0,book_title,detail_url,image_url,book_price,inventory_status
0,A Light in the Attic,http://books.toscrape.com/catalogue/a-light-in...,http://books.toscrape.com/media/cache/2c/da/2c...,£51.77,In stock
1,Tipping the Velvet,http://books.toscrape.com/catalogue/tipping-th...,http://books.toscrape.com/media/cache/26/0c/26...,£53.74,In stock
2,Soumission,http://books.toscrape.com/catalogue/soumission...,http://books.toscrape.com/media/cache/3e/ef/3e...,£50.10,In stock
3,Sharp Objects,http://books.toscrape.com/catalogue/sharp-obje...,http://books.toscrape.com/media/cache/32/51/32...,£47.82,In stock
4,Sapiens: A Brief History of Humankind,http://books.toscrape.com/catalogue/sapiens-a-...,http://books.toscrape.com/media/cache/be/a5/be...,£54.23,In stock


In [14]:
# Spot-checking the dataframe before writing to file
df.tail()

Unnamed: 0,book_title,detail_url,image_url,book_price,inventory_status
995,Alice in Wonderland (Alice's Adventures in Won...,http://books.toscrape.com/catalogue/alice-in-w...,http://books.toscrape.com/media/cache/96/ee/96...,£55.53,In stock
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",http://books.toscrape.com/catalogue/ajin-demi-...,http://books.toscrape.com/media/cache/09/7c/09...,£57.06,In stock
997,A Spy's Devotion (The Regency Spies of London #1),http://books.toscrape.com/catalogue/a-spys-dev...,http://books.toscrape.com/media/cache/1b/5f/1b...,£16.97,In stock
998,1st to Die (Women's Murder Club #1),http://books.toscrape.com/catalogue/1st-to-die...,http://books.toscrape.com/media/cache/2b/41/2b...,£53.98,In stock
999,"1,000 Places to See Before You Die",http://books.toscrape.com/catalogue/1000-place...,http://books.toscrape.com/media/cache/d7/0f/d7...,£26.08,In stock


In [15]:
if (writeOutput):
    out_file = df.to_csv(index=False)
    with open('web_scraping_py_selenium_books_to_scrape_take1.csv', 'w', newline = '\n', encoding = "utf-8") as f:
        f.write(out_file)
    print("Number of records written to file:", len(df))

Number of records written to file: 1000


In [16]:
if (notifyStatus): status_notify("Task 3 Finalize the Output completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [17]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:06:40.215070
