# Web Scraping of Quotes from Famous People Using Selenium
## Version 1 - Detail Page
### David Lowe
### November 22, 2019

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python and leveraged the Selenium module.

INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this iteration, the Python script attempts to follow the links to the author page and scrape the author information.

Starting URLs: http://quotes.toscrape.com/

## Section 0. Prepare Environment

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import shutil
import smtplib
import pymysql
from email.message import EmailMessage
from datetime import datetime
from random import randint
from time import sleep
from selenium import webdriver

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose flag to print detailed messages for debugging (setting to True will activate)
verbose = False

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Set up the writeJSON flag to capture the output in JSON (setting True will write the JSON file!)
writeJSON = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python/Selenium Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [4]:
def download_file(doc_path):
#    local_file = os.path.basename(doc_path)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

In [5]:
if (notifyStatus): email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 1. Perform the Scraping and Processing

In [6]:
# Specifying the URL of desired web page to be scrapped
websiteURL = "http://quotes.toscrape.com"
startingURL = websiteURL + "/"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
headers={'User-Agent': uastring}

In [7]:
if (notifyStatus): email_notify("The page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [8]:
browser = webdriver.Firefox()
try:
    browser.get(startingURL)
    print('Successfully accessed the web page: ' + browser.title)
except:
    print('The web page could not be reached for some reasons!')
    sys.exit("Script processing cannot continue!!!")

Successfully accessed the web page: Quotes to Scrape


In [9]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['author_name','author_birthday','author_location','author_bio','author_link'])
pageNum = 1
i = 0

In [10]:
# Iterate through the quote pages to gather author names and links
done = False

while not done :
    quote_listing = browser.find_elements_by_class_name("quote")
    if (verbose): print(quote_listing)
    
    for quote_item in quote_listing :
        if (verbose): print(quote_item.text)
        author_name = "[Not Found]"
        author_birthday = "[Not Found]"
        author_location = "[Not Found]"
        author_bio = "[Not Found]"
        author_link = "[Not Found]"

        author_name = quote_item.find_element_by_class_name("author").text
        author_link = quote_item.find_element_by_tag_name('a').get_attribute("href")
        if not df['author_name'].str.contains(author_name).any() :
            if (verbose): print(author_name, '|', author_birthday, '|', author_location, '|', author_bio, '|', author_link)
            df.loc[i] = [author_name, author_birthday, author_location, author_bio, author_link]
            i = i + 1

    if ((pageNum % 5)==0) :
        if (notifyStatus): email_notify("Finished parsing page: " + next_page_url + " at "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
    pageNum = pageNum + 1

    try:
        next_page = browser.find_element_by_class_name('next')
    except:
        print("No more quote page to retrieve. The processing has completed!")
        done = True

    if not done:
        next_page_url = next_page.find_element_by_tag_name('a').get_attribute("href")
        print("The URL for the next page is: ", next_page_url)
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(2,5)
        print("Waiting " + str(waitTime) + " seconds to process next page...")
        sleep(waitTime)
        try:
            browser.get(next_page_url)
        except:
            print("Received error whent trying to access an URL. The script will stop!")
            done = True

The URL for the next page is:  http://quotes.toscrape.com/page/2/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/3/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/4/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/5/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/6/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/7/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/8/
Waiting 4 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/9/
Waiting 3 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/10/
Waiting 5 seconds to process next page...
No more quote page to retri

In [11]:
# Iterate through the author pages to gather author biographical information
for i, row in df.iterrows():
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to process author page: " + row['author_link'])
    sleep(waitTime)
    try:
        browser.get(row['author_link'])
    except:
        print("Unable to retrieve the author detail page. The script will skip this author!")
    else:
        print('Successfully accessed the author page: ' + row['author_link'])
        author_birthday = browser.find_element_by_class_name("author-born-date").text
        author_location = browser.find_element_by_class_name("author-born-location").text
        author_bio = browser.find_element_by_class_name("author-description").text
        df.at[i,'author_birthday'] = author_birthday
        df.at[i,'author_location'] = author_location
        df.at[i,'author_bio'] = author_bio

print("No more page to retrieve. The processing has completed!")

Waiting 3 seconds to process author page: http://quotes.toscrape.com/author/Albert-Einstein
Successfully accessed the author page: http://quotes.toscrape.com/author/Albert-Einstein
Waiting 4 seconds to process author page: http://quotes.toscrape.com/author/J-K-Rowling
Successfully accessed the author page: http://quotes.toscrape.com/author/J-K-Rowling
Waiting 5 seconds to process author page: http://quotes.toscrape.com/author/Jane-Austen
Successfully accessed the author page: http://quotes.toscrape.com/author/Jane-Austen
Waiting 2 seconds to process author page: http://quotes.toscrape.com/author/Marilyn-Monroe
Successfully accessed the author page: http://quotes.toscrape.com/author/Marilyn-Monroe
Waiting 5 seconds to process author page: http://quotes.toscrape.com/author/Andre-Gide
Successfully accessed the author page: http://quotes.toscrape.com/author/Andre-Gide
Waiting 5 seconds to process author page: http://quotes.toscrape.com/author/Thomas-A-Edison
Successfully accessed the autho

In [12]:
# Close the browsing session
browser.quit()

In [13]:
if (notifyStatus): email_notify("The page loading and item extraction process completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 2. Organizing Data and Producing Outputs

In [14]:
print('Some of the final set of records captured are:')
df.head(10)

Some of the final set of records captured are:


Unnamed: 0,author_name,author_birthday,author_location,author_bio,author_link
0,Albert Einstein,"March 14, 1879","in Ulm, Germany","In 1879, Albert Einstein was born in Ulm, Germ...",http://quotes.toscrape.com/author/Albert-Einstein
1,J.K. Rowling,"July 31, 1965","in Yate, South Gloucestershire, England, The U...",See also: Robert GalbraithAlthough she writes ...,http://quotes.toscrape.com/author/J-K-Rowling
2,Jane Austen,"December 16, 1775","in Steventon Rectory, Hampshire, The United Ki...",Jane Austen was an English novelist whose work...,http://quotes.toscrape.com/author/Jane-Austen
3,Marilyn Monroe,"June 01, 1926",in The United States,Marilyn Monroe (born Norma Jeane Mortenson; Ju...,http://quotes.toscrape.com/author/Marilyn-Monroe
4,André Gide,"November 22, 1869","in Paris, France",André Paul Guillaume Gide was a French author ...,http://quotes.toscrape.com/author/Andre-Gide
5,Thomas A. Edison,"February 11, 1847","in Milan, Ohio, The United States","Thomas Alva Edison was an American inventor, s...",http://quotes.toscrape.com/author/Thomas-A-Edison
6,Eleanor Roosevelt,"October 11, 1884",in The United States,Anna Eleanor Roosevelt was an American politic...,http://quotes.toscrape.com/author/Eleanor-Roos...
7,Steve Martin,"August 14, 1945","in Waco, Texas, The United States","Stephen Glenn ""Steve"" Martin is an American ac...",http://quotes.toscrape.com/author/Steve-Martin
8,Bob Marley,"February 06, 1945","in Nine Mile, Saint Ann, Jamaica","Robert ""Bob"" Nesta Marley OM was a Jamaican si...",http://quotes.toscrape.com/author/Bob-Marley
9,Dr. Seuss,"March 02, 1904","in Springfield, MA, The United States",Theodor Seuss Geisel was born 2 March 1904 in ...,http://quotes.toscrape.com/author/Dr-Seuss


In [15]:
if (writeJSON):
    out_file = df.to_json(orient='records')
    with open('web-scraping-py-bsoup-simple-pagination.json', 'w') as f:
        f.write(out_file)
print('Total number of records processed:', len(df))

Total number of records processed: 50


In [16]:
if (notifyStatus): email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [17]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:04:23.720655
