# Web Scraping of Quotes from Famous People Using Selenium
## Version 1 - Login Form
### David Lowe
### November 22, 2019

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python and leveraged the Selenium module.

INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this iteration, the Python script attempts to execute the login form and scrape the Goodreads links off each quote. The Goodreads links appear only after a successful authentication.

Starting URLs: http://quotes.toscrape.com/login

## Section 0. Prepare Environment

In [1]:
import numpy as np
import pandas as pd
import os
import sys
import shutil
import smtplib
import pymysql
from email.message import EmailMessage
from datetime import datetime
from random import randint
from time import sleep
from selenium import webdriver

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose flag to print detailed messages for debugging (setting to True will activate)
verbose = False

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Set up the writeJSON flag to capture the output in JSON (setting True will write the JSON file!)
writeJSON = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python/Selenium Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [4]:
def download_file(doc_path):
#    local_file = os.path.basename(doc_path)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

In [5]:
if (notifyStatus): email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 1. Perform the Scraping and Processing

In [6]:
# Specifying the URL of desired web page to be scrapped
websiteURL = "http://quotes.toscrape.com"
startingURL = websiteURL + "/login"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0"
headers={'User-Agent': uastring}

In [7]:
if (notifyStatus): email_notify("The page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [8]:
browser = webdriver.Firefox()
try:
    browser.get(startingURL)
    print('Successfully accessed the web page: ' + browser.title)
except:
    print('The web page could not be reached for some reasons!')
    sys.exit("Script processing cannot continue!!!")

Successfully accessed the web page: Quotes to Scrape


In [10]:
# Login to the website using the supplied credentials
userid_element = browser.find_element_by_id("username")
userid_element.clear()
userid_element.send_keys("abc")
passwd_element = browser.find_element_by_id("password")
passwd_element.clear()
passwd_element.send_keys("abc")
browser.find_element_by_class_name("btn-primary").click()

In [11]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['author_name','quote_text','quote_tags','author_link'])
pageNum = 1
i = 0

In [12]:
done = False

while not done :
    quote_listing = browser.find_elements_by_class_name("quote")
    if (verbose): print(quote_listing)

    for quote_item in quote_listing :
        if (verbose): print(quote_item.text)
        author_name = "[Not Found]"
        quote_text = "[Not Found]"
        quote_tags = ""
        author_link = "[Not Found]"

        author_name = quote_item.find_element_by_class_name("author").text
        quote_text = quote_item.find_element_by_class_name("text").text
        author_link = quote_item.find_element_by_tag_name('a').get_attribute("href")
        tag_listing = quote_item.find_elements_by_class_name("tag")
        if len(tag_listing) > 0 :
            for each_tag in tag_listing :
                quote_tags = quote_tags + "#" + each_tag.text

        if (verbose): print(author_name, '|', quote_text, '|', quote_tags, '|', author_link)
        df.loc[i] = [author_name, quote_text, quote_tags, author_link]
        i = i + 1

    if ((pageNum % 5)==0) :
        if (notifyStatus): email_notify("Finished parsing page: " + next_page_url + " at "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
    pageNum = pageNum + 1

    try:
        next_page = browser.find_element_by_class_name('next')
    except:
        print("No more page to retrieve. The processing has completed!")
        done = True

    if not done:
        next_page_url = next_page.find_element_by_tag_name('a').get_attribute("href")
        print("The URL for the next page is: ", next_page_url)
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(2,5)
        print("Waiting " + str(waitTime) + " seconds to process next page...")
        sleep(waitTime)
        try:
            browser.get(next_page_url)
        except:
            print("Received error whent trying to access an URL. The script will stop!")
            done = True

The URL for the next page is:  http://quotes.toscrape.com/page/2/
Waiting 3 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/3/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/4/
Waiting 4 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/5/
Waiting 4 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/6/
Waiting 4 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/7/
Waiting 5 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/8/
Waiting 3 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/9/
Waiting 2 seconds to process next page...
The URL for the next page is:  http://quotes.toscrape.com/page/10/
Waiting 2 seconds to process next page...
No more page to retrieve. T

In [13]:
# Close the browsing session
browser.quit()

In [14]:
if (notifyStatus): email_notify("The page loading and item extraction process completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 2. Organizing Data and Producing Outputs

In [15]:
print('Some of the final set of records captured are:')
df.head(10)

Some of the final set of records captured are:


Unnamed: 0,author_name,quote_text,quote_tags,author_link
0,Albert Einstein,“The world as we have created it is a process ...,#change#deep-thoughts#thinking#world,http://quotes.toscrape.com/author/Albert-Einstein
1,J.K. Rowling,"“It is our choices, Harry, that show what we t...",#abilities#choices,http://quotes.toscrape.com/author/J-K-Rowling
2,Albert Einstein,“There are only two ways to live your life. On...,#inspirational#life#live#miracle#miracles,http://quotes.toscrape.com/author/Albert-Einstein
3,Jane Austen,"“The person, be it gentleman or lady, who has ...",#aliteracy#books#classic#humor,http://quotes.toscrape.com/author/Jane-Austen
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...",#be-yourself#inspirational,http://quotes.toscrape.com/author/Marilyn-Monroe
5,Albert Einstein,“Try not to become a man of success. Rather be...,#adulthood#success#value,http://quotes.toscrape.com/author/Albert-Einstein
6,André Gide,“It is better to be hated for what you are tha...,#life#love,http://quotes.toscrape.com/author/Andre-Gide
7,Thomas A. Edison,"“I have not failed. I've just found 10,000 way...",#edison#failure#inspirational#paraphrased,http://quotes.toscrape.com/author/Thomas-A-Edison
8,Eleanor Roosevelt,“A woman is like a tea bag; you never know how...,#misattributed-eleanor-roosevelt,http://quotes.toscrape.com/author/Eleanor-Roos...
9,Steve Martin,"“A day without sunshine is like, you know, nig...",#humor#obvious#simile,http://quotes.toscrape.com/author/Steve-Martin


In [16]:
if (writeJSON):
    out_file = df.to_json(orient='records')
    with open('web-scraping-py-bsoup-simple-pagination.json', 'w') as f:
        f.write(out_file)
print('Total number of records processed:', len(df))

Total number of records processed: 100


In [17]:
if (notifyStatus): email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [18]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:12:41.095772
