# Web Scraping of Quotes from Famous People Using BeautifulSoup
### David Lowe
### May 19, 2019

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python and leveraged the BeautifulSoup module.

INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this Take2 iteration, the Python script attempts to follow the links to the author page and scrape the author information.

Starting URLs: http://quotes.toscrape.com/

## Loading Libraries and Packages

In [1]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from bs4 import BeautifulSoup
from random import randint
from time import sleep

startTimeScript = datetime.now()

## Setting up the email notification function

In [2]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [3]:
email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Setting up the necessary parameters

In [4]:
# Specifying the URL of desired web page to be scrapped
website_url = "http://quotes.toscrape.com"
starting_url = website_url + "/"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
headers={'User-Agent': uastring}

try:
    s = requests.Session()
    resp = s.get(website_url, headers=headers)
#     print(resp.text)
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script Processing Aborted!!!")
except ConnectionError as e:
    print('The server could not be reached!')
    sys.exit("Script Processing Aborted!!!")

try:
    webpage = BeautifulSoup(resp.text, 'lxml')
except AttributeError as e:
    print('Page title could not be found - Might indicate problems!')
    sys.exit("Script Processing Aborted!!!")
else:
    print('Successfully accessed the web page: ' + starting_url)

Successfully accessed the web page: http://quotes.toscrape.com/


## Performing the Scraping and Processing

In [5]:
email_notify("The page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['Author_Name','Author_Birthday','Author_Location','Author_Bio','Author_Link'])
pageNum = 1
i = 0

In [7]:
done = False

while not done :
    quote_listing = webpage.find_all("div", class_="quote")
#     print(quote_listing)

    for quote_item in quote_listing :
        author_name = "[Not Found]"
        author_birthday = "[Not Found]"
        author_location = "[Not Found]"
        author_bio = "[Not Found]"
        author_link = "[Not Found]"

        author_name = quote_item.find("small", class_="author").string
        author_link = website_url + quote_item.find('a').get('href')

        if not df['Author_Name'].str.contains(author_name).any() :
            # Adding random wait time so we do not hammer the website needlessly
            waitTime = randint(2,5)
            print("Waiting " + str(waitTime) + " seconds to process next page...")
            sleep(waitTime)
            try:
                resp = s.get(author_link, headers=headers)
            except HTTPError as e:
                print("Unable tp retrieve the author detail page!")
            else:
                try:
                    authorPage = BeautifulSoup(resp.text, 'lxml')
                except AttributeError as e:
                    print('Page title could not be found - Might indicate problems!')
                    sys.exit("Script Processing Aborted!!!")
                else:
                    print('Successfully accessed the web page: ' + author_link)

            author_birthday = authorPage.find("span", class_="author-born-date").string
            author_location = authorPage.find("span", class_="author-born-location").string
            author_bio = authorPage.find("div", class_="author-description").string

#             print(author_name, author_birthday, author_location, author_bio[0:25], author_link)
            df.loc[i] = [author_name, author_birthday, author_location, author_bio, author_link]
            i = i + 1

    if ((pageNum % 5)==0) :
        email_notify("Finished parsing page: " + next_page_url + " at "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
    pageNum = pageNum + 1

    next_page = webpage.find("li", class_="next")
    if next_page != None :
        next_page_url = website_url + next_page.find('a').get('href')
#         print(next_page_url)

        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(2,5)
        print("Waiting " + str(waitTime) + " seconds to process next page...")
        sleep(waitTime)
        try:
            resp = s.get(next_page_url, headers=headers)
        except HTTPError as e:
            print("No more page to retrieve. The processing has completed!")
            done = True
        else:
            try:
                webpage = BeautifulSoup(resp.text, 'lxml')
            except AttributeError as e:
                print('Page title could not be found - Might indicate problems!')
                sys.exit("Script Processing Aborted!!!")
            else:
                print('Successfully accessed the web page: ' + next_page_url)
    else :
        done = True

Waiting 5 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Albert-Einstein
Waiting 4 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/J-K-Rowling
Waiting 2 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Jane-Austen
Waiting 4 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Marilyn-Monroe
Waiting 5 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Andre-Gide
Waiting 5 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Thomas-A-Edison
Waiting 5 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Eleanor-Roosevelt
Waiting 2 seconds to process next page...
Successfully accessed the web page: http://quotes.toscrape.com/author/Stev

## Organizing Data and Producing Outputs

In [8]:
out_file = df.to_json(orient='records')
with open('web-scraping-py-bsoup-detail-page.json', 'w') as f:
    f.write(out_file)
print('Total number of records written to file:', len(df))
email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:', (datetime.now() - startTimeScript))

Total number of records written to file: 50
Total time for the script: 0:04:14.221953
