# Web Scraping of Quotes from Famous People Using BeautifulSoup
### David Lowe
### May 19, 2019

SUMMARY: The purpose of this project is to practice web scraping by gathering specific pieces of information from a website. The web scraping code was written in Python and leveraged the BeautifulSoup module.

INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this Take3 iteration, the Python script attempts to scrape the displayed quote information via an infinite scrolling page.

Note: For this iteration, the website returns the data in JSON format when using the API URL format. As a result, the BeautifulSoup module is not necessary for parsing the web pages for this iteration.

Starting URLs: http://quotes.toscrape.com/

## Loading Libraries and Packages

In [1]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from bs4 import BeautifulSoup
from random import randint
from time import sleep

startTimeScript = datetime.now()

## Setting up the email notification function

In [2]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [3]:
email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Setting up the necessary parameters

In [4]:
# Specifying the URL of desired web page to be scrapped
api_url = 'http://quotes.toscrape.com/api/quotes?page='
pageNum = 1
website_url = api_url + str(pageNum)

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36"
headers={'User-Agent': uastring}

try:
    s = requests.Session()
    resp = s.get(website_url, headers=headers)
#     print(resp.text)
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script Processing Aborted!!!")
except ConnectionError as e:
    print('The server could not be reached!')
    sys.exit("Script Processing Aborted!!!")

webpage = resp.json()

## Performing the Scraping and Processing

In [5]:
email_notify("The page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['Author_Name','Quote_Text','Quote_Tags','Author_Link'])
i = 0

In [7]:
done = False

while not done :
    print('Parsing web page for quotes:',website_url)
    for quote_item in webpage['quotes']:
        author_name = "[Not Found]"
        quote_text = "[Not Found]"
        quote_tags = ""
        author_link = "[Not Found]"

        author_name = quote_item['author']['name']
        quote_text = quote_item['text']
        tag_listing = quote_item['tags']
        if len(tag_listing) > 0 :
            for each_tag in tag_listing :
                quote_tags = quote_tags + "#" + each_tag
        author_link = "https://www.goodreads.com" + quote_item['author']['goodreads_link']
#         print(author_name, quote_text, quote_tags, author_link)

        df.loc[i] = [author_name, quote_text, quote_tags, author_link]
        i = i + 1

    if ((pageNum % 5)==0) :
        email_notify("Finished parsing page: " + website_url + " at "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

    if webpage['has_next'] :
        pageNum = pageNum + 1
        website_url = api_url + str(pageNum)
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(2,5)
        print("Waiting " + str(waitTime) + " seconds to process next page...")
        sleep(waitTime)

        try:
            resp = s.get(website_url, headers=headers)
        except HTTPError as e:
            print('The server could not serve up the web page!')
            sys.exit("Script Processing Aborted!!!")
        except ConnectionError as e:
            print('The server could not be reached!')
            sys.exit("Script Processing Aborted!!!")

        webpage = resp.json()
    else :
        done = True

Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=1
Waiting 2 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=2
Waiting 2 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=3
Waiting 4 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=4
Waiting 4 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=5
Waiting 5 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=6
Waiting 4 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=7
Waiting 4 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=8
Waiting 3 seconds to process next page...
Parsing web page for quotes: http://quotes.toscrape.com/api/quotes?page=

## Organizing Data and Producing Outputs

In [8]:
out_file = df.to_json(orient='records')
with open('web-scraping-py-bsoup-infinite-scrolling.json', 'w') as f:
    f.write(out_file)
print('Total number of records written to file:', len(df))
email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:', (datetime.now() - startTimeScript))

Total number of records written to file: 100
Total time for the script: 0:00:43.273316
