# Web Scraping of SAS Global Forum 2019 Proceedings Using BeautifulSoup
### David Lowe
### June 2, 2019

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: On occasions we have a need to download a batch of documents off a single web page without clicking on the download link one at a time. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://www.sas.com/en_us/events/sas-global-forum/program/proceedings.html

## Loading Libraries and Packages

In [1]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
import urllib.request
from bs4 import BeautifulSoup
from random import randint
from time import sleep

startTimeScript = datetime.now()

## Setting up the email notification function

In [2]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [3]:
email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Setting up the necessary parameters

In [4]:
# Set up the verbose flag to print detailed messages for debugging (only YES will activate!)
verbose = "!YES"

In [5]:
# Specifying the URL of desired web page to be scrapped
startingURL = "https://www.sas.com/en_us/events/sas-global-forum/program/proceedings.html"
homeURL = "https://www.sas.com"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0"
headers={'User-Agent': uastring}

## Performing the Scraping and Processing

In [6]:
email_notify("The web page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [7]:
# Adding random wait time so we do not hammer the website needlessly
websiteURL = startingURL
waitTime = randint(2,5)
print("Waiting " + str(waitTime) + " seconds to process URL: " + websiteURL)
sleep(waitTime)

try:
    s = requests.Session()
    resp = s.get(websiteURL, headers=headers)
    if (verbose=='YES'): print(resp.text)
    if (resp.status_code==requests.codes.ok):
        print('Successfully accessed the web page: ' + websiteURL)
        webPage = BeautifulSoup(resp.text, 'lxml')
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script processing cannot continue!!!")
except ConnectionError as e:
    print('The server could not be reached due to connection issues!')
    sys.exit("Script processing cannot continue!!!")

Waiting 5 seconds to process URL: https://www.sas.com/en_us/events/sas-global-forum/program/proceedings.html
Successfully accessed the web page: https://www.sas.com/en_us/events/sas-global-forum/program/proceedings.html


In [8]:
outerSection = webPage.find(id="tabcontent_all-papers")
if (verbose=='YES'): print(outerSection)

In [9]:
dataURL = homeURL + outerSection.find(class_="async-list")['data-url']
if (verbose=='YES'): print(dataURL)
try:
    s = requests.Session()
    resp = s.get(dataURL, headers=headers)
    if (resp.status_code==requests.codes.ok):
        print('Successfully accessed the data URL at: ' + dataURL)
        innerSection = BeautifulSoup(resp.text, 'lxml')
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script processing cannot continue!!!")
except ConnectionError as e:
    print('The server could not be reached due to connection issues!')
    sys.exit("Script processing cannot continue!!!")

Successfully accessed the data URL at: https://www.sas.com/content/sascom/en_us/events/sas-global-forum/program/proceedings/jcr:content/par/styledcontainer_1306083555/par/tabwrapper/tabwrapperpar/tab/tabpar/styledcontainer/par/listgrouppdf.ajaxlist.html


In [10]:
collection = innerSection.find_all('a')
if (verbose=='YES'): print(collection)
i = 0

In [11]:
for item in collection:
    doc_path = item['href']
    if (verbose=='YES'): print('Found the document link : ' + doc_path)
    if doc_path.endswith('.pdf'):
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(3,8)
        print("Waiting " + str(waitTime) + " seconds to retrieve the next document.")
        sleep(waitTime)
        i = i + 1
        dest_file = os.path.basename(doc_path)
        print('Downlading document: ' + doc_path + " as " + dest_file)
        with urllib.request.urlopen(doc_path) as in_resp, open(dest_file, 'wb') as out_file:
            shutil.copyfileobj(in_resp, out_file)

Waiting 5 seconds to retrieve the next document.
Downlading document: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3115-2019.pdf as 3115-2019.pdf
Waiting 8 seconds to retrieve the next document.
Downlading document: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3497-2019.pdf as 3497-2019.pdf
Waiting 8 seconds to retrieve the next document.
Downlading document: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/4057-2019.pdf as 4057-2019.pdf
Waiting 8 seconds to retrieve the next document.
Downlading document: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3238-2019.pdf as 3238-2019.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3163-2019.pdf as 3163-2019.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://www.sas.com/

## Organizing Data and Producing Outputs

In [12]:
print('Number of documents processed:', i)
email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:',(datetime.now() - startTimeScript))

Number of documents processed: 367
Total time for the script: 0:41:16.108046
