# Web Scraping of O'Reilly Software Architecture Conference New York 2019
### David Lowe
### June 30, 2019

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: On occasions we have a need to download a batch of documents off a single web page without clicking on the download link one at a time. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF documents. The script will also download the PDF documents as part of the scraping process.

Starting URLs: https://conferences.oreilly.com/software-architecture/sa-ny-2019/public/schedule/proceedings

## Loading Libraries and Packages

In [1]:
# Refresh package repositories and set up additional Linux and Python packages
!apt-get update
!apt install chromium-chromedriver
!pip install -q pymysql selenium

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:2 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:6 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease [15.4 kB]
Ign:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:10 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:13 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic/main Sources [1,647

In [0]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
from random import randint
from time import sleep
from selenium import webdriver
import pymysql

startTimeScript = datetime.now()

## Setting up the basic functions

In [0]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [0]:
def download_file(doc_path):
#    local_file = os.path.basename(doc_path)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

## Setting up the necessary parameters

In [5]:
# Mount Google Drive locally

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Set up the verbose flag to print detailed messages for debugging (setting True will activate!)
verbose = False

# Set up the sendNotification flag to send progress emails (setting True will send emails!)
sendNotification = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = True

In [0]:
if (sendNotification): email_notify("The web scraping process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [0]:
# Specifying the URL of desired web page to be scrapped
startingURL = "https://conferences.oreilly.com/software-architecture/sa-ny-2019/public/schedule/proceedings"

# Creating an html document from the URL
uastring = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0"
headers={'User-Agent': uastring}

## Performing the Scraping and Processing

In [0]:
if (sendNotification): email_notify("The web page loading and item extraction process has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [10]:
try:
    s = requests.Session()
    resp = s.get(startingURL, headers=headers)
    if (verbose): print(resp.text)
except HTTPError as e:
    print('The server could not serve up the web page!')
    sys.exit("Script processing cannot continue!!!")
except ConnectionError as e:
    print('The server could not be reached due to connection issues!')
    sys.exit("Script processing cannot continue!!!")

if (resp.status_code==requests.codes.ok):
    print('Successfully accessed the RSS page: ' + startingURL)
    webPage = BeautifulSoup(resp.text, 'lxml')

Successfully accessed the RSS page: https://conferences.oreilly.com/software-architecture/sa-ny-2019/public/schedule/proceedings


In [11]:
# Gather all links to the document
collection = webPage.find_all("a", class_="attach")
i = 0

for item in collection:
    if (verbose): print(item)
    docPath = item['href']
    if docPath.endswith(".pdf") | docPath.endswith(".pptx") | docPath.endswith(".zip"):
        i = i + 1
        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(2,5)
        print("Waiting " + str(waitTime) + " seconds to retrieve " + docPath)
        sleep(waitTime)
        if (executeDownload): download_file(docPath)

print('Finished finding all available documents on the web page!')

Waiting 2 seconds to retrieve https://cdn.oreillystatic.com/en/assets/1/event/289/7%20Years%20of%20DDD_%20Tackling%20Complexity%20in%20Large-Scale%20Marketing%20Systems%20Presentation.pdf
Downladed file: /content/gdrive/My Drive/Colab_Downloads/7%20Years%20of%20DDD_%20Tackling%20Complexity%20in%20Large-Scale%20Marketing%20Systems%20Presentation.pdf
Waiting 3 seconds to retrieve https://cdn.oreillystatic.com/en/assets/1/event/289/A%20service%20mesh%20is%20easy%20to%20swallow%20in%20small%20pieces%20_sponsored%20by%20Aspen%20Mesh_%20Presentation.pdf
Downladed file: /content/gdrive/My Drive/Colab_Downloads/A%20service%20mesh%20is%20easy%20to%20swallow%20in%20small%20pieces%20_sponsored%20by%20Aspen%20Mesh_%20Presentation.pdf
Waiting 2 seconds to retrieve https://cdn.oreillystatic.com/en/assets/1/event/289/An%20architect_s%20guiding%20principles%20for%20leadership%20Presentation.pdf
Downladed file: /content/gdrive/My Drive/Colab_Downloads/An%20architect_s%20guiding%20principles%20for%20lea

In [12]:
print('Number of documents processed:', i)

Number of documents processed: 31


In [0]:
if (sendNotification): email_notify("The web scraping process has completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [14]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:03:52.917116
