# Web Scraping of AWS re:Invent 2019 Using Python and BeautifulSoup
### David Lowe
### January 10, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The AWS re:Invent is a learning conference featuring keynote announcements, training and certification opportunities, a partner expo, and access to more than 2,500 technical sessions. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process.

Starting URLs: https://aws.amazon.com/events/events-content/?awsf.filter-series=event-series%23reinvent

## Section 0. Prepare Environment

In [1]:
# Colab-Specific Setup - Refresh Linux package repositories and set up additional Linux and Python tools
# !apt-get update
# !apt install chromium-chromedriver
# !pip install -q pymysql selenium

In [2]:
import numpy as np
import pandas as pd
import os
import shutil
import smtplib
import sys
from email.message import EmailMessage
from datetime import datetime
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
from random import randint
from time import sleep
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [3]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# Set up the mountStorage flag to mount G Drive for storing files (setting True will mount the drive!)
mountStorage = False

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = True

In [4]:
# Colab-Specific Setup - Mount Google Drive for storing downloaded files
if (mountStorage):
    from google.colab import drive
    drive.mount('/content/gdrive')

In [5]:
# Set up the email notification function
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Web Scraping Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()

In [6]:
def access_url(url):
    wait_time = 10
    firefox_options = Options()
    firefox_options.headless = True
    print('Attempting to access the web page: ' + url)
    browser = webdriver.Firefox(options=firefox_options)
    browser.get(url)
    sleep(wait_time)
    innerHTML = browser.execute_script("return document.body.innerHTML")
    sleep(wait_time)
    bsoup_obj = BeautifulSoup(innerHTML, 'lxml')
    if (debug): print(bsoup_obj.prettify())
    browser.quit()
    return bsoup_obj    

In [7]:
def download_to_local(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    if (os.path.isfile(local_file) == False):
        with requests.get(doc_path, stream=True) as r:
            with open(local_file, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        print('Downladed file: ' + local_file)
    else:
        print('Skipped existing file: ' + local_file)

In [8]:
def download_to_gdrive(doc_path):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,5)
    print("Waiting " + str(waitTime) + " seconds to retrieve " + doc_path)
    sleep(waitTime)
    local_file = doc_path.split('/')[-1]
    gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
    dest_file = gdrivePrefix + local_file
    with requests.get(doc_path, stream=True) as r:
        with open(dest_file, 'wb') as f:
            shutil.copyfileobj(r.raw, f)
    print('Downladed file: ' + dest_file)

In [9]:
if (notifyStatus): email_notify("Phase 0 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Section 1. Perform the Scraping and Processing

In [10]:
if (notifyStatus): email_notify("Phase 1 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [11]:
# Specifying the URL of desired web page to be scrapped
website_url = "https://aws.amazon.com/events/events-content/?awsf.filter-series=event-series%23reinvent&awsm.page-cards="
card_number = 1
max_card = 117
starting_url = website_url + str(card_number)

In [12]:
i = 0
while (card_number <= max_card):
    card_url = website_url + str(card_number)
    web_page = access_url(card_url)
    # Gather all links to the document
    collection = web_page.find_all('h2', class_='m-headline')
    for item in collection:
        if (verbose): print('Found item:', item)
        doc_link = item.a
        doc_path = doc_link['href']
        if doc_path.lower().endswith(".pdf") | doc_path.lower().endswith(".pptx") | doc_path.lower().endswith(".zip"):
            i = i + 1
            if (executeDownload): download_to_local(doc_path)
    card_number = card_number + 1

print('Finished finding all available documents on the web pages!')
print('Total presentation documents downloaded:', i)

Attempting to access the web page: https://aws.amazon.com/events/events-content/?awsf.filter-series=event-series%23reinvent&awsm.page-cards=1
Waiting 3 seconds to retrieve https://d1.awsstatic.com/events/reinvent/2019/AWS_China_Gateway_Power_your_business_in_China_by_working_with_AWS_CHN202.pdf
Downladed file: AWS_China_Gateway_Power_your_business_in_China_by_working_with_AWS_CHN202.pdf
Waiting 4 seconds to retrieve https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Performance_tune_SQL_Server_using_AWS_Machine_Learning_WIN407-R1.pdf
Downladed file: REPEAT_1_Performance_tune_SQL_Server_using_AWS_Machine_Learning_WIN407-R1.pdf
Waiting 2 seconds to retrieve https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Build_multi-region_SQL_Server_Always_On_availability_groups_WIN406-R1.pdf
Downladed file: REPEAT_1_Build_multi-region_SQL_Server_Always_On_availability_groups_WIN406-R1.pdf
Waiting 4 seconds to retrieve https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Active_Directory_de

In [13]:
if (notifyStatus): email_notify("Phase 1 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [14]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 2:50:29.939369
