# Web Scraping of AWS Documentation Using Python and BeautifulSoup
### David Lowe
### April 28, 2019

SUMMARY: The purpose of this project is to practice web scraping by extracting specific information from a website. Using the extracted information, the script further completes other tasks (downloading files in this case). The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: On occasions, there is a need to download a batch of documents off web pages without clicking on the download links one at a time. This web scraping script will automatically traverse through the necessary web pages and collect all links with the PDF document format. The script will also download the PDF documents as part of the scraping process.

For this script to work, it requires the use of Selenium browser automation software and one of its WebDrivers (Firefox in this case).

Starting URLs: https://docs.aws.amazon.com/

## Loading Libraries and Packages

In [1]:
import os
import sys
import shutil
import smtplib
from email.message import EmailMessage
from datetime import datetime
import urllib.request
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
from random import randint
from time import sleep
from selenium import webdriver

startTimeScript = datetime.now()

## Setting up the necessary parameters

In [2]:
# Specifying the URL of desired web page to be scrapped
starting_url = "https://docs.aws.amazon.com/"
website_url = "https://docs.aws.amazon.com"

In [3]:
from selenium.webdriver.firefox.options import Options

firefox_options = Options()
firefox_options.headless = True
browser = webdriver.Firefox(options=firefox_options)
browser.get(starting_url)
sleep(10)
innerHTML = browser.execute_script("return document.body.innerHTML")
sleep(10)
web_page = BeautifulSoup(innerHTML, 'lxml')
browser.quit()
# print(web_page.prettify())

## Performing the Scraping and Processing

In [4]:
# Gather all links to the document
service_collection = web_page.find_all('awsdocs-service-link')
print('Number of potential service documentation groups found:', len(service_collection))

Number of potential service documentation groups found: 200


In [5]:
i = 0
for svc in service_collection:
    doc_url = svc['href']
    start_string = "/"
    end_string = "?id=docs_gateway"
    if doc_url.endswith(end_string) and doc_url.startswith(start_string):
        doc_url = website_url + doc_url
        doc_url = doc_url[:-len(end_string)]
        print('Accessing the doc page: ' + doc_url)
        browser = webdriver.Firefox(options=firefox_options)
        browser.get(doc_url)
        sleep(10)
        innerHTML = browser.execute_script("return document.body.innerHTML")
        sleep(10)
        doc_page = BeautifulSoup(innerHTML, 'lxml')
        browser.quit()
        doc_collection = doc_page.find_all('awsdocs-link')

        for doc in doc_collection:
            if doc['label'] == "PDF":
                # Adding random wait time so we do not hammer the website needlessly
                waitTime = randint(2,5)
                print("Waiting " + str(waitTime) + " seconds to retrieve the next document.")
                sleep(waitTime)
                test_path = website_url + doc['href']
                if test_path.find(".pdf#") > 0:
                    doc_path = test_path.split("#")[0]
                else:
                    doc_path = test_path
                dest_file = os.path.basename(doc_path)
                print('Downlading document: ' + doc_path + " as " + dest_file)
                # The following two lines of code will download the PDF documents
#                 with urllib.request.urlopen(doc_path) as in_resp, open(dest_file, 'wb') as out_file:
#                     shutil.copyfileobj(in_resp, out_file)
                i = i + 1

Accessing the doc page: https://docs.aws.amazon.com/ec2/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-ug.pdf as ec2-ug.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-wg.pdf as ec2-wg.pdf
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ec2-api.pdf as ec2-api.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-dg.pdf as as-dg.pdf
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/as-api.pdf as as-api.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/vm-import/latest/userguide/vm-import-ug.pdf as vm-import-ug.pdf
Accessing the doc page: https://d

Downlading document: https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/dynamodb-api.pdf as dynamodb-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/elasticache/
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/redis-ug.pdf as redis-ug.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/memcached-ug.pdf as memcached-ug.pdf
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AmazonElastiCache/latest/APIReference/elasticache-api.pdf as elasticache-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/neptune/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/neptune/latest/userguide/neptune-ug.pdf as neptune-ug.pdf
Accessing the doc page: https://docs.aws.amazon.com/rds/
Waiting 5 seconds to retrieve the n

Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/guardduty/latest/ug/guardduty-ug.pdf as guardduty-ug.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/guardduty/latest/ug/guardduty-ug.pdf as guardduty-ug.pdf
Accessing the doc page: https://docs.aws.amazon.com/inspector/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/inspector/latest/userguide/inspector-ug.pdf as inspector-ug.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/inspector/latest/APIReference/inspector.pdf as inspector.pdf
Accessing the doc page: https://docs.aws.amazon.com/macie/
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/macie/latest/userguide/macie-ug.pdf as macie-ug.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com

Downlading document: https://docs.aws.amazon.com/forecast/latest/dg/forecast.dg.pdf as forecast.dg.pdf
Accessing the doc page: https://docs.aws.amazon.com/lex/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/lex/latest/dg/lex-dg.pdf as lex-dg.pdf
Accessing the doc page: https://docs.aws.amazon.com/machine-learning/
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/machine-learning/latest/dg/machinelearning-dg.pdf as machinelearning-dg.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/machine-learning/latest/APIReference/amazonml-api.pdf as amazonml-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/personalize/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/personalize/latest/dg/personalize-dg.pdf as personalize-dg.pdf
Accessing the doc page: https://docs.aws.amazon.com/polly/
Wa

Downlading document: https://docs.aws.amazon.com/servicecatalog/latest/dg/service-catalog-dg.pdf as service-catalog-dg.pdf
Accessing the doc page: https://docs.aws.amazon.com/systems-manager/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-ug.pdf as systems-manager-ug.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/systems-manager/latest/APIReference/systems-manager-api.pdf as systems-manager-api.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/ARG/latest/userguide/resgrps-ug.pdf as resgrps-ug.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/ARG/latest/APIReference/resource-groups.pdf as resource-groups.pdf
Accessing the doc page: https://docs.aws.amazon.com/powershell/
Waiting 3 seconds to retrieve the next document.
Downlading do

Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/cloud-map/latest/dg/cloud-map-dg.pdf as cloud-map-dg.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/cloud-map/latest/api/cloud-map-api.pdf as cloud-map-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/cloudfront/
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AmazonCloudFront_DevGuide.pdf as AmazonCloudFront_DevGuide.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/cloudfront/latest/APIReference/cloudfront-api.pdf as cloudfront-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/directconnect/
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/directconnect/latest/UserGuide/dc-ug.pdf as dc-ug.pdf
Waiting 5 seconds to retrieve the 

Downlading document: https://docs.aws.amazon.com/workspaces/latest/userguide/workspaces-ug.pdf as workspaces-ug.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/workspaces/latest/api/workspaces-api.pdf as workspaces-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/appstream2/
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/appstream2/latest/developerguide/appstream2-dg.pdf as appstream2-dg.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/appstream2/latest/APIReference/appstream2-api.pdf as appstream2-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/workdocs/
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/workdocs/latest/adminguide/workdocs-ag.pdf as workdocs-ag.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com

Downlading document: https://docs.aws.amazon.com/sns/latest/dg/sns-dg.pdf as sns-dg.pdf
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/sns/latest/api/sns-api.pdf as sns-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/sqs/
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dg.pdf as sqs-dg.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/sqs-api.pdf as sqs-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/step-functions/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/step-functions/latest/dg/step-functions-dg.pdf as step-functions-dg.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/step-functions/latest/apireference/step-fu

Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/thingsgraph/latest/ug/amazon-things-graph.pdf as amazon-things-graph.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/thingsgraph/latest/APIReference/iot-thingsgraph-api.pdf as iot-thingsgraph-api.pdf
Accessing the doc page: https://docs.aws.amazon.com/iot-1-click/
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/iot-1-click/latest/developerguide/iot-1-click-dg.pdf as iot-1-click-dg.pdf
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/iot-1-click/latest/projects-apireference/1click-papi.pdf as 1click-papi.pdf
Waiting 3 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/iot-1-click/1.0/devices-apireference/1click-dapi.pdf as 1click-dapi.pdf
Accessing the doc page: https://docs.aws.amazon.com/connect/
Waiti

Accessing the doc page: https://docs.aws.amazon.com/account-billing/
Waiting 2 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/awsaccountbilling-aboutv2.pdf as awsaccountbilling-aboutv2.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/aws-cost-management/latest/APIReference/awsbilling-api.pdf as awsbilling-api.pdf
Waiting 5 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/pricing-calculator/latest/userguide/aws-pc.pdf as aws-pc.pdf
Accessing the doc page: https://docs.aws.amazon.com/aws-support/
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/awssupport/latest/user/support-ug.pdf as support-ug.pdf
Waiting 4 seconds to retrieve the next document.
Downlading document: https://docs.aws.amazon.com/awssupport/latest/APIReference/support-api.pdf as support-api.pdf
Accessing the do

In [6]:
print('Number of documents processed:', i)

Number of documents processed: 392


In [7]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 1:50:33.121529
