# Web Scraping of AWS Open Data Registry Using Selenium
### David Lowe
### September 4, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: The Registry of Open Data on AWS makes datasets publicly available through AWS services. When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products. Sharing data in the cloud also lets data users spend more time on data analysis rather than data acquisition. The script automatically traverses the dataset listing and capture the descriptive data by storing them in a CSV output file.

Starting URLs: https://registry.opendata.aws/

## Task 1. Prepare Environment

In [1]:
import os
import sys
import pandas as pd
import shutil
import re
import boto3
from datetime import datetime, date
from random import randint
from time import sleep
from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# # Set up the parent directory location for loading the dotenv files
# useColab = False
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# # Set up the dotenv file for retrieving environment variables
# useLocalPC = False
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the flag to write the output to a JSON document (setting to TRUE will create the document!)
writeOutput = True

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [4]:
if (notifyStatus): status_notify("Task 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 2. Perform the Scraping and Processing

In [5]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# To scrape the detail information of each individual book
def capture_details(item_detail_url, df):
    # Adding random wait time so we do not hammer the website needlessly
    waitTime = randint(2,4)
    print("Waiting", waitTime, "seconds to retrieve the items on page", item_detail_url)
    sleep(waitTime)
    # Initialize the web browser
    second_browser_options = Options()
    second_browser_options.headless = True
    detail_page_browser = webdriver.Firefox(options=second_browser_options)
    if verbose: print('Attempting to access the web page:', item_detail_url)
    try:
        detail_page_browser.get(item_detail_url)
        print('Successfully accessed the web page:', item_detail_url)
    except:
        print('The server could not serve up the web page!')
        sys.exit('Script processing cannot continue!!!')

    dataset_name = detail_page_browser.find_element(By.TAG_NAME, "h1").text
    if verbose: print('Found dataset name:', dataset_name)
    dataset_tags = []
    for i in range(20): dataset_tags.append("N/A")
    tags_on_page = detail_page_browser.find_element(By.TAG_NAME, "p").find_elements(By.CLASS_NAME, "tag")
    j = 0
    while j < len(tags_on_page):
        dataset_tags[j] = tags_on_page[j].text
        if verbose: print('Found tag:', dataset_tags[j])
        j = j + 1
    dataset_description = detail_page_browser.find_element(By.CLASS_NAME, "col-md-6").find_element(By.TAG_NAME, "p").text

    resource_container = detail_page_browser.find_element(By.CLASS_NAME, "col-md-5").find_elements(By.TAG_NAME, "dl")
    for resource_item in resource_container:
        resource_dd = resource_item.find_elements(By.TAG_NAME, "dd")
        resource_desc = resource_dd[0].text
        if debug: print('Found resource description:', dataset_name)
        resource_type = resource_dd[1].text
        if verbose: print('Found resource type:', resource_type)
        resource_name = resource_dd[2].find_element(By.TAG_NAME, "code").text
        if verbose: print('Found Amazon Resource Name:', resource_name)
        resource_region = resource_dd[3].find_element(By.TAG_NAME, "code").text
        if verbose: print('Found resource region:', resource_region)

        last_entry = len(df)
        print('Inserting record number', last_entry, 'into the dataframe.')
        df.loc[last_entry] = [dataset_name, item_detail_url, dataset_description, resource_name, resource_type, resource_region, resource_desc, dataset_tags[0], dataset_tags[1], dataset_tags[2], dataset_tags[3], dataset_tags[4], dataset_tags[5], dataset_tags[6], dataset_tags[7], dataset_tags[8], dataset_tags[9], dataset_tags[10], dataset_tags[11], dataset_tags[12], dataset_tags[13], dataset_tags[14], dataset_tags[15], dataset_tags[16], dataset_tags[17], dataset_tags[18], dataset_tags[19]]

    detail_page_browser.quit()
    return df

In [7]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['dataset_name', 'detail_url', 'dataset_description', 'resource name', 'resource_type', 'resource_region', 'resource_desc', 'tag01', 'tag02', 'tag03', 'tag04', 'tag05', 'tag06', 'tag07', 'tag08', 'tag09', 'tag10', 'tag11', 'tag12', 'tag13', 'tag14', 'tag15', 'tag16', 'tag17', 'tag18', 'tag19', 'tag20'])
num_entries = 0

In [8]:
# Specifying the URL of desired web page to be scrapped
website_url = "https://registry.opendata.aws"
web_page_url = website_url + "/"

In [9]:
# Initialize the web browser
firefox_options = Options()
firefox_options.headless = False
home_page_browser = webdriver.Firefox(options=firefox_options)

In [10]:
print('Attempting to access the web page:', web_page_url)
try:
    home_page_browser.get(web_page_url)
    print('Successfully accessed the web page:', web_page_url)
except:
    print('The server could not serve up the web page!')
    sys.exit('Script processing cannot continue!!!')

dataset_listing = home_page_browser.find_elements(By.CLASS_NAME, "dataset")
if debug: print(dataset_listing)

for dataset_item in dataset_listing:
    detail_url = dataset_item.find_element(By.TAG_NAME, "a").get_attribute("href")
    if verbose: print('Found dataset URL:', detail_url)
    df = capture_details(detail_url, df)

Attempting to access the web page: https://registry.opendata.aws/
Successfully accessed the web page: https://registry.opendata.aws/
Waiting 3 seconds to retrieve the items on page https://registry.opendata.aws/tcga/
Successfully accessed the web page: https://registry.opendata.aws/tcga/
Inserting record number 0 into the dataframe.
Inserting record number 1 into the dataframe.
Waiting 2 seconds to retrieve the items on page https://registry.opendata.aws/target/
Successfully accessed the web page: https://registry.opendata.aws/target/
Inserting record number 2 into the dataframe.
Waiting 2 seconds to retrieve the items on page https://registry.opendata.aws/kids-first/
Successfully accessed the web page: https://registry.opendata.aws/kids-first/
Inserting record number 3 into the dataframe.
Inserting record number 4 into the dataframe.
Inserting record number 5 into the dataframe.
Inserting record number 6 into the dataframe.
Inserting record number 7 into the dataframe.
Inserting recor

In [11]:
home_page_browser.quit()

In [12]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 3. Finalize the Output

In [13]:
if (notifyStatus): status_notify("Task 3 Finalize the Output has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [14]:
# Spot-checking the dataframe before writing to file
df.head()

Unnamed: 0,dataset_name,detail_url,dataset_description,resource name,resource_type,resource_region,resource_desc,tag01,tag02,tag03,...,tag11,tag12,tag13,tag14,tag15,tag16,tag17,tag18,tag19,tag20
0,The Cancer Genome Atlas,https://registry.opendata.aws/tcga/,"The Cancer Genome Atlas (TCGA), a collaboratio...",arn:aws:s3:::tcga-2-open,S3 Bucket,us-east-1,"Clinical Supplement, Biospecimen Supplement, R...",cancer,genomic,life sciences,...,,,,,,,,,,
1,The Cancer Genome Atlas,https://registry.opendata.aws/tcga/,"The Cancer Genome Atlas (TCGA), a collaboratio...",arn:aws:s3:::tcga-2-controlled,S3 Bucket Controlled Access,us-east-1,"WXS/RNA-Seq/miRNA-Seq/ATAC-Seq Aligned Reads, ...",cancer,genomic,life sciences,...,,,,,,,,,,
2,Therapeutically Applicable Research to Generat...,https://registry.opendata.aws/target/,Therapeutically Applicable Research to Generat...,arn:aws:s3:::gdc-target-phs000218-2-open,S3 Bucket,us-east-1,"Clinical Supplement, Biospecimen Supplement, R...",cancer,genomic,life sciences,...,,,,,,,,,,
3,Gabriella Miller Kids First Pediatric Research...,https://registry.opendata.aws/kids-first/,The NIH Common Fund's Gabriella Miller Kids Fi...,arn:aws:s3:::kf-study-us-east-1-prd-sd-46sk55a3,S3 Bucket Controlled Access,us-east-1,Kids First: Pediatric Research Project on the ...,cancer,genetic,genomic,...,,,,,,,,,,
4,Gabriella Miller Kids First Pediatric Research...,https://registry.opendata.aws/kids-first/,The NIH Common Fund's Gabriella Miller Kids Fi...,arn:aws:s3:::kf-study-us-east-1-prd-sd-preasa7s,S3 Bucket Controlled Access,us-east-1,"National Heart, Lung, and Blood Institute (NHL...",cancer,genetic,genomic,...,,,,,,,,,,


In [15]:
# Spot-checking the dataframe before writing to file
df.tail()

Unnamed: 0,dataset_name,detail_url,dataset_description,resource name,resource_type,resource_region,resource_desc,tag01,tag02,tag03,...,tag11,tag12,tag13,tag14,tag15,tag16,tag17,tag18,tag19,tag20
310,Swiss Public Transport Stops,https://registry.opendata.aws/schweizer-haltes...,The basic geo-data set for public transport st...,arn:aws:s3:::data.geo.admin.ch/ch.bav.halteste...,S3 Bucket,eu-west-1,"data files ESRI FGDB, CSV , MapInfo, Interlis",cities,geospatial,infrastructure,...,,,,,,,,,,
311,COVID-19 Molecular Structure and Therapeutics Hub,https://registry.opendata.aws/molssi-covid19-hub/,Aggregating critical information to accelerate...,arn:aws:s3:::molssi-bioexcel-covid-19-structur...,S3 Bucket,us-east-1,Data storage of for the MolSSI and BioExcel CO...,bioinformatics,biology,coronavirus,...,,,,,,,,,,
312,DigitalGlobe Open Data Program,https://registry.opendata.aws/digital-globe-op...,Pre and post event high-resolution satellite i...,opendata.digitalglobe.com,CloudFront Distribution,us-east-1,Imagery and metadata,disaster response,earth observation,geospatial,...,,,,,,,,,,
313,Multiview Extended Video with Activities (MEVA),https://registry.opendata.aws/mevadata/,The Multiview Extended Video with Activities (...,arn:aws:s3:::mevadata-public-01,S3 Bucket,us-east-1,AVI video clips and collection site map,computer vision,urban,us,...,,,,,,,,,,
314,University of British Columbia Sunflower Genom...,https://registry.opendata.aws/ubc-sunflower-ge...,This dataset captures Sunflower's genetic dive...,arn:aws:s3:::ubc-sunflower-genome,S3 Bucket,us-west-2,UBC Sunflower Genome Data 1,agriculture,biodiversity,bioinformatics,...,,,,,,,,,,


In [16]:
if (writeOutput):
    out_file = df.to_csv(index=False)
    with open('web_scraping_py_selenium_aws_opendata_registry.csv', 'w', newline = '\n') as f:
        f.write(out_file)
    print("Number of records written to file:", len(df))

Number of records written to file: 315


In [17]:
if (notifyStatus): status_notify("Task 3 Finalize the Output completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [18]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:27:53.334061
