# Web Scraping of Metro Ridership Statistics Using Python and Selenium
### David Lowe
### August 14, 2020

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Metro is a transportation planner and coordinator, designer, builder and operator for one of the country’s largest, most populous counties, Los Angeles. More than 9.6 million people, nearly one-third of California’s residents, live, work and play within its 1,433-square-mile service area. The purpose of this exercise is to practice web scraping by gathering the bus ridership statistics from the agency's web pages. This iteration of the script automatically traverses the monthly web pages (from January 2009 to June 2020) to capture all bus ridership entries and store the information in a CSV output file.

Starting URLs: http://isotp.metro.net/MetroRidership/Index.aspx

## Task 1. Prepare Environment

In [1]:
import os
import sys
import pandas as pd
import shutil
import boto3
from datetime import datetime, date
from random import randint
from time import sleep
from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.select import Select
from selenium.webdriver.firefox.options import Options

In [2]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the verbose and debug flags to print detailed messages for debugging (setting True will activate!)
verbose = False
debug = False

# Set up the flag to send status emails (setting to True will send the status emails!)
notifyStatus = False

# # Set up the parent directory location for loading the dotenv files
# useColab = False
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# # Set up the dotenv file for retrieving environment variables
# useLocalPC = False
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the flag to write the output to a JSON document (setting to TRUE will create the document!)
writeOutput = True

# Set up the executeDownload flag to download files (setting True will download!)
executeDownload = False

In [3]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])

In [4]:
if (notifyStatus): status_notify("Task 1 Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 2. Perform the Scraping and Processing

In [5]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [6]:
# Setting up a dataframe to capture the records
df = pd.DataFrame(columns=['idx', 'values'])
num_entries = 0

In [7]:
# Specifying the URL of desired web page to be scrapped
website_url = "http://isotp.metro.net/MetroRidership"
dataset_page_url = website_url + "/Index.aspx"

In [8]:
# Initialize the web browser
firefox_options = Options()
firefox_options.headless = False
home_page_browser = webdriver.Firefox(options=firefox_options)

In [9]:
# Gather all blog links from the blog page
print('Attempting to access the web page:', dataset_page_url)
try:
    home_page_browser.get(dataset_page_url)
    print('Successfully accessed the web page:', dataset_page_url)
except:
    print('The server could not serve up the web page!')
    sys.exit('Script processing cannot continue!!!')

last_period = date(2020, 6, 1)
data_years = range(2009, 2021)
data_months = range(1, 13)

for year in data_years:
    for month in data_months:
        current_period = date(year, month, 1)
        if current_period > last_period: break

        # Adding random wait time so we do not hammer the website needlessly
        waitTime = randint(2,5)
        print("Waiting", waitTime, "seconds to retrieve year", year, "and month", month)
        sleep(waitTime)

        select_element = home_page_browser.find_element(By.NAME, "ctl00$ContentPlaceHolder1$ddlYear")
        select_year = Select(select_element)
        select_year.select_by_value(str(year))

        select_element = home_page_browser.find_element(By.NAME, "ctl00$ContentPlaceHolder1$ddlPeriod")
        select_month = Select(select_element)
        select_month.select_by_value(str(month))

        select_element = home_page_browser.find_element(By.NAME, "ctl00$ContentPlaceHolder1$btnSubmit")
        select_element.click()
        
        select_element = home_page_browser.find_element(By.ID, "ContentPlaceHolder1_rpAllBus_gvAllBus")
        table_rows = select_element.find_elements(By.TAG_NAME, "tr")
        target_row = table_rows[4]
        ridership = int(target_row.find_elements(By.TAG_NAME, 'td')[4].text.replace(',',''))
        print("Captured bus ridership of", ridership, "for the period of", current_period, )

        df.loc[num_entries] = [current_period, ridership]
        num_entries = num_entries + 1

Attempting to access the web page: http://isotp.metro.net/MetroRidership/Index.aspx
Successfully accessed the web page: http://isotp.metro.net/MetroRidership/Index.aspx
Waiting 2 seconds to retrieve year 2009 and month 1
Captured bus ridership of 30580085 for the period of 2009-01-01
Waiting 3 seconds to retrieve year 2009 and month 2
Captured bus ridership of 28778916 for the period of 2009-02-01
Waiting 2 seconds to retrieve year 2009 and month 3
Captured bus ridership of 32963302 for the period of 2009-03-01
Waiting 2 seconds to retrieve year 2009 and month 4
Captured bus ridership of 31471512 for the period of 2009-04-01
Waiting 3 seconds to retrieve year 2009 and month 5
Captured bus ridership of 31665879 for the period of 2009-05-01
Waiting 5 seconds to retrieve year 2009 and month 6
Captured bus ridership of 31242816 for the period of 2009-06-01
Waiting 4 seconds to retrieve year 2009 and month 7
Captured bus ridership of 31233738 for the period of 2009-07-01
Waiting 3 seconds t

In [10]:
home_page_browser.quit()

In [11]:
if (notifyStatus): status_notify("Task 2 Perform the Scraping and Processing completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

## Task 3. Finalize the Output

In [12]:
if (notifyStatus): status_notify("Task 3 Finalize the Output has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [13]:
# Spot-checking the dataframe before writing to file
df.head()

Unnamed: 0,idx,values
0,2009-01-01,30580085
1,2009-02-01,28778916
2,2009-03-01,32963302
3,2009-04-01,31471512
4,2009-05-01,31665879


In [14]:
# Spot-checking the dataframe before writing to file
df.tail()

Unnamed: 0,idx,values
133,2020-02-01,21767867
134,2020-03-01,15323805
135,2020-04-01,7448380
136,2020-05-01,9096081
137,2020-06-01,10679063


In [15]:
if (writeOutput):
    out_file = df.to_csv(index=False)
    with open('web_scraping_py_selenium_metro_ridership_statistics.csv', 'w', newline = '\n') as f:
        f.write(out_file)
    print("Number of records written to file:", len(df))

Number of records written to file: 138


In [16]:
if (notifyStatus): status_notify("Task 3 Finalize the Output completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

In [17]:
print ('Total time for the script:',(datetime.now() - startTimeScript))

Total time for the script: 0:09:33.272139
