Zach Tretter

June 2020

--------


# Step 1 - Generate the Data

Workbook Scope : Scrape the historical data for GLacier National Park from the website.  Use requests.post because the URL does not change when selecting a given month/year combination.  The output is will be a CSV file.  There are 13 campgrounds, 20 years, and 5 months available.

In [57]:
import requests
from bs4 import BeautifulSoup

import time
import datetime as dt
import calendar

import os
import pandas as pd
import numpy as np

## Webscrape Parent Page for Campground Status
1. Create a Beautiful Soup for the URL
2. Extract tags with campground names
3. Build the list of names and list of links

#### Instantiate Soup Object

In [2]:
# URL for Glacier National Park Campground Status
base_url = 'https://www.nps.gov/applications/glac/cgstatus/cgstatus.cfm'

# Create a request to the URL
res = requests.get(base_url)

# Create a beautiful soup object
soup = BeautifulSoup(res.content, 'lxml')

#### Extract Tags of Interest

In [3]:
# The third table contains the list of campgrounds
table = soup.find_all('table')[3]

# All campground names are in href tags
campgrounds = table.find_all('a')

#### Build Lists of Names and Links

In [4]:
# Build a list of campground names
names = []

# Build a list of links to the campground status
status_links = []

# Iterate through the campgrounds to build these lists
for i in campgrounds:
    names.append(i.getText())
    status_links.append(i.get('href'))    

#### Verify Output

In [5]:
# Verify the names and status links
for i in range(len(names)):
    print(names[i])
    print(status_links[i])
    print("\n")

Apgar
camping_detail.cfm?cg=Apgar


Avalanche
camping_detail.cfm?cg=Avalanche


Bowman Lake
camping_detail.cfm?cg=Bowman Lake


Cut Bank
camping_detail.cfm?cg=Cut Bank


Fish Creek
camping_detail.cfm?cg=Fish Creek


Kintla Lake
camping_detail.cfm?cg=Kintla Lake


Logging Creek
camping_detail.cfm?cg=Logging Creek


Many Glacier
camping_detail.cfm?cg=Many Glacier


Quartz Creek
camping_detail.cfm?cg=Quartz Creek


Rising Sun
camping_detail.cfm?cg=Rising Sun


Sprague Creek
camping_detail.cfm?cg=Sprague Creek


St. Mary
camping_detail.cfm?cg=St. Mary


Two Medicine
camping_detail.cfm?cg=Two Medicine




## Webscrape Each Campground

http://jonathansoma.com/lede/foundations/classes/friday%20sessions/advanced-scraping-form-submissions-completed/

#### Build List of Full Status Link

In [7]:
# The base url for each campground's status page
status_url = 'https://www.nps.gov/applications/glac/cgstatus/'

# Create the list of links to status
fulllink_campgroundstatus_byname = []

for i in status_links:
    fulllink_campgroundstatus_byname.append(status_url + i)

# View Links
fulllink_campgroundstatus_byname

['https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Apgar',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Avalanche',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Bowman Lake',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Cut Bank',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Fish Creek',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Kintla Lake',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Logging Creek',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Many Glacier',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Quartz Creek',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Rising Sun',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=Sprague Creek',
 'https://www.nps.gov/applications/glac/cgstatus/camping_detail.cfm?cg=S

#### Identify range of parameters for requests.post

In [8]:
# Parameters for requests.post
available_years = [str(i) for i in np.arange(2000,2020)]
available_months = [str(i) for i in np.arange(5,10)]

total_calls = len(names) * len(available_years) * len(available_months)
print(f'Total requests to NPS website will be {total_calls}')
print(f'Using a time.sleep of 0.25 seconds this will require at least \n {round(total_calls/4 / 60,1)} minutes to run')

Total requests to NPS website will be 1300
Using a time.sleep of 0.25 seconds this will require at least 
 5.4 minutes to run


#### Define Function to Remove Escape Characters

In [9]:
def strip_escape_chars(text):
    '''
    getText() on an element in the 'Current and Historic Campground Fill Times' looks like this:
    '\n3\n\r\n\t\t\t\t\t\t\t11:00am\r\n\t\t\t\t\t\t\t\n\n'
    This function cleans it to: '3 11:00am'
    A date with no fill up time will just be a number (the day of the month)
    '''
    return text.strip().replace("\n","").replace('\t',"").replace('\r'," ")

#### Define Function to Extract Historic Fill Times

In [152]:
def retrieve_historic(link_to_campground,
                     months = available_months,
                     years = available_years):
    
    start_time = time.time()
    
    fill_times = []
    
    # Grab the Campground Name
    req = requests.get(link_to_campground)   
    soup = BeautifulSoup(req.text,'html.parser')
    title = soup.find('title').getText()
    name = title.split(" - ")[1].replace(" Campground Information","")

    # For each possible year, 2000 to 2019
    for specific_year in years:
        
        print(f'Examine {name} in {specific_year}')
        
        # For each possible month, May to September
        for specific_month in months:  
            
            # Pause between each loop to not trigger an adverse response from NPS
            time.sleep(0.5)
        
            # Declare the parameters for this particular request
            params = {'selectmm': specific_month,
                      'selectyy': specific_year}
            
            # Create the POST request
            response = requests.post(link_to_campground, params)
            
            # Create the Soup output
            soup = BeautifulSoup(response.text,'html.parser')
            
            # The third table contains the data of interest
            content = soup.find_all('table')[2]
            
            # Month-Year is the first cell
            month_text, year_text = strip_escape_chars(content.find_all('td')[0].getText()).split(" - ")
            month_num = int(dt.datetime.strptime(month_text, "%B").month)
            year = int(year_text)
            
            # Entries in the table start on the 9th cell
            # The first is the header "month - year"
            # The next seven are the days of the week

            for index, i in enumerate(content.find_all('td')[8:]):
                clean = strip_escape_chars(i.getText()).split(" ")

                # Ignore empty cells in the first row (e.g Wed when the month starts on a Fri)
                if clean == ['']:
                    # Break if we've gone through our dates
                    if index > 30:
                        break
                    else:
                        pass

                # For valid table entries, identify the fill up time or False if not
                else:
                    day = int(clean[0])
                    date = dt.date(year, month_num, day)
                    
                    if len(clean)==1:
                        did_fill = 0
                        fill_time = None
                        
                    else:
                        did_fill = 1
                        fill_time = clean[1]
#                     print(name, year,month,date,fill)
                    
                    entry = {
                        'cg_name': name,
                        'date': date,
                        'did_fill': did_fill,
                        'fill_time': fill_time,
                    }

                    fill_times.append(entry)
            
    print(f'Elapsed time {round(time.time() - start_time,0)} secs') 
                              
    return pd.DataFrame(fill_times)

### Build the Dataframe

In [176]:
df = pd.DataFrame()

for campground in fulllink_campgroundstatus_byname:
    function_output = retrieve_historic(campground)
    df = df.append(function_output)

df

Examine Apgar in 2000
Elapsed time 6.0
Examine Apgar in 2001
Elapsed time 10.0
Examine Apgar in 2002
Elapsed time 14.0
Examine Apgar in 2003
Elapsed time 20.0
Examine Apgar in 2004
Elapsed time 25.0
Examine Apgar in 2005
Elapsed time 30.0
Examine Apgar in 2006
Elapsed time 35.0
Examine Apgar in 2007
Elapsed time 40.0
Examine Apgar in 2008
Elapsed time 45.0
Examine Apgar in 2009
Elapsed time 51.0
Examine Apgar in 2010
Elapsed time 55.0
Examine Apgar in 2011
Elapsed time 59.0
Examine Apgar in 2012
Elapsed time 64.0
Examine Apgar in 2013
Elapsed time 69.0
Examine Apgar in 2014
Elapsed time 73.0
Examine Apgar in 2015
Elapsed time 78.0
Examine Apgar in 2016
Elapsed time 82.0
Examine Apgar in 2017
Elapsed time 87.0
Examine Apgar in 2018
Elapsed time 91.0
Examine Apgar in 2019
Elapsed time 95.0
Examine Avalanche in 2000
Elapsed time 4.0
Examine Avalanche in 2001
Elapsed time 8.0
Examine Avalanche in 2002
Elapsed time 12.0
Examine Avalanche in 2003
Elapsed time 16.0
Examine Avalanche in 2004
E

Elapsed time 15.0
Examine Rising Sun in 2003
Elapsed time 19.0
Examine Rising Sun in 2004
Elapsed time 23.0
Examine Rising Sun in 2005
Elapsed time 28.0
Examine Rising Sun in 2006
Elapsed time 32.0
Examine Rising Sun in 2007
Elapsed time 37.0
Examine Rising Sun in 2008
Elapsed time 41.0
Examine Rising Sun in 2009
Elapsed time 46.0
Examine Rising Sun in 2010
Elapsed time 50.0
Examine Rising Sun in 2011
Elapsed time 54.0
Examine Rising Sun in 2012
Elapsed time 60.0
Examine Rising Sun in 2013
Elapsed time 65.0
Examine Rising Sun in 2014
Elapsed time 70.0
Examine Rising Sun in 2015
Elapsed time 74.0
Examine Rising Sun in 2016
Elapsed time 79.0
Examine Rising Sun in 2017
Elapsed time 83.0
Examine Rising Sun in 2018
Elapsed time 87.0
Examine Rising Sun in 2019
Elapsed time 92.0
Examine Sprague Creek in 2000
Elapsed time 5.0
Examine Sprague Creek in 2001
Elapsed time 9.0
Examine Sprague Creek in 2002
Elapsed time 14.0
Examine Sprague Creek in 2003
Elapsed time 19.0
Examine Sprague Creek in 20

Unnamed: 0,cg_name,date,did_fill,fill_time
0,Apgar,2000-05-01,0,
1,Apgar,2000-05-02,0,
2,Apgar,2000-05-03,0,
3,Apgar,2000-05-04,0,
4,Apgar,2000-05-05,1,9:25am
...,...,...,...,...
3055,Two Medicine,2019-09-26,0,
3056,Two Medicine,2019-09-27,0,
3057,Two Medicine,2019-09-28,0,
3058,Two Medicine,2019-09-29,0,


### Export to CSV

In [180]:
df.to_csv('../01_filltimes_raw.csv')