# US National Parks
This notebook is structured based on a [project scoping guide](http://www.datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/) provided by Carnegie Mellon University.<br><br>
**GOALS:** (***1***) Characterize reactional use of public lands in the United States; and (***2***) learn how to scrape and visualize data from a webpage.<br>
**DATA:** Publically avaialable data on recreational use of the US national parks will be scrapped from the [National Parks Service (NPS)](https://irma.nps.gov/STATS/) website.<br>
**ANALYSIS:** Exploratory data analysis to gain insights into the dataset.<br>
**ETHICAL CONSIDERATIONS:** There are no apparent issues with privacy, transparency,
discrimination/equity, or accountability in terms of avaiable data. Whether access to the national parks is equitable across communities in the US should be considered further. The NPS has begun administering a [survey](https://www.nps.gov/subjects/socialscience/socioeconomic-monitoring-visitor-surveys.htm) to understand who accesses the parks, and whether access differs as a function of demographic and economic factors. I'd like to incorporate data from that survey into this notebook when it becomes publically available.<br>
**ADDITIONAL CONSIDERATIONS:** None.

## Load libraries

In [1]:
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup

## Gather list US national parks 
First step is to compile a list of the national parks in the United States. For practice, I will scrape this information from this [wiki page](https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States). 

In [2]:
# url to scrape information from
wiki = 'https://en.wikipedia.org/wiki/List_of_national_parks_of_the_United_States'

In [3]:
# convert webpage to text
page = rq.get(wiki).text

# convert into BeautifulSoup object
soup = BeautifulSoup(page)

In [4]:
# pull table tag that match our class name
table = soup.find('table', class_ = 'wikitable sortable plainrowheaders')

# find <tr> tags in our specified table, ignoring the labels row
parks_table = table.find_all('tr')[1:]

In [5]:
# create empty list to store park names
park_names = []

# extract park names from <a> tags of our table
for park in parks_table:
    name = park.find('a').get('title')
    park_names.append(name)

In [6]:
# print list of park names
print(park_names)

['Acadia National Park', 'National Park of American Samoa', 'Arches National Park', 'Badlands National Park', 'Big Bend National Park', 'Biscayne National Park', 'Black Canyon of the Gunnison National Park', 'Bryce Canyon National Park', 'Canyonlands National Park', 'Capitol Reef National Park', 'Carlsbad Caverns National Park', 'Channel Islands National Park', 'Congaree National Park', 'Crater Lake National Park', 'Cuyahoga Valley National Park', 'Death Valley National Park', 'Denali National Park and Preserve', 'Dry Tortugas National Park', 'Everglades National Park', 'Gates of the Arctic National Park and Preserve', 'Gateway Arch National Park', 'Glacier National Park (U.S.)', 'Glacier Bay National Park and Preserve', 'Grand Canyon National Park', 'Grand Teton National Park', 'Great Basin National Park', 'Great Sand Dunes National Park and Preserve', 'Great Smoky Mountains National Park', 'Guadalupe Mountains National Park', 'Haleakalā National Park', 'Hawaiʻi Volcanoes National Par

## Access data from NPS
Data for this project is publically avialable on the NPS website. I'm going to try a different web-srapping method, and use the `read_html` function from **pandas**. The rationale for the approach I use here is described in [this tutorial](https://www.youtube.com/watch?v=ooj84UP3r6M). 

In [7]:
# define url string 
nps = 'https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park={}'

Now we can loop through a list of NPS-defined park abbreviations to access data for each specific national park.

In [8]:
# parks list by NPS-defined abbreviations
parks = ['ACAD', 'ARCH', 'BADL', 'BIBE', 'BISC', 'BLCA', 'BRCA', 'CANY', 'CARE', 'CAVE', 
         'CHIS', 'CONG', 'CRLA', 'CUVA', 'DEVA', 'DENA', 'DRTO', 'EVER', 'GAAR', 'JEFF', 
         'GLBA', 'GLAC', 'GRCA', 'GRTE', 'GRBA', 'GRSA', 'GRSM', 'GUMO', 'HALE', 'HAVO', 
         'HOSP', 'INDU', 'ISRO', 'JOTR', 'KATM', 'KEFJ', 'KICA', 'KOVA', 'LACL', 'LAVO', 
         'MACA', 'MEVE', 'MORA', 'NERI', 'NPAS', 'NOCA', 'OLYM', 'PEFO', 'PINN', 'REDW', 
         'ROMO', 'SAGU', 'SEQU', 'SHEN', 'THRO', 'VIIS', 'VOYA', 'WHSA', 'WICA', 'WRST', 
         'YELL', 'YOSE', 'ZION']

In [9]:
parks_data = []

for park in parks:
    url = nps.format(park)
    print(url)
    # add table extraction here

https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park=ACAD
https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park=ARCH
https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park=BADL
https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park=BIBE
https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park=BISC
https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%20Last%20Calendar%20Year)?Park=BLCA
https://irma.nps.gov/STATS/SSRSReports/Park%20Specific%20Reports/Recreation%20Visitors%20By%20Month%20(1979%20-%

Looks good! The next step will be to read the data tables into the notebook, followed by general data tidying.<br><br>***To be continued***