Harvest data from Web APIs using the Python Requests library
====================================================

Amanda Devine  
25 July 2019  
SI Carpentries Brown Bag

<br><br><br><br>

***

GitHub Repository: https://github.com/amdevine/cbb-python-requests  
Detailed Jupyter notebook: https://github.com/amdevine/cbb-python-requests/blob/master/harvest-data-apis-python-requests.ipynb 

Definitions
------------------

- **(REST) API**: Application Programming Interface. A special page on a website that provides structured data for other programs and applications.


- **GET Request**: An HTTP command to retrieve code and data from a website. 


- **JSON**: JavaScript Object Notation. A common format of structuring data, analogous to a Python dictionary.


- **Base URL**: The "home" website URL for all API data.
    > NPS Base URL: `https://developer.nps.gov/api/v1`

- **Endpoint**: The specific URL where the API page can be found. 
    > Parks Endpoint: `https://developer.nps.gov/api/v1/parks`


- **Parameter**: An additional criterion that is added to the endpoint to filter data returned. 
    > parkCode, stateCode, and limit parameters: `https://developer.nps.gov/api/v1/parks?parkCode=yell&stateCode=WY&limit=5`
    

- **API Key**: A string of characters assigned by the website to identify the user requesting data via the API. 
    > National Parks API Key: `https://developer.nps.gov/api/v1/parks?api_key=1mdaBewB37R0kUA2ZtfA6URe7PeUsig6jLQmSXyx` (not a real key!)

NPS Data API
-----------------

Official source of data about natural areas managed by the National Park Service
- park information
- campground information
- alerts, events, news, educational resources, etc.


NPS API Keys: https://www.nps.gov/subjects/developer/get-started.htm

NPS Data API documentation: https://www.nps.gov/subjects/developer/api-documentation.htm  

Python Requests library
-----------------------------

Sample GET Request:

    import requests
    url = 'https://baseurl.com/endpoint'
    params = {
        'field1': 'value1',
        'field2': 'value2',
    }
    r = requests.get(url, params).json()
    
Quickstart documentation: https://2.python-requests.org/en/master/user/quickstart/

Setup
--------

Import the `requests` and `pandas` libraries.

In [None]:
import requests
import pandas as pd

Save API Key as a constant or read it from a local file.

In [None]:
# API_KEY = '1mdaBewB37R0kUA2ZtfA6URe7PeUsig6jLQmSXyx'
with open('api_key_file.txt', 'r') as f:
    API_KEY = f.read().strip()
print("API Key: {}".format("API_KEY")) # Remove quotes to display actual API_KEY

Make a GET request to the API to retrieve data
--------------------------------------------------------

This request returns data on up to 100 parks in Washington DC, Maryland, and Virginia.

In [None]:
url = 'https://developer.nps.gov/api/v1/parks'
params = {
    'api_key': API_KEY,
    'stateCode': 'DC,MD,VA', # Per the API documentation, separate multiple values with commas
    'fields': 'entranceFees',
    'limit': 100
}
r = requests.get(url, params)

`api_key` is a required parameter for all NPS Data API requests. `stateCode` filters parks based on two-letter US state abbreviations. `fields` specifies additional fields to return in addition to the default fields. `limit` specifies the maximum number of results to return.

`requests.get()` returns a variety of information about the web page retrieved. 

In [None]:
print("The response code is: {}".format(r.status_code))
print("\nThe retrieved URL is: {}".format("r.url")) #Remove quotes to display URL
print("\nThe first 300 characters of the retrieved text are:\n{}".format(r.text[:300]))

## Work with retrieved data

### Convert GET request object to dictionary

In [None]:
parks_data = r.json()

print("Top level keys:", list(parks_data))
print("\nAvailable keys in each entry:", list(parks_data['data'][0]))

### Create a DataFrame

This code filters the retrieved data to states and associated lat/long coordinate for each park.

In [None]:
parks_df = pd.DataFrame(parks_data['data'])
locations_df = parks_df[['parkCode', 'fullName', 'designation', 'states', 'latLong']]
locations_df.head(10)

### Restructure/flatten data

Retrieved JSON data for an individual park's multiple entrance fees.

In [None]:
parks_data['data'][2]['entranceFees']

For each park in the dataset, and for each entrance fee in that park, add some park and fee values as a dictionary to a new `entry_fee_data` list.

In [None]:
entry_fees_data = []
for park in parks_data['data']:
    for fee in park['entranceFees']:
        entry_fees_data.append({
            'parkCode': park['parkCode'],
            'fullName': park['fullName'],
            'designation': park['designation'],
            'fee_usd': fee['cost'],
            'fee_type': fee['title'],
            'fee_description': fee['description']
        })
print(entry_fees_data[:3])

Convert `entry_fee_data` to a DataFrame

In [None]:
entry_fees_df = pd.DataFrame(entry_fees_data)
entry_fees_df = entry_fees_df[['parkCode', 'fullName', 'designation', 'fee_usd', 'fee_type']]
entry_fees_df['fee_usd'] = entry_fees_df['fee_usd'].astype(float)
entry_fees_df.head(10)

## Export data as a tabular file

CSV file: `df_name.to_csv('output_file_name.csv', index=False)`

TSV file: `df_name.to_csv('output_file_name.tsv', sep='\t', index=False)`

In [None]:
locations_df.to_csv('parks_data.tsv', sep='\t', index=False)
entry_fees_df.to_csv('parks_entry_fees.tsv', sep='\t', index=False)

Additional API Resources
-------------------------------

Full Requests documentation: https://2.python-requests.org/en/master/

List of US Federal Government APIs: https://catalog.data.gov/dataset?res_format=API

Repository of APIs: https://www.programmableweb.com/