# Lecture 7: FAIR data and environmental research
ENVR 890-001: Python for Environmental Research, Fall 2020

October 9, 2020

By Andrew Hamilton. Some material adapted from Matthew Huber & Ashley Dicks (Purdue University).

Thanks to Drs. Venkatesh Merwade, Matthew Huber, Carol Song, and Lan Zhao at Purdue University, and the FACT Cyber Training Fellowship (funded under NSF Award #1829764), for training me in FAIR data and science best practices. 

### Summary
The use and production of data is fundamental to all research, including environmental science, engineering, and health. Data are stored in a wide variety of formats and locations, from curated online databases to individual spreadsheets to pdf tables. In this lesson, we will learn how to download data from a number of online repositories in order to use in our research. We will also learn about FAIR data standards and best practices for making your own data available to other researchers.

### Major public environmental data sources
1. [USGS National Water Information System](https://waterdata.usgs.gov/nwis)
    - Stream flow, water quality, water use, etc.
    - Usually easier to search for specific location (e.g., [Bolin Creek](https://waterdata.usgs.gov/nwis/uv?site_no=0209734440))
1. [Natural Resources Conservation Service (NRCS)](https://www.wcc.nrcs.usda.gov/snow/)
    - Snow depth, water supply forecasts, etc.
1. [NOAA National Centers for Environmental Information](https://www.ncdc.noaa.gov/data-access)
    - Historical weather data, severe weather database, etc.
1. [EPA Air Data](https://www.epa.gov/outdoor-air-quality-data)
    - Current & historical air quality data
    - Automated plot generation & data downloads
1. [EPA Dataset Giveaway](https://edg.epa.gov/metadata/catalog/main/home.page)
    - Many different datasets on climate change, locations of Superfund sites, environmental justice, etc.
1. [CDC National Center for Environmental Health](https://www.cdc.gov/nceh/data.htm)
    - Many datasets about asthma, lead poisoning, nutrition, etc.
1. [Organization for Co-operation and Economic Development (OECD) Data](https://data.oecd.org/)
    - Country-scale data on environment (air pollution, water withdrawals, CO2 emissions, etc.)
    - Energy, healthcare, development, etc.
1. [The National Map (USGS)](https://viewer.nationalmap.gov/basic/)
    - National Hydrography Dataset
    - Digital Elevation Maps
    - Place names, transportation networks
1. [Multi-Resolution Land Characteristics Consortium (MRLC)](https://www.mrlc.gov/data?f%5B0%5D=category%3Aland%20cover)
    - Land use/land cover datasets
1. [National Center for Atmospheric Research (NCAR) Research Data Archive](https://rda.ucar.edu/)
    - Tons of gridded oceanic & atmospheric datasets & reanalyses

### 4 ways to upload data into Python
1. Download csv/xlsx, and use ``pd.read_csv()``, ``pd.read_excel()``

1. Copy online table into Excel, then follow #1
    - e.g., USGS "Tab-separated" output format
    
1. Using special Python APIs
    - [EPA Envirofacts Data Service API](https://www.epa.gov/enviro/web-services)
    - These APIs can be pretty tricky to use, and each one will be different. They often require special python packages and/or an account with the provider. But if you need to download a lot of different datasets, or get updated data regularly, it may be worthwhile to figure out.
    - [Here](https://techrando.com/2019/07/04/how-to-use-the-environmental-protection-agencys-epas-api-to-pull-data/) is an example of how to use with EPA Envirofacts, but I haven't actually used it.
    - [Here](http://kapadia.github.io/usgs/) is a package for interfacing with the USGS API, but I haven't used this either.
    
1. Query online table directly using url (often the most convenient)

We will use USGS streamflow data from Bolin Creek in Chapel Hill, NC. If you click on the Bolin Creek link above on the USGS website, you will find information about how to download data from this location. One of the options for any query is to get data in csv format, which will open a new browser window with a table of data. For example, [here](https://nwis.waterdata.usgs.gov/nwis/uv?cb_00060=on&cb_00065=on&format=rdb&site_no=0209734440&period=&begin_date=2019-10-02&end_date=2020-10-09). Pandas allows us to access this webpage and access the data directly, rather than having to first save a csv file to our computer:

In [1]:
import pandas as pd
import numpy as np

In [2]:
### Use USGS Bolin Creek website to get tab-separated data. Fill in address, header, and delimiter
data_address = 'https://nwis.waterdata.usgs.gov/nwis/uv?cb_00060=on&cb_00065=on&format=rdb&site_no=0209734440&period=&begin_date=2019-10-02&end_date=2020-10-09'
header = 30
delimiter = '\t'
df = pd.read_csv(data_address, header=header, delimiter=delimiter)

In [4]:
df = df.iloc[1:, :]
df

Unnamed: 0,agency_cd,site_no,datetime,tz_cd,89527_00065,89527_00065_cd,89528_00060,89528_00060_cd
1,USGS,0209734440,2019-10-02 00:00,EDT,1.21,A,0.00,A:R
2,USGS,0209734440,2019-10-02 00:15,EDT,1.21,A,0.00,A:R
3,USGS,0209734440,2019-10-02 00:30,EDT,1.21,A,0.00,A:R
4,USGS,0209734440,2019-10-02 00:45,EDT,1.20,A,0.00,A:R
5,USGS,0209734440,2019-10-02 01:00,EDT,1.21,A,0.00,A:R
...,...,...,...,...,...,...,...,...
35395,USGS,0209734440,2020-10-09 10:15,EDT,1.62,P,1.19,P
35396,USGS,0209734440,2020-10-09 10:30,EDT,1.62,P,1.19,P
35397,USGS,0209734440,2020-10-09 10:45,EDT,1.61,P,1.10,P
35398,USGS,0209734440,2020-10-09 11:00,EDT,1.62,P,1.19,P


Once you have an example url, you can often figure out how to automatically get data for new dates or locations. For example, how would you change the query to download data from January 1, 2015, to the present?

In [7]:
## To save the data for later, use pandas to_csv()
df.to_csv('bolin_creek.csv', sep=',', index=False)

### FAIR research
<img src="stall_fair.PNG" style="width: 400px;" />(Image cred: Stall, 2018)

**FAIR** data is:
- **F**indable: The datasets and resources should be easily located by humans and computers
- **A**ccessible: After the dataset is found, the user needs to be able to easily access the datasets
- **I**nteroperable: The datasets need to be in a format that is usable by others, therefore needs to satisfy the following 
- **R**eusable: The datasets need to be able to be used by various people, therefore must have clear metadata

<img src="wilkinson2016_box2.PNG" style="width: 800px;" />(Image cred: Wilkinson et al., 2016)

<img src="huber_fair.PNG" style="width: 600px;" />(Image cred: Matthew Huber et al., [*MyGeoHub*](https://mygeohub.org/cybertraining))

<img src="stall_dataChallenges.PNG" style="width: 800px;" />(Image cred: Stall, 2018)

<img src="rosenberg2020_fig1.PNG" style="width: 600px;" />(Image cred: Rosenberg et al., 2020)

<img src="hutton2016_fig1.PNG" style="width: 800px;" />(Image cred: Hutton et al., 2016)

<img src="stall_dataEcosystem.PNG" style="width: 600px;" />(Image cred: Stall, 2018)

[Coalition for Publishing Data in the Earth and Space Sciences, Enabling FAIR Data Project](https://copdess.org/enabling-fair-data-project/)
- Scientific orgs: American Geophysical Union(AGU), European Geosciences Union (AGU), etc.
- Publishers: AGU, PNAS, Nature, Science, Elsevier, Wiley, etc.
- Repositories & Data Infrastructure

**More reading on FAIR/open data and science:**
- AGU FAIR data working group presentation ([Stall, 2018, Big Data Interagency Working Group](https://www.nitrd.gov/nitrdgroups/images/0/02/Enabling-FAIR-Data-ESES-ShelleyStall.pdf))
- FAIR guiding principles for scientific data management and stewardship ([Wilkinson et al, 2020, *Scientific Data*](https://www.nature.com/articles/sdata201618%22))
- MyGeoHub description of FAIR principles ([Merwade, Huber, Song, Huang, Zhao, *MyGeoHub*](https://mygeohub.org/cybertraining/fair))
- Most computational hydrology is not reproducible, so is it really science? ([Hutton et al., 2016, *Water Resources Research*](https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1002/2016WR019285))
- History, promises, and challenges of open science/open data for public health research ([Huston et al., 2019, *Canada Communicable Disease Report*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6781855/#:~:text=Open%20Data%20is%20based%20on,that%20inform%20and%20support%20them))
- Best practices for performing reproducible research, focused on complex computer modeling workflows ([Rosenberg et al., 2020, *Journal of Water Resources Planning and Management*](https://ascelibrary.org/doi/full/10.1061/%28ASCE%29WR.1943-5452.0001215))
- Great lecture on how scientists can improve reproducibility/reusability by learning from the open-source software community ([McElreath, 2020, *YouTube*](https://www.youtube.com/watch?v=zwRdO9_GGhY&t=0s&ab_channel=RichardMcElreath))

### Sharing your research products
Repositories for sharing research products (data and/or code)
1. [HydroShare](https://www.hydroshare.org/landingPage/)
1. [Nature Scientific Data list of repositories](https://www.nature.com/sdata/policies/repositories#climate)
1. GitHub + Zenodo
    - e.g., my [GitHub repository](https://github.com/ahamilton144/hamilton-2020-managing-financial-risk-tradeoffs-for-hydropower) for code associated with a research paper. See "Tags" for snapshot versions associated with each submission. Each snapshot is downloadable on Zenodo and has a permanent DOIs. 