# Exploratory Data Analysis Exercise with Pandas and Matplotlib

In this exercise, you are responsible for devleoping a data pipeline to ingest and analyze multi-state streamflow records from CSV files in our Canvas Class. This exercise will directly assist with HW #1. Filepath for the data:

    files -> Data -> NWIS_Streaflow -> <STATE>

You will download the data from Canvas and load it into a folder you create called "streamflow_data". Once within the repo, you will load the data into this python notebook and perform exploratory data analysis. After performing data cleaning and time-series alignment with Pandas, you will transition develop Matplotlib visualizations. The core of the assignment emphasizes the Matplotlib philosophy, challenging you to use powerful operators to link, overlay, and explore discharge trends across Idaho, Utah, and Wyoming.

The [USGS NWIS Mapper](https://apps.usgs.gov/nwismapper/) provides interactive mapping to locate sites and repective metadata.

## Task 1: Select, download, and bring the data into your notebook session

Use the [USGS NWIS Mapper](https://apps.usgs.gov/nwismapper/) to locate one site below a reservoir,  one site in a headwater catchment, and one site near a rivers terminus to the Great Salt Lake. Using this siteid, find the site data in the Canvas NWIS_Streamflow data folder, download it to your computer, then upload it to this repo into a folder named "streamflow_data". In the code block below, load the data into a Pandas DataFrame and inspect it as we previously did in the Pandas exercises (.head(), .describe()). Write down what you notice. Remove any outliers NaN values, and -999.

In [22]:
# load data sets from streamflow_data folder
import pandas as pd
streamflow_headwater = pd.read_csv("streamflow_data/10023000_1980_2020.csv")
streamflow_headwater.head(20)
streamflow_headwater.describe()
# remove NaN values from the data set
streamflow_headwater = streamflow_headwater.dropna()
streamflow_headwater.head()





Unnamed: 0,Datetime,USGS_flow,variable,USGS_ID,measurement_unit,qualifiers,series
0,1986-12-24,30.0,streamflow,10023000,ft3/s,['A'],0
1,1999-10-01,27.941177,streamflow,10023000,ft3/s,"['A', '[91]']",0
2,1999-10-02,27.583334,streamflow,10023000,ft3/s,"['A', '[91]']",0
3,1999-10-03,27.208334,streamflow,10023000,ft3/s,"['A', '[91]']",0
4,1999-10-04,27.041666,streamflow,10023000,ft3/s,"['A', '[91]']",0


In [17]:
# load data set for reservoir
streamflow_reservoir = pd.read_csv("streamflow_data/10140100_1980_2020.csv")
# remove NaN values from the data set
streamflow_reservoir = streamflow_reservoir.dropna()
streamflow_reservoir.describe()

Unnamed: 0,USGS_flow,USGS_ID,series
count,10349.0,10349.0,10349.0
mean,107.399452,10140100.0,0.0
std,215.219657,0.0,0.0
min,3.397188,10140100.0,0.0
25%,10.654166,10140100.0,0.0
50%,19.883333,10140100.0,0.0
75%,103.75,10140100.0,0.0
max,1550.8334,10140100.0,0.0


In [18]:
# load data set for gsl
streamflow_gsl = pd.read_csv("streamflow_data/10141000_1980_2020.csv")
# remove NaN values from the data set
streamflow_gsl = streamflow_gsl.dropna()
streamflow_gsl.describe()

Unnamed: 0,USGS_flow,USGS_ID,series
count,11489.0,11489.0,11489.0
mean,328.125434,10141000.0,0.0
std,565.55416,0.0,0.0
min,3.957083,10141000.0,0.0
25%,73.91458,10141000.0,0.0
50%,119.46875,10141000.0,0.0
75%,277.35416,10141000.0,0.0
max,5024.5835,10141000.0,0.0


## Task 2: Slicing and Dicing

We are interested in examining the data from 2000-2010. Slice the data accordingly and save it to a new Pandas DataFrame.

In [1]:
# set index to Datetime for all data sets
streamflow_headwater["Datetime"] = pd.to_datetime(streamflow_headwater["Datetime"])
streamflow_headwater = streamflow_headwater.set_index("Datetime")
streamflow_reservoir["Datetime"] = pd.to_datetime(streamflow_reservoir["Datetime"])
streamflow_reservoir = streamflow_reservoir.set_index("Datetime")
streamflow_gsl["Datetime"] = pd.to_datetime(streamflow_gsl["Datetime"])
streamflow_gsl = streamflow_gsl.set_index("Datetime")
streamflow_headwater.head()




NameError: name 'pd' is not defined

In [33]:
# slice data set to only include 2000-2010
streamflow_headwater_2000_2010 = streamflow_headwater[streamflow_headwater.index >= "2000-01-01"]
streamflow_headwater_2000_2010 = streamflow_headwater_2000_2010[streamflow_headwater_2000_2010.index <= "2010-12-31"]
streamflow_headwater_2000_2010.tail()
streamflow_reservoir_2000_2010 = streamflow_reservoir[streamflow_reservoir.index >= "2000-01-01"]
streamflow_reservoir_2000_2010 = streamflow_reservoir_2000_2010[streamflow_reservoir_2000_2010.index <= "2010-12-31"]
streamflow_reservoir_2000_2010.tail()
streamflow_gsl_2000_2010 = streamflow_gsl[streamflow_gsl.index >= "2000-01-01"]
streamflow_gsl_2000_2010 = streamflow_gsl_2000_2010[streamflow_gsl_2000_2010.index <= "2010-12-31"]
streamflow_gsl_2000_2010.tail()

Unnamed: 0_level_0,USGS_flow,variable,USGS_ID,measurement_unit,qualifiers,series
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-12-27,376.58334,streamflow,10141000,ft3/s,['A'],0
2010-12-28,359.28125,streamflow,10141000,ft3/s,['A'],0
2010-12-29,399.59375,streamflow,10141000,ft3/s,['A'],0
2010-12-30,613.36456,streamflow,10141000,ft3/s,['A'],0
2010-12-31,501.79166,streamflow,10141000,ft3/s,['A'],0


## Task 3: Create plots for each DataFrame using the df.plot() function

Use the built in functionality of Pandas to plot the time series of each stream.

In [34]:
# using pandas, create a plot for each of the data sets for the years 2000-2010
streamflow_headwater_2000_2010.plot(y="Streamflow", title="Headwater Streamflow 2000-2010")
streamflow_reservoir_2000_2010.plot(y="Streamflow", title="Reservoir Streamflow 2000-2010")
streamflow_gsl_2000_2010.plot(y="Streamflow", title="GSL Streamflow 2000-2010") 


KeyError: 'Streamflow'

## Task 4: Join/Merge Pandas DataFrames

Create a single dataframe named All_Streams and combine all streamflow monitoring data into this dataframe. Hint, set your index to the date. Create custom labels for each monitoring station location to communicate there location within the watershed (e.g, headwater, below reservoir, GSL Terminus). Print the dataframe.head() to demonstrate that is complete.

## Task 5: Demonstrate your Prowess with Matplotlib

Create a Four separate figures with all three stream on them:

* Figure 1 should be a single plot with all three stream  monitoring locations
* Figure 2 should be a single figure with subplots for each stream monitoring location. The subplots should be 2 rows and 2 columns
* Figure 3 should be a single figure with subplots for each stream monitoring location. The subplots should be 3 rows and 1 column 
* Figure 3 should be a single figure with subplots for each stream monitoring location. The subplots should be 1 row and 3 columns

Make sure your plots have the correct axes, labeled axes, a title, a legend. Create custom labels for each monitoring station location to communicate there location within the watershed (e.g, headwater, below reservoir, GSL Terminus).