In [1]:
import urllib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# A quick detour on 'f-strings'
In python there is a really nice way to insert values 
from variables into a string variable. This task is 
referred to in programming as string formatting. In the
past there were other ways to do this in python, but 
a few years ago they added this approach which is generally
seen as the best way to do this. For an overview see this:
https://realpython.com/python-f-strings/


In [2]:
random_numbers = np.random.random(5)
print('The average of my random numbers is', np.mean(random_numbers))
print(f'The average of my random numbers is {np.mean(random_numbers):0.2f}')


The average of my random numbers is 0.34930435553825073
The average of my random numbers is 0.35


# Access streamflow data for Niagara River!
With that out of the way let's start to work towards being able to grab data on the fly from the USGS website. And now we've defined the site id for the Verde River, as well as some start and end dates to get the data for. With those defined clearly it makes it much easier for someone else to understand what you are trying to do.

## 1. define site specific information

In [5]:
args = {
    'site_no': '04216000',
    'begin_date': '2022-09-01',
    'end_date': '2023-09-01'
}

In [6]:
query = urllib.parse.urlencode(args)

In [7]:
query

'site_no=04216000&begin_date=2022-09-01&end_date=2023-09-01'

## 2. Create the url and access the data using `urllib`

Now we can use f-strings to insert these values into the query URL which will point to the same website that we saw in the lecture portion
You can verify this by copying the URL into your web browser.

In [8]:
verde_url = (
    f'https://waterdata.usgs.gov/nwis/dv?'
    f'cb_00060=on&format=rdb&referred_module=sw&{query}'
)
print(verde_url)

https://waterdata.usgs.gov/nwis/dv?cb_00060=on&format=rdb&referred_module=sw&site_no=04216000&begin_date=2022-09-01&end_date=2023-09-01


## 3. Read the data using `pandas`

In [32]:
# With that we need to download the data and get it into pandas.
# To download the data we'll use the `urllib` module which is 
# built into the python "standard library" of stuff you get for
# free when you install python. We use the `urllib.request.urlopen`
# function which simply opens a connection to the url, just like 
# going to the url in your web browser. Then, we can put the `response`
# into `pd.read_table`. There are a lot of other parameters going 
# into this function now, and this is very common for when you scrape
# data directly from the internet because formats vary.

response = urllib.request.urlopen(verde_url)

# Anyways, let's walk through a few of them:
#  - comment='#': Lines beginning with a '#' are comments that pandas should ignore
#  - sep='\s+': The data representing columns are separated by white space
#  - names: The names of the columns. I set these because the USGS ones are trash
#  - index_col=2: Set the 3rd column as the index (that is, "date")
#  - parse_dates=True: Try to make dates the correct data type, didn't work here but a good idea
#  - date_format='yyyy-mm-dd': Display the format of date
#  - engine='python': Python engine is currently more feature-complete

df = pd.read_table(
    response,
    comment='#',
    sep='\s+',
    names=['agency', 'site', 'date', 'streamflow', 'quality_flag'],
    index_col=2,
    parse_dates=True,
    date_format='yyyy-mm-dd',
    engine='python'
)

In [33]:
df

Unnamed: 0_level_0,agency,site,streamflow,quality_flag
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
datetime,agency_cd,site_no,217733_00060_00003,217733_00060_00003_cd
20d,5s,15s,14n,10s
2022-09-01,USGS,04216000,227000,A
2022-09-02,USGS,04216000,222000,A
2022-09-03,USGS,04216000,223000,A
...,...,...,...,...
2023-08-28,USGS,04216000,233000,A
2023-08-29,USGS,04216000,235000,A
2023-08-30,USGS,04216000,239000,A
2023-08-31,USGS,04216000,230000,A


In [34]:
# discard the first two rows
df = df.iloc[2:]

In [35]:
# Now convert the streamflow data to floats and
# the index to datetimes. When processing raw data
# it's common to have to do some extra postprocessing
df['streamflow'] = df['streamflow'].astype(np.float64)
df.index = pd.DatetimeIndex(df.index)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['streamflow'] = df['streamflow'].astype(np.float64)


Unnamed: 0_level_0,agency,site,streamflow,quality_flag
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-09-01,USGS,4216000,227000.0,A
2022-09-02,USGS,4216000,222000.0,A
2022-09-03,USGS,4216000,223000.0,A
2022-09-04,USGS,4216000,219000.0,A
2022-09-05,USGS,4216000,213000.0,A
