# Nenana Ice Classic Data Gathering - River Flow
This notebook was used for gathering the river flow data used for this project.
## Data Source
* USGS Water Services website (https://waterservices.usgs.gov/).

In [1]:
# imports
import numpy as np
import pandas as pd
import requests
import json
import datetime as dt
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import gc

## Getting info from USGS Water Services website

USGS water resources data; Tanana River at Nenana; discharge, cubic ft/sec
https://waterdata.usgs.gov/nwis/dvstat/?referred_module=sw&amp;site_no=15515500&amp;por_15515500_1106=624121,00060,1106,1962-05-01,2019-10-12&amp;start_dt=1989-01-01&amp;end_dt=1989-12-31&amp;stat_cds=p50_va&amp;format=rdb&amp;date_format=YYYY-MM-DD&amp;rdb_compression=value&amp;submitted_form=parameter_selection_list

https://waterservices.usgs.gov/nwis/dv/?site=15515500&startDT=1989-03-01&endDT=1989-03-31&format=json&parameterCd=00060

In [2]:
# date formats: YYYY-MM-DD; this cell is example format for the url
# url_string = f'http://waterservices.usgs.gov/nwis/dv/?site=15515500&startDT={start_date}&endDT={end_date}&format=json&parameterCd=00060'

### DataFrame Creation
Create a DataFrame to hold the results of all the web queries.

In [3]:
df = pd.DataFrame(data = None, columns = ['value', 'qualifiers', 'dateTime'])

### Initialize Variables

Create a list of years data will be retrieved for.

In [2]:
retrieval_years = [str(x) for x in range(1989, 2020)]
retrieval_years

['1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019']

Create a list of tuples for the monthly start and end dates. I chose to ignore leap years for this data because I felt that one day every four years will not significantly impact any trends in this data.

In [3]:
retrieval_dates = [('01-01', '01-31'), ('02-01', '02-28'), ('03-01', '03-31'), ('04-01', '04-30'), ('05-01', '05-31')]

Create a list of URLs to be scraped.

In [6]:
url_list = []
for i in range(len(retrieval_years)):
    year = retrieval_years[i]
    for j in retrieval_dates:
        start = j[0]
        end = j[1]
        start_date = year + '-' + start
        end_date = year + '-' + end
        new_url = f'http://waterservices.usgs.gov/nwis/dv/?site=15515500&startDT={start_date}&endDT={end_date}&format=json&parameterCd=00060'
        url_list.append(new_url)

Set up Selenium options.

In [4]:
chrome_path = '/Users/davidwalkup/Downloads/chromedriver-2'
options = Options()

Perform test call and examine returned data.

In [18]:
target_url = url_list[0]
driver = webdriver.Chrome(chrome_path, 
                          options=options)
driver.set_window_size(1400,1000)
driver.get(target_url)
page_source = driver.page_source
soup = BeautifulSoup(page_source)
print(soup.prettify())

<html>
 <head>
 </head>
 <body>
  <pre style="word-wrap: break-word; white-space: pre-wrap;">{"name":"ns1:timeSeriesResponseType","declaredType":"org.cuahsi.waterml.TimeSeriesResponseType","scope":"javax.xml.bind.JAXBElement$GlobalScope","value":{"queryInfo":{"queryURL":"http://waterservices.usgs.gov/nwis/dv/site=15515500&amp;startDT=1989-01-01&amp;endDT=1989-01-31&amp;format=json&amp;parameterCd=00060","criteria":{"locationParam":"[ALL:15515500]","variableParam":"[00060]","timeParam":{"beginDateTime":"1989-01-01T00:00:00.000","endDateTime":"1989-01-31T00:00:00.000"},"parameter":[]},"note":[{"value":"[ALL:15515500]","title":"filter:sites"},{"value":"[mode=RANGE, modifiedSince=null] interval={INTERVAL[1989-01-01T00:00:00.000-05:00/1989-01-31T00:00:00.000-05:00]}","title":"filter:timeRange"},{"value":"methodIds=[ALL]","title":"filter:methodId"},{"value":"2020-03-22T16:23:00.278Z","title":"requestDT"},{"value":"65b96740-6c59-11ea-95a2-6cae8b6642f6","title":"requestId"},{"value":"Provision

In [19]:
driver.quit()

Parse out where the data I need is located in the returned JSON.

In [20]:
data = json.loads(soup.body.text)['value']['timeSeries'][0]['values'][0]['value']
data

[{'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-01T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-02T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-03T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-04T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-05T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-06T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-07T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-08T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-09T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-10T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-11T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 'dateTime': '1989-01-12T00:00:00.000'},
 {'value': '7600', 'qualifiers': ['A'], 

The data I need is in the form of a list of dicts. Loop through the list, making a temporary DataFrame from each dict, and append that to the main DataFrame. Once complete, convert value and dateTime fields to correct data types.

In [11]:
for entry in data:
    temp_df = pd.DataFrame.from_dict(entry)
    df = df.append(temp_df, ignore_index = True)
df['value'] = pd.to_numeric(df['value'], downcast = 'integer')
df['dateTime'] = pd.to_datetime(df['dateTime'], yearfirst = True, infer_datetime_format = True)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   value       31 non-null     int16         
 1   qualifiers  31 non-null     object        
 2   dateTime    31 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int16(1), object(1)
memory usage: 686.0+ bytes


In [13]:
df.head()

Unnamed: 0,value,qualifiers,dateTime
0,7600,A,1989-01-01
1,7600,A,1989-01-02
2,7600,A,1989-01-03
3,7600,A,1989-01-04
4,7600,A,1989-01-05


Now that I've worked out how to get the information from the webpage and into a DataFrame, perform the actual data retrieval for all years needed.

In [5]:
df = pd.DataFrame(data = None, columns = ['value', 'qualifiers', 'dateTime'])
target = f'http://waterservices.usgs.gov/nwis/dv/?site=15515500&startDT=1989-01-01&endDT=2019-05-31&format=json&parameterCd=00060'
driver = webdriver.Chrome(chrome_path, 
                          options=options)
driver.set_window_size(1400,1000)
driver.get(target)
page_source = driver.page_source
soup = BeautifulSoup(page_source)
data = json.loads(soup.body.text)['value']['timeSeries'][0]['values'][0]['value']
for entry in data:
    temp_df = pd.DataFrame.from_dict(entry)
    df = df.append(temp_df, ignore_index = True)
driver.quit()
df['value'] = pd.to_numeric(df['value'], downcast = 'integer')
df['dateTime'] = pd.to_datetime(df['dateTime'], yearfirst = True, infer_datetime_format = True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17276 entries, 0 to 17275
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   value       17276 non-null  int32         
 1   qualifiers  17276 non-null  object        
 2   dateTime    17276 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int32(1), object(1)
memory usage: 337.5+ KB


In [7]:
df.sample(10)

Unnamed: 0,value,qualifiers,dateTime
13737,7000,A,2013-02-21
10020,52400,A,2006-08-27
9769,7200,A,2006-03-03
2401,9400,A,1993-04-15
13489,17000,A,2012-10-20
11583,6800,A,2009-04-19
12705,7100,A,2011-04-14
15309,8400,A,2015-12-27
12506,7300,e,2011-01-04
6767,8200,e,2000-12-13


Originally, I looped through the URLs in the url_list to get river flow data per month, but that resulted in "connection refused" errors from the USGS website. In order to get the data from the USGS, I ended up doing one large query, starting with Jan 1, 1989 and ending at May 31, 2019.

As a result, I needed to remove all the data for each year outside the window of Jan 1 - May 31.

In [8]:
df.head().append(df.tail())

Unnamed: 0,value,qualifiers,dateTime
0,7600,A,1989-01-01
1,7600,A,1989-01-02
2,7600,A,1989-01-03
3,7600,A,1989-01-04
4,7600,A,1989-01-05
17271,28200,A,2019-05-27
17272,30100,A,2019-05-28
17273,30100,A,2019-05-29
17274,29200,A,2019-05-30
17275,28400,A,2019-05-31


In [9]:
df['dateTime'].dt.month

0        1
1        1
2        1
3        1
4        1
        ..
17271    5
17272    5
17273    5
17274    5
17275    5
Name: dateTime, Length: 17276, dtype: int64

In [10]:
drop_list = [idx for idx in df.index if df['dateTime'].dt.month.loc[idx] > 5]
drop_list[:5]

[151, 152, 153, 154, 155]

In [11]:
df.drop(index = drop_list, inplace = True)

In [12]:
df.sample(10)

Unnamed: 0,value,qualifiers,dateTime
7988,7800,e,2003-01-21
9660,7800,e,2006-01-07
14916,7400,A,2015-03-15
4545,6000,e,1997-01-23
6850,7700,A,2001-01-24
9171,6000,A,2005-02-19
7008,7800,A,2001-04-13
652,9300,A,1990-04-20
14437,28500,e,2014-04-26
5827,5600,e,1999-03-26


In [13]:
df.shape

(8476, 3)

In [14]:
df.drop_duplicates(subset = 'dateTime', inplace = True)

In [15]:
df.shape

(4688, 3)

### Save Data To File

In [16]:
df.to_csv('../data/river_flow_data.csv', index_label = 'dateTime')