# D5 - Example Data Sources

Feedback: https://forms.gle/Le3RAsMEcYqEyswEA

**Index**:
* API Keys
* Water Data
  * CIMIS
  * NWIS Waterdata via web
  * NWIS data via module


## API Keys
When you get an API key for a website, it's associated with your account and not something you want to share.  So, it's **not** a good idea to keep API keys in your scripts if you're going to share them, post them on github, etc.  

A good strategy is to put them in an environment variable so that you can get the key from the env variable inside your script.  In the examples below you'll see checks for os.environ - this is looking for an environment variable with your key.

In linux, you can set an environment variable by adding a line to your ~/.bashrc file like follows, and the log back into your computer:

    export CIMIS_API_KEY="your key here"

And in windows, you can run the following in a cmd window:

    setx OPENAI_API_KEY “<yourkey>”

Theres more info on this stuff here: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

## Cimis Data
Create an account at https://cimis.water.ca.gov.  Once you log, click the "Account" button in the top right corner. Scroll down and you'll see an API key that you'll need for queries.

We can query the Cimis data from: https://et.water.ca.gov...  
* There are example queries at: https://et.water.ca.gov/Rest/Index

Here's some example code for querying it:

In [10]:
import json
import os
import requests
import pandas as pd

if 'CIMIS_API_KEY' in os.environ:
    api_key = os.environ['CIMIS_API_KEY']
else:
    api_key = 'your_api_key_here'
url_base = 'https://et.water.ca.gov/api'
options = [f'appKey={api_key}', 'startDate=2010-01-01', 'endDate=2010-01-05', 'targets=2,8,127']
options = '&'.join(options)
url = f'{url_base}/data?{options}'

print("Request url is: ", url)
response = requests.get(url)

Request url is:  https://et.water.ca.gov/api/data?appKey=46f5b53b-3adc-4a4d-a660-474c010a26f4&startDate=2010-01-01&endDate=2010-01-05&targets=2,8,127


In [None]:
response_data = json.loads(response.text)
# for record in response['Data']['Providers'][0]['Records']:
#     print(record)
df = pd.DataFrame(response_data['Data']['Providers'][0]['Records'])
# print(df.info())
# print(df.head())

# The values all have this dictionary format:
# DayAirTmpAvg: {'Value': '39', 'Qc': ' ', 'Unit': '(F)'}

# So we need to break them out into separate columns:
value_cols = [c for c in df.columns if c.startswith('Day')]
for c in value_cols:
    df[f'{c}_Units'] = df[c].apply(lambda x: x['Unit'])
    # You may also want to preserve the Qc value
    df[c] = df[c].apply(lambda x: x['Value'])

print(df.head())

## USGS NWIS Waterdata via Web
https://waterdata.usgs.gov/

Example site data: https://waterdata.usgs.gov/nwis/uv?site_no=05056241&legacy=1
There's a new version page at: https://waterdata.usgs.gov/monitoring-location/05056241/
This one has download data button -> select primary time series -> retreive, and this opens another page with this url:
https://waterservices.usgs.gov/nwis/iv/?sites=05056241&startDT=2024-09-19T20:57:15.477-05:00&endDT=2024-09-26T20:57:15.477-05:00&parameterCd=00065&format=rdb

Change `format=rdb` to `format=json` and we get some easy to work with data.  View it in the browser...  usually there's a pretty print check box at the top of the browser for json data like this that will make it easier to read.  

The data returned has a lot of meta-data and the time series data we're interested in: 

    data['value']['timeSeries'][0]['values'][0]['value']

Look through the meta data as some of it is useful... time zone, query info, and meaning of the pcodes in the output data:

        "variable": {
          "variableCode": [
            {
              "value": "00065",
              "network": "NWIS",
              "vocabulary": "NWIS:UnitValues",
              "variableID": 45807202,
              "default": true
            }
          ],
          "variableName": "Gage height, ft",
          "variableDescription": "Gage height, feet",
          "valueType": "Derived Value",
          "unit": {
            "unitCode": "ft"
          },
          "options": {
            "option": [
              {
                "name": "Statistic",
                "optionCode": "00000"
              }



In [None]:
#https://waterservices.usgs.gov/nwis/iv/?sites=05056241&startDT=2024-09-19T20:57:15.477-05:00&endDT=2024-09-26T20:57:15.477-05:00&parameterCd=00065&format=rdb
# midnight to midnight
startDT = '2024-09-19T00:00:00-05:00'
endDt = '2024-09-26T00:00:00-05:00'
# I suspect the 05:00 is the time zone offset
sites = '05056241' # presumably this could be a comma separated list
parameterCd = '00065' # discharge in cubic feet per second
format = 'json'
base_url = 'https://waterservices.usgs.gov/nwis/iv'
params = [f'sites={sites}', f'startDT={startDT}', f'endDT={endDt}', f'parameterCd={parameterCd}', f'format={format}']
params = '&'.join(params)
url = f'{base_url}/?{params}'
print(url)
response = requests.get(url)
data = json.loads(response.text)
# print(response.text)
df = pd.DataFrame(data['value']['timeSeries'][0]['values'][0]['value'])
print(df.head())
print(df.info())


Often, you'll be limited in the date range, number of sites, or number of parameters you can query at once, so you can play around and see what works and what gives an error.  The json output should have an error code or explanation.  Once you know the limits, use a loop to increment the dates and collect all of the data needed:

In [None]:
chunk_size_days = 7
from datetime import datetime, timedelta
start_date = datetime(2024, 1, 1)
end_date = datetime(2024, 3, 31)
delta = timedelta(days=chunk_size_days)
working_date = start_date
dataframes = []
while working_date <= end_date:
    end_working_date = working_date + delta
    if end_working_date > end_date:
        end_working_date = end_date
    startDT = working_date.strftime('%Y-%m-%dT00:00:00-05:00')
    endDT = end_working_date.strftime('%Y-%m-%dT00:00:00-05:00')
    params = [f'sites={sites}', f'startDT={startDT}', f'endDT={endDT}', f'parameterCd={parameterCd}', f'format={format}']
    params = '&'.join(params)
    url = f'{base_url}/?{params}'
    print(url)
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['value']['timeSeries'][0]['values'][0]['value'])
    dataframes.append(df)
    working_date = end_working_date + timedelta(days=1)
df = pd.concat(dataframes)

df.to_csv(f'site_{sites}.csv', index=False)
print(df.info())
print(df.head())


## NWIS Data
This data uses PCodes.  You'll need to look up 

In [None]:
import pandas as pd 
%pip install hydrofunctions
import hydrofunctions as hf

#Example Sites:
# 'CCH':'11455350',
# 'CCH41':'11455385',
# 'CFL':'11455508',
# 'DEC':'11455478',

startDT = '2022-01-01'
endDT = '2024-02-02'
site_name = 'CCH41'
site_code = '11455385'

NWIS_request = hf.NWIS(site_code,'iv',startDT,endDT)
        
df = NWIS_request.df()

headers = {}
lines = str(NWIS_request).split('\n') #treates NWIS_request as a string - this gives us one long string of original col names
for line in lines[1:-2]: #ignore the first line  (which is the site name) and the last two lines (are the strt and end dates)
    line = line.strip() #remove white space
    identifier = line.split(':')[0].strip() 
    headers[identifier] = line #put this into the dictionary as the identifier

cols = list(df.columns)  #creates a list of original column names
new_cols = []

for column in cols:
    qual = 'qualifier' in column #creates a bool based on whether the column is a qualifer column, as opposed to a data column
    scol = column.split(':')

df['site_code'] = site_code
df['site_name'] = site_name

df = df.reset_index()

df['datetimeUTC'] = pd.to_datetime(df['datetimeUTC'], format='%Y-%m-%d %H:%M:%S')
df['TS Timestamp (PST)'] = df['datetimeUTC'].dt.strftime('%Y-%m-%d %H:%M:%S')
df['TS Timestamp (PST)'] = pd.to_datetime(df['TS Timestamp (PST)'], format='%Y-%m-%d %H:%M:%S') - pd.Timedelta(hours = 8)

df = df.set_index('TS Timestamp (PST)')

df.columns = [col.split(':')[2] if ':' in col else col for col in df.columns]
df.columns = [col.split('-')[0] if '-' in col else col for col in df.columns]

print(df.columns)
print(df.head())

#fname = f'TS_{field_id}_{site}.csv'   
fname = f'TS_{site_code}_{site_name}.csv'
print('saving to', fname)
df.to_csv(fname)