# Download Missing Modules
If specific modules are not already installed in the notebook, the code will be unable to run effectively.  Modules such as ‘lxml’, and ‘suds’ are not natively installed in the Jupyter kernel.  To manually install these modules, the following code must be run. Once this cell has been run, the rest of the code blocks will recognize these modules, and be able to successfully complete the workflow. 

In [0]:
!pip install lxml
!pip install suds-jurko


# Access SPT Data
Another way to access the SPT data is through the API. The following script can be used to access the MHS data for a single station, or for a list of stations.  

To use this script, the user must specify the following variables:
1. watershed = the name of the watershed within the SPT
2. subbasin = the name of the subbasin within the SPT
3. spt_id = the ID number of the streamreach where the station is located
4. tethys_token = this token is available through the settings of the Tethys portal
5. file_location = file location to save the MHS data as a .csv file

In [0]:
import requests
from io import StringIO
import pandas as pd


#Define watershed parameters
watershed='South America'
subbasin='Continental'
spt_id=[177442]
tethys_token='6cf48ff8aa834c2b923ba84137d0f34fdbd845a2'
file_location='D:\Jackson\Streamflow Prediction\Data Analysis\Python Stats\\'

for i in spt_id:
    request_params=dict(watershed_name=watershed, subbasin_name=subbasin, reach_id=spt_id, return_format='csv')
    request_headers = dict(Authorization='Token '+tethys_token)
    res = requests.get('http://tethys-staging.byu.edu/apps/streamflow-prediction-tool/api/GetHistoricData/', params=request_params, headers=request_headers)
    csv=res.content
    csv=csv.decode('utf-8')
    csvfile=file_location +str(i)+'.csv'
    data=StringIO(csv)
    df_data=pd.read_csv(data, sep=',', header=None, names=['predicted streamflow'], index_col=0, infer_datetime_format=True, skiprows=1)
    df_data.to_csv(csvfile,sep=',', index_label='Datetime')
    print(df_data)

# Access Observed Data from a Hydroserver
The following code block can be used to access observed data from a Hydroserver, parse the file from WaterML, and create a .csv file that can then be merged with the SPT data for analysis.  This workflow consists of two functions, get_hydroserver(), and parse_waterml() that allow the user to access, parse, and download observed data from a Hydroserver.  

To use this workflow, the user must input:
1. url = the url endpoint for the Hydroserver where the observed data is stored
2. site_code = the hydroserver-specific code that corresponds to the streamreach in question
3. variable_code = the data-specific code that corresponds to the data (such as discharge) that the user wants to download. 
4. start_date = the beginning of the time to be downloaded.  Must be in the format 'YYYY-MM-DD'
5. end_date = the beginning of the timeframe to be downloaded. Must in the format 'YYYY-MM-DD'
6. csv_file = the file location where the downloaded data will be saved as a .csv file



In [23]:
from lxml import etree as ET
from suds.client import Client
import pandas as pd


def get_hydroserver(url, site_code, variable_code, start_date, end_date, auth_token):
    try:
        client = Client(url)
    except:
        print('could not connect')
    response = client.service.GetValues(site_code,
                                        variable_code,
                                        start_date,
                                        end_date,
                                        auth_token)
    return response


def parse_waterml(waterml_string):
    root = ET.fromstring(waterml_string)
    x = None
    y = None
    print('parsing waterml data')
    time_series = root.findall(
        './/{http://www.cuahsi.org/waterML/1.1/}timeSeries')
    nodata = root.findtext(
        './/{http://www.cuahsi.org/waterML/1.1/}noDataValue')
    variable = root.findtext(
        './/{http://www.cuahsi.org/waterML/1.1/}variableName')
    for series in time_series:
        x = []
        y = []
        values = series.findall(
            './/{http://www.cuahsi.org/waterML/1.1/}value')

        for element in values:
                date = element.attrib['dateTime']
                x.append(date)
                v = element.text
                if nodata in v or v in nodata:
                    value = None
                    y.append(value)
                else:
                    v = float(v)
                    y.append(float(v))

        if variable is None:
            variable = ''
        if y == []:
            variable = 'no data'
    waterml_data = {
        'dates': x,
        'values': y,
    }

    return waterml_data


#Declare variables
url = 'http://brasilia.essi-lab.eu/hsl-br/index.php/default/services/cuahsi_1_1.asmx?WSDL'
site_code = 'hsl-br:60781000'
variable_code = 'hsl-br:Discharge'
start_date = '2014-03-01'
end_date = '2014-03-05'
csv_file=r'D:\Jackson\Streamflow Prediction\Data Analysis\Python Stats\Peru\Observed Data\BRAZIL_DATA.csv'

# Most servers don't need an auth_token
hydro_string = get_hydroserver(url, site_code, variable_code, start_date, end_date, auth_token=None)
hydro_values = parse_waterml(hydro_string)

#Write to .csv file
df=pd.DataFrame.from_dict(hydro_values)
df.to_csv(csv_file, index=False)
print(csv_file)
print(df)

parsing waterml data
D:\Jackson\Streamflow Prediction\Data Analysis\Python Stats\Peru\Observed Data\BRAZIL_DATA.csv
                   dates  values
0    2014-03-01T00:00:00   172.0
1    2014-03-01T00:15:00   172.0
2    2014-03-01T00:30:00   172.0
3    2014-03-01T00:45:00   172.0
4    2014-03-01T01:00:00   172.0
5    2014-03-01T01:15:00   172.0
6    2014-03-01T01:30:00   172.0
7    2014-03-01T01:45:00   172.0
8    2014-03-01T02:00:00   172.0
9    2014-03-01T02:15:00   172.0
10   2014-03-01T02:30:00   172.0
11   2014-03-01T02:45:00   172.0
12   2014-03-01T03:00:00   172.0
13   2014-03-01T03:15:00   172.0
14   2014-03-01T03:30:00   172.0
15   2014-03-01T03:45:00   172.0
16   2014-03-01T04:00:00   171.0
17   2014-03-01T04:15:00   171.0
18   2014-03-01T04:30:00   171.0
19   2014-03-01T04:45:00   171.0
20   2014-03-01T05:00:00   171.0
21   2014-03-01T05:15:00   171.0
22   2014-03-01T05:30:00   171.0
23   2014-03-01T05:45:00   171.0
24   2014-03-01T06:00:00   171.0
25   2014-03-01T06:15:00  

# Merge Data Workflow
This workflow merges two time series, an observed and a predicted series, into one .csv file, which can then be used as an input for the Correlation Analysis workflow, and the Lag Analysis workflow. The workflow consists of a function merge_data(), and then the application of that function to produce the merged data file. 

To use this workflow, the user must specify:
1. recorded_dir = directory where the .csv files with the observed flow data are saved
2. interim_dir = directory where the .csv files from the SPT are saved
3. merged_dir = directory where the merged .csv files will be saved.  This directory should be the up one level from where the observed and SPT data are saved. 
4. locations = list of stations to be merged


In [0]:
import pandas as pd
from os import listdir
import glob

def merge_data(recorded_data,interim_data,location):
    #Importing data into a dataframe
    df_recorded = pd.read_csv(recorded_data, delimiter=",", header=None, names=['recorded streamflow'], index_col=0, infer_datetime_format=True, skiprows=1)
    df_predicted = pd.read_csv(interim_data, delimiter=",", header=None, names=['predicted streamflow'], index_col=0, infer_datetime_format=True, skiprows=1)
    #Converting the index to datetime type
    df_recorded.index = pd.to_datetime(df_recorded.index, infer_datetime_format=True)
    df_predicted.index = pd.to_datetime(df_predicted.index, infer_datetime_format=True)
    #Joining the two dataframes
    df_merged = pd.DataFrame.join(df_predicted, df_recorded).dropna()
    df_merged.to_csv(merged_dir+location + "_merged.csv",sep=",",index_label="Datetime")

# Specify variables
recorded_dir=r"C:\Users\Owner\Documents\School\Research\Peru\Peru\Observed Data"
interim_dir=r'C:\Users\Owner\Documents\School\Research\Peru\Peru\SPT Data'
merged_dir=r'C:\Users\Owner\Documents\School\Research\Peru\Peru\\'
locations = ['Chazuta','Requena', 'San Regis']

recorded_list = listdir(recorded_dir)
interim_list = listdir(interim_dir)

print(recorded_list)
print(interim_list)
print(locations)

for i,j,k in zip(recorded_list,interim_list,locations):
    print(i)
    i=recorded_dir+"\\"+str(i)
    j=interim_dir+'\\'+str(j)
    merge_data(i,j,k)

# Python Modules to Import
As part of the computer setup, specific modules must be imported, so that Python can complete the necessary calculations.

In [0]:
#import necessary modules
import pandas as pd
import numpy as np
import scipy
from scipy import stats
from tqdm import tqdm
import glob
import os
import matplotlib.pyplot as plt
import hydrostats as hs