## NHS Post processing Library

This notebook provides examples on how to carry out data evaluation, visualization and analysis using the post_processing python library. Be sure to go through the [Quick Start](https://nhs-postprocessing.readthedocs.io/en/latest/QuickStart.html) section of the [documentation](https://nhs-postprocessing.readthedocs.io/en/latest/index.html) for instructions on how to access and import the libary and its packages.

##### Note: 
The Library is still under active development and empty sections will be completed in Due time

### Table of content
- Requirements
- DataTypes(csv, DF, etc)
- Organizations(MESH, TB EVAL, EOD, etc)
- Analysis
- Viaualizations


### Requirements

The conda environmnent contains all libraries associated the post processing library. After setting up the conda environment, you only have to import the metrics and visualization modules from postprocessinglib.evaluation

In [1]:
### Remove and modify these later.
import ipywidgets as widgets
from ipywidgets import interactive, ToggleButtons
import datetime
import sys
sys.path.append("../../../../postprocessing")

In [2]:
from postprocessinglib.evaluation import metrics, visuals

From this point on, the function you use will depend on which project you are working on and what sort of data i.e., the datatype of the data, you have.

### MESH CSV files

Assuming you have a csv file containing an csv file with observed and simulated values for a list of stations of the form:
| Some datetime  | station1_obs | station1_sim | station2_obs | station2_sim |
| -------------- | ------------ | ------------ | ------------ | ------------ | 

We simply pass it into our generate dataframes function as shown below:


In [3]:
# passing a controlled csv file for testing
path = "MESH_output_streamflow.csv"

# assuming the simulation model needs 365 days to warm up and account for erros during the learning phase.
observed, simulated = metrics.generate_dataframes(csv_fpath=path, warm_up=365)

In this case we generate two dataframes, one for the observed and measured values and another for the values being generated from the model

In [4]:
print("The Observed dataframe: \n")
print(observed)
print("\nThe Simulated dataframe: \n")
print(simulated)

The Observed dataframe: 

           QOMEAS_05BB001  QOMEAS_05BA001
YEAR JDAY                                
1980 366            10.20            -1.0
1981 1               9.85            -1.0
     2              10.20            -1.0
     3              10.00            -1.0
     4              10.10            -1.0
...                   ...             ...
2017 361            -1.00            -1.0
     362            -1.00            -1.0
     363            -1.00            -1.0
     364            -1.00            -1.0
     365            -1.00            -1.0

[13515 rows x 2 columns]

The Simulated dataframe: 

           QOSIM_05BB001  QOSIM_05BA001
YEAR JDAY                              
1980 366        2.530770       1.006860
1981 1          2.518999       1.001954
     2          2.507289       0.997078
     3          2.495637       0.992233
     4          2.484073       0.987417
...                  ...            ...
2017 361        4.418050       1.380227
     362      

You are immediately presented with two dataframes, one containing simulated values and the other containing observed values. It is important to have these two dataframes as they are both necessary for use in the analysis calculations.

In [5]:
# You are also able to split the data into their respective stations in the cases where you might want to do so
for station in metrics.station_dataframe(observed=observed, simulated=simulated, station_num=2):
    print(station)

           QOMEAS_05BB001  QOSIM_05BB001
YEAR JDAY                               
1980 366            10.20       2.530770
1981 1               9.85       2.518999
     2              10.20       2.507289
     3              10.00       2.495637
     4              10.10       2.484073
...                   ...            ...
2017 361            -1.00       4.418050
     362            -1.00       4.393084
     363            -1.00       4.368303
     364            -1.00       4.343699
     365            -1.00       4.319275

[13515 rows x 2 columns]
           QOMEAS_05BA001  QOSIM_05BA001
YEAR JDAY                               
1980 366             -1.0       1.006860
1981 1               -1.0       1.001954
     2               -1.0       0.997078
     3               -1.0       0.992233
     4               -1.0       0.987417
...                   ...            ...
2017 361             -1.0       1.380227
     362             -1.0       1.372171
     363             -1.0      

<mark>Note: As it stands, that is the only kind of data that we it is currently able to handle. In future releases, It will be able to handle more file types including, netcdf file, shape file, arrays, and more.<mark/>

### Analysis

Because the library is in active development, there will be regular removals and additions to its features. As a rule of thumb therefore it is always a good idea to check what it can do at the time of use.

In [6]:
# We do this by calling on its available metrics
for metric in metrics.available_metrics():
    print(metric)

MSE
RMSE
MAE
NSE
NegNSE
LogNSE
NegLogNSE
KGE
NegKGE
KGE 2012
BIAS
AbsBIAS
TTP
TTCoM
SPOD


Now that we know what it can currently do, we know what we want to ask for.

##### Mean Square Error

In [7]:
# Mean square error for the first station in the data we were given
print(metrics.mse(observed=observed, simulated=simulated, num_stations=1))

[1656.685638835447]


##### Root Mean Square Error

In [8]:
# lets calculate for the root mean square error for the first station in the data we were given
# The syntax goes:-
print(metrics.rmse(observed=observed, simulated=simulated, num_stations=1))

[40.702403354537275]


##### Mean Average Error

In [9]:
# Similarly, the mean absolute error for the first station in the data we were given will look like 
print(metrics.mae(observed=observed, simulated=simulated, num_stations=1))

[22.12912878335626]


##### Nash-Sutcliffe Efficiency

In [10]:
# Similarly, the Nash-Sutcliffe Efficiency for the first station in the data we were given will look like 
print(metrics.nse(observed=observed, simulated=simulated, num_stations=1))

[0.0021806971124596064]


##### Kling-Gupta Efficiency

In [11]:
# Similarly, the Kling-Gupta Efficiency for the first station in the data we were given will look like 
print(metrics.kge(observed=observed, simulated=simulated, num_stations=1))

[0.49061085454963205]


##### The Updated Kling-Gupta Efficiency (2012)

In [12]:
# Similarly, the Kling-Gupta Efficiency for the first station in the data we were given will look like 
print(metrics.kge_2012(observed=observed, simulated=simulated, num_stations=1))

[0.27812840065858213]


##### Percentage Bias

In [13]:
# Similarly, the Percentage Bias for the first station in the data we were given will look like 
print(metrics.bias(observed=observed, simulated=simulated, num_stations=1))

# Observed that it returns the percentage not the actual value i.e 27% not 0.27

[-27.052012466427488]


##### Time to Peak

In [14]:
# Similarly, the Time to Peak for the simulated data from the first station in the data we were given will look like 
print(metrics.time_to_peak(df=simulated, num_stations=1))

# The time to peak for the observed data for the same station looks like:-
print(metrics.time_to_peak(df=observed, num_stations=1))

[186.61111111111111]
[167.375]


##### Time to Centre of Mass

In [15]:
# Similarly, the Time to Peak for the simulated data from the first station in the data we were given will look like 
print(metrics.time_to_centre_of_mass(df=simulated, num_stations=1))

# The time to peak for the observed data for the same station looks like:-
print(metrics.time_to_centre_of_mass(df=observed, num_stations=1))

[192.78635358563776]
[193.911419943451]


##### Spring Pulse Onset Delay

In [None]:
# Similarly, the Time to Peak for the simulated data from the first station in the data we were given will look like 
print(metrics.SpringPulseOnset(df=simulated, num_stations=1))

# The time to peak for the observed data for the same station looks like:-
print(metrics.SpringPulseOnset(df=observed, num_stations=1))

#### Multiple metrics

In [None]:
# You are also able to calculate possible available metrics 
for key, value in metrics.calculate_all_metrics(observed=observed, simulated=simulated, num_stations=2).items():
    print(f"{key}: {value}")

Note: There till be an option to have this returned in a text file for easy observation in the coming days

In [None]:
# Naturally, you are also able to calculate a few metrics at a time but putting then in a list and passing them into the function
# as shown below
metrices = ["MSE", "RMSE", "MAE", "NSE", "NegNSE", "LogNSE"]
for key, value in metrics.calculate_metrics(observed=observed, simulated=simulated, num_stations=1, metrices=metrices).items():
    print(f"{key}: {value}")

### Visualizations

<mark>Note: The visualization modules are not yet at a point where it can be put into public use.<mark/>