## Basic statistics and WAPOR data

#### Introduction

The waterpip package includes a set of statistical functions (tools) that can be used to carry out analysis on any raster or rasters that the user provides. These functions can be found in the script: <br>

*waterpip\scripts\support\statistics.py* <br>

- NOTE: These functions can be used on on any file of the correct type however it is easier to use them on files retrieved using **WaporRetrieval** class. <br>

In this notebook we will walk you through how to run two of the functions available

### **Steps**:<br>

1. Importing of the modules and functions needed<br><br> 

2. running of the function *raster_count_statistics*: counts the unique values found in a raster and calculates the percentage of non nan cells they make up as well as the area each value covers. <br><br> 

3. running of the function *calc_field_statistics*: calculate per field statistics from a raster or set of rasters using a shapefile to determine the fields/areas.  <br><br> 

4. Export the calculated field statistics too a shapefile<br><br> 

5. Visualise the data<br><br> 

6. Rinse and Repeat<br><br> 

NOTE: If this is your first time running this please read the instructions below and follow the steps, otherwise feel free to use the notebook as you wish.
***

## 1. Import modules/libraries

In [None]:
import os
from datetime import datetime
from waterpip.scripts.support.statistics import raster_count_statistics, calc_field_statistics
from waterpip.scripts.support.vector import records_to_shapefile

print('scripts imported successfully, you are at the starting line')

***
## 2. count the raster values in a categorical raster (land cover classification)

As a first step carry out a count of the different values that exist in a categorical raster such as a land cover classification. This can be done using the function *raster_count_statistics*. To do this you can either provide your own categorical raster or you can use the WAPOR land cover classification raster. <br>
 
- *raster_count_statistics*: counts the unique values found in a raster and calculates the percentage of non nan cells they make up as well as the area each value covers. <br><br>

- NOTE: The WAPOR land cove classfication raster is retrieved using the datacomponent code = ['LCC'] and the notebook *waterpip\tutorials\01_Basics\01A_downloads\01A_waterpip_download_basics.ipynb* <br>

once you have a raster it is possible to run the function using the below inputs:

#### Required Inputs:<br>

- **input_raster_path**: path to the input raster holding the values to count<br>

#### Optional Inputs:<br>

- **output_csv**: if the path to an output csv is provided then a csv and excel of the output
calculated is made<br><br>

- **categories_dict**: if a dict of categories is provided uses the dict to assign names/categories 
to the values found.<br>

    - NOTE: the categories_dict has to be formatted so that the dictionary keys are the categories (names) 
and the values are the values found in the raster that the categories/names have to match<br><br>

- **category_name**: if a categories dictionary is provided this is used to name the new key in the output 
dictionary and or the column in the output dataframe/csv. This is the column/key that will hold the category names
(category dictionary keys). Auto set to 'categories'<br><br>

- **out_dict**: a boolean input if True outputs a dictionary if False outputs a dataframe. autoset to False<br> 

#### Outputs: <br>

the function returns a tuple of a dataframe/dict and the path to a csv if provided on input. Each contains 
the same information on the values counted in the raster. 


***
### 2.1 Retrieving WAPOR land cover classification categories dict

NOTE: this step is only applicable if carrying out *raster_count_statistics* on the land cover classification raster (LCC) retrieved from the WAPOR portal. If analysing your own categoricla raster you cna skip this step

To add categories to the wapor LCC we provide a wapor LCC categories dict this can be retrieved from the script: 

*waterpip\scripts\retrieval\wapor_land_cover_classification_codes.py*

using the following function: *wapor_lcc* 

to use it all you have to do is import the fucntion and when running it provide the wapor level (1,2,3) matching the wapor level you used when retrieving the wapor LCC raster.

In [None]:
# retrieve the wapor LCC categories dict (OPTIONAL)

from waterpip.scripts.retrieval.wapor_land_cover_classification_codes import wapor_lcc

categories_dict = wapor_lcc(wapor_level=3)

In [None]:
# carry out the raster statistics count
count_data, count_csv = raster_count_statistics(
    input_raster_path=r"path to the categorical raster you want to analyse goes here",
    output_csv=r"path where you want to output a csv too goes here",
    categories_dict=categories_dict,
    category_name= 'landcover',
    out_dict=False
    )
    
print(count_data)

print(count_csv)


***
## 3. calculate field based statistics from a raster or a set of rasters

in the waterpip package we provide a set of statistical tools tha you can use to analyse rasters. One of these is the 
function *calc_field_statistics*. It allows you to carry out zonal statistics on a single raster or a set of rasters using a shapefile to determine the fields/zones/geometries for which to calculate statistics.<br>

- NOTE: When running the function for a single raster the name of each column is taken from the statistic being calculated. However in the case of mulitple rasters this is not feasible so the name of each input raster or vrt band  in combination with the statistic calculated is taken as the column name.<br>

    - WARNING: For csv and excel the names generated above are fine however shapefiles only except 8 characters per column. So before outputting to shapefile csvs/excels may need editing. the input option *waterpip_files* attempts to automate this by maintaining the most important parts of the names when running the script for raster files with the standardised waterpip file names. (**IMPORTANT**) <br>

to run the function you need to provide the following inputs:<br>

**Required Inputs**:<br>

- **fields_shapefile_path**: path to the shapefile containing the fields used to designate the zones of analysis. <br><br>

    - NOTE: if working with wapor data it is recommended to use the mask shapefile made when running the function *create_raster_mask_from_shapefile* or *create_raster_mask_from_wapor_lcc* from **WaporRetrieval** or **WaporAnalysis**. However any correctly formatted shapefile is acceptable.<br><br>

- **input_rasters**: list of paths to the rasters that are to be analyzed. For one raster just provide a list of one raster<br>

    - NOTE: *calc_field_statistics* accepts multiple rasters and/or vrts and will calculate the field statistics for each raster provided, automatically generating names for the columns of output produced to distinguish them. Also works with a single raster provided in list format. <br>

**Optional Inputs**:<br>

- **csv_file_path**: path to the csv where the calcualted statistics are outputted too if provided<br><br>

- **field_stats**: list of field statistics to calculate (checked against a list of accepted keywords), if not provided uses the default set: ['min', 'max', 'mean', 'sum', 'stddev']<br><br>

- **id_key**: identifies the column in the *fields_shapefile_path* used to mark/identify each field in the shapefile. This input is autoset to 'wpid' in the assumption that you are using a mask shapefile produced using *create_raster_mask_from_shapefile* or *create_raster_mask_from_wapor_lcc*. <br>

    - WARNING: the id has to be unique per field and has to exist in the shapefile (**IMPORTANT**)<br><br>

- **out_dict**: boolean option if set to True outputs the data in a dict instead of a dataframe.<br><br>

- **waterpip_files**: boolean option if set to True assumes rasters with standardised waterpip file paths have 
been provided as input and deconstructs the file name to provide automatic column names.<br>

    - NOTE: only relevant when running for multiple rasters <br>

**Output**:

The output for the function *calc_field_statistics* is a tuple containing both the dataframe/dict produced and the path to where the dataframe is automatically stored as a .csv if a path is provided

In [None]:
field_stats = calc_field_statistics(
    fields_shapefile_path=r"path to the shapefile containing hte fields/polygons you want to analyse goes here",
    input_rasters=[r"list of paths to rasters you want to analyse goes here"],
    output_csv_path=r"path you want to output a csv too goes here or not",
    field_stats=['min', 'max', 'mean', 'sum', 'stddev'],
    id_key='wpid',
    out_dict=False
)

# access the tuple and print the dataframe and path to the .csv
field_stats_df = field_stats[0]

field_stats_csv_path = field_stats[1]

print(field_stats_df)

print(field_stats_csv_path)

***
## 4. Output to shapefile

As a last step we can output the calculated field statistics too shapefile so that it can be visualised in QGIS or ArcGIS as the user wants.<br>

**Required Inputs**:<br>

- **records**: the dictionary or dataframe contain the records/info that is to be outputted to shapefile.<br><br>

- **output_shapefile_path**: path to output the created shapefile too<br><br>

- **fields_shapefile_path**: path to the shapefile holding the reference fields/geometries to which the data should be attached. For exmaple the input shapefiel used to generate the data, or the reference shapefile generated by the crop maks function of wapor analysis.<br><br> 

- **union_key**: identifies the column in the *fields_shapefile_path*  and in the records used to combine the too. if workign with a shapefiel generated by the crop maks script 'wpid' is suggested. otherwise another column/key can also be used.<br>

**Optional Inputs**:<br>

- **output_crs**: if provided warps the shapefile to match this crs<br><br>

WARNING: long column names (like those currently autogenerated in the creation of pai csvs/excels will be truncated, use the csv to match which column is which or edit the csv to have shorter column names)


In [None]:
records_to_shapefile(
    records=field_stats_df, # stats too output too shapefile goes here
    output_shapefile_path=r"path to output a shapefile too",
    fields_shapefile_path= r"path to the template shapefile",
    union_key="wpid")

***
## 5. Visualise the data

You can check the data using a program such as Excel, Qgis or arcGIS or however you want.

***
## 6. Rinse and Repeat  

Now that you know how to retrieve data and analyse data feel free to repeat the notebooks *01A_waterpip_download_basics* and *01B_waterpip_statistics_basics* and play around with the parameters. If you feel like it you can even get into the code itself and see what you can code, run, retrieve and analyse! 

***
## The next step: Performance Assessment Indicators (PAIs)

f you feel like it you can also take a look at notebook *02A_waterpip_analysis_PAIs.ipynb* where we walk you through the process of producing more diffcult statistics: *Performance Assessment Indicators (PAIs)* for an area from download to analysis.