# NAWA FRACHT dataset extraction

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is used to retrieve and concatenate the NAWA FRACHT (NADUF) dataset. 

The output is one file per catchemnt (similar to the CAMELS-CH), with 44 columns:

- date_start
- date_end 
- alk
- As
- Ba
- Br
- Cd
- Ca
- Cl
- Cr
- Cu
- doc
- drp
- ec25_sensor
- ec20_lab
- F
- Fe
- Pb
- Mg
- q_mean_sensor
- Hg
- Ni
- NO3_N
- O2C_sensor
- O2S_sensor
- pH_lab
- pH_sensor
- K
- H4SiO4
- Na
- Sr
- SO4
- tfp
- th
- tn
- toc
- tp
- tss
- temp_sensor
- Zn

## Requirements
**Python:**

* Python>=3.6
* Jupyter
* geopandas=0.10.2
* numpy
* os
* pandas=2.1.3
* scipy=1.9.0
* tqdm

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**

* naduf_data_1981-2020_v6.xlsx


**Directory:**

* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 


## References
* NADUF. National River Monitoring and Survey Programme, https://www.bafu.admin.ch/bafu/en/home/topics/water/state/water--monitoring-networks/national-surface-water-quality-monitoring-programme--nawa-/national-river-monitoring-and-survey-programme--naduf-.html (last access: 20 Sep 2024).
## Observations
* None

# Import modules

In [None]:
import pandas as pd
import tqdm as tqdm
import os
import warnings

# Configurations

In [None]:
# Only editable variables:
# Relative path to your local directory
PATH = "../.."

# Suppress all warnings
warnings.filterwarnings("ignore")

# Path to where the data are stored
path_naduf = r"C:\Users\nascimth\Documents\data\CAMELS_CH_Chem\\"

* #### The users should NOT change anything in the code below here. 

In [None]:
# Non-editable variables:
PATH_OUTPUT = r"results\Dataset\stream_water_chemistry\interval_samples"

# Set the directory:
os.chdir(PATH)

# Import data

In [None]:
# Full dataset of interval (time-series)
dataset_naduf = pd.read_excel(path_naduf+r"data\NADUF\naduf_data_1981-2020_v6.xlsx")
dataset_naduf

- Network

In [None]:
# Network NADUF
network_naduf = pd.read_excel(path_naduf+"data\CAMELS_CH_chem_stations_short_v3.xlsx", sheet_name='naduf')
network_naduf

In [None]:
len(dataset_naduf.naduf_id.unique())

In [None]:
dataset_naduf.naduf_id.unique()

Observations
- 1827 is not present in the dataset. 

### Renaming the columns

In [None]:
dataset_naduf.columns

In [None]:
column_rename_dict = {
    'naduf_id': 'nawafracht_id', 
    'status_number': 'status_number', 
    'remark':'remark' , 
    'year':'year', 
    'date_end':'date_end', 
    'duration':'duration',
    'mean_discharge': 'q_mean_sensor',
    'total_discharge': 'total_discharge(Miom3)',
    'temperature_BAFU': 'temp_sensor',
    'pH_BAFU': 'pH_sensor',
    'conductivity_25C_BAFU': 'ec25_sensor',
    'oxygen': 'O2C_sensor',
    'oxygen_saturation': 'O2S_sensor',
    'pH_lab': 'pH_lab',
    'conductivity_20C_lab': 'ec20_lab',
    'total_hardness': 'th',
    'alkalinity': 'alk',
    'calcium': 'Ca',
    'magnesium': 'Mg',
    'nitrate': 'NO3_N',
    'total_nitrogen': 'tn',
    'DRP': 'drp',
    'total_phosphorus': 'tp',
    'total_phosphorus_filtered': 'tfp',
    'chloride': 'Cl',
    'fluoride': 'F',
    'bromide': 'Br',
    'silicate': 'H4SiO4',
    'sulphate': 'SO4',
    'sodium': 'Na',
    'potassium': 'K',
    'iron': 'Fe',
    'TOC': 'toc',
    'DOC': 'doc',
    'suspended_material': 'tss',
    'chromium': 'Cr',
    'zinc': 'Zn',
    'copper': 'Cu',
    'cadmium': 'Cd',
    'lead': 'Pb',
    'nickel': 'Ni',
    'mercury': 'Hg',
    'barium': 'Ba',
    'strontium': 'Sr',
    'arsenic': 'As',
    'manganese': 'Mn'
}

In [None]:
# Rename columns based on the dictionary
dataset_naduf.rename(columns=column_rename_dict, inplace=True)

In [None]:
(dataset_naduf.duration/24).min()

In [None]:
# Convert to datetime:
dataset_naduf["date_end"] = pd.to_datetime(dataset_naduf["date_end"], format='%Y-%m-%d')

# Subtract hours (duration_column) from datetime_column
dataset_naduf['date_start'] = dataset_naduf['date_end'] - pd.to_timedelta(dataset_naduf['duration'], unit='h')

# Round the datetime to the nearest minute
dataset_naduf['date_start'] = dataset_naduf['date_start'].dt.round('S')

dataset_naduf

In [None]:
#dataset_naduf = dataset_naduf[['naduf_id', 'date',
#       'mean_discharge(m3/s)',
#       'temperature(°C)', 'pH(-)', 'conductivity_25C(µS/cm)',
#       'oxygen(mg/l)', 'oxygen_saturation(%)', 'pH_lab(-)',
#       'conductivity_20C_lab(µS/cm)', 'total_hardness(mmol/l)',
#       'alkalinity(mmol/l)', 'calcium(mg/l)', 'magnesium(mg/l)',
#       'nitrate(mgN/l)', 'total_nitrogen(mgN/l)', 'DRP(mgP/l)',
#       'total_phosphorus(mgP/l)', 'total_phosphorus_filtered(mgP/l)',
#       'chloride(mg/l)', 'fluoride(mg/l)', 'bromide(mg/l)',
#       'silicate(mgH4SiO4/l)', 'sulphate(mgSO4/l)', 'sodium(mg/l)',
#       'potassium(mg/l)', 'iron(mg/l)', 'TOC(mgC/l)', 'DOC(mgC/l)',
#       'suspended_material(mg/l)', 'chromium(µg/l)', 'zinc(µg/l)',
#       'copper(µg/l)', 'cadmium(µg/l)', 'lead(µg/l)', 'nickel(µg/l)',
#       'mercury(µg/l)', 'barium(µg/l)', 'strontium(µg/l)', 'arsenic(µg/l)',
#       'manganese(µg/l)']]
#dataset_naduf

In [None]:
dataset_naduf = dataset_naduf[[
    'nawafracht_id', 
    'date_start',
    'date_end', 
    'alk',
    'As',
    'Ba',
    'Br',
    'Cd',
    'Ca',
    'Cl',
    'Cr',
    'Cu',
    'doc',
    'drp',
    'ec25_sensor',
    'ec20_lab',
    'F',
    'Fe',
    'Pb',
    'Mg',
    'q_mean_sensor',
    'Hg',
    'Ni',
    'NO3_N',
    'O2C_sensor',
    'O2S_sensor',
    'pH_lab',
    'pH_sensor',
    'K',
    'H4SiO4',
    'Na',
    'Sr',
    'SO4',
    'tfp',
    'th',
    'tn',
    'toc',
    'tp',
    'tss',
    'temp_sensor',
    'Zn',
    ]]
dataset_naduf

In [None]:
# Function to round numbers and preserve symbols
def round_values(val):
    if isinstance(val, str):  # Handle string values with symbols
        if val.startswith('>') or val.startswith('<'):
            symbol = val[0]  # Extract the symbol ('>' or '<')
            try:
                number = float(val[1:])  # Convert the rest to a float
                return f"{symbol}{round(number, 4)}"
            except ValueError:  # Handle cases where conversion might fail
                return val
        else:
            try:
                return str(round(float(val), 4))  # Round plain string numbers
            except ValueError:
                return val  # Return original value if conversion fails
    elif isinstance(val, (int, float)):  # Handle numeric values
        return round(val, 4)
    return val  # Return unchanged if it's neither string nor numeric

In [None]:
# Network CAMELS_CH_Chem
network_camels_ch_chem = pd.read_excel(path_naduf+r"data\CAMELS_CH_chem_stations_short_v3.xlsx", sheet_name='all_5')
#network_camels_ch_chem.set_index("basin_id", inplace=True)
network_camels_ch_chem

In [None]:
for code in tqdm.tqdm(network_naduf.naduf_id):
    
    dataset = dataset_naduf[dataset_naduf["nawafracht_id"] == code]
    dataset.set_index("date_start", inplace = True)
    dataset.drop(["nawafracht_id"], axis=1, inplace = True)
    
    dataset.index.name = "date_start"
    
    # Apply the function to the column
    dataset = dataset.applymap(round_values)

    # There are some non-numeric things in the columns, instead of NaNs
    #dataset = dataset.apply(pd.to_numeric, errors='coerce')
    
    # Here we take out the > or < before converting to a numeric value:
    #dataset = dataset.applymap(lambda x: str(x).replace('<', '') if isinstance(x, str) else x)
    #dataset = dataset.applymap(lambda x: str(x).replace('>', '') if isinstance(x, str) else x)

    # There are some non-numeric things in the columns, instead of NaNs
    #dataset = dataset.apply(pd.to_numeric, errors='coerce')

    #dataset = dataset.round(4)
    basin_id_name = str(network_camels_ch_chem[network_camels_ch_chem.naduf_id == code].loc[:, "basin_id"].values[0])

    dataset.to_csv(PATH_OUTPUT + "\\nawa_fracht\camels_ch_chem_nawafracht_"+str(basin_id_name)+".csv", encoding='latin')

Observations
- We have 24 stations in total (one is empty: 1827)
- So far, the intervals are variable (not resampled)

# End