# QA/QC
 
### What is QA/QC:
The task of annotating the quality of collected data/observation:

    - GOOD
    - BAD
    - SUSPECT
    - UNKNOWN
    
### Why QA/QC is needed:

Due to different conditions in the natural environment, observations collected by sensors may not be reliable.  
The quality assurance and control of data collected from sensor is very important to make sure the data usability.

### What are methods to annotate data with QA/QC flags:

[IOOS](https://ioos.github.io/ioos_qc/resources.html) has defined standard and statistical methods to annotate quality check on ocean time-series data. As for each essential ocean variable (EoV), different set of statistical tests are recommended to ensure the quality of collected data. Please read the [mannual](https://github.com/ioos/ioos_qc/blob/main/resources/argo-quality-control-manual.pdf) and [mannual In-situ](https://cdn.ioos.noaa.gov/media/2019/08/QARTOD_Currents_Update_Second_Final.pdf)


### Why thresholds are necessary

[IOOS](https://ioos.github.io/ioos_qc/resources.html) developed a IOOS_QC [QARTOD](https://ioos.github.io/ioos_qc/examples/Qartod_Single_Test_Example.html) python package in which different statistical functions are implemented. As each statistical function requires series to observations (time-series data), as well as it requires additional parameters which are referred as thresholds. 

For Example:

The most basic test (QARTOD function) for each EoV is `range test` where the natural range of value is used to validate the observation (data value). E.g., for sea surface temperature the global range is between -2.5 and 40.0. In this example, the lowest recorded temperature is -2.5 and the highest recorded temperature is 40.0. However, it is uncommon that temperature reaches to extreme that is why `range test` function also takes suspect threshold value to define which values can be suspect to error or warning. 

### What current datasets are avaiable with QA/QC annotations

In CIOOS Atlantic data repository, CMAR is the only partners whose dataset are annotated with QA/QC flags. Here is the list of dataset by CMAR:

1. Annapolis County
2. Antigonish County
3. Cape Breton County
4. Colchester County
5. Digby County
6. Inverness County
7. Lunenburg County
8. Pictou County
9. Queens County
10. Richmond County
11. Shelburne County
12. Victoria County
13. Yarmouth County

The description related to [CMAR QC Test & Threshold](https://dempsey-cmar.github.io/cmp-data-governance/pages/qc_tests.html#fn1) are provided. The EoVs annotated in these datasets are:

- dissolved oxygen
- sea surface temperature
- salinity
- depth check


### What we are trying to estimate

With the help of domain expert, CMAR has defined [thresholds](res/2024-10-24_cmar_water_quality_thresholds.csv) for each EoVs with respect to QARTOD function and month. Our objective is to learn the thresholds from existing data to annotate QA/QC flags. Estimating and learning threshold from known data will help us to predict QA/QC flags for unknown data which is not annotated.  

In this empirical study, we are focusing on `sea surface temperature`.

## Step#1:  Download the data as csv
The csv file contain first column as field name and second column contains the unit of each value.

## Step#2: Replacing String flag into int Flags
By default, the raw data contains flag in String but for easy to process we replace string into integer to optimize. We save the a new file as .csv

In [None]:
import os.path

import pandas as pd
import math

class QartodFlags:
    """Primary flags for QARTOD."""
    GOOD = 1
    UNKNOWN = 2
    SUSPECT = 3
    FAIL = 4
    MISSING = 9


# TARGET EoV is SEA SURFACE TEMPERATURE
eov_range = (-2.0, 40)
eov_col_name = 'temperature'
eov_flag_name = 'qc_flag_temperature'
window_hour = 12
min_rows_in_a_chunk = 6
minimum_rows_for_each_group = 50


################# REPLACE FLAGS FROM STRING TO INT FROM CSV##################
def custom_replacement(value):
    if value == 'Not Evaluated':
        return QartodFlags.UNKNOWN
    elif value == 'Pass':
        return QartodFlags.GOOD
    elif value == 'Suspect/Of Interest':
        return QartodFlags.SUSPECT
    elif value == 'Fail':
        return QartodFlags.FAIL
    elif math.isnan(float(value)):
        return -1
    else:
        print(f"Unknown [{value}]")

    return value

csv_name = "D://CIOOS-Full-Data/Annapolis County Water Quality Data.csv"
# if the file is big then process into chunks
df_chunks = pd.read_csv(csv_name, chunksize=10000)
columns_ = None
header_written = False
for df in df_chunks:
    if columns_ is None:
        lst_col  = list(df.columns)
        # columns which starts with `qc_` are flag columns
        columns_ = [col for col in lst_col if "qc" in col.lower()]
    for col in columns_:
        df[col] = df[col].apply(custom_replacement)

    df.to_csv(csv_name.replace(".csv", "_FlagCode.csv") , index=False, mode='a', header= not header_written)
    header_written = True

# output: Annapolis County Water Quality Data_FlagCode.csv
#######################################################

## Step#3: Grouping the data
The data is collected on different stations. And each station may have installed one or more sensors. Thus we need to group data by station and sensor. 

In this code below, the csv `dataframe` is grouped on `station` and `sensor`. Within the data collected, we also need to validate if the collected data is from the same geographical area. That is the reason we group the location with difference of `eps=0.001`.  

In [None]:
import numpy as np
from sklearn.cluster import DBSCAN

def group_with_dbscan(values, eps=0.001): 
    # Convert to 2D array as required by DBSCAN
    X = np.array(values).reshape(-1, 1)
    db = DBSCAN(eps=eps, min_samples=1)
    labels = db.fit_predict(X)
    # Group values by cluster label
    groups = []
    for label in sorted(set(labels)):
        group = [v for v, l in zip(values, labels) if l == label]
        groups.append(group)
    return groups


csv_name = "D://CIOOS-Full-Data/Annapolis County Water Quality Data_FlagCode.csv"
# lets make subdirectory
dir__ = os.path.dirname(csv_name) # D://CIOOS-Full-Data/
filename_ = os.path.basename(csv_name) #Annapolis County Water Quality Data_FlagCode.csv
new_directory_name = filename_.split(" ")[0]  #Annapolis
save_dir_ = os.path.join(dir__, new_directory_name)
os.makedirs(save_dir_) # D://CIOOS-Full-Data/Annapolis/
save_name =  os.path.join( save_dir_, filename_.replace("_FlagCode.csv","")) 


df_ = pd.read_csv(csv_name, parse_dates=['time'], skiprows=[1])
df_['latitude'] = df_['latitude'].astype(np.float32).round(4)
df_['longitude'] = df_['longitude'].astype(np.float32).round(4)

groups_ = df_.groupby(by=['station', 'sensor_serial_number'])
id_ = 1
for grp_, chunk in groups_:
    if chunk.shape[0] < minimum_rows_for_each_group: 
        continue
    # check if the data of more than 1 day is collected
    d_threshold = (pd.to_datetime(chunk['time'].max()) - pd.to_datetime(chunk['time'].min())).days > 1
    if not d_threshold:
        continue

    # checking if the data from same sensor is collected from different geographical area
    lat_uni_ = chunk['latitude'].unique()
    lon_uni_ = chunk['longitude'].unique()
    a_ = lat_uni_.std()
    b_ = lon_uni_.std()
    if a_ > 0.001:
        rows__ = chunk.shape[0]
        total_rows__ = 0
        lat_groups_ = group_with_dbscan(values=lat_uni_)
        for j, lat_grp in enumerate(lat_groups_):
            sub_chunk_ = chunk[ (chunk['latitude'] >= min(lat_grp)) &  (chunk['latitude'] <= max(lat_grp)) ]
            sub_b_ = sub_chunk_['longitude'].astype(np.float32).unique()
            assert sub_b_.std() <= 0.001, f"many longitudes {lat_uni_}"
            sub_chunk_.to_csv(save_name + f"-{id_}-{j}.csv", index=False)
            total_rows__ += sub_chunk_.shape[0]
            print(save_name + f"-{id_}-{j}.csv")
        assert total_rows__ == rows__, f"LAT sub chunk rows not equal to main chunk [{lat_grp}] - [{lat_uni_}]"

    else:
        if b_ > 0.001:
            rows__ = chunk.shape[0]
            total_rows__ = 0
            lon_groups_ = group_with_dbscan(values=lon_uni_)
            for j, lon_grp in enumerate(lon_groups_):
                sub_chunk_ = chunk[(chunk['longitude'] >= min(lon_grp)) & (chunk['longitude'] <= max(lon_grp))]
                sub_chunk_.to_csv(save_name + f"-{id_}-{j}.csv", index=False)
                print(save_name + f"-{id_}-{j}.csv")
                total_rows__ += sub_chunk_.shape[0]

            assert total_rows__ == rows__, f"LON sub chunk rows not equal to main chunk [{lon_groups_}] - [{lon_groups_}]"
        else:
            chunk.to_csv(save_name+f"-{id_}.csv", index=False)
    id_+=1

![Time-window](res/final_report_diagram.jpg)

## Current Feature Set:
1)	Rolling Standard Deviation
2)	Past-window mean – current value
3)	Future-window mean – current value
4)	Future-window mean – Past-window mean
5)	current value – ( (lead value – lag value)  / 2)
6)	Month Average Hourly Change – (lag Value – current Value)
7)	Month Average Hourly Change – (lead Value – current Value)
8)	(Current Value – q_997) if current value > q_997 else 0
9)	(Current Value – q_003) if current value < q_003 else 0
10)	 (Current Value – fwq_997) if current value > fwq_997 else 0
11)	(Current Value – fwq_003) if current value > fwq_003 else 0
12)	Month (1 - 12)
