# fDOM Data Merging
This file's main goal is to merge all labeled timeline data into a single source to allow for easier data augmentation, as well a classifier that can detect all types of anomalies in one, rather than one classifier for each one. This specific file merges fDOM labeled data into a single file.

In [None]:
# Imports
import pandas as pd

The following functions are helpers for the rest of the merging process.

In [None]:
def print_entire_df(dataframe):
    """Print out the entire contents of a dataframe, useful if you need to see differences (WARNING: ENTIRE OUTPUT WILL GO INTO GITHUB IF FILE COMMITTED WITH CELL OUTPUT"""
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        print(dataframe)

## Load in data

In [None]:
# Load in all of the datasets:
fDOM_PLP_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_PLP/julian_time/fDOM_PLP_0k-300k.csv'
fDOM_SKP_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_SKP/julian_time/fDOM_SKP_0k-300k.csv'
fDOM_PP_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_PP/julian_time/fDOM_PP_0k-300k.csv'
fDOM_FPT_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_FPT/julian_time/fDOM_FPT_0k-300k.csv'
fDOM_FSK_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_FSK/julian_time/fDOM_FSK_0k-300k.csv'

# Load in dataframes
fDOM_PLP_df = pd.read_csv(fDOM_PLP_path)
fDOM_SKP_df = pd.read_csv(fDOM_SKP_path)
fDOM_PP_df = pd.read_csv(fDOM_PP_path)
fDOM_FPT_df = pd.read_csv(fDOM_FPT_path)
fDOM_FSK_df = pd.read_csv(fDOM_FSK_path)

# update indices to use timestamp
fDOM_PLP_df.set_index('timestamp_of_peak', inplace=True)
fDOM_SKP_df.set_index('timestamp_of_peak', inplace=True)
fDOM_PP_df.set_index('timestamp_of_peak', inplace=True)
fDOM_FPT_df.set_index('timestamp_of_peak', inplace=True)
fDOM_FSK_df.set_index('timestamp_of_peak', inplace=True)

In [None]:
# Visualize these dataframes
print("PLP Head:")
print(fDOM_PLP_df.head())

print("\nSKP Head:")
print(fDOM_SKP_df.head())

print("\nPP Head:")
print(fDOM_PP_df.head())

## Peak Precendence
The following code block sets the order precendence of peaks. 

In [None]:
# SET PRECENDENCE OF PEAKS
# skyrocketing <- phantom <- plummeting <- flat plateau <- flat sink
TOP = fDOM_PP_df
SECOND = fDOM_SKP_df
THIRD = fDOM_PLP_df
FOURTH = fDOM_FPT_df
FIFTH = fDOM_FSK_df


## Merge Data
We concat all values into a single dataframe, stable sort by timestamp, and then drop all duplicates.
Following this, we then rename all labels starting with N to be NAP (not anomaly peaks)

Using the stable sorting method keeps the indices in the correct order, so when we drop duplicates our peak precendence is saved should there be any overlapping ones.
Following t

In [None]:
# concat all three dataframes
df = pd.concat([TOP, SECOND, THIRD, FOURTH, FIFTH])

# sort values
df = df.sort_values(by=['timestamp_of_peak'], kind='stable')

# remove duplicates
df = df[~df.index.duplicated(keep='first')]

# rename all labels that start with N to be NAP using regex
final_data = df.replace(to_replace='N(.*)', value="NAP", regex=True)


## Output to CSV
The following codeblock exports the newly created timeline.

In [None]:
# specify the name and path of the file
filename = '../Data/labeled_data/ground_truths/fDOM/fDOM_all_julian_0k-300k.csv'

# write to csv
final_data.to_csv(filename)