# fDOM Data Merging
This file's main goal is to merge all labeled timeline data into a single source to allow for easier data augmentation, as well a classifier that can detect all types of anomalies in one, rather than one classifier for each one. This specific file merges fDOM labeled data into a single file.

In [1]:
# Imports
import pandas as pd

The following functions are helpers for the rest of the merging process.

In [2]:
def print_entire_df(dataframe):
    """Print out the entire contents of a dataframe, useful if you need to see differences (WARNING: ENTIRE OUTPUT WILL GO INTO GITHUB IF FILE COMMITTED WITH CELL OUTPUT"""
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        print(dataframe)

## Load in data

In [3]:
# Load in all of the datasets:
fDOM_PLP_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_PLP/julian_time/fDOM_PLP_0k-300k.csv'
fDOM_SKP_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_SKP/julian_time/fDOM_SKP_0k-300k.csv'
fDOM_PP_path = '../Data/labeled_data/ground_truths/fDOM/fDOM_PP/julian_time/fDOM_PP_0k-300k.csv'

# Load in dataframes
fDOM_PLP_df = pd.read_csv(fDOM_PLP_path)
fDOM_SKP_df = pd.read_csv(fDOM_SKP_path)
fDOM_PP_df = pd.read_csv(fDOM_PP_path)

# update indices to use timestamp
fDOM_PLP_df.set_index('timestamp_of_peak', inplace=True)
fDOM_SKP_df.set_index('timestamp_of_peak', inplace=True)
fDOM_PP_df.set_index('timestamp_of_peak', inplace=True)

In [4]:
# Visualize these dataframes
print("PLP Head:")
print(fDOM_PLP_df.head())

print("\nSKP Head:")
print(fDOM_SKP_df.head())

print("\nPP Head:")
print(fDOM_PP_df.head())

PLP Head:
                   value_of_peak label_of_peak  idx_of_peak
timestamp_of_peak                                          
2.456064e+06           112.40602          NPLP         2083
2.456077e+06           113.10874          NPLP         3270
2.456077e+06            84.50452          NPLP         3276
2.456077e+06            90.15410          NPLP         3294
2.456077e+06            96.68559          NPLP         3300

SKP Head:
                   value_of_peak label_of_peak  idx_of_peak
timestamp_of_peak                                          
2.456049e+06            28.46222          NSKP          616
2.456056e+06            38.09339          NSKP         1318
2.456063e+06            38.94278          NSKP         1993
2.456064e+06            43.10656          NSKP         2091
2.456077e+06            20.55849          NSKP         3269

PP Head:
                   value_of_peak label_of_peak  idx_of_peak
timestamp_of_peak                                          
2.456045e

## Peak Precendence
The following code block sets the order precendence of peaks. 

In [5]:
# SET PRECENDENCE OF PEAKS
# skyrocketing <- phantom <- plummeting
TOP = fDOM_SKP_df
TOP_ACRO = "SKP"
TOP_NO_ACRO = "NSKP"

SECOND = fDOM_PP_df
SECOND_ACRO = "PP"
SECOND_NO_ACRO = "NPP"

THIRD = fDOM_PLP_df
THIRD_ACRO = "PLP"
THIRD_NO_ACRO = "NPLP"


## Merge Data
We concat all values into a single dataframe, stable sort by timestamp, and then drop all duplicates.
Following this, we then rename all labels starting with N to be NAP (not anomaly peaks)

Using the stable sorting method keeps the indices in the correct order, so when we drop duplicates our peak precendence is saved should there be any overlapping ones.
Following t

In [6]:
# concat all three dataframes
df = pd.concat([TOP, SECOND, THIRD])

# sort values
df = df.sort_values(by=['timestamp_of_peak'], kind='stable')

# remove duplicates
df = df[~df.index.duplicated(keep='first')]

# rename all labels that start with N to be NAP using regex
final_data = df.replace(to_replace='N(.*)', value="NAP", regex=True)


## Output to CSV
The following codeblock exports the newly created timeline.

In [7]:
# specify the name and path of the file
filename = '../Data/labeled_data/ground_truths/fDOM/fDOM_all_julian_0k-300k.csv'

# write to csv
final_data.to_csv(filename)