# Example: Dataset CEAP-360VR with pandas and scikit-learn

This notebook loads and preprocesses the dataset `CEAP-360VR` [GitHub repo](https://github.com/luiseduve/CEAP-360VR-Dataset) described in the paper *CEAP-360VR: A Continuous Physiological and
Behavioral Emotion Annotation Dataset for 360 VR Videos* [(DOI)](10.1109/TMM.2021.3124080)

*Description:*

1. A class was created to load the individual Json files in a structured way through the index file `data_tree_index.json`. Similarly, demographics and video stimuli information are stored in two .csv files. The `Frame` data was used as main data source.
2. The sampling frequency is different among data modalities, they were normalized to 30Hz for all videos (Video1 was at 25Hz). Moreover, we loaded `Raw` IBI to generate new signals `IBI_R_Peaks` indicating with a 1 when a heart-rate beat was detected. This information is useful for HRV analysis.
3. Finally, the dataset is combined to avoid missing values and append the class label according to the paper.

In [None]:
import ceap_loader

# Import data science libs
import numpy as np
import pandas as pd

# Visualizatio
import matplotlib.pyplot as plt

---
# Setup

In [None]:
# All the files generated from this notebook are in a subfolder with this name
STR_DATASET = "ceap_example_notebookCEAP/"

In [None]:
def gen_path_temp(filename, subfolders="", extension=".csv"):
    # Function to generate temporary files easier
    TEMP_FOLDER_NAME = "./temp/"
    return ceap_loader.generate_complete_path(filename, \
                                        main_folder=TEMP_FOLDER_NAME, \
                                        subfolders=STR_DATASET+subfolders, \
                                        file_extension=extension)

---
# Load full dataset
---

## Generate index

The class `DatasetCEAP` generates an index file `data_tree_index.json` that searches the folder path containing a specific type of data for a specific participant. The constructor class also loads the description of the stimuli and demographics data (*not the questionnaires*).

In [None]:
# if experiment_config.DATASET_MAIN == Datasets.CEAP:
dataset_root_folder = "../../CEAP-360VR/"
print(dataset_root_folder)

In [None]:
data_loader_ceap = ceap_loader.DatasetCEAP(dataset_root_folder)

In [None]:
data_loader_ceap.stimuli

In [None]:
data_loader_ceap.demographics.head(2)

### Plotting a single loaded file

Test loading a single file and plotting all the data contained in it.

In [None]:
def plot_CEAP_df(df, time_colname="TimeStamp", y_colname="VideoID"):
    """
    This function plots a single dataframe from the CEAP dataset.
    `df` is the result of applying the function `load_data_from_participant()`.

    A single file from the dataset contains many features, which may be sampled
    at different frequencies. This function takes a large dataset and creates a
    subplot (timeStamp, videoId) to generate a plot with the loaded data
    """
    # Get the numeric columns and delete the `index_colname` to make it index later
    cols = list(df.select_dtypes([np.number]).columns)
    if time_colname not in cols: raise ValueError(f"The dataframe does not contain numeric columns for the index x_colname={time_colname}")
    if y_colname not in df.columns: raise ValueError(f"The dataframe does not contain column with y_colname={y_colname}")
    cols.remove(time_colname)
    if y_colname in cols: cols.remove(y_colname)

    NUM_ROWS = len(cols)
    colnames_labels = df[y_colname].unique()
    NUM_COLS = len(colnames_labels)

    cmap = matplotlib.cm.get_cmap("tab20")
    fig,axes = plt.subplots(NUM_ROWS, NUM_COLS, sharex=True, figsize=(6*NUM_COLS, 2*NUM_ROWS))
    for i in range(NUM_ROWS):
        for j in range(NUM_COLS):
            ax = axes[i,j]

            # Filter the data that has to do with this column label
            df_ax = df[ df[y_colname] == colnames_labels[j] ]
            df_ax = df_ax[[time_colname, cols[i]]].set_index(time_colname).dropna(axis=0)
            timestamps = df_ax.index.values
            data = df_ax[cols[i]].values
            ax.plot(timestamps, data, label=cols[i], color=cmap.colors[i] )
            if(i==0): ax.set_title(f"{y_colname}: {colnames_labels[j]}") # Suptitles for first row
            if(j==0): ax.set_ylabel(cols[i])    # Xlabel for first column
            if(i==NUM_ROWS-1): ax.set_xlabel(time_colname)
    plt.tight_layout()
    return

In [None]:
# Parameters of data to load
pid = 1         # 1-32
typ = "Physio"  # ["Annotations", "Behavior", "Physio"]
prep = "Frame"    # ["Raw", "Transformed", "Frame"]

# Load data 
data_loaded = data_loader_ceap.load_data_from_participant(pid,typ,prep)
data_loaded

In [None]:
### Was data missed when applying `merge` of all features from single dataframe?
## To know how many samples were loaded per feature on a specific feature
ts_colname = "TimeStamp"    #data_loader_ceap.K_TIMESTAMP
video_colname = "VideoID"   #data_loader_ceap.K_VIDEO
participant_colname = "ParticipantID"

cols = list(data_loaded.select_dtypes([np.number]).columns)
cols.remove(ts_colname)
CHECK_VIDEO_ID = 2
for c in cols:
    df_ax = data_loaded[ data_loaded[video_colname] == CHECK_VIDEO_ID ]
    df_ax = df_ax[[ts_colname, c]].set_index(ts_colname).dropna(axis=0)
    print(f"C:{c} \t{df_ax.shape}")

#### Desired output from this cell should show:
# - The same number of samples for all features (except IBI) if preprocessing type = "Frame"
# - Different number of samples per feature if preprocessing type = "Raw" because every sensor has different sampling freq

## CONCLUSION
# Merging is working fine!! 
# - IBI has fewer samples and need to be resampled.
# - VideoID 1 was resampled at 25, and the rest at 30Hz. Because video1 had lower FPS

In [None]:
plot_CEAP_df(data_loaded)

### Plot the data from all file in the dataset

**Uncomment if needed** The cell below takes around **130mins** plotting the whole dataset, per loaded file.

In [None]:
# ## Generate plots per data type and to visualize all the data per participant
# for typ in data_loader_ceap.LIST_DATA_TYPES:
#     for prep in data_loader_ceap.LIST_PROCESSING_LEVELS:
#         for pid in range(1,33):
#             data_loaded = data_loader_ceap.load_data_from_participant(pid,typ,prep)
#             plot_CEAP_df(data_loaded)
#             save_path_plot = gen_path_plot(f"{prep}/Participant{pid}_{typ}")
#             plt.savefig(save_path_plot)
#             plt.close()

### Generate a CSV with data of interest

Creating a CSV for the whole dataset produces a file `~800MB`.

Thus, we decided to choose only the data with preprocessing level `Frame`. These files proved to be accurate as accurate as the `Raw` data, but they already normalized and resampled at `30Hz`.

The `Raw` data is used to extract the `IBI` and calculate new HRV from them.

Finally, the dataset of interest is comprised by:
- `Annotations`, `Behavior`, and `Physio` as found in the folder: `Frame`
- `IBI` is excluded from the dataset because presents irregular samples.
- New `IBI` and `HRV` is calculated from the folder: `Raw`.

In [None]:
# Participants IDS
PARTICIPANTS_IDS = np.arange(1,33)
# Load data Annotations, Behavior, and Physio
DATA_GROUPS = data_loader_ceap.LIST_DATA_TYPES
# Load the Raw, Transformed, or Frame (resampled) data processing
DATA_PROCESSING_LEVELS = ["Frame"]#data_loader_ceap.LIST_PROCESSING_LEVELS
print(f"DATA_GROUPS={DATA_GROUPS}, DATA_PROCESSING_LEVELS={DATA_PROCESSING_LEVELS}")

In [None]:
# Where the compiled dataset will be stored
DATASET_POSTPROCESSED_FILENAME = gen_path_temp("Dataset_CEAP_resampled_by_Frame", extension=".csv")

# Load or create dataframe with statistics of initial dataset
data_postprocessed = None

### INPUTS / OUTPUTS
"""EDIT CUSTOM FILENAMES"""
input_files = [DATASET_POSTPROCESSED_FILENAME]

# Try to load or create files
for tries in range(3): # One for loading, and max 2 for creating+loading
    try:
        ### LOAD FILE
        print(f"Try # {tries+1} to load files: {input_files}")
        
        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to READ from files
        data_postprocessed = pd.read_csv(input_files[0])
        print(f"File {input_files[0]} was successfully loaded")

        #############################################################
        break
    except Exception as e:
        ### CREATE FILE
        print(f"File not found. Creating again! {e}")

        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to WRITE to disk

        # Load all data resampled by frame
        for pid in PARTICIPANTS_IDS:
            for dttype in DATA_GROUPS:
                for prep in DATA_PROCESSING_LEVELS:
                    df_single_file = data_loader_ceap.load_data_from_participant(pid,dttype,prep)
                    data_postprocessed = df_single_file if (data_postprocessed is None) else pd.concat([data_postprocessed, df_single_file], axis=0)
            
        # Saving .csv
        data_postprocessed.to_csv( input_files[0], index=False)

        #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        print(f"\n\tFinished creating files {input_files}")

In [None]:
data_postprocessed

---
# 2. Preprocessing stages
---

1. Delete the data corresponding to `IBI` because it has different sampling frequency than the rest of the dataset. The `IBI` will be replaced later with a signal indicating the peaks.
2. Supersample the `Frame` data from `VideoID=1` from 25 to 30Hz. *Reason:* All data in `Frame` is resampled to the FPS (Frames per second) of the video where the data was collected. However, `VideoID=1` is the only one with @25Hz, whereas all others `VideoID=2..8` are resampled @30Hz. Supersample `V1` to make them consistent across videos.
3. Extract heart-rate variability data from the `IBI` column in the type `Raw`. *Reason:* 

In [None]:
# Constant sampling frequency to be applied to data from Video1 and to transform IBI to peaks.
RESAMPLING_FREQUENCY = 30       # What is the sampling frequency of the peaks array?

# Name of the column containing IBI data (this column will be removed and replaced by R-peaks)
ibi_colname = "IBI_IBI"
r_peaks_colname = "IBI_R_Peaks" # Name that will be used after transforming IBI into R-peaks

## Remove IBI data

Remove all the dataset with IBI, and drop the rows with NA. This cleaning should lead to dataframes that have the same number of samples per participant, video, and data group. (i.e., 1800 rows)

In [None]:
# Remove IBI colnames. 
# However, we also need to delete the rows full of NaN, that only had values on IBI.
data_postprocessed = data_postprocessed.drop([ibi_colname], axis=1)
data_postprocessed

In [None]:
# Identify the non-numeric `basic` colnames from the `data` colnames containing the relevant time-series.
# The rows whose `data_colnames` are all NaN will be removed. Because it was a row created by IBI.
basic_cols = ["data_type","processing_level", participant_colname, video_colname, ts_colname]

# Difference between sets of colnames
data_colnames = set(data_postprocessed.columns).difference(set(basic_cols))
# Remove all the rows that are empty, after removing IBI data
remaining_data_index = data_postprocessed[data_colnames].dropna(axis=0, how="all").index
# New post_processed data should have `1800` samples per time-series (except Video1)
data_postprocessed = data_postprocessed.loc[remaining_data_index]

# After removing IBI, all arrays have 1800 samples
df_filtered_single_ts = data_postprocessed[ (data_postprocessed.VideoID == 5) # Do not use VideoID=1 here!
                    & (data_postprocessed.ParticipantID == 1)
                    & (data_postprocessed.data_type == "Physio") ]
df_filtered_single_ts

## Supersample `Frame` data from VideoID=1

In [None]:
def upsample_df_with_interpolation(df, new_timestamps, index_name="TimeStamps"):
    """
    Function used to upsample the data corresponding to VideoID=1 
    from 25Hz to 30Hz, so that it matches the sample frequency of the
    data in the rest of the videos.

    Given a dataframe `df`. It upsamples the numeric columns
    doing linear interpolation based on a function trained on each column.
    The non-numeric features are replaced by the same value of the 
    first row in the original df.ParticipantID
    
    Returns another pandas DataFrame, with the same columns than the input
    `df` but the index corresponds to the `new_timestamps`.
    """
    import scipy.interpolate

    # Create new dataframe with the resampled version
    df_resampled = pd.DataFrame(index=pd.Index(new_timestamps, name=index_name), columns=df.columns)

    # Find the numeric columns that can be interpolated
    cols_numeric = list(df.select_dtypes([np.number]).columns)
    cols_non_numeric = list(set(df.columns.values).difference(set(cols_numeric)))

    # Non-numeric columns are replaced with the first value in the original dataframe
    df_resampled[cols_non_numeric] = df[cols_non_numeric].iloc[0,:]

    # Apply interpolation
    x = df.index.values#.total_seconds().values
    y = df[cols_numeric].values
    f_interpolation = scipy.interpolate.interp1d(x,y,axis=0)
    df_resampled[cols_numeric] = f_interpolation(new_timestamps)
    return df_resampled

In [None]:
# Extract a reference TimeStamp array to be used in the resampled version of VideoID=1
timestamps_reference = df_filtered_single_ts.set_index(ts_colname).dropna(axis=1, how="all")
timestamps_reference

In [None]:
# Use a dataframe from another video as reference for the timestamps required after resampling
timestamps_reference = timestamps_reference.index.values
timestamps_reference

In [None]:
print(f"Size original dataset:{data_postprocessed.shape}")

# Subset of data with values 1 and remove it from original dataset
data_video_1 = data_postprocessed[ data_postprocessed.VideoID ==1 ].copy(deep=True)
print(f"Size data video 1:{data_video_1.shape}")

# Delete from main dataset, we will input new values with proper resampling.
data_postprocessed = data_postprocessed[ data_postprocessed.VideoID != 1 ]
print(f"Size after removing video 1:{data_postprocessed.shape}")

In [None]:
# New dataframe for video 1
data_video_1_resampled = None

# Extract each time series (per participant, and per data group)
for pid in data_video_1.ParticipantID.unique():
    for dgroup in data_video_1.data_type.unique():
        Q = ( (data_video_1.ParticipantID == pid) 
                & (data_video_1.data_type == dgroup) )
        df_filter = data_video_1[ Q ]

        # Define timestamps as the index, and delete 
        # columns that are not relevant to the datagroup
        df_filter.set_index(ts_colname, inplace=True)
        df_filter.dropna(axis=1, how="all", inplace=True)

        # Resample the df_filter with the same timestamps than the reference.
        df_resampled = upsample_df_with_interpolation(df_filter, timestamps_reference, index_name=df_filter.index.name)
        df_resampled.reset_index(inplace=True)

        data_video_1_resampled = df_resampled if (data_video_1_resampled is None) else pd.concat([data_video_1_resampled, df_resampled], axis=0, ignore_index=True)
        # print(f"Original = {df_filter.shape} - Resampled = {df_resampled.shape}")
        # break
    # break
print(f"Size data video 1 resampled: {data_video_1_resampled.shape}\n\tEnd")

In [None]:
# Attach the resampled data from video 1 to the original dataset
data_postprocessed = pd.concat([data_postprocessed, data_video_1_resampled], axis=0, ignore_index=True)
print(f"Size after inserting video 1 resampled:{data_postprocessed.shape}")

In [None]:
df_filter.reset_index().plot(subplots=True, figsize=(9,5))
df_resampled.reset_index().plot(subplots=True, figsize=(9,5))

## Convert Raw IBI to R-peaks array and attach to dataset

In [None]:
# Where the compiled dataset will be stored
DATASET_RAW_IBI_FILENAME = gen_path_temp("Dataset_CEAP_Physio_Raw", extension=".csv")

# Load or create dataframe with statistics of initial dataset
data_IBI_raw = None

### INPUTS / OUTPUTS
"""EDIT CUSTOM FILENAMES"""
input_files = [DATASET_RAW_IBI_FILENAME]

# Try to load or create files
for tries in range(3): # One for loading, and max 2 for creating+loading
    try:
        ### LOAD FILE
        print(f"Try # {tries+1} to load files: {input_files}")
        
        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to READ from files
        data_IBI_raw = pd.read_csv(input_files[0])
        print(f"File {input_files[0]} was successfully loaded")

        #############################################################
        break
    except Exception as e:
        ### CREATE FILE
        print(f"File not found. Creating again! {e}")

        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to WRITE to disk

        # Load all raw IBI to generate HRV data
        for pid in PARTICIPANTS_IDS:
            df_single_file = data_loader_ceap.load_data_from_participant(pid,"Physio","Raw")
            data_IBI_raw = df_single_file if (data_IBI_raw is None) else pd.concat([data_IBI_raw, df_single_file], axis=0)
            
        # Saving .csv
        data_IBI_raw.to_csv( input_files[0], index=False)

        #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        print(f"\n\tFinished creating files {input_files}")

In [None]:
def extract_peaks_from_IBI(df, FS = 30, ibi_colname="IBI_IBI", output_colname="IBI_R_Peaks"):
    """
    Given a dataframe `df` containing irregular physiological 
    features from interbeat intervals with column name:`IBI_IBI`.
    This function returns another dataframe containing 60 seconds of
    data at the same sampling rate than the rest of the dataset
    preprocessed as `Frame`.

    This series contains the position of the `peaks` as `1`, and
    the rest of the array contains zeros. The returned dataframe can be
    directly used directly in neurokit2 package to extract HRV features:
     - `neurokit2.hrv(peaks, sampling_rate=FS)`
    """
    # The first IBI allows to regenerate a new peak right after the first IBI
    first_beat_time = df.index[0]
    first_beat_time = first_beat_time - df.iloc[0][0]
    if(first_beat_time>=0):
        df.loc[first_beat_time] = 0
    df.sort_index(inplace=True)
    
    # Generate an zero-array that will contain the R-peaks as 1's at a specific sampling frequency `FPS`
    MAXIMUM_TIME_SECS = 60
    ts_index_resampled = np.linspace(0, MAXIMUM_TIME_SECS, 60 * FS) # The way used by the authors of the dataset. I would use `np.arange(0,60,1/FS)`
    df_peaks = pd.DataFrame(data=np.zeros(ts_index_resampled.size, dtype=int), index=ts_index_resampled, columns=["IBI_IBI"])

    # Match the IBI times to the closest timestamp in the array containing the peaks
    closest_times_to_peaks = df_peaks.index.get_indexer(df.index.values, method="nearest")
    closest_index_values = df_peaks.index[closest_times_to_peaks] # Get index values from the positions
    df_peaks.loc[closest_index_values] = 1

    # The dataframe needs to be called `R_Peaks` to extract HRV with neurokit
    df_peaks = df_peaks.rename({ibi_colname:output_colname}, axis=1)
    
    return df_peaks

In [None]:
### Testing extraction of R_peaks from individual IBI

# Find the participants with short amount of IBI data 
for pid in data_IBI_raw.ParticipantID.unique():
    for vid in data_IBI_raw.VideoID.unique():
        Q = ( (data_IBI_raw.ParticipantID == pid) 
            & (data_IBI_raw.VideoID == vid)
        )
        data_single_instance = data_IBI_raw[ Q ][ [ts_colname, ibi_colname] ].set_index(ts_colname).dropna(axis=0, how="all")
        if( ibi_colname in data_single_instance.columns and data_single_instance[ibi_colname].size > 0): # If contains the column, and the column has data
            data_peaks_resampled = extract_peaks_from_IBI(data_single_instance, 
                                                        FS=RESAMPLING_FREQUENCY, 
                                                        ibi_colname=ibi_colname,
                                                        output_colname=r_peaks_colname)
        else:
            data_peaks_resampled = pd.DataFrame({
                ts_colname: timestamps_reference,
                ibi_colname: np.zeros(timestamps_reference.size),
            }).set_index(ts_colname)
        
        print(f"P{pid}, V{vid}: SIZE:{data_single_instance.size} \t\tVALID: {data_single_instance.size >= THRESHOLD_SAMPLES_IBI} \t R_peaks: {data_peaks_resampled.values.sum()}")


In [None]:
### Example on how the peaks can be used in the feature extraction stage
import neurokit2 as nk
hrv_indices = nk.hrv(data_peaks_resampled, sampling_rate=RESAMPLING_FREQUENCY)#, show=True)
hrv_indices

In [None]:
# Where the compiled dataset will be stored
DATASET_POSTPROCESSED_WITH_RPEAKS_FILENAME = gen_path_temp("Dataset_CEAP_replacing_IBI_with_RPeaks", extension=".csv")

# Load or create dataframe with statistics of initial dataset
data_postprocessed_with_Rpeaks = None

### INPUTS / OUTPUTS
"""EDIT CUSTOM FILENAMES"""
input_files = [DATASET_POSTPROCESSED_WITH_RPEAKS_FILENAME]

# Try to load or create files
for tries in range(3): # One for loading, and max 2 for creating+loading
    try:
        ### LOAD FILE
        print(f"Try # {tries+1} to load files: {input_files}")
        
        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to READ from files
        data_postprocessed_with_Rpeaks = pd.read_csv(input_files[0])
        print(f"File {input_files[0]} was successfully loaded")

        #############################################################
        break
    except Exception as e:
        ### CREATE FILE
        print(f"File not found. Creating again! {e}")

        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to WRITE to disk

        data_postprocessed_with_Rpeaks = data_postprocessed

        # Add empty R_Peaks to the whole dataset
        data_postprocessed_with_Rpeaks[r_peaks_colname] = np.nan

        # Iterate over participants and videos to add the respective R_peaks
        for pid in np.sort(data_postprocessed_with_Rpeaks.ParticipantID.unique()):
            for vid in np.sort(data_postprocessed_with_Rpeaks.VideoID.unique()):
                #######
                # Query to filter subset of IBI data
                Q = ( (data_IBI_raw.ParticipantID == pid) 
                    & (data_IBI_raw.VideoID == vid)
                    & (data_IBI_raw.data_type == "Physio"))
                # Extract the R_peaks from the corresponding Raw IBI
                data_single_instance = data_IBI_raw[Q][[ts_colname, ibi_colname]].set_index(ts_colname).dropna(axis=0, how="all")
                # If contains the column, and the column has data
                if(ibi_colname in data_single_instance.columns and data_single_instance[ibi_colname].size > 0):
                    data_peaks_resampled = extract_peaks_from_IBI(data_single_instance,
                                                            FS=RESAMPLING_FREQUENCY,
                                                            ibi_colname=ibi_colname,
                                                            output_colname=r_peaks_colname)
                else:
                    # Dataframe full of zeros but without peaks, to compensate for those samples without IBI data.
                    data_peaks_resampled = pd.DataFrame({
                        ts_colname: timestamps_reference,
                        r_peaks_colname: np.zeros(timestamps_reference.size, dtype=int),
                    }).set_index(ts_colname)

                #######
                # Replace the relevant subsection of the postprocessed data
                # Query to filter subset of big dataframe
                Q = ((data_postprocessed_with_Rpeaks.ParticipantID == pid)
                    & (data_postprocessed_with_Rpeaks.VideoID == vid)
                    & (data_postprocessed_with_Rpeaks.data_type == "Physio"))
                idx_to_replace = data_postprocessed_with_Rpeaks[Q].index
                data_postprocessed_with_Rpeaks.loc[idx_to_replace, r_peaks_colname ] = data_peaks_resampled[r_peaks_colname].values
                print(f"P{pid} - V{vid} - #R-Peaks:{data_peaks_resampled[r_peaks_colname].values.sum()}")

                #######
                # # Testing size of the dataset
                # df_instance = data_postprocessed_with_Rpeaks[Q]
                # df_instance[r_peaks_colname] = data_peaks_resampled.values

                # # Get numeric and non-numeric colnames
                # cols_numeric = list(df_instance.select_dtypes([np.number]).columns)
                # cols_non_numeric = list( set(df_instance.columns.values).difference(set(cols_numeric)) )

                # # print(f"P:{pid} \tV:{vid} \tShape:{df_instance.shape} \tCols:{cols_numeric}")

        # All the individual instances should be the same shape (1800) even after adding IBI_R_peaks
        print(data_postprocessed_with_Rpeaks.shape)
        
        ## Saving .csv
        data_postprocessed_with_Rpeaks.to_csv( input_files[0], index=False)

        #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        print(f"\n\tFinished creating files {input_files}")

---
# 3. Final dataset ready for feature extraction
---

The dataset ready to be used executes the following steps:

1. Remove NaN by merging the time-series per their respective `data_group`. They already have the same `ParticipantID`, `VideoID` and `TimeStamp`, thus it's easy to remove the column that indicates the type of data (*Annotations, Behavior, Physio*) so that the whole dataframe does not contain missing values.
2. Include the class labels in high/low arousal/valence according to the paper: `[HAHL, HALV, LAHV, LA,LV]`

In [None]:
dgroup_colname = "data_type"        # Existing column to be removed
class_label_colname = "class_VA"    # Class column name to be created

# These columns are used as index to join df
basic_colnames = [participant_colname, video_colname, ts_colname]
basic_colnames

In [None]:
# Mapping used according to the paper's information in Table 1
# doi: 10.1109/TMM.2021.3124080
MAPPING_VIDEO_TO_CLASS = {
    1: "HVHA",
    2: "HVLA",
    3: "LVHA",
    4: "LVLA",
    5: "HVHA",
    6: "HVLA",
    7: "LVHA",
    8: "LVLA",
}
# Create a function from the dictionary to apply on the final array
mapper_videoid_to_classes = np.vectorize(MAPPING_VIDEO_TO_CLASS.get)

In [None]:
# Where the compiled dataset will be stored
DATASET_POSTPROCESSED_WITHOUT_NAN = gen_path_temp("Dataset_CEAP_postprocessed", extension=".csv")

# Load or create dataframe with statistics of initial dataset
dataset_postprocessed_no_nan = None

### INPUTS / OUTPUTS
"""EDIT CUSTOM FILENAMES"""
input_files = [DATASET_POSTPROCESSED_WITHOUT_NAN]

# Try to load or create files
for tries in range(3): # One for loading, and max 2 for creating+loading
    try:
        ### LOAD FILE
        print(f"Try # {tries+1} to load files: {input_files}")
        
        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to READ from files
        data_postprocessed_with_Rpeaks = pd.read_csv(input_files[0])
        print(f"File {input_files[0]} was successfully loaded")

        #############################################################
        break
    except Exception as e:
        ### CREATE FILE
        print(f"File not found. Creating again! {e}")

        #vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
        ##          Custom functions to WRITE to disk

        # Delete preprocessing level info (Full of labels saying `Frame`)
        data_postprocessed_with_Rpeaks.drop(["processing_level"], axis=1, inplace=True)

        # Merge data from different groups to remove Nan values
        for pid in np.sort(data_postprocessed_with_Rpeaks[participant_colname].unique()):
            for vid in np.sort(data_postprocessed_with_Rpeaks[video_colname].unique()):
                # Stores the different data groups per time-series instance.
                df_instance = None
                for dg in DATA_GROUPS:
                    print(f"P{pid} V{vid} G:{dg}")
                    Q = ( (data_postprocessed_with_Rpeaks[participant_colname] == pid) 
                        & (data_postprocessed_with_Rpeaks[video_colname] == vid)
                        & (data_postprocessed_with_Rpeaks[dgroup_colname] == dg))
                    selection_idx = data_postprocessed_with_Rpeaks[Q].index
                    data_per_group = data_postprocessed_with_Rpeaks.loc[selection_idx].copy()
                    # Load the data get the relevant columns that do not contain missing values
                    data_per_group.drop(dgroup_colname, axis=1, inplace=True)
                    data_per_group.set_index(basic_colnames, inplace=True)
                    data_per_group.dropna(axis=1, how="all", inplace=True)
                    # Add specific data group to time series
                    df_instance = data_per_group if (df_instance is None) else df_instance.join(data_per_group)

                # Add joined dataset to general one
                df_instance.reset_index(inplace=True)
                dataset_postprocessed_no_nan = df_instance if (dataset_postprocessed_no_nan is None) else pd.concat([dataset_postprocessed_no_nan, df_instance], axis=0, ignore_index=True)
        print("\tEnd")

        # Map each video to the corresponding Class label
        video_id_array = dataset_postprocessed_no_nan[video_colname]
        dataset_postprocessed_no_nan[class_label_colname] = mapper_videoid_to_classes(video_id_array)

        print(dataset_postprocessed_no_nan.shape)

        # Saving .csv
        dataset_postprocessed_no_nan.to_csv( input_files[0], index=False)

        #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        print(f"\n\tFinished creating files {input_files}")

# `TODO`

- [x] Load the big dataset with preprocessing `Frame`
- [X] Creating summary plots for the whole dataset
- [X] Transform the data from videoId 1 from 25Hz to 30Hz to be consistent with the rest.
- [X] Extract the irregular `IBI` and transform it into regular array @30Hz containing peaks.
- [X] Delete NaN from the columns by joining them by `data_type`
- [X] Add class labels for Valence-Arousal according to the paper.
- [X] Save the dataset with time-series @30Hz with index: `[ParticipantID, VideoID, DataGroup]` and values corresponding to features in the groups: `[Annotations, Behavior, Physio]`

In [None]:
# The dictionary below can be used to recover the column names per data type
{
    'Annotations': ['Valence', 'Arousal'],
    'Behavior': ['HM_Pitch', 'HM_Yaw', 'EM_Pitch', 'EM_Yaw', 'LEM_Pitch', 'LEM_Yaw', 'REM_Pitch', 'REM_Yaw', 'LPD_PD', 'RPD_PD'],
    'Physio': ['ACC_ACC_X', 'ACC_ACC_Y', 'ACC_ACC_Z', 'SKT_SKT', 'EDA_EDA', 'BVP_BVP', 'HR_HR', 'IBI_R_Peaks']
}

In [None]:
print(">> FINISHED WITHOUT ERRORS!!")