# Exploratory Data Analysis for Korean Emotion Multimodal Database in 2020 (KEMDy20)

### Data Information

- KEMDy20 is a multimodal emotion dataset:
    - speech data
    - Text data transcribed from the speech
    - electrodermal activity (EDA),
    - Inter-Beat-Interval (IBI),
    - Wrist skin temperature 

### Data Characteristics

| Characteristics  |  Description |  
|---|---| 
|Number of Participants | 80   |  
|Age of participants| 19-30  |  
| Language  | Korean  |  
| Device  | Empatica E4  |  
| Stimulus  |  6 videos (~5-minute length) + conversation   |  
| Annotation  |  10 External annotators (??)
| Labels  | 7 Emotions(“angry”, “sad”, “happy”, “disgust”, “fear”, “surprise”, “neutral”))
Arousal & Valence (1-5 scale)  |  

### Initial Observation

- Data was collected during conversation
- The videos watched before the conversation can be considered the stimulus videos, although the conversation is the primary stimulus as the emotions may depend on the emotions of another person in the conversation.
- The video was labeled by external annotators. Therefore, this dataset does not deal with the emotions experienced by the participants. As we deal with the observed emotions, the contribution of bio-signal modality may not be crucial.
- Annotations were aggregated based solely on majority vote of the external annotators, this may bring ambiguity if there is conflict on opinions of external annotators and highly contrasting emotion labels are assigned to a sample.
- Video was recorded during the conversation. 
- The external annotators **tagged the speech segments while listening to the recorded utternaces** - It is not clear whether the evaluators had access to the facial expressions of the participants. If annotators watched the videos, this could lead to  bias due to lack of facial information in the dataset.

### EDA for Bio-signal Data

### Questions to be explored
- Is the data ready for ML/DL based analysis or some data cleansing is needed?
- What is the size of the data, is there any data excluded/missing?
- What is the distribution of different labels. Are there any outliers?
- Is there any special consideration needed before building a classification/prediction model?

##### Answers:
- The dataset is well organized in the folders for different modalitie. However, the number of samples in the modalities do not match. (ie. there are missing samples in some modalities)
- This discrepency in number of samples may arise due to various reasons including faulty data due to device malfunction, participant's decision to opt-out etc.

# Data organization

- Each Modality data is organized into 40 folders each representing a session
- Each session folder contains data related to 6 conversations between two participants

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dataprep.eda import plot_missing, plot, plot_correlation
import warnings
import shutil
warnings.filterwarnings('ignore')
from tqdm import tqdm
%matplotlib inline

In [None]:
DATA_DIR = "F:/sudarshan/ABAW/ETRI"

## Check Annotations

- Annotations are organized in CSV files for each session

In [None]:
annotaiton_file = pd.read_csv(f"{DATA_DIR}/annotation/Sess01_eval.csv")

In [None]:
# check if any column as missing data
isNull =  False
for i in range(len(annotaiton_file.columns)): 
    isNull = pd.isnull(annotaiton_file).any()[i]
print(isNull)   

In [None]:
annotaiton_file.describe()

### Observations
- The annotation file contains:
    - the annotation by all 10 evaluators
    - total evaluation (Average)
    - Segment times for WAV file (NOT SURE if this is relevant:: WAV files are already split into corresponding segments)
    - labels for both emotion and arousal/valence are provided
    
- We are interested in labels for overall valence and arousal  (Total Evaluation), we we can discard other columns

### Select required columns only

In [None]:
annotaiton_data_ = annotaiton_file.iloc[1:, [3,4,5,6]]
annotaiton_data_.columns = ["segment_id", 'emotion', 'arousal', 'valence']
annotaiton_data_

In [None]:
annotaiton_data_['emotion'].value_counts()

In [None]:
# cleanup the dataframe to include only the columns required for training
def prepare_cleaned_dataframe(session_num):
    annot_df = pd.read_csv(f"{DATA_DIR}/annotation/Sess{session_num:02}_eval.csv") 
    annot_df = annot_df.iloc[1:, [3,4,5,6]]
    annot_df.columns = ["segment_id", 'emotion', 'arousal', 'valence'] 
    return annot_df

In [None]:
full_df = pd.DataFrame()
for session_num in range(1, 41): 
    full_df = full_df.append(prepare_cleaned_dataframe(session_num), ignore_index=True)
print(full_df.shape) 

In [None]:
full_df['emotion'].value_counts()

#### Some samples were found to have multiple emotion labels.

In this case, we can select the labels with least occurence to avoid class imbalance to some extent.

## Fix labels using the labels with least occurence

In [None]:
replace_map = {"happy;neutral": "happy", 
              "angry;neutral": "angry",
               "neutral;sad": "sad",
               "surprise;neutral": "surprise",
               "neutral;disqust": "disgust",
               "neutral;fear": "fear",
               "happy;surprise": "surprise",
               "angry;disqust": "angry",
               "happy;fear": "fear",
               "angry;neutral;disqust": "disgust",
               "neutral;disqust;sad": "disgust",
               "angry;neutral;disqust;fear;sad": "fear",
               "happy;sad": "sad",
               "happy;surprise;neutral": "surprise",
               "happy;angry;neutral": "angry",
               "happy;neutral;disqust": "disgust",
               "happy;neutral;fear": "fear", 
               "disqust": "disgust"
              }

In [None]:
full_df['emotion'] = full_df['emotion'].replace(replace_map)
full_df

In [None]:
full_df['emotion'].value_counts()

In [None]:
full_df.shape

In [None]:
full_df.to_csv("labels.csv")

# Electrodermal Activity(EDA)
- EDA Data is organized in folders of 40 sessions
- Each session has data from 6 paired conversations ie. 12 recordings 

####  It should be noted that there are three columns. third column showing which segment the data belongs to

In [None]:
df_eda = pd.read_csv(f"{DATA_DIR}/EDA/Session01/Sess01_script01_User001F.csv", skiprows=2, header=None, names=["eda_value", "time_stamp", "segment_id"])
print(df_eda)

In [None]:
df_eda['eda_value'].plot()

#### select the rows with target segments only

In [None]:
df_target_segments = df_eda.dropna()
print(df_target_segments.shape)
df_target_segments.head(20)

#### As it has different segments let's seperate based on segment Id
- The number of segments may vary, so find the unique segmetns

### Extract all EDA segments

In [None]:
seg_ids_annot = full_df['segment_id'].values
len(seg_ids_annot)

In [None]:
total_segments = 0
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    for eda_file in os.listdir(f"{DATA_DIR}/EDA/{sess__}"):
        df_eda_ = pd.read_csv(f"{DATA_DIR}/EDA/{sess__}/{eda_file}", skiprows=2, header=None, names=["eda_value", "time_stamp", "segment_id"])
        df_target_segments_ = df_eda_.dropna()
        groups = df_target_segments_.groupby("segment_id")["eda_value"]
        
        out_dir_  = f"{DATA_DIR}/preprocessed/EDA/{sess__}"
        if not os.path.exists(out_dir_):
            os.makedirs(out_dir_)
        
        for x, group_data in enumerate(groups):
            if group_data[0] in seg_ids_annot:
                group_data[1].to_csv(f"{out_dir_}/{group_data[0]}.csv", index=False)
                total_segments =  total_segments + 1 
            else:
                print(f"{group_data[0]} has no annotation available.")
print(f"Extracted {total_segments} segments")

#### There were 13415 segments in EDA data.  

-- But Not all are in annotation file
So selected only those present in annotation file

# IBI Data

IBI data is also organized in a similar way : 40 folders of the segmetns

### Extract all IBI segments

In [None]:
total_segments = 0
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    for ibi_file in os.listdir(f"{DATA_DIR}/IBI/{sess__}"):
        df_ibi = pd.read_csv(f"{DATA_DIR}/IBI/{sess__}/{ibi_file}", skiprows=1, header=None, names=["time_diff", "ibi_value", "time_stamp", "segment_id"])
        df_target_segments_ = df_ibi.dropna()
        groups = df_target_segments_.groupby("segment_id")["ibi_value"]
        
        out_dir_  = f"{DATA_DIR}/preprocessed/IBI/{sess__}"
        if not os.path.exists(out_dir_):
            os.makedirs(out_dir_)
        
        for x, group_data in enumerate(groups):                
            if group_data[0] in seg_ids_annot:
                group_data[1].to_csv(f"{out_dir_}/{group_data[0]}.csv", index=False)
                total_segments =  total_segments + 1 
            else:
                print(f"{group_data[0]} has no annotation available.")
print(f"Extracted {total_segments} segments")

#### ONLY 10204  segments were found. MAYBE SOME DATA IS MISSING

# Temperature data

In [None]:
total_segments = 0
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    for temp_file in os.listdir(f"{DATA_DIR}/TEMP/{sess__}"):
        df_temp = pd.read_csv(f"{DATA_DIR}/TEMP/{sess__}/{temp_file}", skiprows=2, header=None, names=["temp_value", "time_stamp", "segment_id"])
        df_target_segments_ = df_temp.dropna()
        groups = df_target_segments_.groupby("segment_id")["temp_value"]
        
        out_dir_  = f"{DATA_DIR}/preprocessed/TEMP/{sess__}"
        if not os.path.exists(out_dir_):
            os.makedirs(out_dir_)
        
        for x, group_data in enumerate(groups):                
            if group_data[0] in seg_ids_annot:
                group_data[1].to_csv(f"{out_dir_}/{group_data[0]}.csv", index=False)
                total_segments =  total_segments + 1 
            else:
                print(f"{group_data[0]} has no annotation available.")
print(f"Extracted {total_segments} segments")

# Audio & Text 

- The dataset also contains the audio recordings and the transcripts of the conversations

In [None]:
WAV_DIR = "F:/sudarshan/ABAW/ETRI/wav"

In [None]:
files = []
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    files.extend(os.listdir(f"{WAV_DIR}/{sess__}"))

print(len(files)/2)
count_valid = 0;
for file____ in files:
    if ".txt" in file____:
        if file____[:-4] in valid_segment_ids:
            count_valid = count_valid + 1 
        else:
            print(file____)
print(count_valid)

### We have 13462 samples with both audio and text

# As WAV and Txt modalities are complete. Check how many samples missing based on these
### first, take all the segments from audio as audio and text have all matching files available

In [None]:
# WAV files
WAV_DIR = "F:/sudarshan/ABAW/ETRI/wav"
files = []
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    files.extend(os.listdir(f"{WAV_DIR}/{sess__}"))
segment_ids = [f_[:-4] for f_ in files if '.wav' in f_]
print(len(segment_ids))

In [None]:
# Annotated samples
seg_ids_annot = full_df['segment_id'].values
len(seg_ids_annot)

- This number matches exactly with the annotation samples
- Now lets check EDA and TEMP samples (We ignored IBI as there are only a few samples)

In [None]:
data_dir = "F:/sudarshan/ABAW/ETRI/preprocessed"
files_EDA = []
files_TEMP = []
count  = 0
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    files_EDA.extend(os.listdir(f"{data_dir}/EDA/{sess__}"))
    files_TEMP.extend(os.listdir(f"{data_dir}/TEMP/{sess__}"))

In [None]:
print(f"EDA: {len(files_EDA)}, TEMP: {len(files_TEMP)}, Annotations: {len(seg_ids_annot)}")   

### Find the missing ones

In [None]:
13462-12715 

In [None]:
missing_in_eda = []
for segment_id_ in seg_ids_annot:
    if segment_id_+".csv" not in files_EDA:
        missing_in_eda.append(segment_id_)
len(missing_in_eda)

In [None]:
missing_in_temp = []
for segment_id_ in seg_ids_annot:
    if segment_id_+".csv" not in files_TEMP:
        missing_in_temp.append(segment_id_)
len(missing_in_temp) 

In [None]:
#Check if they are same
for eda_miss_ in missing_in_eda:
    if eda_miss_ not in missing_in_temp:
        print("MISSING")

### get final annotation file based on 12715 samples available in EDA data

In [None]:
segment_id_in_eda = [seg_id[:-4] for seg_id in files_EDA] # removed extension

In [None]:
len(segment_id_in_eda)

In [None]:
full_df 

In [None]:
final_annot_df = full_df[full_df['segment_id'].isin(segment_id_in_eda)]
final_annot_df.to_csv("annotation_final.csv", index=False)

##### Now select from text and WAV (If we want to run text-audio bimodal modell later we can use full dataset for multimodal we use 12715 smaples only)

In [None]:
for sess__ in [f'Session{sess:02}' for sess in range(1,41)]: 
    for wav_or_txt in os.listdir(f"{WAV_DIR}/{sess__}"):
        if wav_or_txt[:-4] in segment_id_in_eda: 
            src_file = f"{WAV_DIR}/{sess__}/{wav_or_txt}"
            dest_dir = f"{DATA_DIR}/KEMDy20/WAV" if ".wav" in wav_or_txt else f"{DATA_DIR}/KEMDy20/TXT"
            shutil.copy(src_file, dest_dir)        