# Part 1. DEAP Dataset

In this series of tutorial, we will be working on basic EEG analysis using python mne + pyTorch.  The case study will be on the DEAP dataset, a benchmark EEG emotion recognition dataset.   In this part 1, we will focus on looking at the dataset.

This set of tutorial assumes:

1.  You have already basic understanding of Python
2.  You have some experience with scikit-learn, and also some knowledge about machine learning
3.  You have a bit of experience with pyTorch and also some knowledge about deep learning

In this dataset, there is a total of 32 participants, where each participant watches 40 1-minute videos.  Thus <code>s01.dat</code> is holding 40 batches.   The total sample is thus 40*32=1280 batches.

Looking in each dat file (e.g., s01), it contains the data and label
- Data ----- 40 x 40 x 8064 [	video/batches x channel x samples ]
- Label  ---- 40 x 4 

Out of 40 channels, 32 channels were of EEG, and the rest of 8 of them from other sensors such as EOG (see the section 6.1 of the original paper).  We shall only extract the first 32 channels.   For the 8064, since the data is downsampled to 128Hz, thus one second contains around 128 samples, thus in one minute which is 60 seconds, it will be roughly 7680 samples.  The paper did not really talk a lot but it is likely there is  another 1.5 seconds before and after which total to 8064 samples (128 Hz * 63 seconds).

The four labels correspond to valence, arousal, liking, and dominance, in this order.  We will only use valence and arousal, thus index 0 and 1 of the labels will be extracted.

In [1]:
import torch

import os
import pickle
import numpy as np

Set cuda accordingly.

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Configured device: ", device)

Configured device:  cuda


## 1. Loading dataset

Let's first create a simple dataset loader.   The code is explained using comments and is quite self-explanatory.

In [3]:
class Dataset(torch.utils.data.Dataset):
    
    def __init__(self, path, stim):
        _, _, filenames = next(os.walk(path))
        filenames = sorted(filenames)
        all_data = []
        all_label = []
        for dat in filenames:
            temp = pickle.load(open(os.path.join(path,dat), 'rb'), encoding='latin1')
            all_data.append(temp['data'])
            
            if stim == "Valence":
                all_label.append(temp['labels'][:,:1])   #the first index is valence
            elif stim == "Arousal":
                all_label.append(temp['labels'][:,1:2]) # Arousal  #the second index is arousal
                
        self.data = np.vstack(all_data)   #shape: (1280, 40, 8064) ==> 1280 samples / 40 samples = 32 participants
        self.label = np.vstack(all_label) #(1280, )  ==> 1280 samples, each with a unique label (depend on the param "stim")
        
        del temp, all_data, all_label

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx):
        single_data  = self.data[idx]
        single_label = (self.label[idx] > 5).astype(float)   #convert the scale to either 0 or 1 (to classification problem)
        
        batch = {
            'data': torch.Tensor(single_data),
            'label': torch.Tensor(single_label)
        }
        
        return batch

Let's try load the dataset.

In [4]:
path = "data"  #create a folder "data", and inside put s01.dat,....,s32.dat inside from the preprocessed folder from the DEAP dataset

In [5]:
dataset_valence = Dataset(path, "Valence")
dataset_arousal = Dataset(path, "Arousal")

We can try look at one sample using the index.  This is automatically mapped to the <code>__getitem__</code> function in the <code>Dataset</code> class.

In [6]:
dataset_valence[0]

{'data': tensor([[ 9.4823e-01,  1.6533e+00,  3.0137e+00,  ..., -2.8265e+00,
          -4.4772e+00, -3.6769e+00],
         [ 1.2471e-01,  1.3901e+00,  1.8351e+00,  ..., -2.9870e+00,
          -6.2878e+00, -4.4743e+00],
         [-2.2165e+00,  2.2920e+00,  2.7464e+00,  ..., -2.6371e+00,
          -7.4065e+00, -6.7559e+00],
         ...,
         [ 2.3078e+02,  6.9672e+02,  1.1951e+03,  ...,  1.0108e+03,
           1.2831e+03,  1.5200e+03],
         [-1.5418e+03, -1.6180e+03, -1.6927e+03,  ..., -1.5784e+04,
          -1.5782e+04, -1.5781e+04],
         [ 6.3905e-03,  6.3905e-03,  6.3905e-03,  ..., -9.7608e-02,
          -9.7608e-02, -9.7608e-02]]),
 'label': tensor([1.])}

In [7]:
print("Shape of data: ", dataset_valence[0]['data'].shape)  #40 channels of data, 8064 samples in 1 minute
print("Shape of label: ", dataset_valence[0]['label'].shape) #just 1 single label; 0 or 1

Shape of data:  torch.Size([40, 8064])
Shape of label:  torch.Size([1])


Let's try to look at our data and label distribution.

In [8]:
data = dataset_valence[:]['data']
label = dataset_valence[:]['label']

In [9]:
#so we got 1280 trial (40 videos * 32 participants = 1280, each with 40 channels of data, each video contains 8064 EEG samples)
data.shape  

torch.Size([1280, 40, 8064])

In [10]:
#so we got 1280 labels, i.e., one label per video
label.shape  

torch.Size([1280, 1])

Let's count how many 0 and 1 in the valence dataset, to see if there is some imbalance.

In [11]:
cond_1 = label == 1
cond_0 = label == 0

print("Labels 1 in valence dataset: ", len(label[cond_1]))
print("Labels 0 in valence dataset: ", len(label[cond_0]))

Labels 1 in valence dataset:  708
Labels 0 in valence dataset:  572


Let's also count in the valence dataset, to see if there is some imbalance.

In [12]:
cond_1 = label == 1
cond_0 = label == 0

print("Labels 1 in arousal dataset: ", len(label[cond_1]))
print("Labels 0 in arousal dataset: ", len(label[cond_0]))

Labels 1 in arousal dataset:  708
Labels 0 in arousal dataset:  572


To confirm that the first 32 channels are EEG and the rest of the 8 channels are other channels, let's check the median value of each channel to see whether there is a pattern.

In [13]:
for i in range(40):
    print(f"Median of {i} data: {torch.median(data[:, i, :])}")

Median of 0 data: 0.05827333778142929
Median of 1 data: 0.024529436603188515
Median of 2 data: -0.019204378128051758
Median of 3 data: 0.033645644783973694
Median of 4 data: -0.033030420541763306
Median of 5 data: -0.016304221004247665
Median of 6 data: -0.008036154322326183
Median of 7 data: 0.09355251491069794
Median of 8 data: -0.00792337954044342
Median of 9 data: 0.021872472018003464
Median of 10 data: 0.004741182550787926
Median of 11 data: -0.02171526849269867
Median of 12 data: -0.011923680081963539
Median of 13 data: -0.04902170971035957
Median of 14 data: -0.04108745604753494
Median of 15 data: 0.033856555819511414
Median of 16 data: 0.05146871879696846
Median of 17 data: 0.03564863279461861
Median of 18 data: -0.017957160249352455
Median of 19 data: 0.007688858546316624
Median of 20 data: 0.043062545359134674
Median of 21 data: 0.019127536565065384
Median of 22 data: -0.0017579937120899558
Median of 23 data: -0.006185607053339481
Median of 24 data: 0.015526460483670235
Media

As we can see, the data index 0 to 31 is clearly EEG, while data from 32 onward is not.

### Summary

The way we process our dataset has three important observations we have to fix:
1. First, make sure we only take **32 channels of EEG**.  Of course, feel free to play around with other channels of data as well but this tutorial focuses on EEG.
2. **Data segmentation** is the process of creating more segments of data.  For example, in a one minute video, we could possibly divide into 12 segments of 5 seconds, thus greatly increase the number of samples, thus greatly increase the chance of better prediction. The steps are:
         a. Reshape so that (1280, 32 8064) becomes (1280, 32, 672, 12)
         b. Then permute (1280, 32, 672, 12) to (1280, 12, 32, 672)
         c. Then reshape to (1280*12, 32, 672)

Note that since the data is already preprocessed by the authors, we don't have to do anything more, but it's very natural for us to do preprocessing, e.g., min-max normalization, notch filters, band pass filters, etc.

## 2. Loading dataset (version 2)

In [14]:
class Dataset(torch.utils.data.Dataset):
    
    def __init__(self, path, stim):
        _, _, filenames = next(os.walk(path))
        filenames = sorted(filenames)
        all_data = []
        all_label = []
        for dat in filenames:
            temp = pickle.load(open(os.path.join(path,dat), 'rb'), encoding='latin1')
            all_data.append(temp['data'])
            
            if stim == "Valence":
                all_label.append(temp['labels'][:,:1])   #the first index is valence
            elif stim == "Arousal":
                all_label.append(temp['labels'][:,1:2]) # Arousal  #the second index is arousal
                
        self.data = np.vstack(all_data)[:, :32, ]   #shape: (1280, 32, 8064) --> take only the first 32 channels
        
        shape = self.data.shape
        
        #perform segmentation=====
        segments = 12
        
        self.data = self.data.reshape(shape[0], shape[1], int(shape[2]/segments), segments)
        #data shape: (1280, 32, 672, 12)

        self.data = self.data.transpose(0, 3, 1, 2)
        #data shape: (1280, 12, 32, 672)

        self.data = self.data.reshape(shape[0] * segments, shape[1], -1)
        #data shape: (1280*12, 32, 672)
        #==========================
        
        self.label = np.vstack(all_label) #(1280, 1)  ==> 1280 samples, 
        self.label = np.repeat(self.label, 12)[:, np.newaxis]  #the dimension 1 is lost after repeat, so need to unsqueeze (1280*12, 1)
        
        del temp, all_data, all_label

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx):
        single_data  = self.data[idx]
        single_label = (self.label[idx] > 5).astype(float)   #convert the scale to either 0 or 1 (to classification problem)
        
        batch = {
            'data': torch.Tensor(single_data),
            'label': torch.Tensor(single_label)
        }
        
        return batch

Now let's try to load the dataset and see the shape.

In [15]:
path = "data"  #create a folder "data", and inside put s01.dat,....,s32.dat inside from the preprocessed folder from the DEAP dataset

In [16]:
dataset = Dataset(path, "Valence")

data  = dataset[:]['data']
label = dataset[:]['label']

print("Data shape: " , data.shape)  #15360 = 32 * 40 trials * 12 segments, 32 EEG channels, 672 samples
print("Label shape: ", label.shape)  #two classes of valence

Data shape:  torch.Size([15360, 32, 672])
Label shape:  torch.Size([15360, 1])


Let's look the label distribution of the dataset.

In [17]:
lv = label == 0
hv = label == 1

assert len(label[lv]) + len(label[hv]) == label.shape[0]  #simple unit test
print("count of low valence: ", len(label[lv]))
print("count of high valence: ", len(label[hv]))

count of low valence:  6864
count of high valence:  8496


Let's see the median of EEG of each group (you can do std on your own exercise)

In [18]:
lv_unsqueeze = lv.squeeze()
hv_unsqueeze = hv.squeeze()

print("Median of low valence",  np.median(data[lv_unsqueeze, :, :]))
print("Median of high valence", np.median(data[hv_unsqueeze, :, :]))

Median of low valence 0.009302851
Median of high valence 0.0034587365


Hmm....certainly, we can see some differences in voltage, which could be due to some peaks.  Anyhow, in the next tutorial, we shall look at power spectrum which could help us look at the power at different frequencies.