# PROBLEM STATEMENT & MOTIVATION
-----------------
-----------------

&nbsp;

Brain-Computer Interfaces (BCIs), can be extremely empowering for people with disabilities who choose to use them. There are a few commercially available EEG headsets, such as the Emotiv EPOC (pictured below) that provide less cost prohibitive, noninvasive options to convert electrical signals from the brain into computer commands that might be otherwise inaccessible to input for someone.

&nbsp;

![imageof the Emotiv EPOC+ headset on a white background, next to a schematic of the 10-20 electrode placement system](https://d2z0k1elb7rxgj.cloudfront.net/uploads/2018/11/a-Emotiv-EPOC-headset-b-Spatial-mapping-of-the-electrodes-on-the-scalp.jpg)


&nbsp;


This project aims to use some techniques from the field of data science to explore the feasibility of classiying EEG signals captured by a low cost, dry electrode system such as the Emotiv EPOC+. The data used was collected over nearly two years, 2014-15 and is [curated and hosted by the subject of the readings, David Vivancos](http://mindbigdata.com/opendb/index.html). Although four different datasets using four different devices are available, for this project I decided to analyze the one with the most channels, especially since it was the only one with electrode channels on the occipital lobe, which is the "visual cortex." Lower cost options are available, such using a development board or microcontroller (\\$5 - \\$50) with an amplifier such as [this one](https://biosignals.berndporr.me.uk/#build_your_own_bio-amplifier), with electrodes from any supplier, which can be only [a few dollars](https://www.alibaba.com/product-detail/Colorful-Reusable-Gold-Cup-Electrodes-Cable_1600592681920.html)

In [None]:
# Import the necessary packages

import pandas as pd
import numpy as np
import gc
import itertools
import scipy
from scipy.signal import hilbert, savgol_filter, wavelets, periodogram
from sklearn.decomposition import FastICA
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, GlobalMaxPooling2D, MaxPooling2D, Flatten
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC

## DATA PROCESSING
-----------------
-----------------

Uncomment and run the cell below to load and process the data for the model.

In [None]:
# # Load in original raw data and give it column names
cols = ['id', 'event', 'device', 'channel', 'code', 'size', 'data']
emotiv = pd.read_csv('../../fulldata/EP1.txt', delimiter='\t', names=cols)
emotiv.drop(['device', 'id'], inplace=True, axis=1)

# Optional: break up full df into sub-dfs by channel
# occO1 = emotiv[(emotiv['channel'] == 'O1')]
# occO2 = emotiv[(emotiv['channel'] == 'O2')]
# fefF3 = emotiv[(emotiv['channel'] == 'F3')]
# fefF4 = emotiv[(emotiv['channel'] == 'F4')]
# fefF7 = emotiv[(emotiv['channel'] == 'F7')]
# fefF8 = emotiv[(emotiv['channel'] == 'F8')]
# temT7 = emotiv[(emotiv['channel'] == 'T7')]
# temT8 = emotiv[(emotiv['channel'] == 'T8')]
# pfcAF3 = emotiv[(emotiv['channel'] == 'AF3')]
# pfcAF4 = emotiv[(emotiv['channel'] == 'AF4')]
# motFC5 = emotiv[(emotiv['channel'] == 'FC5')]
# motFC6 = emotiv[(emotiv['channel'] == 'FC6')]
# parP7 = emotiv[(emotiv['channel'] == 'P7')]
# parP8 = emotiv[(emotiv['channel'] == 'P8')]

# # Delete and garbage collect the full df so computer doesn't run out of RAM and freeze
# del emotiv
# gc.collect()

def dataProcessor(df):
    '''
Cleans data column by splitting it into smaller strings, converting those to float, cutting it down to length defined by shortest data vector, normalizing the indexes by resetting.

i: Dataframe for single channel
o: Processed dataframe, printouts of lengths before and after clipping for check, timestamp for each iteration
    '''
    
    col = df['data'].apply(lambda x: list(map(float, x.split(','))))
    print(type(col), type(col.iloc[0]), type(col.iloc[0][0]))

    for i in range(len(col)):
        l = []
        l.append(len(col.iloc[i]))

    print(min(l))

    for i in range(len(col)):
        col.iloc[i] = col.iloc[i][:256] # or 257?

    for i in range(len(col)):
        l = []
        l.append(len(col.iloc[i]))

    print(max(l))
    return col.reset_index(drop=True)

# Choose  which channels to include
dfs = [occO1, occO2, fefF3,
       fefF4, fefF7, fefF8, temT7,temT8,
        pfcAF3, pfcAF4, motFC5, motFC6,
        parP7, parP8]

# Init blank dataframe for processed channels to be added to
df = pd.DataFrame()

#  select columns by name by grabbing channel name value string from the 'channel' column
# then running dataProcessor on each/any channel dataframes
for x in dfs:
    name = x['channel'].iloc[0]
    df[name] = dataProcessor(x) 

# Add code column from any channel df
df['code'] = occO1['code'].reset_index(drop=True)
print(df.head())
print(type(df), type(df.iloc[0]), type(df.iloc[0][0]))

# Delete original dfs with this ugly stack of dels, garbage collect to conserve RAM
del occO1
del occO2
del fefF3
del fefF4
del fefF7
del fefF8
del temT7
del temT8
del pfcAF3
del pfcAF4
del motFC5
del motFC6
del parP7
del parP8
gc.collect()

# # Save resulting dataframe to csv
# df.to_csv('../data/df.csv', sep=';', quoting=None)

In [None]:
#Loading each channel then merging to a single dataframe

#Note: these names are 
occO0 = pd.read_csv('../data/occ0Exp.csv', delimiter=',')
occO1 = pd.read_csv('../data/occ1Exp.csv', delimiter=',')
fefF3 = pd.read_csv('../data/fefF3Exp.csv', delimiter=',')
fefF4 = pd.read_csv('../data/fefF4Exp.csv', delimiter=',')
fefF7 = pd.read_csv('../data/fefF7Exp.csv', delimiter=',')
fefF8 = pd.read_csv('../data/fefF8Exp.csv', delimiter=',')
temT7 = pd.read_csv('../data/temT7Exp.csv', delimiter=',')
temT8 = pd.read_csv('../data/temT8Exp.csv', delimiter=',')
pfcAF3 = pd.read_csv('../data/pfcAF3Exp.csv', delimiter=',')
pfcAF4 = pd.read_csv('../data/pfcAF4Exp.csv', delimiter=',')
motFC5 = pd.read_csv('../data/motFC5Exp.csv', delimiter=',')
motFC6 = pd.read_csv('../data/motFC6Exp.csv', delimiter=',')
parP7 = pd.read_csv('../data/parP7Exp.csv', delimiter=',')
parP8 = pd.read_csv('../data/parP8Exp.csv', delimiter=',')

# OPTIONAL: load same datasets expanded with filter processing, as shown below (saving process in working code)

# occO0 = pd.read_csv('../data/occ0Proc.csv', delimiter=',')
# occO1 = pd.read_csv('../data/occ1Proc.csv', delimiter=',') 
# fefF3 = pd.read_csv('../data/fefF3Proc.csv', delimiter=',')
# fefF4 = pd.read_csv('../data/fefF4Proc.csv', delimiter=',')
# fefF7 = pd.read_csv('../data/fefF7Proc.csv', delimiter=',')
# fefF8 = pd.read_csv('../data/fefF8Proc.csv', delimiter=',') 
# temT7 = pd.read_csv('../data/temT7Proc.csv', delimiter=',')
# temT8 = pd.read_csv('../data/temT8Proc.csv', delimiter=',')
# pfcAF3 = pd.read_csv('../data/pfcAF3Proc.csv', delimiter=',')
# pfcAF4 = pd.read_csv('../data/pfcAF4Proc.csv', delimiter=',')
# motFC5 = pd.read_csv('../data/motFC5Proc.csv', delimiter=',')
# motFC6 = pd.read_csv('../data/motFC6Proc.csv', delimiter=',')
# parP7 = pd.read_csv('../data/parP7Proc.csv', delimiter=',')
# parP8 = pd.read_csv('../data/parP8Proc.csv', delimiter=',')


#Rename all the columns so they are unique when concatenated 
occO0.rename(columns={'data': 'occ0Data'}, inplace=True)
occO1.rename(columns={'data': 'occ1Data'}, inplace=True)
fefF3.rename(columns={'data': 'fefF3Data'}, inplace=True)
fefF4.rename(columns={'data': 'fefF4Data'}, inplace=True)
fefF7.rename(columns={'data': 'fefF7Data'}, inplace=True)
fefF8.rename(columns={'data': 'fefF8Data'}, inplace=True)
temT7.rename(columns={'data': 'temT7Data'}, inplace=True)
temT8.rename(columns={'data': 'temT8Data'}, inplace=True)
pfcAF3.rename(columns={'data': 'pfcAF3Data'}, inplace=True)
pfcAF4.rename(columns={'data': 'pfcAF4Data'}, inplace=True)
motFC5.rename(columns={'data': 'motFC5Data'}, inplace=True)
motFC6.rename(columns={'data': 'motFC6Data'}, inplace=True)
parP7.rename(columns={'data': 'parP7Data'}, inplace=True)
parP8.rename(columns={'data': 'parP8Data'}, inplace=True)


# Function to process the channels and average the data vectors by event

for i in dfs:
    i.drop('Unnamed: 0', inplace=True, axis=1)
    i['data'].astype(float, copy=False)
    print(i.columns)
    i = i.groupby('event').mean()
    print(i.columns)

    
# Merge into one dataframe
#merge all DataFrames into one
df = reduce(lambda  left,right: pd.merge(left,right,on=['event'],
                                            how='outer'), dfs)


# DATA SETUP
-----------------
-----------------

In [None]:
# Setting up X and y
X = df.drop('code', axis=1)
y = df['code']

In [None]:
##Run this cell to convert the values to numerical type before signal processing.
X = X.applymap(eval) # takes a while
print(X.iloc[0], type(y.iloc[0]), X.shape, y.shape)

In [None]:
# train-test split for "df"
XTrain, XTest, yTrain, yTest = train_test_split(X, y)

In [None]:
XTrain.shape, XTest.shape, yTrain.shape, yTest.shape

In [None]:
XTrain.head()

In [None]:
XTrainNP = XTrain.applymap(np.array)
XTrainNP = XTrainNP.to_numpy()
XTrainNP.shape
XTrainNP = np.reshape(XTrainNP, newshape=(256, 14, 1))

In [None]:
# One-Hot Encode the y values
ohe = OneHotEncoder(sparse=False)
yTrainOHE = ohe.fit_transform(yTrain.to_numpy().reshape(-1,1))


# SIGNAL PREPROCESSING
-----------------
-----------------

## Savitzky-Golay Filter

![Gif of how the filter smoothly approximates the curve at discrete time steps](https://upload.wikimedia.org/wikipedia/commons/8/89/Lissage_sg3_anim.gif)

"The idea of Savitzky-Golay filters is simple – for each sample in the filtered sequence, take its direct neighborhood of N neighbors and fit a polynomial to it. Then just evaluate the polynomial at its center (and the center of the neighborhood), point 0, and continue with the next neighborhood. "


-- https://bartwronski.com/2021/11/03/study-of-smoothing-filters-savitzky-golay-filters/

In [None]:
XSGF = X[X[['O1', 'O2', 'F3', 'F4', 'F7', 'F8', 'T7', 'T8', 'AF3', 'AF4', 'FC5',
       'FC6', 'P7', 'P8']]].applymap(savgol_filter(10001, 1))
#dfSGF.concatenate(df['code'])

## Fast ICA

![Logo of the name of the algorithm from its project website from the University of Aalto, Finland ](https://research.ics.aalto.fi/ica/fastica/FastICA.gif)

"Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals.

ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed nongaussian and mutually independent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA. "


-- https://www.cs.helsinki.fi/u/ahyvarin/whatisica.shtml

In [None]:
fICA = FastICA(5, whiten=True)
XfICA = fICA.fit_transform(X.to_numpy().reshape(-1, 1))

# MODELING
-----------------
-----------------

## Convolutional Neural Network

In [None]:
# Initiate and setup the model

model = Sequential()
model.add(Conv2D(filters=64, kernel_size=1, activation='relu', input_shape=(256,14,1)))
model.add(MaxPooling2D(pool_size=(1,1)))
model.add(Conv2D(filters=64, kernel_size=1, activation='relu'))
model.add(MaxPooling2D(pool_size=(1,1)))
model.add(Conv2D(filters=64, kernel_size=1, activation='relu'))
model.add(MaxPooling2D(pool_size=(1,1)))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(11, activation='softmax'))

# Print summary
model.summary()

Most recent model summary:

```
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 256, 14, 64)       128       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 256, 14, 64)      0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 256, 14, 64)       4160      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 256, 14, 64)      0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 256, 14, 64)       4160      
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 256, 14, 64)      0         
 2D)                                                             
                                                                 
 flatten (Flatten)           (None, 229376)            0         
                                                                 
 dense (Dense)               (None, 32)                7340064   
                                                                 
 dense_1 (Dense)             (None, 32)                1056      
                                                                 
 dense_2 (Dense)             (None, 11)                363       
                                                                 
=================================================================
Total params: 7,349,931
Trainable params: 7,349,931
Non-trainable params: 0
```

In [None]:
# Run compile and fit model
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
results = model.fit(XTrain, yTrainOHE, batch_size=20,epochs=30, verbose=1)

## Support Vector Classfier

In [None]:
# To improve the score, the vectors of each row of the data column are expanded out into columns
XO1 = pd.DataFrame(X['O1'].tolist())
XO2 = pd.DataFrame(X['O2'].tolist())
XF3 = pd.DataFrame(X['F3'].tolist())
XF4 = pd.DataFrame(X['F4'].tolist())
XF7 = pd.DataFrame(X['F7'].tolist())
XF8 = pd.DataFrame(X['F8'].tolist())
XT7 = pd.DataFrame(X['T7'].tolist())
XT8 = pd.DataFrame(X['T8'].tolist())
XAF3 = pd.DataFrame(X['AF3'].tolist())
XAF4 = pd.DataFrame(X['AF4'].tolist())
XFC5 = pd.DataFrame(X['FC5'].tolist())
XFC6 = pd.DataFrame(X['FC6'].tolist())
XP7 = pd.DataFrame(X['P7'].tolist())
XP8 = pd.DataFrame(X['P8'].tolist())

XWide = pd.concat([XO1, XO2, XF3, XF4, XF7, XF8, XT7, XT8, XAF3,
                    XAF4, XFC5, XFC6, XP7, XP8], axis=1)
del X
del XO1
del XO2
del XF3
del XF4
del XF7
del XF8
del XT7
del XT8
del XAF3
del XAF4
del XFC5
del XFC6
del XP7
del XP8
gc.collect()

In [None]:
# New train-test split using XWide. 

XTrainWide, XTestWide, yTrain, yTest = train_test_split(XWide, y)

# Create a simple Support Vector Classification as an alternative model
# The following code takes a long time to run (~6 hrs on my computer)
svc = SVC()
svc.fit(XTrainWide, yTrain)

svc.score(XTestWide, yTest)

## RESULTS
---------------
---------------

The first iterations of the model were using data that was averaged per event. This effectively erased the time data, which is apparently crucial for neural coding (see this [video lecture by Earl Miller for more information on why](https://www.youtube.com/watch?v=Kqyhr9fTUjs)). This seems like it would be made worse by raw data being a measurement of changes in amlitude over time, even with filters. It might perform better the frequency information from Fourier transformation, but mean freqency without time would still lose a lot of information, other than perhaps the dominant frequency band of the channel. 


>
>Best accuracy for the unprocessed signal with all channels was about .1020
> &nbsp;
>
>In 6/14 channels, adding SGF data was about the same
>
> &nbsp;
>
>In 6/14 channels, adding SGF and dropping raw data resulted in really low accuracy and NaN loss
>
> &nbsp;
>
>In 6/14 channels, ICA only had a accuracy of about .1000
>
> &nbsp;
>
>6/14 channels, unproccessed+SGF+ICA, 22,059 params: NaN loss again, what's wrong with the data? woe be the futility of mine ways
>
> &nbsp;
>
>6/14 channels, SGF only, 9,771 params, accuracy still around .1000
>
> &nbsp;
>
>6/14 channels, raw, 9,771 params, accuracy about .1000, so the .0020 increase was only from the extra channels. Processing doesn't seem to change the accuracy at all.
>
> &nbsp;
>
 ----------

 &nbsp;


Realizing the temporal coding is crucial, I ran the model using vector data rows. The first iterations of this process indicated that the TF CNN algorithm cannot run on matrices whos values are both scalars and vectors. To correct this, I attempted to expand the vectors into columns of the dataframes, but this ends up being too large for the CNN, which gives the following error:

```
ResourceExhaustedError: {{function_node __wrapped__StatelessRandomUniformV2_device_/job:localhost/replica:0/task:0/device:CPU:0}} OOM when allocating tensor with shape[799129600,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:StatelessRandomUniformV2]
```

However, a Support Vector Classifier was able to run with this input. Unfortunately, it gave only a slightly better accuracy score of 0.112 (+0.01). This was the raw data, and it might improve with denoising and/or source separation techniques, but other projects on the same dataset seem to not get much higher than around 20% accuracy. Though this answered modeling questions and taught me a lot, perhaps a different dataset would be a better option to create a deployable model to be used for BCI applications

## BIBLIOGRAPHY
---------------
---------------
- "Deep learning for electroencephalogram (EEG) classification tasks: a review", Alexander Craik, et al, 2019, J. Neural Eng. 16 031001, doi:10.1088/1741-2552/ab0ab5

&nbsp;

- "Denoising Source Separation", Jaako Särelä & Harri Valpola, 2005, J. Machine Learning Res. 6, pp. 233-272, doi:10.5555/1046920.1058110

&nbsp;


- "Frequency Band and PCA Feature Comparison for EEG Signal Classification", I Wayan Pio Pratama, et al, 2021, Lontar Komputer Vol. 12 No. 1, doi:10.24843/LKJITI.2021.v12.i01.p01

&nbsp;


- "PIEEG: Turn a Raspberry Pi into a Brain-Computer-Interface to measure biosignals", Ildar Rakhmatulin & Sebastian Volkl, 2022, arxiv::2201.02228

&nbsp;


- "Progress in Brain Computer Interface: Challenges and Opportunities", Simanto Saha, et al, 2021, Front. Syst. Neurosci., doi:10.3389/fnsys.2021.578875

&nbsp;


- "Supply and demand analysis of the current and future US neurology workforce", Timothy M. Dall, et al, 2013, Neurology 81(5), doi:10.1212/WNL.0b013e318294b1cf

&nbsp;


- "Toward Direct Brain-Computer Communication", Jacques J. Vidal, 1973, Ann. Rev Biophysics & Bioengineering, Vol. 2, pp. 157-180, doi:10.1146/annurev.bb.02.060173.001105

&nbsp;


- "What is a Savitzky-Golay Filter?", Ronald W. Schafer, 2011, IEEE Sig. Proc. Mag July 2011, pp. 111-117, doi:10.1109/MSP.2011.941097

&nbsp;


- Github repo listing many public EEG Datasets, including the one used in this project  https://github.com/meagmohit/EEG-Datasets

&nbsp;


- List of papers that use/reference the MindBigData digits dataset on its website http://mindbigdata.com/opendb/index.html

&nbsp;


- "meegkit: EEG and MEG denoising in Python" https://nbara.github.io/python-meegkit/ -- Unused but interesting EEG/MEG specific library

&nbsp;


- "MNE, MEG + EEG Analysis & Visualization"  https://mne.tools/dev/index.html -- Another, much larger, neuro signal analysis library

&nbsp;


- SciPy's signal processing library documentation https://docs.scipy.org/doc/scipy/reference/signal.html

&nbsp;


- Website for Emotiv EPOC headset https://www.emotiv.com/epoc/ 
