# Machine Learning for EEG Dataset : DEAP

http://www.eecs.qmul.ac.uk/mmv/datasets/deap/


## Training Models and Obtaining Accuracies for Subject Independent Data Classification

### Dataset (per Subject)

nFeatures = 18 for each frequency band <br>
nFeatures for each trial = 32 (electrode channels) x 5(Freq bands) x 18 features
                         <br> =  2880 features per trial per patient

This is reshaped into - 
- trials(40) x electrode channels(32) , Frequency Bands(5) x nFeatures (18)
- Shape is (1280, 90) 
- 1280 samples and 90 features per sample

<b> Now there are 32 such files, that get combined to give a final dataset shape of:
- (40960, 90)

## Model

### Input Data --> Reshape --> Impute NaN Values --> Feature Elimination --> Classification --> Output

For the last 2 steps, there are multiple options as follows,

#### Feature Elimination:
- LDA
- RFE
- PCA
- FA

#### Classification:
- SVM
  - Linear kernel
  - Rbf kernel
  - Polynomial kernel
  - Signmoid kernel
- Decision Tree
- Logistic Regression
- Gaussian Naive Bayes
- K Nearest Neighbours (KNN)


# Subject Independent

I recommend trying these out only on Google Colab - truly too heavy and too large for a normal CPU to handle, and having GPUs won't help either

## Installations

In [None]:
#If not in colab - run this in cmd and restart jupyter notebook 
!pip install tornado==5 distributed==2.4.0 dask-ml[complete]
!python -m pip install dask[dataframe] --upgrade
#Restart runtime once after running this cell 

In [1]:
#If not in colab - run this in cmd and restart jupyter notebook 
!pip install -U scikit-learn
!pip install openpyxl

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/f3/74/eb899f41d55f957e2591cde5528e75871f817d9fb46d4732423ecaca736d/scikit_learn-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
[K     |████████████████████████████████| 22.3MB 1.4MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.1 threadpoolctl-2.1.0


In [None]:
#Ensure the sklearn version is the latest
import sklearn
sklearn.__version__
#Should be 0.24.1

'0.24.1'

## Imports

In [None]:
import copy
import pickle
import numpy as np
import pandas as pd

from scipy import sparse
from shutil import rmtree

import warnings
warnings.filterwarnings("ignore")

In [None]:
#If you don't want to use dask-ml, replace all dask_ml with sklearn. All commands are the same
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFE
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import  KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

import joblib
import dask.array as da
from dask.distributed import Client
from dask_ml.impute import SimpleImputer
from dask_ml.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from dask_ml.wrappers import Incremental, ParallelPostFit

from openpyxl.styles import Font
from openpyxl import load_workbook, Workbook
from openpyxl.styles.fills import PatternFill
from openpyxl.styles.borders import Border, Side
from openpyxl.utils.dataframe import dataframe_to_rows

In [None]:
#Only run if importing features data and data generator from google drive
#If you're using Google colab, like I am, you will have to import everything from google drive.
#Uploading locally will take too much time, and you'll have to upload every time you open the notebook or restart runtime
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys
sys.path.insert(1, '/content/drive/MyDrive/Upwork/') #Path to folder containing DataGenerator File
from DataGenerator import DataGenerator

## Functions for Saving to Excel file and formatting (optional)
Not important 

In [None]:
#Following are 2 variables used in the formatting of the excel file
border = Border(left=Side(style='thin'), right=Side(style='thin'), 
                     top=Side(style='thin'), bottom=Side(style='thin'))
col = dict(zip(list(range(1, 18)), list('ABCDEFGHIJ')))

#Function that applies the below formatting function to all worksheets in the workbook 
def apply_formatting(wb, metric):
    if wb.worksheets[0].title=="Sheet":
        wb.remove(wb.worksheets[0])
    for ws in wb.worksheets:
        ws['A1'] = metric
        A = pd.DataFrame(ws.values).iloc[1:, 1:].to_numpy()
        bold_idx = np.where(A == np.nanmax(A))
        bold_cells = [col[i+2] + str(j+2) for i,j in zip(bold_idx[1], bold_idx[0])]
        bold_colour(ws, bold_cells)
        apply_border(ws)


#Functions for formatting the excel sheet - to apply borders
def apply_border(sheet):
    cells = [i + '6' for i in 'BCDEFGHIJ']
    cells = cells + ['J' + str(i) for i in range(2,7)] + [i + '1' for i in 'ABCDEFGHIJ'] + ['A' + str(i) for i in range(2,7)]
    for cell in cells:
        sheet[cell].border = border

#Functions for formatting the excel sheet - to highlight the cells of max accuracy        
def bold_colour(sheet, cells):
    for cell in cells + [i + '1' for i in 'ABCDEFGHIJ'] + ['A' + str(i) for i in range(2,7)]:
        sheet[cell].font = Font(bold=True)
    CC = [c for c in cells if c[0]!='J' and c[1]!='6'] 
    my_fill = PatternFill(patternType='solid', fgColor='b6d7a8')
    for cell in CC + ['A1']:
        sheet[cell].fill = my_fill
# Function that converts a pandas dataframe to an excel worksheet
def df2sheet(df, sheet):
    for r in dataframe_to_rows(df, index=True, header=True):
        sheet.append(r)
    if not sheet['A2'].value:
        sheet.delete_rows(2)


## Import Data

In [None]:
#Replace datapath to the folder containing the final features and add "/feats" to the end of the path
#Replace metapath with the path to "participant_questionnaire.csv"
D = DataGenerator(datapath="/content/drive/MyDrive/Upwork/Final_features/feats", metapath="/content/drive/MyDrive/Upwork/participant_questionnaire.csv")

data, labels = D.gen_data(mode='s_indep') #Data mode: s_indep : gives all the data from all patients in concatenated form

In [None]:
data.shape, labels.shape

((40960, 90), (40960, 3))

## Create necessary variables
- Open the Multiprocessing client
- Open the excel sheets to save data to
- Create variables containing all the required Models
- Create the pipeline for final model


In [None]:
#For multi-processing for dask-ml , just to speed up the model training and scoring
#Reduce the memory limit to 2GB if you are not using google colab
#But for this Subject Independent Model, a limit of over 40GB is required, which isn't available in most normal CPUs
client = Client(processes=False, memory_limit='25GB')

In [None]:
#Open a new Excel file to save all the data
acc = Workbook()
f1 = Workbook()

# If you want to append sheets to an existing excel file use the following code
# acc = load_workbook('path to acc excel file')
# f1 = load_workbook('path to f1 excel file')

In [None]:
#Dictionary containing the Feature Reduction Functions for 75% and 25% features 
FeatureScalers = {
        '0.75':{'LDA': LDA(), 
                'RFE': RFE(svm.SVC(kernel='linear', cache_size=7000),n_features_to_select=0.75), 
                'PCA': PCA(n_components=int(0.75*90)), 
                'FA': FactorAnalysis(n_components=int(0.75*90))},
        
        '0.25':{'LDA': LDA(), 
                'RFE': RFE(svm.SVC(kernel='linear', cache_size=7000),n_features_to_select=0.25), 
                'PCA': PCA(n_components=int(0.25*90)), 
                'FA': FactorAnalysis(n_components=int(0.25*90))}
                } 
    
#Dictionary containing the Final Classifiers to be used 
Classifiers = {
        'SVM_linear': svm.LinearSVC(), 
        'SVM_rbf': svm.SVC(kernel='rbf', cache_size=7000),
        'SVM_poly': svm.SVC(kernel='poly', cache_size=7000), 
        'SVM_sigmoid': svm.SVC(kernel='sigmoid', cache_size=7000),
        'DecTree': DecisionTreeClassifier(), 
        'LogReg': LogisticRegression(solver='liblinear'), 
        'GNB': GaussianNB(), 
        'KNN': KNeighborsClassifier()
        }



In [None]:
#Cache storage to speed up calculations - requires a lot of space
cache = "/content/Cache"

#Form one path of pipeline to use as input estimator
pipe = Pipeline(steps=[
                  ('Impute', SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=0)),
                  ('Scale', StandardScaler()),
                  ('Feature_Elim', PCA()),
                  ('Classifier', svm.LinearSVC())], 
                memory = cache)

### Note: Pipelines
- Pipelines are essentially a series of models that are executed one after another.
- If you remember our model, there were multiple stages:
  - Data Extraction
  - Imputing NaN Values
  - Scaling
  - Feature Reduction
  - Classification
- A pipeline can add individual models for each of these steps 
  - Creates an overall model that ensures all of the steps are executed sequentially
  - There's no hassle of executing each step individually


## Model Execution and Scoring

### 75% Features, Label 0 - Val

In [None]:
'''Change the score variable accordingly: 
  0 - Valence
  1 - Arousal
  2 - Dominance
'''
score = 0 #Choosing only the Valence scores
X = data
y = labels[:, score] 

In [None]:
# Create a grid of options corresponding the the different classifiers and feature reduction methods
# !!!! Change Feature value to 0.25 for 25% Features
param_grid = {'Classifier':list(Classifiers.values()), 
              'Feature_Elim':list(FeatureScalers['0.75'].values()) #!!!! Change Feature value HERE to 0.25 for 25% Features
              }


In [None]:
# This grid is used in the cross-validation scheme to get all the scores of all the models at once
param_grid

{'Classifier': [LinearSVC(),
  SVC(cache_size=7000),
  SVC(cache_size=7000, kernel='poly'),
  SVC(cache_size=7000, kernel='sigmoid'),
  DecisionTreeClassifier(),
  LogisticRegression(solver='liblinear'),
  GaussianNB(),
  KNeighborsClassifier()],
 'Feature_Elim': [LinearDiscriminantAnalysis(),
  PCA(n_components=67),
  FactorAnalysis(n_components=67)]}

In [None]:
scoring = ['accuracy', 'f1']

splits = 10 #Change this to 5 if taking too much time/space

grid_search = GridSearchCV(copy.copy(pipe), #Input estimator - our pipeline - take only a copy of it to ensure no overlapping errors
                           param_grid=param_grid, #Applies cross-val for all the combinations of the param_grid of the pipeline
                           cv=splits, #Uses StratifiedKFold cross validation
                           return_train_score=False, 
                           refit=False,
                           verbose=2, #Reduce this if you don't want to see intermediate outputs; Increase if you want to see more outputs
                           scoring=scoring, #Multi-metric - Accuracy and F1 scores are calculated and returned together
                           n_jobs=-1) #Use all available cpu cores in the system for training

In [None]:

with joblib.parallel_backend('dask'): #This line uses the dask-ml computation to speed up calculations
      grid_search.fit(X, y)

In [None]:
#Get the results
CV = grid_search.cv_results_
acc_results = np.round(CV['mean_test_accuracy'], 4).reshape(3, 8, order='F')
acc_results = np.insert(acc_results, 1, 0, axis=0)
f1_results = np.round(CV['mean_test_f1'], 4).reshape(3, 8, order='F')
f1_results = np.insert(f1_results, 1, 0, axis=0)

In [None]:
# Put the results into a dataframe and view
A_df = pd.DataFrame(acc_results, index = FeatureScalers['0.75'].keys(), columns = Classifiers.keys())
F_df = pd.DataFrame(f1_results, index = FeatureScalers['0.75'].keys(), columns = Classifiers.keys())

In [None]:
A_df

In [None]:
F_df

In [None]:
#Calculate the maximum across both axes
A_df['Max'] = A_df.max(axis=1)
A_df.loc['Max'] = A_df.max()

F_df['Max'] = F_df.max(axis=1)
F_df.loc['Max'] = F_df.max()

In [None]:
A_df

In [None]:
F_df

In [None]:
#Create a sheet in the open Workbooks with the correct name
#!!!! Change the name for different sheets
wA = acc.create_sheet("75% Val")
wF = f1.create_sheet("75% Val")

In [None]:
# Apply formatting to the sheets
df2sheet(A_df, wA)
df2sheet(F_df, wF)

apply_formatting(acc, 'Acc')
apply_formatting(f1, 'F1')

In [None]:
#Save Workbook to excel file 
# !!! Remember to change names as and when you save
path = "/content/drive/MyDrive/Upwork/Results/Subject Independent/"

acc.save(path + 'S_Indep_Acc.xlsx')
f1.save(path + 'S_Indep_F1.xlsx')

**Similarly, change the index to 1 or 2 to get Arousal and Dominance data
and features to 25%**
 <br>
 Change the values and names accordingly to get the 6 different sheet combinations and save to excel file.

Subject Independent takes a really long time. <br> 
**8 hours or more** per sheet, so I advise against executing this on Jupyter Notebook.