# Machine Learning for EEG Dataset : DEAP

http://www.eecs.qmul.ac.uk/mmv/datasets/deap/


## Training Models and Obtaining Accuracies for Subject Dependent Data Classification

### Dataset (per Subject)

nFeatures = 18 for each frequency band <br>
nFeatures for each trial = 32 (electrode channels) x 5(Freq bands) x 18 features
                         <br> <b>=  2880 features per trial per patient

This is reshaped into - 
- trials(40) x electrode channels(32) , Frequency Bands(5) x nFeatures (18)
- <b> Final shape is (1280, 90) 
- 1280 samples and 90 features per sample

## Model

### Input Data --> Reshape --> Impute NaN Values --> Feature Elimination --> Classification --> Output

For the last 3 steps, there are multiple options as follows,

#### Impute NaN values:
- Use constant 0 value substitution
- Use mean of the column to substitute
- Use median of the column to substitute

#### Feature Elimination:
- PCA
- RFE

#### Classification:
- SVM
- Decision Tree
- Logistic Regression
- Gaussian Naive Bayes
- K Nearest Neighbours (KNN)


## Installations

In [None]:
#If not in colab - run this in cmd and restart jupyter notebook 
!pip install -U scikit-learn
!pip install openpyxl

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/f3/74/eb899f41d55f957e2591cde5528e75871f817d9fb46d4732423ecaca736d/scikit_learn-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
[K     |████████████████████████████████| 22.3MB 1.5MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.1 threadpoolctl-2.1.0


In [None]:
#Ensure the sklearn version is the latest
import sklearn
sklearn.__version__
#Should be 0.24.1

'0.24.1'

## Imports

In [None]:
import copy
import pickle
import numpy as np
import pandas as pd

from scipy import sparse
from shutil import rmtree

import warnings
warnings.filterwarnings("ignore")

In [None]:
from sklearn import svm, datasets
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFE
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.neighbors import KNeighborsTransformer, KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

from openpyxl import load_workbook, Workbook
from openpyxl.styles import Font
from openpyxl.styles.fills import PatternFill
from openpyxl.styles.borders import Border, Side
from openpyxl.utils.dataframe import dataframe_to_rows

In [None]:
#Only run if importing features data and data generator from google drive
#If you're using Google colab, like I am, you will have to import everything from google drive.
#Uploading locally will take too much time, and you'll have to upload every time you open the notebook or restart runtime
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys
sys.path.insert(1, '/content/drive/MyDrive/Upwork/') #Path to folder containing DataGenerator File
from DataGenerator import DataGenerator

## Custom Model
It is a class that calculates all the scores of accuracy and f1 across all model pipelines and saves it to an excel file at the designated path with the required name

In [None]:
'''
In a class, the word "self" refers to an instance of this class.
- Functions and variables that belong to the class are saved as "self.someVariable" and "someFunction(self, ...)"
- The functions inside the class are called by self.someFunction(...)

- Outside the class, when running in normal code, the class functions are called as follows:
  - Obj = ClassName(...) [Obj is now an instance or Object of the class]
  - Obj.someFunction(...) [Obj has all the functions - execute the functions by calling them as part of the object]
  - Obj.someVariable [Any variable of the class can be accessed like this]

'''
class Model():
    '''The init function is the initializing function of the class. This function automatically runs and need not be called separately
    This function initializes all the required variables:
      - DataGen - is the DataGenerator that must be passed to the class 
      - pid - list of Patient IDs depending on the mode
      - n_splits: Cross_validation splits, default = 10
      - path: path to save the Score Excel sheets
      - mode: Mode of data to be obtained from the DataGenerator
      - cache: path to cache storage folder
      - name: names the file accoridngly. If not given, the name of the file is the same as pid

    This function also creates the following variables while initializing the class:
      - data, target: Data and labels obtained from calling the DataGenerator in the requested mode and pid
      - feat_opts: Feature Reduction options of 75% and 25%
      - metrics: scoring metrics of accuracy and f1
      - scoring: a dictionary containing the scoring metrics
      - order: Order of the score sheets tp be saved in the excel file
      - border, col : used in formatting the excel file
      - run_dict: miscellaneous variable to keep track of the feature reduction and label - val/aro/dom
      - Classifiers: Dictionary containing all the Final Classifiers of the model
      - Feature Scalars: Dictionary containing all the Feature Reduction Methods of the model
    
    All these variables are used in the model to help automate the entire process
    '''
    def __init__(self, DataGen, pid, n_splits=10, path='', mode='s_dep', cache='', name='chuck'):
        self.path = path 
        self.cache = cache
        self.mode = mode
        self.pid = pid
        self.n_splits = n_splits
        self.D = DataGen
        self.data, self.target = self.D.gen_data(self.mode, self.pid)
        if type(self.pid) == type(['02']):
          self.pid = '_'.join(self.pid)
        if name!='chuck':
          self.pid = name
        print(self.data.shape, self.target.shape)
        self.feat_opts = [0.75, 0.25]
        self.metrics = ['accuracy', 'f1']
        self.scoring = dict(zip(self.metrics,self.metrics))
        self.order = ['75% Val', '25% Val', '75% Aro', '25% Aro', '75% Dom', '25% Dom']
        self.border = Border(left=Side(style='thin'), right=Side(style='thin'), 
                     top=Side(style='thin'), bottom=Side(style='thin'))
        self.col = dict(zip(list(range(1, 18)), list('ABCDEFGHIJ')))
        self.run_dict = dict((zip(self.order, [(f, l) for l in range(3) for f in self.feat_opts] )))
        self.Classifiers = {
                'SVM_linear': svm.SVC(kernel='linear', cache_size=7000), 'SVM_rbf': svm.SVC(kernel='rbf', cache_size=7000),
                'SVM_poly': svm.SVC(kernel='poly', cache_size=7000), 'SVM_sigmoid': svm.SVC(kernel='sigmoid', cache_size=7000),
                'DecTree': DecisionTreeClassifier(), 'LogReg': LogisticRegression(solver='liblinear'), 
                'GNB': GaussianNB(), 'KNN': KNeighborsClassifier()}
        self.FeatureScalers = {
                'LDA': LDA(), 'RFE': RFE(svm.SVC(kernel='linear', cache_size=7000)), 'PCA': PCA(), 'FA': FactorAnalysis()}

    #________________________________________________________________________________________________________


    
    '''
    Function to create an initial pipeline. 
    Input:
      - fe: Name of Feature Elimination method
      - clf: Name of Classifier
      - n_features: 0.75 or 0.25 feature reduction
    Output:
      - Returns a full pipeline of the model with the input specifications
    '''
    def create_pipe(self, fe, clf, n_features):
        fe_model = copy.deepcopy(self.FeatureScalers[fe])
        if fe=='RFE':
            fe_model.n_features_to_select = n_features
        elif fe=='LDA':
            pass
        else:
            fe_model.n_components = int(n_features*90)

        pipe = Pipeline(steps=[
                          ('Impute', SimpleImputer(missing_values=np.nan, strategy='mean', fill_value=0)),
                          ('Scale', StandardScaler()),
                          ('Feature_Elim', fe_model),
                          ('Classifier', copy.deepcopy(self.Classifiers[clf]))], 
                        memory = self.cache)
        return pipe
    #________________________________________________________________________________________________________

    
    '''
    Function to get the cross_validation scores of the model
    Input:
      - model: Model or estimator to be fitted for cross_validation
      - label: 0,1,2 for Val/Aro/Dom respectively
    Output:
      - returns the mean accuracy and mean f1 score after cross validation.
    '''
    def fit_score(self, model, label):
        Model = copy.deepcopy(model)
        scores = cross_validate(Model, self.data, self.target[:, label], scoring=self.scoring,
                                 cv=StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=1), n_jobs=-1)
        return round(np.mean(scores['test_accuracy']),4), round(np.mean(scores['test_f1']),4)
    #________________________________________________________________________________________________________

    '''
    Function to create an empty dataframe with the Column Names and Row Names as required
    Returns an empty DataFrame as follows:
        | SVM1 | SVM2 | SVM3 | SVM4 | DT | LogReg | GNB | KNN |
    LDA |                                                     |     
    RFE |                                                     |
    PCA |                                                     |
    FA  |                                                     |
    '''
    def create_df(self):
        #Create an empty Excel file for all the sheets to be added.
        return pd.DataFrame(index = self.FeatureScalers.keys(), columns = self.Classifiers.keys())
    #________________________________________________________________________________________________________
    '''
    Function to calculate scores for each model combination of feature elimination and classifier

    Input:
      - n_features: Feature Elimination 0.75 or 0.25 
      - label: 0,1,2 for Val/Aro/Dom respectively
    Output:
      - Returns 2 dataframes for Accuracy and F1 scores respectively.
    
    The returned DataFrames are of this format

    A/F | SVM1 | SVM2 | SVM3 | SVM4 | DT | LogReg | GNB | KNN | Max |
    _________________________________________________________________
    LDA |                                                           |     
    RFE |                                                           |
    PCA |                                                           |
    FA  |                                                           |
    _________________________________________________________________
    Max |                                                           |
    _________________________________________________________________

    '''
    def calc_scores(self, n_features, label):
        #Create a dataframe to save the values
        A_df = self.create_df()
        F_df = self.create_df()
        
        #Iterate over the FeatureScalers - Feature Elimination Methods
        for fred in self.FeatureScalers.keys():
            print(" ", end=" ")
            print(fred)
            #Iterate over the Classifiers - Final Classifier Methods
            for clf in self.Classifiers.keys():
                print(clf, end="\t")
                #Create a pipeline of the required Feature Elimination Method and final Classifier
                Model = self.create_pipe(fred, clf, n_features)
                #Calculate the cross_val score of the model
                [a, f] = self.fit_score(Model, label)
                #Save the score to the specific column and row of their respective dataframes 
                A_df.loc[fred, clf] = a
                F_df.loc[fred, clf] = f
            print()
        
        #Calculate the max columns
        A_df['Max'] = A_df.max(axis=1)
        A_df.loc['Max'] = A_df.max()
        
        F_df['Max'] = F_df.max(axis=1)
        F_df.loc['Max'] = F_df.max()

        #Completed message
        print('\n----> DF created: '+ str(n_features) + ' Features, Target - '+str(label))
        return [A_df, F_df]

    #The following 4 functions are not important- they are used for formatting the data in the excel sheet and saving
    #________________________________________________________________________________________________________
    #Functions for formatting the excel sheet
    def apply_formatting(self, wb, metric):
        if wb.worksheets[0].title=="Sheet":
            wb.remove(wb.worksheets[0])
        for ws in wb.worksheets:
            ws['A1'] = metric
            A = pd.DataFrame(ws.values).iloc[1:, 1:].to_numpy()
            bold_idx = np.where(A == np.nanmax(A))
            bold_cells = [self.col[i+2] + str(j+2) for i,j in zip(bold_idx[1], bold_idx[0])]
            self.bold_colour(ws, bold_cells)
            self.apply_border(ws)
    
    #Functions for formatting the excel sheet - to apply borders
    def apply_border(self, sheet):
        cells = [i + '6' for i in 'BCDEFGHIJ']
        cells = cells + ['J' + str(i) for i in range(2,7)] + [i + '1' for i in 'ABCDEFGHIJ'] + ['A' + str(i) for i in range(2,7)]
        for cell in cells:
            sheet[cell].border = self.border
    
    #Functions for formatting the excel sheet - to highlight the cells of max accuracy        
    def bold_colour(self, sheet, cells):
        for cell in cells + [i + '1' for i in 'ABCDEFGHIJ'] + ['A' + str(i) for i in range(2,7)]:
            sheet[cell].font = Font(bold=True)
        CC = [c for c in cells if c[0]!='J' and c[1]!='6'] 
        my_fill = PatternFill(patternType='solid', fgColor='b6d7a8')
        for cell in CC + ['A1']:
            sheet[cell].fill = my_fill
    
    # Function that converts a pandas dataframe to an excel worksheet
    def df2sheet(self, df, sheet):
        for r in dataframe_to_rows(df, index=True, header=True):
            sheet.append(r)
        if not sheet['A2'].value:
            sheet.delete_rows(2)
    #________________________________________________________________________________________________________

    '''
    Main Function to Run

    Function that Generates 2 Excel Files(Accuracy and F1) each with 6 sheets- 
    - '75% Val'
    - '25% Val'
    - '75% Aro'
    - '25% Aro'
    - '75% Dom'
    - '25% Dom'
    and saves it to the designated path.

    '''
    def run(self):
        print('>>> Running Model for S' + self.pid)
        acc = Workbook() #Opens new Excel workbook to save all the data
        f1 = Workbook() #Opens new Excel workbook to save all the data
        
        #This for loop iterates over the sheet names
        for name, value in zip(self.run_dict.keys(),self.run_dict.values()) :
            '''Here, name is the sheet-name - obtained from the variable "self.order"
            Value is (n_features, label) corresponding to the sheet name
            For example: 
              - Sheetname - "75% Val"
              - Value - (0.75, 0)
            '''
            #Create sheets in the acc and f1 workbooks for the sheetname
            wA = acc.create_sheet(name)
            wF = f1.create_sheet(name)

            df = self.calc_scores(value[0], value[1]) # Calculates the scores of all the models and returns them as dataframes
            self.df2sheet(df[0], wA) #Convert the dataframe to the required sheet
            self.df2sheet(df[1], wF) #Convert the dataframe to the required sheet

        #Apply formatting to the entire workbook - bold, max values and highlight
        self.apply_formatting(acc, 'Acc') 
        self.apply_formatting(f1, 'F1')

        #Save the workbook at the required path
        acc.save(self.path + 'accuracy/s' + self.pid + '.xlsx')
        f1.save(self.path + 'f1/s' + self.pid + '.xlsx')
        print('>>> Completed')
    

## Subject Dependent Splits

In [None]:
#Replace datapath to the folder containing the final features and add "/feats" to the end of the path
#Replace metapath with the path to "participant_questionnaire.csv"

#Create the DataGenerator Instance
D = DataGenerator(datapath="/content/drive/MyDrive/Upwork/Final_features/feats", 
                  metapath="/content/drive/MyDrive/Upwork/participant_questionnaire.csv")

In [None]:
# Change the path here - give the path of the folder you want to save the files to.
# !!!Note : the folder should have an empty accuracy and f1 folder already made inside.
path = "/content/drive/MyDrive/Upwork/Results/Subject Dependent/"

#Change cache path as required
cache_path = "/content/Cache"

In [None]:
#Run the model for each pid 
model = Model(D, pid='07', mode = 's_dep',
              path=path, cache=cache_path)
model.run()

In [None]:
#Or use a for loop to run the model for all subjects
for pid in ['01', '02', '03']:
    model = Model(D, pid=pid, mode = 's_dep',
                path=path, cache=cache_path)
    model.run()

## Random Group Splits

In [None]:
#Replace datapath to the folder containing the final features and add "/feats" to the end of the path
#Replace metapath with the path to "participant_questionnaire.csv"

#Create the DataGenerator Instance
D = DataGenerator(datapath="/content/drive/MyDrive/Upwork/Final_features/feats", metapath="/content/drive/MyDrive/Upwork/participant_questionnaire.csv")

In [None]:
#Run the DataGenerator at the required mode - s08/s16 to get the randomly selected groupings
f,l = D.gen_data(mode='s08')

The Split Groupings are: 
- ['30', '23', '10', '05']
- ['31', '04', '29', '16']
- ['08', '09', '12', '24']
- ['22', '17', '14', '13']
- ['32', '06', '27', '01']
- ['02', '11', '15', '25']
- ['03', '26', '21', '19']
- ['07', '18', '20', '28']

In [None]:
# Change the path here - give the path of the folder you want to save the files to.
# !!!Note : the folder should have an empty accuracy and f1 folder already made inside.
random_path_s08 = "/content/drive/MyDrive/Upwork/Results/Random Group Split/Splits_8/"

#Change cache path as required
cache_path = "/content/Cache"

In [None]:
#Run the model for each split

#For each run of the model - Change the input of the pid values and name as required. 
model = Model(D, pid=['30', '23', '10', '05'], mode = 's_dep', name='Group0',
              path=random_path_s08, cache=cache_path) #Mode = s_dep: The model will concatenate all the 

(5120, 90) (5120, 3)


In [None]:
model.run()

>>> Running Model for SGroup0
  LDA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  RFE
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  PCA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  FA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	

----> DF created: 0.75 Features, Target - 0
  LDA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  RFE
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  PCA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  FA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	

----> DF created: 0.25 Features, Target - 0
  LDA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  RFE
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  PCA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  FA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	

----> DF created: 0.75 Features, Target - 

In [None]:
#For each run of the model - Change the input of the pid values and name as required. 
model = Model(D, pid=['31', '04', '29', '16'], mode = 's_dep', name='Group1',
              path="/content/drive/MyDrive/Upwork/Results/Random Group Split/Splits_8/", cache="/content/Cache")
model.run()

(5120, 90) (5120, 3)
>>> Running Model for SGroup1
  LDA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  RFE
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  PCA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  FA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	

----> DF created: 0.75 Features, Target - 0
  LDA
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  RFE
SVM_linear	SVM_rbf	SVM_poly	SVM_sigmoid	DecTree	LogReg	GNB	KNN	
  PCA
SVM_linear	SVM_rbf	

**Similarly run the model for all random splits to get all the files**

In [1]:
import glob
import pickle
import numpy as np
import pandas as pd


In [9]:
with open(r'C:/Users/Aditi/Documents/Machine Learning/Upwork Project/data_preprocessed_python/data_preprocessed_python/Final_features/feats01.pickle', 'rb') as handle:
    data = pickle.load(handle)
        

(40, 32, 5, 18)