#  Project summary 

Project title : Mental State Classification using EEG-based MUSE Brain-Machine Interface

|Project Goal|:

    -Create a machine learning model to predict the state of the brain between "Normal" and "Fatigue"

|Use cases|:

    -Use cases in high risk/dangerous jobs such as construction, pilots etc.

|Data pre-preporcessing|:

    -We tried to preprocess the data using mutiple methods to understand what gave us the best results.

    -Generated 4 different datasets:
    
        data_preprocessing sub-folder: Here, we applied Gaussian based outlier detection and imputation of missing values on the original dataset.

        data_preprocessing_with_PCA  sub-folder: Here, we added 5 PCAs to each dataset in the data_preprocessing sub-folder.

        data_preprocessing_with_FastICA  sub-folder: Here, we added 5  FastICAs to each dataset in the data_preprocessing sub-folder.

        data_preprocessing_with_FastICA_Temporal_Freq  sub-folder: Here, we  added 120 temporal-based features and 100 frequency-based features to each dataset in the data_preprocessing_with_FastICA  sub-folder.  The total number of features for each dataset in this sub-folder is now 245 features.


|AutoML Library|: 

    -Used a library to help us define the best model to use and then created our model around that to save time and improve accuracy


|Results|:

    -Used the AutoML library to try different ML models on the 4 preprocessed datasets to determine which ML algorithm gives the highest accuracy. ML models considered: Baseline, Linear , Decision Tree, Random Forest, XGBoost, Neural Network.

    -The highest accuracy was achieved on the data_preprocessing_with_PCA subfolder (99.1% with Neural Network)

|Challenges|:

    -Differentiating between fatigue and normal data​

    -Misalignment of the Muse device (headband) leading to null values

# Converting the Temporal Raw Data to an Aggregated Data Format

## Importing the Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime, timedelta

## Importing the Dataset 

In [None]:
dataset = pd.read_csv('datasets/Mohamed_Normal_1.csv')

type(dataset)


## Showing the Dataset in a Table

In [None]:
pd.DataFrame(dataset)

##  A Quick Review of the Data

In [None]:
# Number of rows in the dataset
len(dataset.index)

In [None]:
# column data types
dataset.dtypes

## Converting  column 'Timestamp' to be of type datetime64 [ns]


In [None]:
dataset['TimeStamp'] = pd.to_datetime(dataset['TimeStamp'])

In [None]:
type (dataset['TimeStamp'][0])

## Drooping rows related to eye blinking or jaw clenching

In [None]:
# Keep only the rows with at least 3 non-NA values and keep the DataFrame with valid entries in the same variable(i.e. dataset).
dataset.dropna(thresh=3,axis=0, inplace=True)

In [None]:
pd.DataFrame(dataset)


## Discretising time steps and creating an empty dataset :

In [None]:
'''
time_step_size: time step size (aka granularity level) used to discretise time steps
columns: The column labels of the DataFrame.
index: The index (row labels) of the DataFrame.
'''
time_step_size= 100 
unit='ms' # if unit changed, then change the unit in each encounter of 'timedelta(unit=time_step_size)'

timestamps=pd.date_range(min(dataset['TimeStamp']), max(dataset['TimeStamp']), freq=str(time_step_size)+unit)


cols =['Delta_TP9','Delta_AF7','Delta_AF8','Delta_TP10','Theta_TP9' ,'Theta_AF7','Theta_AF8','Theta_TP10','Alpha_TP9','Alpha_AF7','Alpha_AF8','Alpha_TP10','Beta_TP9','Beta_AF7','Beta_AF8','Beta_TP10','Gamma_TP9','Gamma_AF7','Gamma_AF8','Gamma_TP10','RAW_TP9','RAW_AF7','RAW_AF8'
,'RAW_TP10']
new_dataset= pd.DataFrame(index=timestamps, columns=cols)

### Showing the empty dataset

In [None]:
new_dataset

## Deriving the values of the empty dataset 
Now, we will derive the values for each  column at each discrete time step (i.e. each row of the new empty dataset)

In [None]:
'''
For row selection: common operation is the use of boolean vectors to filter the data.
The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.
source:Stackoverflow

Each row in the new dataset represents a summary of the values encountered in the interval defined
by the time step it was created for until the next time step, i.e. [t, t+ Δt)."Chp2 of the book"

Aggreation type : average score
'''
for i in range(0, len(dataset.index)):
    
    # Row selection of the original dataset for each row in the new dataset
    selected_rows = dataset[( dataset['TimeStamp'] >= new_dataset.index[i] ) & (dataset['TimeStamp'] < new_dataset.index[i] +  timedelta(milliseconds=time_step_size)) ]
    
    if len(selected_rows) >0:
        print(len(selected_rows))
        for col in cols:
            new_dataset.loc[new_dataset.index[i],col]= np.average(selected_rows[col])
        
    else:
        print("selected_rows was empty")
        for col in cols:
            new_dataset.loc[new_dataset.index[i],col]= np.nan

     
     




## Showing the final dataset

In [None]:
new_dataset

# Joining Datasets 

Sources:

[1] https://www.datacamp.com/community/tutorials/joining-dataframes-pandas

[2] https://www.geeksforgeeks.org/how-to-drop-one-or-multiple-columns-in-pandas-dataframe/

[3 https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/

## Datasets combination and mental state classficaiton 

In [None]:
import os
import pandas as pd
import numpy as np

'''
fileNames is a list with the names of the csv files within the 'datasets' path
'''

fileNames = []
for file in os.listdir("datasets"):
    if file.endswith(".csv"):
        fileNames.append(file)


print(fileNames)
print( 'Total number of csv files to be combined is : {}'.format(len(fileNames)))

'''
function that reads the file from the fileNames list, converts it to a dataFrame, and adds
a mental state column to the dataframe: 0 for "Fatigue" and 1 for "Normal"
'''
def getFile(fn):
    location = 'datasets/' + fn
    df = pd.read_csv(location)
    if "Normal" in fn:
        df['State'] = 1
    else:
        # Fatigue-typed csv file
        df['State'] = 0
    return df


'''
List comprehension to create the final dataframe

'''
dfs = [getFile(file) for file in fileNames]
df_final = pd.concat(dfs,ignore_index=True)

## Showing the Final Dataset in a Table

In [None]:
df_final

##  Drooping rows related to eye blinking or jaw clenching

In [None]:
# Keep only the rows with at least 4 non-NA values and keep the DataFrame with valid entries in the same variable(i.e. df_final).
df_final.dropna(thresh=4,axis=0, inplace=True)

In [None]:
df_final

## Dropping other irrelvant cloums

In [None]:
# Remove all columns between column index 25 to 39
df_final.drop(df_final.iloc[:, 25:39], inplace = True, axis = 1)
#Remove the timestamp cloumn
df_final.drop(['TimeStamp'], inplace = True, axis = 1)

In [None]:
df_final

# Data preprocessing 

In [None]:
import pandas as pd
import numpy as np

from Visualization import Visualization
from OutlierDetection import DistributionBasedOutlierDetection
from ImputationMissingValues import ImputationMissingValues
from KalmanFilters import KalmanFilters


import os
from pathlib import Path
import copy

In [None]:
# isntances of our classes
outlier_distr = DistributionBasedOutlierDetection()
imputator=ImputationMissingValues()
data_visualizer= Visualization()

In [None]:
start_folder='datasets'
end_folder ='data_preprocessing'

Path(end_folder).mkdir(exist_ok=True, parents=True)


for file in os.scandir(start_folder):
   
        file_path = file.path
        file_name=file.name
        print("current file_path : ",file_path)
        print("current file_name : ",file_name)
        
        if  file.name != '.DS_Store': 
            # index_col indicates what column of the csv to use as the indexes (row labels) of the dataframe
            dataset = pd.read_csv(file_path,index_col=0)

            

            # Drooping rows related to eye blinking or jaw clenching by keeping only the rows with at least 3 non-NAN values 
            dataset.dropna(thresh=3,axis=0, inplace=True)

            # Remove all columns between column index 20 to 39
            dataset.drop(dataset.iloc[:, 20:39], inplace = True, axis = 1)


            #Outlier detection using Guassian mixture model 
            for col in [col for col in dataset.columns if col != "State"]: 
                print('current column: {}'.format(col))
                dataset = outlier_distr.mixture_model(dataset, col, 3)




                print('Number of outliers with density threshold less than 0.0005 for column ' + col + ': ' + str(dataset[col+'_mixture'][dataset[col+'_mixture'] < 0.0005].count()))
                dataset.loc[dataset['{}_mixture'.format(col)] < 0.0005, col] = np.nan # creating a missing value
                del dataset[col + '_mixture']
                
                # imputating missing values 
                dataset= imputator.impute_interpolate(dataset,col)
                print('NaNs total: {}'.format(dataset[col].isna().sum())  )
               
                


            # Adding a mental state column to the dataset: 0 for "Fatigue" and 1 for "Normal"
            if "Normal" in file_name:
                dataset['State'] = 1
            else:
                # Fatigue-typed csv file
                dataset['State'] = 0
                
            #Dataset visualization :
            data_visualizer.plot_dataset(dataset, ['Gamma_','Beta_', 'Alpha_', 'Theta_', 'Delta_'],
                    ['like', 'like', 'like', 'like', 'like', 'like'],
                    ['line', 'line', 'line', 'line', 'line'],file_name.split('.')[0])
            # save the new file in the new folder named end_folder
            dataset.to_csv(end_folder +'/' + file.name)
            

# Feature engnieering  

In [None]:
from DataTransformation import PrincipalComponentAnalysis,IndependentComponentAnalysis

from FrequencyAbstraction import FourierTransformation
from TemporalAbstraction import TemporalNumericalAggregation

PCA=PrincipalComponentAnalysis()
FastICA=IndependentComponentAnalysis()

# create an instance of the (FA) class

freq_transformer = FourierTransformation()
num_aggregator = TemporalNumericalAggregation()
#create an instance of the (TA) class

## PCA

In [None]:
start_folder='data_preprocessing'
end_folder ='data_preprocessing_with_PCA'

Path(end_folder).mkdir(exist_ok=True, parents=True)

# find number_components for PCA


# Finding explained variance ratios for each dataset
PCA_all_evs=[]
for file in os.scandir(start_folder):
   
        file_path = file.path
        file_name=file.name
        print("current file_path : ",file_path)
        print("current file_name : ",file_name)
        
        if  file.name != '.DS_Store': 
            # index_col indicates what column of the csv to use as the indexes (row labels) of the dataframe
            dataset = pd.read_csv(file_path,index_col=0)
            cols=[ col for col in dataset.columns if col != "State" ]
            PCA_ev = PCA.determine_pc_explained_variance(dataset, cols)
            PCA_all_evs.append(PCA_ev)

# Plotting explained variance ratios for each dataset

for i in range (len(PCA_all_evs)):
     data_visualizer.plot_xy(x=[range(1, len(cols)+1)], y=[PCA_all_evs[i]],
                        xlabel='principal component number', ylabel='explained variance ratio',
                        ylim=[0, 1], line_styles=['b-'])

From the above 8 figures(i.e. 8 datasets), we can use the elbow method to find the optimal number of principal components:

- Figure 1:   4
- Figure 2:   3
- Figure 3:   5
- Figure 4:   5
- Figure 5:   4
- Figure 6:   5
- Figure 7:   4
- Figure 8:   4

In the next part, We can use either 4 or 5 for the number_components variable since they are the two most common values. 

In [None]:

number_components=5 


# Loop through all the files
for file in os.scandir(start_folder):
   
        file_path = file.path
        file_name=file.name
        print("current file_path : ",file_path)
        print("current file_name : ",file_name)
        
        if  file.name != '.DS_Store': 
            # index_col indicates what column of the csv to use as the indexes (row labels) of the dataframe
            dataset = pd.read_csv(file_path,index_col=0)
            
            # Applying PCA
            cols=[ col for col in dataset.columns if col != "State" ]
            dataset = PCA.apply_pca(copy.deepcopy(dataset),cols, number_components)
            
            # moving "State" column to the end
            col_State_copy = dataset['State'].copy()
            dataset=dataset.drop(['State'], axis = 1)
            dataset['State']=col_State_copy
            
            #Dataset visualization :
            data_visualizer.plot_dataset(dataset, ['Gamma_','Beta_', 'Alpha_', 'Theta_', 'Delta_','pca_'],
                    ['like', 'like', 'like', 'like', 'like', 'like','like'],
                    ['line', 'line', 'line', 'line', 'line','line'],file_name.split('.')[0])
            # save the new file in the new folder named end_folder
            dataset.to_csv(end_folder +'/' + file.name)
        

## ICA 

In [None]:
start_folder='data_preprocessing'
end_folder ='data_preprocessing_with_FastICA'
Path(end_folder).mkdir(exist_ok=True, parents=True)



In [None]:
# Let's use the same name of components for FastICA
number_components_ICA=5
# Loop through all the files
for file in os.scandir(start_folder):
   
        file_path = file.path
        file_name=file.name
        print("current file_path : ",file_path)
        print("current file_name : ",file_name)
        
        if  file.name != '.DS_Store': 
            # index_col indicates what column of the csv to use as the indexes (row labels) of the dataframe
            dataset = pd.read_csv(file_path,index_col=0)
            
            # Applying Fast ICA
            #https://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas
            cols=[ col for col in dataset.columns if col != "State" ]
            dataset = FastICA.apply_fast_ica(copy.deepcopy(dataset),cols, number_components_ICA)
            
            # moving "State" column to the end
            col_State_copy = dataset['State'].copy()
            dataset=dataset.drop(['State'], axis = 1)
            dataset['State']=col_State_copy
            
            #Dataset visualization :
            data_visualizer.plot_dataset(dataset, ['Gamma_','Beta_', 'Alpha_', 'Theta_', 'Delta_','FastICA_'],
                    ['like', 'like', 'like', 'like', 'like', 'like','like'],
                    ['line', 'line', 'line', 'line', 'line','line'],file_name.split('.')[0])
            # save the new file in the new folder named end_folder
            dataset.to_csv(end_folder +'/' + file.name)
        

## PCA & Temporal & Frequency 

In [None]:
#instances
#freq_transformer
# num_aggregator 

In [None]:
start_folder='data_preprocessing_with_FastICA'
end_folder ='data_preprocessing_with_FastICA_Temporal_Freq'
Path(end_folder).mkdir(exist_ok=True, parents=True)


In [None]:
# Loop through all the files
for file in os.scandir(start_folder):
   
        file_path = file.path
        file_name=file.name
        print("current file_path : ",file_path)
        print("current file_name : ",file_name)
        
        if  file.name != '.DS_Store': 
            # index_col indicates what column of the csv to use as the indexes (row labels) of the dataframe
            dataset = pd.read_csv(file_path,index_col=0)
            dataset.index = pd.to_datetime(dataset.index)
            #milliseconds_per_instance =((dataset.index[1] - dataset.index[0]).total_seconds() * 1000)
            milliseconds_per_instance=1000
           
            
            # window size of 2 sec for frequency and time domains 
            window_size =int(float(2000)/milliseconds_per_instance)
            # Sampling rate
            fs = float(200000)/milliseconds_per_instance # 200Hz
            
            # Applying temporal numerical aggregation
            dataset = num_aggregator.abstract_numerical(dataset, cols, window_size, ['mean', 'median', 'max', 'min', 'std', 'slope'])
            
            #Applying fourier transformation 
            dataset = freq_transformer.abstract_frequency(dataset, cols,window_size, fs)
            
            
            # An overlap of 50% 
            window_overlap = 0.5
            skip_points = int((1-window_overlap) * window_size) ## if skip_points =1 , it means we will not skip any point
            dataset = dataset.iloc[::skip_points,:]
            
            
            # Moving "State" column to the end
            col_State_copy = dataset['State'].copy()
            dataset=dataset.drop(['State'], axis = 1)
            dataset['State']=col_State_copy
            
            #Dataset visualization :
            data_visualizer.plot_dataset(dataset, ['Gamma_','Beta_', 'Alpha_', 'Theta_', 'Delta_','FastICA_'],
                    ['like', 'like', 'like', 'like', 'like', 'like','like'],
                    ['line', 'line', 'line', 'line', 'line','line'],file_name.split('.')[0])
            # save the new file in the new folder named end_folder
            dataset.to_csv(end_folder +'/' + file.name)
            

# ML Models

## data_preprocessing folder 

### Joining datasets

In [None]:
import os
import pandas as pd
import numpy as np

'''
fileNames is a list with the names of the csv files within the specified 
folder path (i.e. data_preprocessing, data_preprocessing_with_PCA, data_preprocessing_with_FastICA etc.. )
'''

fileNames = []
for file in os.listdir("data_preprocessing"):
    if file.endswith(".csv"):
        fileNames.append(file)


print(fileNames)
print( 'Total number of csv files to be combined is : {}'.format(len(fileNames)))

'''
function that reads the file from the fileNames list and converts it to a dataFrame
'''
def getFile(fn):
    location = 'data_preprocessing/' + fn
    df = pd.read_csv(location)
    return df


'''
List comprehension to create the final dataframe

'''
dfs = [getFile(file) for file in fileNames]
df_final = pd.concat(dfs,ignore_index=True)


In [None]:
#Showing the Final Dataset in a Tabl
df_final

In [None]:
df_final['State'].unique()

In [None]:
df_final[df_final.columns[1:-1]]

### AutoML

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df= df_final
X_train, X_test, y_train, y_test = train_test_split(df[df.columns[1:-1]], df["State"], test_size=0.25)

automl = AutoML(eval_metric="accuracy")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)


## data_preprocessing_with_PCA folder 

### Joining datasets

In [None]:
import os
import pandas as pd
import numpy as np

'''
fileNames is a list with the names of the csv files within the specified 
folder path (i.e. data_preprocessing, data_preprocessing_with_PCA, data_preprocessing_with_FastICA etc.. )
'''

fileNames = []
for file in os.listdir("data_preprocessing_with_PCA"):
    if file.endswith(".csv"):
        fileNames.append(file)


print(fileNames)
print( 'Total number of csv files to be combined is : {}'.format(len(fileNames)))

'''
function that reads the file from the fileNames list and converts it to a dataFrame
'''
def getFile(fn):
    location = 'data_preprocessing_with_PCA/' + fn
    df = pd.read_csv(location)
    return df


'''
List comprehension to create the final dataframe

'''
dfs = [getFile(file) for file in fileNames]
df_final = pd.concat(dfs,ignore_index=True)

In [None]:
df_final

In [None]:
df_final['State'].unique()

In [None]:
df_final[df_final.columns[1:-1]]

### AutoML

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df= df_final
X_train, X_test, y_train, y_test = train_test_split(df[df.columns[1:-1]], df["State"], test_size=0.25)

automl = AutoML(eval_metric="accuracy")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)


## data_preprocessing_with_FastICA folder 

### Joining datasets

In [None]:
import os
import pandas as pd
import numpy as np

'''
fileNames is a list with the names of the csv files within the specified 
folder path (i.e. data_preprocessing, data_preprocessing_with_PCA, data_preprocessing_with_FastICA etc.. )
'''

fileNames = []
for file in os.listdir("data_preprocessing_with_FastICA"):
    if file.endswith(".csv"):
        fileNames.append(file)


print(fileNames)
print( 'Total number of csv files to be combined is : {}'.format(len(fileNames)))

'''
function that reads the file from the fileNames list and converts it to a dataFrame
'''
def getFile(fn):
    location = 'data_preprocessing_with_FastICA/' + fn
    df = pd.read_csv(location)
    return df


'''
List comprehension to create the final dataframe

'''
dfs = [getFile(file) for file in fileNames]
df_final = pd.concat(dfs,ignore_index=True)

In [None]:
df_final

In [None]:
df_final['State'].unique()

In [None]:
df_final[df_final.columns[1:-1]]

### AutoML

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df= df_final
X_train, X_test, y_train, y_test = train_test_split(df[df.columns[1:-1]], df["State"], test_size=0.25)

automl = AutoML(eval_metric="accuracy")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)


## data_preprocessing_with_FastICA_Temporal_Freq  folder 

### Joining datasets

In [None]:
import os
import pandas as pd
import numpy as np

'''
fileNames is a list with the names of the csv files within the specified 
folder path (i.e. data_preprocessing, data_preprocessing_with_PCA, data_preprocessing_with_FastICA etc.. )
'''

fileNames = []
for file in os.listdir("data_preprocessing_with_FastICA_Temporal_Freq"):
    if file.endswith(".csv"):
        fileNames.append(file)


print(fileNames)
print( 'Total number of csv files to be combined is : {}'.format(len(fileNames)))

'''
function that reads the file from the fileNames list and converts it to a dataFrame
'''
def getFile(fn):
    location = 'data_preprocessing_with_FastICA_Temporal_Freq/' + fn
    df = pd.read_csv(location)
    # Drooping rows with empty values; we will keep only the rows with at least 246 non-NAN values
    # This step was added since Toma's orginal data data has a lot of zeros. Otherwise, we will choose 
    # 2 for "tresh" rather 246.
    df.dropna(thresh=246,axis=0, inplace=True)
    return df


'''
List comprehension to create the final dataframe

'''
dfs = [getFile(file) for file in fileNames]
df_final = pd.concat(dfs,ignore_index=True)

In [None]:
df_final

In [None]:
df_final['State'].unique()

In [None]:
df_final[df_final.columns[1:-1]]

### AutoML

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df= df_final
X_train, X_test, y_train, y_test = train_test_split(df[df.columns[1:-1]], df["State"], test_size=0.25)

automl = AutoML(eval_metric="accuracy")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)
