# Epileptic seizure prediction based on intracranial electroencephalography (EEG) data

***
***

## Summary

### I - Feature generation

<ol>
<li> <strong><em> Setup </em></strong>  </li>
    <ol>
        <li> Imports </li>
        <li> Parameters </li>
        <li> Logging </li>
    </ol>
<br>

<li> <strong><em> Helper functions </em></strong> </li>
    <ol>
        <li> Reading data from .mat files </li>
        <li> Timeseries features </li>
        <li> Entropy features </li>
    </ol>    
<br>

<li> <strong><em> Feature generation loop </em></strong> </li>
    <ol>
        <li> Feature generation function </li>
        <li> Feature generation in loop over patients </li>
    </ol>
<br>

<li> <strong><em> EDA of the generated dataset </em></strong> </li>
    <ol>
        <li> zero values and NANs </li>
        <li> Feature distibutions </li>
    </ol>
</ol>


### II - Machine learning model

<ol>
<li> <strong><em> Setup </em></strong> </li>
    <ol>
        <li> Imports </li>
        <li> Metrics function </li>
        <li> Create test and train datasets  </li>
    </ol>
<br>

<li> <strong><em> Model </em></strong> </li>
    <ol>
        <li> Choosing type of model </li>
        <li> Hyperparameter tuning with genetic algo </li>
        <li> ROC Curve and choosing threshold </li>
    </ol>  
</ol>

***

# Abstract

Epilepsy is one of the most common brain disorders, [according to the CDC](http://www.ur.ac.rw) 1.2% of the population has active epilepsy. About 40% of them have seizures that are not controlled by medication ([Kwan and Brodie 2000](https://pubmed.ncbi.nlm.nih.gov/11034869/)). The unpredictable nature of epileptic seizures can have a significant impact on their lives, making certain common activities such as driving or swimming potentially life-threatening. Being able to predict the onset of seizures could drastically improve the quality of life of those affected.  
The point of this work is to contribute to the [collective effort](https://academic.oup.com/brain/article/141/9/2619/5066003) guided by [the Epilepsy Ecosystem](epilepsyecosystem.org).  
A classifier will be designed and trained to maximise the AUC for 3 patients. It will be trained on features generated from the 16 channel EEG data from the NeuroVista trials for these same patients and then evaluated independently by the Epilepsy Ecosystem.

# I - Feature generation

## Setup

### Imports

In [1]:
from tqdm import tqdm
import sklearn.preprocessing as preprocessing
import numpy as np
from scipy.integrate import simps
import scipy.io as sio
import scipy.stats
import scipy.signal
from os import listdir
from os.path import isfile, join
from datetime import datetime, date
import warnings
warnings.filterwarnings('ignore')
import logging

### Parameters

In [2]:
DATA_PATH = 'C:/Users/gijsb/OneDrive/Documents/epilepsy_neurovista_data/'
TRAIN_PATHS = [f'Pat{i}Train' for i in [1, 2, 3]]
TEST_PATHS = [f'Pat{i}Test' for i in [1, 2, 3]]
FEATURE_SAVE_PATH = DATA_PATH # Path where output feature arrays will be saved

SAMPLING_FREQUENCY = 400
DOWNSAMPLING_RATIO = 5
CHANNELS = range(0,16)
BANDS = [0.1,1,4,8,12,30,70]
HIGHRES_BANDS = [0.1,1,4,8,12,30,70,180]

### Logging

In [3]:
#today_string = str(datetime.now())[0:19].replace('-', '_').replace(':', '_').replace(' ', '_')
now = datetime.now()
today_string = now.strftime("%d_%m_%Y__%H_%M_%S")
log_filename = f'feature_generation_log_{today_string}.log'
logging.basicConfig(level=logging.DEBUG, 
                    filename=log_filename, 
                    format='%(asctime)s.%(msecs)03d %(levelname)s {%(module)s} [%(funcName)s] %(message)s', 
                    datefmt='%Y-%m-%d,%H:%M:%S')

## Helper functions

#### Reading .mat files

In [4]:
def load_mat(mat_file_path):
    """
    input : filepath string
    output : numpy array, returns zeros if it cannot read the file
    """
    try:
        data = (sio.loadmat(mat_file_path)['data']).T
        logging.debug('data loaded')
        return data

    except Exception:
        warnings.warn(f'error reading .mat file {mat_file_path}')
        return np.zeros((16, 240000))

#### Features on timeseries

In [23]:
def zero_crossings(data):
    pos = data > 0
    return (pos[:-1] & ~pos[1:]).nonzero()[0].shape[0]


def band_energy(f, psd, low_f, high_f):
    # Find intersecting values in frequency vector
    idx_delta = np.logical_and(f >= low_f, f <= high_f)
    # The frequency resolution is the size of each frequency bin
    freq_res = f[1] - f[0]
    # Compute the absolute power by approximating the integral
    return simps(psd[idx_delta], dx=freq_res)


def total_energy(segment_downsampled):
    # From litterature the lowest frequencies of interest in a EEG is 0.5Hz so we need to keep our resolution at 0.25Hz hence a 4 second window cf.Nyquist
    window = SAMPLING_FREQUENCY / DOWNSAMPLING_RATIO * 4
    f, psd = scipy.signal.welch(
        segment_downsampled, fs=SAMPLING_FREQUENCY/DOWNSAMPLING_RATIO, nperseg=window)
    return psd.sum()


def highres_total_energy(segment):
    window = SAMPLING_FREQUENCY * 4
    f, psd = scipy.signal.welch(segment, fs=SAMPLING_FREQUENCY, nperseg=window)
    return psd.sum()

#### SVD Entropy

In [6]:
def _embed(x, order=3, delay=1):  # credits to raphaelvallat
    """Time-delay embedding.
    Parameters
    ----------
    x : 1d-array
        Time series, of shape (n_times)
    order : int
        Embedding dimension (order).
    delay : int
        Delay.
    Returns
    -------
    embedded : ndarray
        Embedded time-series, of shape (n_times - (order - 1) * delay, order)
    """
    N = len(x)
    if order * delay > N:
        raise ValueError("Error: order * delay should be lower than x.size")
    if delay < 1:
        raise ValueError("Delay has to be at least 1.")
    if order < 2:
        raise ValueError("Order has to be at least 2.")
    Y = np.zeros((order, N - (order - 1) * delay))
    for i in range(order):
        Y[i] = x[i * delay:i * delay + Y.shape[1]]
    return Y.T


def svd_entropy(x, order=3, delay=1, normalize=False):
    x = np.array(x)
    mat = _embed(x, order=order, delay=delay)
    W = np.linalg.svd(mat, compute_uv=False)
    # Normalize the singular values
    W /= sum(W)
    svd_e = -np.multiply(W, np.log2(W)).sum()
    if normalize:
        svd_e /= np.log2(order)
    return svd_e

## Feature generation loop
N.B. This loop assumes that data is stored in different folders for each patient and for train and test sets (this is how the data from Seer is provided so i made use of it)

In [40]:
def generate_features(patient_number, data_path, is_training_data, save_to_disk = True):
    
    filenames = [f for f in listdir(data_path) if isfile(join(data_path, f))]
    filelist = [join(data_path, f) for f in listdir(data_path) if isfile(join(data_path, f))]


    logging.debug(f'generated filelist of length {len(filelist)} for patient {patient_number}; is_training_data = {is_training_data}')
    
    counter = 0
    for filename in tqdm(filenames[0:20]):
        
        # Lists that will contain feature names and values, we will stack these to make X_train
        index = []
        features = []

        #Load file & normalise
        data = load_mat(join(data_path, filename))
        data = preprocessing.scale(data, axis=1, with_std=True)
        data_downsampled = scipy.signal.decimate(data, 5, zero_phase=True)
        
        logging.debug(f'starting feature generation file:{counter}')
        
        # ID features
        index.append('Patient')
        features.append(patient_number)
        
        index.append('filenumber')
        features.append(filename[filename.find('_')+1:-6])
        
        #accross channels features on full data
        correlation_matrix = np.corrcoef(data)
        correlation_matrix = np.nan_to_num(correlation_matrix)
        # take only values in upper triangle to avoid redundancy
        triup_index = np.triu_indices(16, k=1)
        for i, j in zip(triup_index[0], triup_index[1]):
            features.append(correlation_matrix[i][j])
            index.append(f'correlation_{i}-{j}')

        eigenvals = np.linalg.eigvals(correlation_matrix)
        eigenvals = np.nan_to_num(eigenvals)
        eigenvals = np.real(eigenvals)
        for i in CHANNELS:
            features.append(eigenvals[i])
            index.append(f'eigenval_{i}')
            
        # summed across all channels and frequencies
        summed_energy = total_energy(data_downsampled)
        features.append(summed_energy)
        index.append('summed_energy')
        
        logging.debug('general features generated')
        
        #Per channel features
        #TODO work on all channels in parrallel as one matrix, vectorise all of it
        for c in CHANNELS:
            
            logging.debug(f'starting feature generation file:{counter}, channel:{c}')
            
            # Create necessary functions
            data_channel = data_downsampled[c]
            diff1 = np.diff(data_channel, n=1)
            diff2 = np.diff(data_channel, n=2)

            ## Simple features
            std = np.std(data_channel)
            features.append(std)
            index.append(f'std_{c}')

            skew = scipy.stats.skew(data_channel)
            features.append(skew)
            index.append(f'skew_{c}')

            kurt = scipy.stats.kurtosis(data_channel)
            features.append(kurt)
            index.append(f'kurt_{c}')

            zeros = zero_crossings(data_channel)
            features.append(zeros)
            index.append(f'zeros_{c}')
            
            logging.debug('simple features generated')

            #RMS = np.sqrt(data_channel**2.mean())

            ## Differential features
            mobility = np.std(diff1)/np.std(data_channel)
            features.append(mobility)
            index.append(f'mobility_{c}')

            complexity = (np.std(diff2) * np.std(diff2)) / np.std(diff1)
            features.append(complexity)
            index.append(f'complexity_{c}')

            zeros_diff1 = zero_crossings(diff1)
            features.append(zeros_diff1)
            index.append(f'zeros_diff1_{c}')

            zeros_diff2 = zero_crossings(diff2)
            features.append(zeros_diff2)
            index.append(f'zeros_diff2_{c}')

            std_diff1 = np.std(diff1)
            features.append(std_diff1)
            index.append(f'std_diff1_{c}')

            std_diff2 = np.std(diff2)
            features.append(std_diff2)
            index.append(f'std_diff2_{c}')
            
            logging.debug('differential features generated')

            # Frequency features

            ## Use welch method to approcimate energies per frequency subdivision
            # From litterature the lowest frequencies of interest in a EEG is 0.5Hz so we need to keep our resolution at 0.25Hz hence a 4 second window cf.Nyquist
            window = (SAMPLING_FREQUENCY / DOWNSAMPLING_RATIO) * 4
            f, psd = scipy.signal.welch(data_channel, fs=80, nperseg=window)
            psd = np.nan_to_num(psd)

            ## Total summed energy
            channel_energy = band_energy(f, psd, 0.1, 40)
            features.append(channel_energy)
            index.append(f'channel_{c}_energy')

            ## Normalised summed energy
            normalised_energy = channel_energy / summed_energy
            features.append(normalised_energy)
            index.append(f'normalised_energy_{c}')

            ## Peak frequency
            peak_frequency = f[np.argmax(psd)]
            features.append(peak_frequency)
            index.append(f'peak_frequency_{c}')

            ## Normalised_summed energy per band
            for k in range(len(BANDS)-1):
                energy = band_energy(f, psd, BANDS[k], BANDS[k+1])
                normalised_band_energy = energy / channel_energy
                features.append(normalised_band_energy)
                index.append(f'normalised_band_energy_{c}_{k}')
                
            logging.debug('lowres frequency features generated')

            ## Spectral entropy
            psd_norm = np.divide(psd, psd.sum())
            spectral_entropy = -np.multiply(psd_norm, np.log2(psd_norm)).sum()
            #spectral_entropy /= np.log2(psd_norm.size) #uncomment to normalise entropy
            features.append(spectral_entropy)
            index.append(f'spectral_entropy_{c}')

            ## SVD entropy
            entropy = svd_entropy(data_channel, order=3,
                                  delay=1, normalize=False)
            features.append(entropy)
            index.append(f'svd_entropy_{c}')
            
            logging.debug('entropy features generated')

            # Highres features : energy per frequency band in 1min segements
            highres_channel_energy = highres_total_energy(data)
            features.append(highres_channel_energy)
            index.append(f'total_channel_energy_{c}')
            
            f, psd = scipy.signal.welch(data, fs=400, nperseg=SAMPLING_FREQUENCY*4)
            psd = np.nan_to_num(psd)
            full_psd_sum = psd.sum()/10  # for normalisation purposed
            # TODO add band energy divided by full_psd_sum as feature
            
            # j allows us to iterate over 1min segments with 30s overlap
            for j in range(19):
                data_segment = data_channel[j*30*SAMPLING_FREQUENCY: (j+1)*30*SAMPLING_FREQUENCY]
                f_segment, psd_segment = scipy.signal.welch(
                    data_segment, fs=SAMPLING_FREQUENCY, nperseg=SAMPLING_FREQUENCY*4)
                psd_segment = np.nan_to_num(psd_segment)

                for k in range(len(HIGHRES_BANDS)-1):
                    window_band_energy = psd_segment[(f_segment > HIGHRES_BANDS[k]) & (
                        f_segment < HIGHRES_BANDS[k+1])].sum()
                    features.append(window_band_energy)
                    index.append(f'windowed_band_energy_{c}_{k}_{j}')
                    normalised_window_band_energy = window_band_energy/full_psd_sum
                    features.append(normalised_window_band_energy)
                    index.append(f'normalised_window_band_energy_{c}_{k}_{j}')
                    #TODO check if normalised feature is redundant
                    
            logging.debug('highres frequency features generated')
            
            #logging.debug(f'finished feature generation file:{counter}, channel:{c}')

        # Save generated features to X_train
        if counter == 0:
            #X_train = np.zeros((1, len(features)))
            X = np.array(features)
            logging.debug('X created and updated')   
        else:
            X = np.vstack((X, np.array(features)))
            logging.debug(f'features for file:{counter} added to array ; X.shape = {X.shape}')
        
        # Save label to y_train
        if is_training_data:
            label = filename[-5 : -4] #last char excluding .mat
            if counter == 0:
                y = np.array(label).astype('int')
                logging.debug(' y created and updated')
            else :
                y = np.vstack((y, np.array(label))).astype('int')
                logging.debug(f'label stacked onto y, y.shape = {y.shape}')
        
        counter += 1

        #TODO add logging

    # Save X_train to file before moving on to next patient data
    X = np.nan_to_num(X)
    y = np.nan_to_num(y)
       
    if is_training_data:
        if save_to_disk:
            np.save(join(FEATURE_SAVE_PATH, f'neurovista_X_train_pat{patient_number}.npy'), X)
            np.save(join(FEATURE_SAVE_PATH, f'neurovista_y_train_pat{patient_number}.npy'), y)
            logging.info('features and labels saved to disk')
        return (X, y, index)
    
    if is_training_data == False:
        if save_to_disk:
            np.save(join(FEATURE_SAVE_PATH, f'neurovista_X_test_pat{patient_number}.npy'), X)
            logging.info('features saved to disk')
        return (X, index)

#### Loop over our patients and call feature generation function

In [None]:
data_dict = {}

for p in [1, 2, 3]:  #[1, 2, 3]:  # iterating over patients 1, 2, 3

    logging.info(f'Entering loop to generate train features for patient {p}')
    patient_path = join(DATA_PATH, TRAIN_PATHS[p-1])
    
    data_dict[f'X_train_pat{p}'], data_dict[f'y_train_pat{p}'], index = generate_features(patient_number = p,
                                                                                   data_path = patient_path,
                                                                                   is_training_data = True,
                                                                                   save_to_disk = True)

## EDA of the generated training dataset

### Create training set dataframe for analysis

#### Load data

In [None]:
X_pat = {}
y_pat = {}

for p in [1,2,3]:
    X_pat[f'pat{p}'] = np.load(f'neurovista_X_train_pat{p}.npy').astype('float32')
    X_pat[f'pat{p}'] = np.nan_to_num(X_pat[f'pat{p}'])
    
    y_pat[f'pat{p}'] = np.load(f'neurovista_y_train_pat{p}.npy').astype('float32')
    y_pat[f'pat{p}'] = np.nan_to_num(y_pat[f'pat{p}'])
    
logging.debug('X and y loaded into dictionary')

X = np.vstack(tuple(X_pat.values()))
y = np.vstack(tuple(y_pat.values()))

In [None]:
# Create mini training data for test
X, y, index = generate_features(patient_number = 1, 
                                data_path = join(DATA_PATH, TRAIN_PATHS[0]),
                                is_training_data = True,
                                save_to_disk = False)

 40%|█████████████████████████████████▏                                                 | 8/20 [00:41<01:07,  5.63s/it]

#### Generate index

In [31]:
_, _, index = generate_features(patient_number = 1,data_path = join(DATA_PATH, TRAIN_PATHS[0]),is_training_data = True,save_to_disk = False)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.89s/it]


#### Convert to dataframe for high level manipulation

In [34]:
X_df = pd.DataFrame(data = X, columns = index) 
X_df.head()

Unnamed: 0,Patient,filenumber,correlation_0-1,correlation_0-2,correlation_0-3,correlation_0-4,correlation_0-5,correlation_0-6,correlation_0-7,correlation_0-8,...,windowed_band_energy_15_2_18,normalised_window_band_energy_15_2_18,windowed_band_energy_15_3_18,normalised_window_band_energy_15_3_18,windowed_band_energy_15_4_18,normalised_window_band_energy_15_4_18,windowed_band_energy_15_5_18,normalised_window_band_energy_15_5_18,windowed_band_energy_15_6_18,normalised_window_band_energy_15_6_18
0,1.0,1009.0,0.428712,0.239288,-0.029995,-0.049008,-0.108267,-0.221111,-0.259864,0.220531,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,100.0,0.362774,0.176893,0.057885,0.033084,0.071525,-0.327772,0.006972,0.304759,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,100.0,-0.014333,-0.340122,-0.036219,0.138175,-0.008025,-0.56697,0.276999,0.141704,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1010.0,0.615629,0.34125,0.073276,0.178729,0.130639,-0.31314,-0.244894,0.44443,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1011.0,0.587197,0.352734,0.103082,0.179288,0.129077,-0.313195,-0.250147,0.378431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Basic analysis to validate feature generation

In [35]:
X_df.describe()

Unnamed: 0,Patient,filenumber,correlation_0-1,correlation_0-2,correlation_0-3,correlation_0-4,correlation_0-5,correlation_0-6,correlation_0-7,correlation_0-8,...,windowed_band_energy_15_2_18,normalised_window_band_energy_15_2_18,windowed_band_energy_15_3_18,normalised_window_band_energy_15_3_18,windowed_band_energy_15_4_18,normalised_window_band_energy_15_4_18,windowed_band_energy_15_5_18,normalised_window_band_energy_15_5_18,windowed_band_energy_15_6_18,normalised_window_band_energy_15_6_18
count,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,...,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0,5047.0
mean,2.26491,954.449158,0.238063,0.043888,-0.043385,-0.096788,-0.11627,-0.165793,-0.134419,0.01649,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,0.722606,665.897156,0.198666,0.139083,0.076619,0.110189,0.110527,0.106412,0.093296,0.196551,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,1.0,1.0,-0.293432,-0.613305,-0.480398,-0.424388,-0.508449,-0.870395,-0.624882,-0.332272,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,331.5,0.064269,-0.057817,-0.093495,-0.157638,-0.2001,-0.207418,-0.192806,-0.178084,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,906.0,0.182777,0.020732,-0.050954,-0.118022,-0.149075,-0.171789,-0.149634,0.063644,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3.0,1516.5,0.427678,0.153395,-0.001668,-0.050567,-0.042323,-0.106461,-0.083293,0.162918,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3.0,2396.0,0.833619,0.765953,0.419973,0.642662,0.512602,0.199429,0.686623,0.649834,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
unique_count = X_df.nunique(axis=0, dropna=True)
unique_count.value_counts()

1       3360
4983     775
4982     334
4981     101
4980      21
        ... 
3260       1
4971       1
4955       1
4935       1
66         1
Length: 92, dtype: int64

On first run some features seemed off and i corrected the feature generation code

In [39]:
unique_count[unique_count == 1] # Singles out the columns that have only 1 value meaning feature generation failed

windowed_band_energy_0_0_4               1
normalised_window_band_energy_0_0_4      1
windowed_band_energy_0_1_4               1
normalised_window_band_energy_0_1_4      1
windowed_band_energy_0_2_4               1
                                        ..
normalised_window_band_energy_15_4_18    1
windowed_band_energy_15_5_18             1
normalised_window_band_energy_15_5_18    1
windowed_band_energy_15_6_18             1
normalised_window_band_energy_15_6_18    1
Length: 3360, dtype: int64

# II - Machine learning model

## Setup

### Imports

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import roc_auc_score, f1_score
import sklearn.metrics
import logging
from sklearn.metrics import make_scorer
from evolutionary_search import EvolutionaryAlgorithmSearchCV #
from sklearn.model_selection import StratifiedKFold

# Import many classifiers to choose the one that performs best
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import RidgeClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
import lightgbm

#Ignore future warning for legibility
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

In [9]:
TEST_SET_SIZE = 0.33

### Metrics functions

In [10]:
def compute_metrics(clf, X_test, y_test):
    """
    return a dict containing auc, f1score, accuracy, balanced_accuracy, recall
    """
    y_pred = clf.predict(X_test)
    y_pred_proba = clf.predict_proba(X_test)[:, [1]]
    
    auc = roc_auc_score(y_test, y_pred_proba)
    f1score = f1_score(y_test, y_pred)
    accuracy = sklearn.metrics.average_precision_score(y_test, y_pred)
    balanced_accuracy = sklearn.metrics.balanced_accuracy_score(y_test, y_pred, adjusted = True)
    recall = sklearn.metrics.recall_score(y_test, y_pred)
    
    metrics_dict = {}
    metrics_dict['auc'] = auc
    metrics_dict['f1score'] = f1score
    metrics_dict['accuracy'] = accuracy
    metrics_dict['balanced_accuracy'] = balanced_accuracy
    metrics_dict['recall'] = recall
    
    return metrics_dict

In [11]:
def auc_patient_cv(clf, X_pat, y_pat):
    """
    Computes AUC while fitting to 2 of our 3 patients and evaluating on the patient it has never seen
    ______
    input : classifier, X_pat dictionary, y_pat dictionary
    output : dictionary containing the 3 AUC metrics (one for each fit)
    """
    
    auc_dict = {}
    
    # A - train = 1 & 2 ; test = 3
    X_train = np.vstack((X_pat['pat1'], X_pat['pat2']))
    y_train = np.vstack((y_pat['pat1'], y_pat['pat2']))
    X_test = X_pat['pat3']
    y_test = y_pat['pat3']
    
    clf.fit(X_train, y_train)
    y_pred_proba = clf.predict_proba(X_test)[:, [1]]
    auc_dict['train 1_2'] = roc_auc_score(y_test, y_pred_proba)
    
    # B - train = 1 & 3 ; test = 2
    X_train = np.vstack((X_pat['pat1'], X_pat['pat3']))
    y_train = np.vstack((y_pat['pat1'], y_pat['pat3']))
    X_test = X_pat['pat2']
    y_test = y_pat['pat2']
    
    clf.fit(X_train, y_train)
    y_pred_proba = clf.predict_proba(X_test)[:, [1]]
    auc_dict['train 1_3'] = roc_auc_score(y_test, y_pred_proba)    

    # C - train = 2 & 3 ; test = 1
    X_train = np.vstack((X_pat['pat2'], X_pat['pat3']))
    y_train = np.vstack((y_pat['pat2'], y_pat['pat3']))
    X_test = X_pat['pat1']
    y_test = y_pat['pat1']
    
    clf.fit(X_train, y_train)
    y_pred_proba = clf.predict_proba(X_test)[:, [1]]
    auc_dict['train 2_3'] = roc_auc_score(y_test, y_pred_proba)     
    
    return auc_dict

### Create train and test datasets

#### Load data

In [None]:
X_pat = {}
y_pat = {}

for p in [1,2,3]:
    X_pat[f'pat{p}'] = np.load(f'neurovista_X_train_pat{p}.npy').astype('float32')
    X_pat[f'pat{p}'] = np.nan_to_num(X_pat[f'pat{p}'])
    
    y_pat[f'pat{p}'] = np.load(f'neurovista_y_train_pat{p}.npy').astype('float32')
    y_pat[f'pat{p}'] = np.nan_to_num(y_pat[f'pat{p}'])
    
logging.debug('X and y loaded into dictionary')

#### Simple random test-train split

In [12]:
X = np.vstack(tuple(X_pat.values()))
y = np.vstack(tuple(y_pat.values()))

X_train_rand, X_test_rand, y_train_rand, y_test_rand = train_test_split(X, y, test_size=0.33, random_state=42)

#### Splitting per patient 
to guarantee test and train have equal proportions of data from each patient

In [13]:
X_train_dict = {}
X_test_dict = {}
y_train_dict = {}
y_test_dict = {}

for p in [1, 2, 3]:
    X_train_dict[f'pat{p}'], X_test_dict[f'pat{p}'], y_train_dict[f'pat{p}'], y_test_dict[f'pat{p}'] = train_test_split(X_pat[f'pat{p}'], y_pat[f'pat{p}'], test_size=0.4, random_state=42)

X_train_pat = np.vstack(tuple(X_train_dict.values()))
X_test_pat = np.vstack(tuple(X_test_dict.values()))
y_train_pat = np.vstack(tuple(y_train_dict.values()))
y_test_pat = np.vstack(tuple(y_test_dict.values()))

## Machine learning model

### Test different classifiers without hyperparameter tuning
To find which kind of classifier perform best for this task, i will fit them to the data and evaluate them whithout tuning hyperparameters.

In [14]:
classifiers = {'ExtraTrees' : ExtraTreesClassifier(n_jobs = -1), 
               'Kneighbours' : KNeighborsClassifier(3, n_jobs = -1),
#               'Gaussian process' : GaussianProcessClassifier(1.0 * RBF(1.0), n_jobs = -1),
#               'Decision tree' : DecisionTreeClassifier(),
#               'Random forest' : RandomForestClassifier(n_estimators=500, n_jobs = -1),
#               'Neural network' : MLPClassifier(alpha=1, max_iter=1000),
#               'Adaboost' : AdaBoostClassifier(),
#               'Gaussain NB' : GaussianNB(),
#               'Quadratic discriminant' : QuadraticDiscriminantAnalysis(),
#               'Hist Gradient Boosting Classifier (LGBM-like)' : HistGradientBoostingClassifier(),
               'LGBM' : lightgbm.LGBMClassifier(n_estimators = 100, objective = 'binary')}

performance_dict = {}
parformance_df = pd.DataFrame()

for clf_name, clf in classifiers.items():
    # random split
    clf.fit(X_train_rand, y_train_rand)
    metrics = compute_metrics(clf, X_test_rand, y_test_rand)
    print(f'{clf_name}')
    print(f"random split : {metrics}")
    metrics['split'] = 'random'
    metrics['classifier'] = f'{clf_name}'
    performance_dict[f'{clf_name}_rand'] = metrics
    
    # carefull per patient split
    clf.fit(X_train_pat, y_train_pat)
    metrics = compute_metrics(clf, X_test_pat, y_test_pat)
    print(f"patient split : {metrics}")
    metrics['split'] = 'per patient'
    metrics['classifier'] = f'{clf_name}'
    performance_dict[f'{clf_name}_pat'] = metrics
    
    # Train on 2 patients test on the other
#    auc_dict = auc_patient_cv(clf, X_pat, y_pat)
#    print(f'{clf_name} trained on 2 out of 3 : {auc_dict}\n')
#    performance_dict[f'{clf_name}_2of3'] = auc_dict

ExtraTrees
random split : {'auc': 0.9444435028248587, 'f1score': 0.5056818181818182, 'accuracy': 0.407266106442577, 'balanced_accuracy': 0.3468192090395479, 'recall': 0.356}
patient split : {'auc': 0.9264022682243631, 'f1score': 0.4911392405063291, 'accuracy': 0.3975681623760937, 'balanced_accuracy': 0.3316353801204843, 'recall': 0.33797909407665505}
Kneighbours
random split : {'auc': 0.9097401129943502, 'f1score': 0.6901408450704225, 'accuracy': 0.5529383662555931, 'balanced_accuracy': 0.5675197740112994, 'recall': 0.588}
patient split : {'auc': 0.8902077330214725, 'f1score': 0.6388308977035491, 'accuracy': 0.49111870546543523, 'balanced_accuracy': 0.5106096958151984, 'recall': 0.5331010452961672}
LGBM
random split : {'auc': 0.9982937853107344, 'f1score': 0.9350104821802936, 'accuracy': 0.8924884209190285, 'balanced_accuracy': 0.8891751412429378, 'recall': 0.892}
patient split : {'auc': 0.9966925077060953, 'f1score': 0.9239332096474953, 'accuracy': 0.876069870146439, 'balanced_accurac

In [15]:
print(performance_dict)

{'ExtraTrees_rand': {'auc': 0.9444435028248587, 'f1score': 0.5056818181818182, 'accuracy': 0.407266106442577, 'balanced_accuracy': 0.3468192090395479, 'recall': 0.356, 'split': 'random', 'classifier': 'ExtraTrees'}, 'ExtraTrees_pat': {'auc': 0.9264022682243631, 'f1score': 0.4911392405063291, 'accuracy': 0.3975681623760937, 'balanced_accuracy': 0.3316353801204843, 'recall': 0.33797909407665505, 'split': 'per patient', 'classifier': 'ExtraTrees'}, 'Kneighbours_rand': {'auc': 0.9097401129943502, 'f1score': 0.6901408450704225, 'accuracy': 0.5529383662555931, 'balanced_accuracy': 0.5675197740112994, 'recall': 0.588, 'split': 'random', 'classifier': 'Kneighbours'}, 'Kneighbours_pat': {'auc': 0.8902077330214725, 'f1score': 0.6388308977035491, 'accuracy': 0.49111870546543523, 'balanced_accuracy': 0.5106096958151984, 'recall': 0.5331010452961672, 'split': 'per patient', 'classifier': 'Kneighbours'}, 'LGBM_rand': {'auc': 0.9982937853107344, 'f1score': 0.9350104821802936, 'accuracy': 0.8924884209

In [20]:
performance_df = pd.DataFrame.from_dict(performance_dict, orient = 'index')
performance_df.head()

Unnamed: 0,auc,f1score,accuracy,balanced_accuracy,recall,split,classifier
ExtraTrees_rand,0.944444,0.505682,0.407266,0.346819,0.356,random,ExtraTrees
ExtraTrees_pat,0.926402,0.491139,0.397568,0.331635,0.337979,per patient,ExtraTrees
Kneighbours_rand,0.90974,0.690141,0.552938,0.56752,0.588,random,Kneighbours
Kneighbours_pat,0.890208,0.638831,0.491119,0.51061,0.533101,per patient,Kneighbours
LGBM_rand,0.998294,0.93501,0.892488,0.889175,0.892,random,LGBM


### Hyperparameter tuning
We are going to use an evolutionary algorithm to tune hyperparameters for the best performing classifiers

In [7]:
from sklearn.metrics import make_scorer
from evolutionary_search import EvolutionaryAlgorithmSearchCV
from sklearn.model_selection import StratifiedKFold

#### Extra trees classifier

In [None]:
# Create model 
extra_trees_clf = ExtraTreesClassifier(n_jobs = -1)

# Optimise hyperparameters

scorer_object = make_scorer(roc_auc_score, greater_is_better = True, needs_proba = True)

distributions = dict(criterion = ['gini', 'entropy'],
                     n_estimators = range(200, 8000), 
                     max_depth = [50, 100, 150, 200, 1000, None],
                     min_samples_split = range(20, 100),
                     min_samples_leaf = range(10, 200),
                     max_features = ['auto', 'sqrt', 'log2'],
                     max_leaf_nodes = [10, 100, 500, 1000, None],
#                   min_impurity_decrease = [0, 0.01, 0.25, 1, 10],
                     bootstrap = [False, True],
                     class_weight = ['balanced', 'balanced_subsample', None],
                     ccp_alpha = np.linspace(0, 0.08, 20) ,
                     max_samples = np.linspace(0.001, 0.5, 20).astype('float'))

# pool = Pool(4)

clf = EvolutionaryAlgorithmSearchCV(extra_trees_clf, 
                                    distributions, 
                                    population_size = 100,
                                    gene_mutation_prob = 0.10,
                                    gene_crossover_prob = 0.5,
                                    tournament_size = 3,
                                    generations_number = 6,
                                    cv = StratifiedKFold(n_splits = 5),
                                    verbose = 1,
                                    scoring = scorer_object,
                                    n_jobs = 1)
                                    #pmap = pool.map

search = clf.fit(X_train, y_train) # TODO test removing search

#### LightGBM
For better accuracy :
- Use large max_bin (may be slower)
- Use small learning_rate with large num_iterations
- Use large num_leaves (may cause over-fitting)

To avoid overfitting an LGBM classifier :
- Use small max_bin
- Use small num_leaves
- Use min_data_in_leaf and min_sum_hessian_in_leaf
- Use bagging by set bagging_fraction and bagging_freq
- Use feature sub-sampling by set feature_fraction
- Try lambda_l1, lambda_l2 and min_gain_to_split for regularization
- Try max_depth to avoid growing deep tree
- Try extra_trees
- Try increasing path_smooth

In [10]:
# Instantiate lgbm with basic hyperparameters that speed up training so we can cover a larger space of hyperparameters
lgbm_clf = lightgbm.LGBMClassifier(objective = 'binary', num_threads = 3, 
                                   feature_pre_filter = True, 
                                   metric = 'auc', 
                                   force_col_wise = True)

scorer_object = make_scorer(roc_auc_score, 
                            greater_is_better = True, 
                            needs_proba = True)

distributions = dict(num_leaves = [100, 500, 2500, 5000],
                    min_data_in_leaf = [5, 10, 30, 100, 200],
                    bagging_fraction = [0.9, 0.8, 0.7, 0.6, 0.5],
                    bagging_freq = [5, 10, 20, 50],
                    feature_fraction = [1, 0.9, 0.8, 0.7, 0.6, 0.5],
                    max_depth = [10, 50, 100, 200, 500],
                    num_iterations = [50, 100, 250, 500],
                    is_unbalance = [True, False],
#                    early_stopping_rounds = [5, 10, 15],
                    extra_trees = [True, False],
                    lambda_l1 = [0.0, 0.1, 0.5, 1, 10],
#                    lambda_l2 = [0.0, 0.1, 0.5, 1, 10])
                    )

clf = EvolutionaryAlgorithmSearchCV(lgbm_clf, 
                                    distributions, 
                                    population_size = 50,
                                    gene_mutation_prob = 0.10,
                                    gene_crossover_prob = 0.5,
                                    tournament_size = 3,
                                    generations_number = 3,
#                                    cv = StratifiedKFold(n_splits = 5),
                                    verbose = 1,
                                    scoring = scorer_object,
                                    n_jobs = 1)

clf.fit(X_train_pat, y_train_pat)

Types [1, 1, 2, 1, 1, 1, 1, 1, 1, 2] and maxint [3, 4, 4, 3, 5, 4, 3, 1, 1, 4] detected
--- Evolve in 960000 possible combinations ---






























gen	nevals	avg    	min     	max     	std      
0  	50    	0.95188	0.853129	0.971181	0.0194449
















1  	32    	0.962643	0.948028	0.971181	0.00501631
















2  	31    	0.966412	0.956633	0.974058	0.00355384












3  	26    	0.969282	0.964557	0.974058	0.00255279
Best individual is: {'num_leaves': 100, 'min_data_in_leaf': 30, 'bagging_fraction': 0.8, 'bagging_freq': 5, 'feature_fraction': 0.6, 'max_depth': 500, 'num_iterations': 500, 'is_unbalance': True, 'extra_trees': False, 'lambda_l1': 0.1}
with fitness: 0.9740581171539514
