# Detecting Epileptic Seizures through EEG Data: Part 1

In this project, we'll be trying to figure out whether a person is experiencing a seizure from their EEG reading.

## Goals for today:
1. Understand what epilepsy is
2. Understand what an EEG is
3. Visualize an EEG as a time series
4. Try out naive ML models (KNNs, logistic regression, etc.) for detecting epilepsy
5. Try out a simple neural net

In [None]:
#@title ##Import libraries and create helper functions!

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn import model_selection
from sklearn.model_selection import train_test_split

import tensorflow as tf

import keras
from keras.models import Sequential
from keras.layers import Activation, MaxPooling2D, Dropout, Flatten, Reshape, Dense, Conv2D, GlobalAveragePooling2D
from keras.wrappers.scikit_learn import KerasClassifier
import keras.optimizers as optimizers
from keras.callbacks import ModelCheckpoint
monitor = ModelCheckpoint('./model.hdf5', 
                          monitor='val_accuracy', 
                          verbose=0, 
                          save_best_only=True, 
                          save_weights_only=False, 
                          mode='auto', 
                          save_freq='epoch')

import gdown

## Utils function to combine 23 chunks from the same patient into one big chunk
# @author Siyi Tang
def prepare_data(eeg_df):
  file_names = eeg_df['Unnamed: 0'].tolist()

  subject_ids = []
  chunk_ids = []
  for fn in file_names:
    subject_ids.append(fn.split('.')[-1])
    chunk_ids.append(fn.split('.')[0])
  subject_ids = list(set(subject_ids))
  assert len(subject_ids) == 500

  sub2ind = {}
  for ind, sub in enumerate(subject_ids):
    sub2ind[sub] = ind

  eeg_combined = np.zeros((500, int(178*23)))
  labels_combined = np.zeros(500)
  labels_chunks = np.zeros((500, 23))
  labels_dict = {}
  for i in range(len(eeg_df)):
    fn = eeg_df.iloc[i]['Unnamed: 0']
    subject_id = fn.split('.')[-1]
    subject_ind = sub2ind[subject_id]

    chunk_id = int(fn.split('.')[0].split('X')[-1])
    start_idx = (chunk_id - 1) * 178
    end_idx = start_idx + 178
    eeg_combined[subject_ind, start_idx:end_idx] = eeg_df.iloc[i].values[1:-1]

    if subject_id not in labels_dict:
      labels_dict[subject_id] = []
    labels_dict[subject_id].append(eeg_df.iloc[i].values[-1])

  for sub_id, labels in labels_dict.items():
    sub_ind = sub2ind[sub_id]
    is_seizure = int(np.any(np.array(labels) == 1))
    labels_combined[sub_ind] = is_seizure
    labels = np.array(labels)
    labels = np.where(labels>1, 0, labels)
    labels_chunks[sub_ind,:] = labels

  return eeg_combined, labels_combined, labels_chunks


def plot_acc(history, ax = None, xlabel = 'Epoch #'):
  # i'm sorry for this function's code. i am so sorry. 
  history = history.history
  history.update({'epoch':list(range(len(history['val_accuracy'])))})
  history = pd.DataFrame.from_dict(history)

  best_epoch = history.sort_values(by = 'val_accuracy', ascending = False).iloc[0]['epoch']

  if not ax:
    f, ax = plt.subplots(1,1)
  sns.lineplot(x = 'epoch', y = 'val_accuracy', data = history, label = 'Validation', ax = ax)
  sns.lineplot(x = 'epoch', y = 'accuracy', data = history, label = 'Training', ax = ax)
  ax.axhline(0.5, linestyle = '--',color='red', label = 'Chance')
  ax.axvline(x = best_epoch, linestyle = '--', color = 'green', label = 'Best Epoch')  
  ax.legend(loc = 1)    
  ax.set_ylim([0.4, 1])

  ax.set_xlabel(xlabel)
  ax.set_ylabel('Accuracy (Fraction)')
  
  plt.show()


In [None]:
#@title ## Download our data set!
data_path = 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/Deep%20Dives/AI%20%2B%20Healthcare/Projects%20(Session%206%2B)/Seizure%20Prediction%20/data.csv'
uci_epilepsy = './uci_epilepsy'
gdown.download(data_path, uci_epilepsy, False)

## What is Epilepsy?
Epilepsy when someone experiences chronic, uncontrolled seizures.

What's a seizure? Read through the **Overview**, **What is a Seizure?**, and **What Causes Seizures** section of [this article](https://mayfieldclinic.com/pe-seizure.htm).

Afterwards, read through the **Overview** and **What is Epilepsy?** sections of [this article](https://mayfieldclinic.com/pe-epilepsy.htm). 

If you have more time, you can continue reading both of these articles to get a better understanding of seizures and epilepsy!

## What is an EEG (an electroencephalogram)?



First, watch this quick [video](https://www.youtube.com/watch?v=tZcKT4l_JZk) on EEG.

If you have more time, start to read through [this](https://www.ncbi.nlm.nih.gov/books/NBK390346/) (slightly more technical) introduction ot EEG. The second and the fourth paragraphs especially provide some information that will be relevant to us!

Discuss with your group:  
What makes EEG an appealing diagnostic tool? 
What are some things that we need to be wary about when we use EEG?


## Our goal: Use EEGs to detect seizures

In order for physicians to treat patients, it's important for them to know if their patients are experiencing epileptic seizures! 

Can we find seizure-specific patterns in EEGS?

# First, let's take a look at our data set!

We'll be using the [UCI Epileptic Seizure Recognition Data Set](https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition#)\*. Click on the link and read the section titled **Attribute Information**. Discuss with your group. What is the input to our machine learning pipeline? What will be the output?

Now go ahead and run the cell below to take a look at your data. Check in with your group - do you all understand what each row and each column of the DataFrame represents? What are the units associated with the values?

<sub>\* This data set is a modified version of the data set generated by the authors of [this paper](https://pubmed.ncbi.nlm.nih.gov/11736210/).</sub>

In [None]:
EEG = pd.read_csv(uci_epilepsy)
EEG.head()


If you read over the UCI documentation, you'll know that each patient provided 23 of these rows, because the UCI researchers took every EEG sample that they had and split it into 23 chunks. For reasons we'll get into later, we would prefer to have a single EEG sample per patient.

We'll use the helper function *prepare_data* (defined in the Import libraries and create helper functions cell at the beginning of this notebook!) to do just that. Run the code below!

In [None]:
eeg, labels, __ = prepare_data(EEG)
print("eeg: ")
print(eeg)
print("labels: ")
print(labels)

####Discussion: 

eeg and labels are NumPy arrays, not DataFrames, so they don't format quite as prettily. But this is a good format to have them in for our future work! Before we move on, check your understanding with your classmates:

What does each row of the eeg array represent? What does each column represent? What are the units of the measurements?

What does each entry in the labels array represent?

In [None]:
#@title Solution
# 1 row of eeg = 1 patient's eeg
# 1 column of eeg = the samples for every eeg at a particular time
# each entry of labels = the classification of the eeg: 1 for seizure, 0 for non-seizure

## Let's visualize one of these EEG samples

First, use [splicing](https://numpy.org/doc/stable/reference/arrays.indexing.html) to extract just one of the EEG samples.

In [None]:
print("EEG: ")
eeg1 = None ### FILL ME IN ###
print(eeg1)


In [None]:
#@title Solution

print("EEG: ")
eeg1 = eeg[0][:]
print(eeg1)


#### Class Discussion:

What are the units of each of the numbers in the EEG array?

####It's kind of hard to understand what's going on when we're just looking at a long list of numbers! 

Let's try plotting the EEG as a **time series** waveform. A time series is term used to refer to a sequence of data points listed in chronological order. Alternatively, we can say that our EEG data exists in the **time domain**. All of our EEG data has been uploaded in time series form! 

(In our second notebook, we will discuss another way of representing waveforms -- a frequency domain representation.)

In [None]:
plt.figure(figsize=(18,6))
plt.plot(eeg1)


###What are the units of the x axis of this plot? What are the units of the y axis?

(Hint: You'll want to review the information given to you here: [UCI Epileptic Seizure Recognition Data Set](https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition#))




In [None]:
x_units = '' #@param {type:"string"}
y_units = '' #@param {type:"string"}

print("The y axis units are microvolts and the x axis units are 1/178s of a second.")


**We want to change the units of the x axis to seconds**

To accomplish that, we can use a numpy function called [linspace](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html). Click on the link to look at linspace's documentation! Use the documentation to fill in the code below.

In [None]:
x = np.linspace(### FILL ME IN ###)

plt.figure(figsize=(18,6))
plt.plot(x,eeg1)
plt.xlabel('Seconds')
plt.ylabel('Microvolts')
plt.title(name)

In [None]:
#@title Solution
# Each EEG = 23.6 seconds
# 4094 data points per EEG

x = np.linspace(0,23.6,4094) 

# EEGs tend to be measured in microvolts
plt.figure(figsize=(18,6))
plt.plot(x,eeg1)
plt.xlabel('Seconds')
plt.ylabel('Microvolts')


Now that we understand what our data looks like, we can move on to creating classification models!

# Build and Evaluate Naive Models

We'll start by using some simple models. When we do any type of machine learning, we follow a standard machine learning pipeline. To remind yourself of our workflow, correctly order the pipeline steps below!

In [None]:
 #@title Machine Learning Pipeline
 
 one = 'Collect input and output data' #@param ["Collect input and output data", "Fit the model to the training data", "Create the model", "Split the data into training and testing data sets", "Test the model on the testing data"]
 two = 'Collect input and output data' #@param ["Collect input and output data", "Fit the model to the training data", "Create the model", "Split the data into training and testing data sets", "Test the model on the testing data"]
 three = 'Collect input and output data' #@param ["Collect input and output data", "Fit the model to the training data", "Create the model", "Split the data into training and testing data sets", "Test the model on the testing data"]
 four = 'Collect input and output data' #@param ["Collect input and output data", "Fit the model to the training data", "Create the model", "Split the data into training and testing data sets", "Test the model on the testing data"]
 five = 'Collect input and output data' #@param ["Collect input and output data", "Fit the model to the training data", "Create the model", "Split the data into training and testing data sets", "Test the model on the testing data"]

In [None]:
#@title Solution
print("1.Collect input and output data")
print("2. Split the data into training and testing data sets")
print("3. Create the model")
print("4. Fit the model to the training data")
print("5. Test the model's accuracy with the testing data")

### First, we need input and output arrays for our ML pipeline. 

Luckily, our eeg and labels arrays are almost ready. We'll just make one or two modifications.

In [None]:
x = eeg.astype('float')
y = labels.astype('float')

### Next, split the data into training and testing sets

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=1)

### Let's build some models!

Here are some models in sklearn that we learned about last week! 
* `knn = KNeighborsClassifier(n_neighbors = 5)`
* `log = LogisticRegression()`
* `dt = DecisionTreeClassifier(max_depth = 2)`

Try all 3 models, and try varying the parameter value for the KNN and DT models. What is the highest accuracy you can achieve on the test set with `accuracy_score`? 

In [None]:
#@title Solution

## BEST ACCURACIES:
## KNN, n_neighbors=1, acc=~85%
## DT, max_depth=10, acc=~89%


knn = KNeighborsClassifier(n_neighbors=10)      
knn.fit(x_train, y_train)
predictions = knn.predict(x_test)
acc = accuracy_score(y_test, predictions)
print(acc)


### Discussion: How well did we do?

Some things to consider:
* How well would we have done if we had just classified each EEG randomly?
* What type of accuracy is needed for this problem?
* Are false positives or false negatives more alarming?

In [None]:
# Compute the confusion matrix for the test data using confusion_matrix function
conf_mat = confusion_matrix(y_test, predictions)
cm_display = ConfusionMatrixDisplay(conf_mat, display_labels=np.arange(2)).plot()

# Build a Simple Neural Net

Let's try out a model that's a little more complicated. Neural networks look something like this: 

![A 2 layer neural network](https://cdn-images-1.medium.com/max/1600/1*DW0Ccmj1hZ0OvSXi7Kz5MQ.jpeg)

But how can we create this model in Python? Here' some code that will create the neural network picture above. We're going to use the tensorflow and keras libraries to create our neural network!

In [None]:
# grab tools from our tensorflow and keras toolboxes!
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import optimizers

# create our model by specifying and compiling it
model = Sequential()
model.add(Dense(7, input_shape=(5,),activation = 'relu'))
model.add(Dense(7, activation = 'relu'))
model.add(Dense(4, activation = 'linear'))
model.compile(loss='binary_crossentropy',
                optimizer='adam',
                metrics=['mean_squared_error'])

**Can you explain what each step does here?**


###Your Assignment: 
Make a neural network with two hidden layers that have ReLU activation. 
The first layer should have 128 neurons. 

The second layer should have 64 neurons. 

Then you should have one output layer with sigmoid activation. 

Use the sample neural net created above as a starting place! You can use the same loss, optimizer, and metrics in your compile command.

In [None]:
nn = Sequential()
### TODO: YOUR CODE HERE! ###

In [None]:
#@title Solution

out_activation = 'sigmoid' 
loss_fxn = 'binary_crossentropy'

nn = Sequential()
nn.add(Dense(128, input_shape = (x.shape[1],), activation = 'relu'))
nn.add(Dense(64, activation = 'relu'))
nn.add(Dense(units = 1, activation = out_activation))
nn.compile(loss= loss_fxn,
           optimizer = 'adam', 
           metrics = ['accuracy'])


Now run the code below to train your network 

In [None]:
nn.fit(x_train, y_train, epochs = 100, validation_data = (x_test, y_test), shuffle = True, callbacks = [monitor])
plot_acc(nn.history)

**What problem is your model experiencing?** How could we address it?

### Dropout

One way to address overfitting is by introducing [dropout](https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/). Read the linked article to understand what dropout is. 

Then copy your model from above into the cell below, and add dropout to one or more of your layers! Train your model and plot its accuracy. Play around with different dropout values. Were you able to reduce overfitting? By how much?

In [None]:
## TODO: YOUR CODE HERE ## 
# Copy your model from above! Then add dropout.

In [None]:
nn.fit(x_train, y_train, epochs = 100, validation_data = (x_test, y_test), shuffle = True, callbacks = [monitor])
plot_acc(nn.history)

In [None]:
#@title Solution
# I have gotten 85% acc with 0.2 dropout added after both hidden layers
# Sometimes this set up also gets as low as 74% acc - discuss with students!
out_activation = 'sigmoid' 
loss_fxn = 'binary_crossentropy'
dropout = 0.2

nn = Sequential()
nn.add(Dense(128, input_shape = (x.shape[1],), activation = 'relu'))
nn.add(Dropout(dropout))
nn.add(Dense(64, activation = 'relu'))
nn.add(Dropout(dropout))
nn.add(Dense(units = 1, activation = out_activation))
nn.compile(loss= loss_fxn,
           optimizer = 'adam', 
           metrics = ['accuracy'])


nn.fit(x_train, y_train, epochs = 100, validation_data = (x_test, y_test), shuffle = True, callbacks = [monitor])
plot_acc(nn.history)

#### You may have also noticed that your accuracy sometimes changes when you retrain the same neural net. Why does this happen? Discuss with your classmates.

# You've finished this notebook 😊 Congrats!