We will be using the <b>"Epileptic Seizure Recognition Data Set"</b> (2017) from the <it>UC Irvine Machine Learning Repository</it>, available for download here: http://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition# <br> 
(For Linux or Mac users, you can open a terminal, copy this link, and use the command: <br>
        wget http://archive.ics.uci.edu/ml/machine-learning-databases/00388/data.csv 
to download the dataset into your working directory) <br>

<b>An aside about the dataset:</b> the data file is in CSV, or comma-separated value, format and contains a numerical representation of EEG data, which is recorded as a time series (the frequencies of brainwaves as they change over time). To analyze this data, it is helpful to "sample" the time series and process it into an easier-to-use format using the Fast Fourier Transform (FFT) algorithm - a super useful and powerful mathematical tool which you don't need to know about to use this data. <br>

The dataset was constructed by measuring brainwave activity from a total of 500 individuals who were each recorded for 23.5 seconds, then the time series were transformed using the FFT such that the 23.5 second measurements sampled into 4097 data points. The 500 patients' 23 second recordings results in 11500 samples in the dataset, each with a total of 179 features + a label indicating epiliptic activity. <br>

That's a whole lot of data! One nice way to think of how to represent this in our program is in a grid-like structure with rows and columns (or a matrix, if you are already familiar), in which each row is a sample corresponding to a patient's brain data and each column representing one feature of the EEG data. Therefore, we should have 11500 rows and 179+1 columns in our grid. <br>

Now, let's load this data into our grid - a multidimensional Numpy array!

In [1]:
# Import some helpful libraries
import numpy as np  # Numpy - we'll use this to store and preprocess the data
import sklearn      # scikit learn - we'll take advantage of data visualization tools as well as an easy to use, off-the-shelf SVM implementation

# The 1st row in the dataset is a header, which we will exclude using the skiprows parameter, 
# as well as the first column, which "names" the specific example based on patient and brainwave sample

extract_cols = range(1, 180) # Keep the brain activity features and corresponding label
seizure_data = np.loadtxt("seizure_data.csv", delimiter=",", skiprows=1, usecols=extract_cols) # Load in the data

Each row in the dataset has a label with values 1-5: a label of '1' indicates epileptic seizure, while labels '2', '3', '4', and '5' represent subjects who did not have a seizure. Most papers which have analyzed this data have used this for binary classification, which is what we'll also do as a slight simplification and for more meaningful results (since we're assuming that you haven't come to this tutorial to learn about neuroscience). <br>

We call this process "binarizing" the dataset in a <b>"one-against-all"</b> manner (either the patient has epileptic seizure or doesn't), so we consider all rows with label '1' to be part of the <b>"positive class"</b>, and all other labels will be '0' and part of the <b>"negative class"</b>. 

In [2]:
print("Before binarizing:", seizure_data[:10, -1])

# Binarize the labels of the all samples/rows in the dataset
for i in range(len(seizure_data)):
    # If the sample doesn't have a positive label, consider it in the negative class
    if seizure_data[i, -1] != 1:
        seizure_data[i, -1] = 0
        
print("After binarizing:", seizure_data[:10, -1])

('Before binarizing:', array([4., 1., 5., 5., 5., 5., 4., 2., 1., 4.]))
('After binarizing:', array([0., 1., 0., 0., 0., 0., 0., 0., 1., 0.]))


How that we have our data ready to go, we want to get some sense of "what it looks like". If our data were two dimensional we could simply plot it and see plainly if the two types or classes were mixed or very far apart. To give a silly example, let's say we wanted to classify German shepherds and tabby kittens based on their weight and height. It would be pretty easy to see that plotting the weights along the x-axis, heights along the y-axis, that kittens would be close to the origin while the shepherds would be so much taller and heavier that they would be far away on the other side of the plot. 

Or, for example, if we wanted to classify individual cherries and watermelons based on the color and weight of each, we would have red and small weights corresponding to cherries, and green and big weights corresponding to watermelons, which would be easy to visualize. We'd have a bunch of red dots close to the origin, and a bunch of green dots far away from the origin. <br>



### 

In [29]:
from PIL import Image
import matplotlib.pyplot as plt
dog_img = Image.open("german-shepherd.jpg")
cat_img = Image.open("kitten.jpg")
dog_img.show()
cat_img.show()

ImportError: No module named matplotlib.pyplot

Once we begin dealing with data with more than 3 or form dimensions - the dimension of space we live in - it's nearly impossible to have an intuition on how data "looks". So for us, unless we do something special with our data, we won't be able to have a visual sense of the form of our data. Herein, we look at two special algorithms: <b>atPCA</b> (Principal Components Analysis) and <b>t-SNE</b> (t-distributed Stochastic Neighbor Embedding). <br>

## Principal Components Analysis:
If you've had some exposure to Linear Algebra, then you may enjoy this next portion; otherwise, feel free to read about the intuition behind it and skip down to t-SNE (no hard feelings). <> <br>

<b>PCA</b> is a special procedure which takes a set of examples/samples/<it>observations</it> (the rows of our data matrix), and their corresponding features/attributes/<it>variables</it>: in statistical terms, it takes the observations and their possibly correlated, or dependent variables and processes them in a way to return a minimal set of variables which are <b>linearly uncorrelated</b>. This minimal set of uncorrelated variables are where the algorithm gets its name; these are the <b>principal components</b>. In layman terms, PCA takes your data in a high dimension we'll represent with the letter $d$ and aims to transform it into a lower dimension we'll call $b$ (with $b < d$) such that you keep the dimensions which encapsulate the most information about our data. lose the a small but minimal  <br>






In [27]:
from sklearn.decomposition import PCA

print("Before PCA, the dataset has:", seizure_data.shape[0], "samples and", seizure_data.shape[1], "features.")

# Instantiate the object which will reduce our dataset down from 178 to to having 2 features
reduce_dimensionality = PCA(n_components=2)

# "Fit" the PCA model with the seizure_data (but not the labels!)
reduced_data = reduce_dimensionality.fit_transform(seizure_data[:, :-2])
# Add the column of labels to the reduced data matrix
reduced_data = np.hstack((reduced_data, seizure_data[:, -1].reshape(-1, 1)))

print("After PCA, the dataset has:", reduced_data.shape[0], "samples and", reduced_data.shape[1], "features.")

('Before PCA, the dataset has:', 11500, 'samples and', 179, 'features.')
('After PCA, the dataset has:', 11500, 'samples and', 3, 'features.')
[[ -88.44502706 -200.96990243    0.        ]
 [-278.64704183 -446.40244851    1.        ]
 [  88.60747079  -41.85030192    0.        ]
 ...
 [ -46.39832059  -21.16020693    0.        ]
 [ -33.60284095 -177.79396568    0.        ]
 [  19.1257944    51.50416       0.        ]]


Now that we've done some successful dimensionality reduction, we will plot the data and color the points black if they are negative (no seizure activity) samples and red if they are positive (seizure activity). For this, we'll use the <b>matplotlib</b> library to use the pyplot.scatter function to produce a nice scatterplot.

In [28]:
plt.scatter(reduced_data[:, -2], reduced_data[:, -1], c=['b, r'])
plt.show()

NameError: name 'plt' is not defined

In [4]:
# Toy example with classifying individual cherries from watermelons

grape_weights = np.random.randn()
print grape_weights


[-0.26186441]
