<a href="https://colab.research.google.com/github/bdandersen-berkeley/mids/blob/master/data_management.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MIDS W207 Group 9 - Final Project - Data Management

## Colaboratory Notebook Preconditions

**This Colab notebook requires access to a Google Drive.**  Project data is retrieved from a folder on the Google Drive associated with the Google account with which this notebook is executing.

This data folder may exist anywhere beneath the Google Drive.  Unfortunately, mounted Google Drives do not have access to folders and files shared by others.

If the project data files do not reside on the Google Drive of the current Google account, please contact any of the following Group 9 members for assistance:

* Brad Andersen - bdandersen@berkeley.edu
* Stephanie Mather - stephanie.mather@berkeley.edu
* Sonal Thakkar - sonalthakkar@berkeley.edu

In [1]:
import sys
print("Python runtime:", sys.version_info)
assert 3 == sys.version_info.major and 6 == sys.version_info.minor, "Python runtime must be version 3.6.x"

Python runtime: sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)


### Identifying the Pathname for Project Resources

This Colab notebook will mount the current user's Google Drive beneath `/content/drive`.   Please specify the pathname of the project resource folder as it is maintained on the current user's Google Drive.

*This is the folder beneath which subfolders **data** and **py** are located.*

In [0]:
gdrive_project_pathname = '/My Drive/MIDS/W207/Final Project'  #@param {type: "string"}

In [4]:
import os.path
from google.colab import drive

# Mount the current user's Google Drive
GOOGLE_DRIVE_MOUNT_POINT = "/content/drive"
print("Mounting Google Drive beneath %s" % GOOGLE_DRIVE_MOUNT_POINT)
drive.mount(GOOGLE_DRIVE_MOUNT_POINT, force_remount = True)

# Build the pathname to the project's folder residing beneath the current user's Google Drive
if not gdrive_project_pathname.startswith("/"):
  gdrive_project_pathname = "/" + gdrive_project_pathname
abs_project_pathname = GOOGLE_DRIVE_MOUNT_POINT + gdrive_project_pathname
print("Project folder: %s" % gdrive_project_pathname)

# Check that the subdirectories anticipated beneath the Google Drive project folder exist
# by checking for the presence of the DO_NOT_DELETE.txt file
for subfolder in ["data", "py"]:
    if not os.path.exists(abs_project_pathname + "/" + subfolder + "/DO_NOT_DELETE.txt"):
        raise FileNotFoundError("Required subfolder '" + subfolder + "' does not exist beneath the Google Drive project folder")
print("Project subfolders successfully verified")

abs_data_pathname = abs_project_pathname + "/data"
abs_py_pathname = abs_project_pathname + "/py"

Mounting Google Drive beneath /content/drive
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
Project folder: /My Drive/MIDS/W207/Final Project
Project subfolders successfully verified


## EDA of Seizure Dataset 

## Dependencies

In [7]:
import IPython

! pip install -U mne

sys.path.append(abs_py_pathname)
import data
import eda
import ml_utl

Collecting mne
[?25l  Downloading https://files.pythonhosted.org/packages/42/ec/08afc26ea6204473031f786d0f3034119a5a138d40062b37fbf578c81c01/mne-0.18.2.tar.gz (6.3MB)
[K     |████████████████████████████████| 6.3MB 4.9MB/s 
Building wheels for collected packages: mne
  Building wheel for mne (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/71/40/30/fb9a9bdeac02c6b3b144be66ac345c5b5587a7d7610564535b
Successfully built mne
Installing collected packages: mne
Successfully installed mne-0.18.2


## Accessing Seizure Data
The data used in this project is  clips.tar.gz is available from **UPenn and Mayo Clinic's Seizure Detection Challenge** https://www.kaggle.com/c/seizure-detection/overview

Note:  clips.tar.gz has been broken down into a single zip file for each subject for storage on Google Drive.

Example of Data Retrievalfor Patient 1


## Data Structure
The dataset is from the UPenn and Mayo Clinic's Seizure Detection Challenge to detect seizures in intracranial EEG recordings.

The data is made up of 1-second EEG clips from 4 dogs and 7 patients. Each test subject has threee types of clip:

* "Ictal" for seizure data segments
* "Interictal" for non-seizure data segments
* "test" for test data segments

The data is stored as a .mat file. This is an output from Matlab.

The Kaggle competion reported that each clip contained the following information:

* **data:** a matrix of EEG sample values arranged row x column as electrode x time.
* **data_length_sec:** the time duration of each data row (1 second for all data in this case).
* **latency:** the time in seconds between the expert-marked seizure onset and the first data point in the data  segment (in ictal training segments only).
* **sampling_frequency:** the number of data samples representing 1 second of EEG data. (Non-integer values represent an average across the full data record and may reflect missing EEG samples).
* **channels**: a list of electrode names corresponding to the rows in the data field.



A summary of the data is shown below:

In [8]:
eda.clip_summary(abs_data_pathname, "Patient_2_ictal_segment_1.mat")

Filename:           Patient_2_ictal_segment_1.mat
Subject:            Patient_2
Data class:         ictal
Segment:            1
Data:
  Shape:            (16, 5000)
  Min (volts):      -313.1382
  Max (volts):      351.8618
  Duration (sec):   unknown
Latency (sec):      0.0000
Samples:            unknown
Electrodes:         16


Futher exploration of the .mat file revealed the data structure reported by the kaggle file is slightly incorrect. Additional information was also added by the loadmat function used to bring the data into python for the EDA.

The resulting data structure is as follows:

* **data:** a matrix of EEG sample values arranged row x column as electrode x time.
* **data_length_sec:** *missing* The competition pre-amble sates that all clips are 1 sec long.
* **latency:** the time in seconds between the expert-marked seizure onset and the first data point in the data  segment (in ictal training segments only) 
* **freq:** the number of data samples representing 1 second of EEG data. (Non-integer values represent an average across the full data record and may reflect missing EEG samples)
* **channels**: a list of electrode names corresponding to the rows in the data field



In [9]:
clip_example = eda.clip_load(abs_data_pathname, "Patient_1_ictal_segment_1.mat")

clip_example.keys()

dict_keys(['__header__', '__version__', '__globals__', 'data', 'freq', 'channels', 'latency'])

Accessing some of the data structures created by loadmat() were difficult to extract, especially the channel data. For now, the channel names have been extracted as a list of strings, however this may have led to some loss of positional data when integrating the clip file with traditional EEG visualisation tools.

In [10]:
clip_example["channels"]

array([[(array(['LFG1'], dtype='<U4'), array(['LFG10'], dtype='<U5'), array(['LFG11'], dtype='<U5'), array(['LFG12'], dtype='<U5'), array(['LFG13'], dtype='<U5'), array(['LFG14'], dtype='<U5'), array(['LFG15'], dtype='<U5'), array(['LFG16'], dtype='<U5'), array(['LFG17'], dtype='<U5'), array(['LFG18'], dtype='<U5'), array(['LFG19'], dtype='<U5'), array(['LFG2'], dtype='<U4'), array(['LFG20'], dtype='<U5'), array(['LFG21'], dtype='<U5'), array(['LFG22'], dtype='<U5'), array(['LFG23'], dtype='<U5'), array(['LFG24'], dtype='<U5'), array(['LFG25'], dtype='<U5'), array(['LFG26'], dtype='<U5'), array(['LFG27'], dtype='<U5'), array(['LFG28'], dtype='<U5'), array(['LFG29'], dtype='<U5'), array(['LFG3'], dtype='<U4'), array(['LFG30'], dtype='<U5'), array(['LFG31'], dtype='<U5'), array(['LFG32'], dtype='<U5'), array(['LFG33'], dtype='<U5'), array(['LFG34'], dtype='<U5'), array(['LFG35'], dtype='<U5'), array(['LFG36'], dtype='<U5'), array(['LFG37'], dtype='<U5'), array(['LFG38'], dtype='<U5'), ar

In [11]:
print("The raw EEG data is:\n",clip_example["data"][0][0])
print("The shape of the raw EEG data is:\n",clip_example["data"].shape)

print("The sampling frequency is:\n",clip_example["freq"])


## How to access data inside array inside list?
print("The channels are:\n",[((clip_example["channels"][0][0][i][0])) for i in range(68)])

The raw EEG data is:
 204.98399999999998
The shape of the raw EEG data is:
 (68, 500)
The sampling frequency is:
 [499.906994]
The channels are:
 ['LFG1', 'LFG10', 'LFG11', 'LFG12', 'LFG13', 'LFG14', 'LFG15', 'LFG16', 'LFG17', 'LFG18', 'LFG19', 'LFG2', 'LFG20', 'LFG21', 'LFG22', 'LFG23', 'LFG24', 'LFG25', 'LFG26', 'LFG27', 'LFG28', 'LFG29', 'LFG3', 'LFG30', 'LFG31', 'LFG32', 'LFG33', 'LFG34', 'LFG35', 'LFG36', 'LFG37', 'LFG38', 'LFG39', 'LFG4', 'LFG40', 'LFG41', 'LFG42', 'LFG43', 'LFG44', 'LFG45', 'LFG46', 'LFG47', 'LFG48', 'LFG49', 'LFG5', 'LFG50', 'LFG51', 'LFG52', 'LFG53', 'LFG54', 'LFG55', 'LFG56', 'LFG57', 'LFG58', 'LFG59', 'LFG6', 'LFG60', 'LFG61', 'LFG62', 'LFG63', 'LFG64', 'LFG7', 'LFG8', 'LFG9', 'LFS1', 'LFS2', 'LFS3', 'LFS4']


In [12]:
clip_data_df = eda.create_clip_df(abs_data_pathname, "Patient_1_ictal_segment_1.mat")
clip_data_df[0].head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499
LFG1,204.984,213.984,217.984,197.984,186.984,177.984,174.984,173.984,156.984,141.984,143.984,128.984,119.984,102.984,99.984,92.984,80.984,97.984,112.984,143.984,161.984,183.984,178.984,175.984,187.984,180.984,162.984,152.984,137.984,144.984,145.984,149.984,165.984,178.984,192.984,190.984,158.984,112.984,54.984,46.984,...,-280.016,-292.016,-303.016,-295.016,-264.016,-254.016,-246.016,-228.016,-221.016,-212.016,-226.016,-248.016,-240.016,-262.016,-239.016,-205.016,-190.016,-177.016,-167.016,-169.016,-156.016,-154.016,-152.016,-113.016,-79.016,-58.016,-53.016,-75.016,-88.016,-85.016,-119.016,-112.016,-105.016,-120.016,-115.016,-142.016,-179.016,-195.016,-216.016,-214.016
LFG10,-138.314,-127.314,-111.314,-148.314,-179.314,-201.314,-210.314,-198.314,-210.314,-211.314,-230.314,-260.314,-285.314,-290.314,-300.314,-272.314,-223.314,-173.314,-135.314,-117.314,-127.314,-131.314,-139.314,-156.314,-175.314,-207.314,-236.314,-237.314,-237.314,-238.314,-236.314,-199.314,-127.314,-72.314,-35.314,-23.314,-8.314,-1.314,-5.314,9.686,...,-250.314,-259.314,-273.314,-275.314,-249.314,-236.314,-235.314,-240.314,-223.314,-204.314,-163.314,-141.314,-118.314,-125.314,-127.314,-143.314,-153.314,-152.314,-159.314,-154.314,-161.314,-166.314,-159.314,-156.314,-161.314,-158.314,-158.314,-145.314,-118.314,-81.314,-37.314,14.686,68.686,93.686,117.686,105.686,63.686,17.686,-17.314,-40.314
LFG11,-64.03,-57.03,-28.03,-50.03,-117.03,-162.03,-196.03,-189.03,-216.03,-222.03,-229.03,-240.03,-243.03,-255.03,-247.03,-233.03,-226.03,-214.03,-222.03,-213.03,-233.03,-231.03,-232.03,-220.03,-215.03,-220.03,-231.03,-266.03,-301.03,-315.03,-307.03,-270.03,-240.03,-225.03,-189.03,-174.03,-165.03,-169.03,-191.03,-195.03,...,-119.03,-129.03,-158.03,-182.03,-167.03,-128.03,-62.03,-40.03,-45.03,-47.03,-69.03,-85.03,-59.03,-74.03,-72.03,-91.03,-93.03,-79.03,-99.03,-101.03,-88.03,-61.03,-14.03,-13.03,-37.03,-41.03,-62.03,-70.03,-65.03,-46.03,-60.03,-47.03,-34.03,-52.03,-39.03,-40.03,-46.03,-55.03,-83.03,-66.03
LFG12,-167.764,-152.764,-158.764,-187.764,-195.764,-220.764,-259.764,-299.764,-332.764,-373.764,-420.764,-435.764,-436.764,-442.764,-423.764,-388.764,-368.764,-350.764,-339.764,-368.764,-415.764,-424.764,-440.764,-460.764,-451.764,-415.764,-363.764,-325.764,-349.764,-365.764,-362.764,-349.764,-324.764,-271.764,-193.764,-176.764,-197.764,-216.764,-247.764,-245.764,...,-160.764,-213.764,-284.764,-320.764,-265.764,-218.764,-185.764,-152.764,-170.764,-170.764,-127.764,-111.764,-78.764,-29.764,45.236,127.236,159.236,187.236,191.236,261.236,270.236,240.236,237.236,193.236,156.236,141.236,147.236,152.236,157.236,182.236,180.236,175.236,171.236,154.236,174.236,172.236,196.236,234.236,257.236,293.236
LFG13,-139.808,-118.808,-87.808,-102.808,-94.808,-116.808,-126.808,-147.808,-177.808,-196.808,-201.808,-219.808,-267.808,-316.808,-330.808,-298.808,-231.808,-151.808,-85.808,-45.808,-54.808,-78.808,-118.808,-142.808,-118.808,-72.808,-40.808,-20.808,-6.808,-6.808,-25.808,-37.808,-72.808,-67.808,-43.808,-48.808,-46.808,-49.808,-65.808,-42.808,...,-89.808,-114.808,-155.808,-170.808,-151.808,-141.808,-118.808,-106.808,-97.808,-74.808,-54.808,-42.808,-12.808,3.192,57.192,77.192,100.192,140.192,171.192,207.192,213.192,195.192,190.192,176.192,151.192,154.192,152.192,143.192,143.192,139.192,147.192,176.192,205.192,204.192,220.192,217.192,211.192,228.192,248.192,286.192
