### Helper Classes

First we get all of our helper modules. The prepare_EMG module will prepare the EMG data for phoneme recognition. The prepare_outputs module will prepare our target labels and align them with our EMG data. The module 'prepare_data' will help us read data from CSV into a dataframe. Finally, 'vis' will help visualize EMG data in both time and frequency domains. 

In [55]:
%load_ext autoreload
%autoreload 2

import prepare_EMG, prepare_outputs, prepare_data, vis
# autodetector = Output_Prep.detector
EMG_Prep = prepare_EMG.EMG_preparer(window_size=50.0)
# Output_Prep = prepare_outputs.output_preparer(subvocal_detector = autodetector, window_size=30.0)
Output_Prep = prepare_outputs.output_preparer(window_size=50.0,do_grid_search=False)

Data_Prep = prepare_data.data_preparer()



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


  y = column_or_1d(y, warn=True)


Training Score: 0.89906312737


In [None]:
print('detector:',Output_Prep.detector,'cv results:',Output_Prep.cv_results)



### Labeling the Data

First, we need to visualize a few EMG voltage graphs to find some sections that most likely contain no subvocalization. Then, we'll need to find some regions that almost certainly do. These two classes of EMG readouts will serve to train an identifier to help us automatically label EMG windows with phonemes. The model used here will most likely be an SVC, inside "prepare_outputs". It will process each EMG window in order, and when it finds one that most likely contains subvocalization, it applies the next phoneme as that window's label. 

In [56]:
data_1 = Data_Prep.load('Sat Mar  4 00:44:23 2017')
data_2 = Data_Prep.load('Sat Mar  4 00:45:02 2017')
data_3 = Data_Prep.load('Sat Mar  4 00:45:47 2017')
data_4 = Data_Prep.load('Sat Mar  4 00:47:01 2017')
data_5 = Data_Prep.load('Sat Mar  4 00:47:36 2017')
data_6 = Data_Prep.load('Sat Mar  4 00:48:09 2017')
data_7 = Data_Prep.load('Sat Mar  4 00:49:05 2017')
data_8 = Data_Prep.load('Sat Mar  4 00:49:41 2017')
data_9 = Data_Prep.load('Sat Mar  4 00:50:22 2017')
data_10 = Data_Prep.load('Sat Mar  4 00:51:17 2017')
data_11 = Data_Prep.load('Sat Mar  4 00:52:02 2017')
data_12 = Data_Prep.load('Sat Mar  4 00:52:38 2017')
data_13 = Data_Prep.load('Sat Mar  4 00:53:24 2017')
data_14 = Data_Prep.load('Sat Mar  4 00:53:51 2017')
data_15 = Data_Prep.load('Sat Mar  4 00:54:25 2017')
data_16 = Data_Prep.load('Sat Mar  4 00:54:57 2017')
data_17 = Data_Prep.load('Sat Mar  4 00:56:01 2017')
data_18 = Data_Prep.load('Sat Mar  4 00:56:35 2017')
data_19 = Data_Prep.load('Sat Mar  4 00:57:21 2017')
data_20 = Data_Prep.load('Sat Mar  4 00:57:49 2017')
data_21 = Data_Prep.load('Sat Mar  4 00:58:59 2017')
data_22 = Data_Prep.load('Sat Mar  4 00:59:53 2017')



data_list = [data_1, data_2, data_3, data_4, data_5, data_6, data_7, data_8, data_9, data_10, data_11, data_12, data_13, data_14, data_15, data_16, data_17, data_18, data_19, data_20, data_21, data_22]

In [57]:
import pandas
%autoreload 2


num_files = len(data_list)
labels_frame = pandas.read_csv('austen_subvocal.csv')
labels_frame
# print(labels_frame.iloc[0][0])
trans_labels = Output_Prep.transform(labels_frame.iloc[0][0])
data_1_proc = EMG_Prep.process(data_1)
aligned_data, trans_labels= Output_Prep.zip(data_1_proc, trans_labels, repeat=3)


for file in range(1, num_files):
    trans_labels_iter = Output_Prep.transform(labels_frame.iloc[file][0])
    data_proc_iter = EMG_Prep.process(data_list[file])
    aligned_data_iter, trans_labels_iter = Output_Prep.zip(data_proc_iter, trans_labels_iter, repeat=3)

    aligned_data = aligned_data.append(aligned_data_iter)
    trans_labels = trans_labels.append(trans_labels_iter)
    
# print("This is trans_labels:",trans_labels,'This is trans_labels length:',len(trans_labels))
# print('this is data_1_proc:',data_1_proc)
# print('this is aligned_data', aligned_data)
# aligned_data = aligned_data_1.append([aligned_data_2, aligned_data_3])
# trans_labels = trans_labels_1.append([trans_labels_2, trans_labels_3])
print('Aligned Data shape:',aligned_data.shape,'Trans labels shape:',trans_labels.shape)

Aligned Data shape: (5300, 50) Trans labels shape: (5300, 4)


### AF Extractor Models

These models will be optimized for extracting AF's from the data, before passing those AF's onto an MLPC for identifying the most likely phoneme. 

In [58]:
# Prepare lists of parameters for our GridSearch
# First, our layer sizes
layer_sizes = []
for i in range(2,3):
    for j in range(0,180,30):
        if j:
            tup = []
            for k in range(i):
                tup.append(j)
            layer_sizes.append(tuple(tup))
print('number layer sizes:',len(layer_sizes),'here be layer sizes',layer_sizes)

# Next, our alpha values
alphas = [0.0000001,1,1000]




number layer sizes: 5 here be layer sizes [(30, 30), (60, 60), (90, 90), (120, 120), (150, 150)]


In [59]:
from sklearn.neural_network import MLPClassifier as MLPC
# Import other models to try for feature extraction
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
import copy

param_grid = {
    'solver':['adam'],
    'hidden_layer_sizes':layer_sizes,
    'activation':['logistic'],
    'alpha': alphas,
    'max_iter':[200],
    'beta_1':[((10000-n)/10000) for n in range(2,2000,400)],
    'beta_2':[((10000-n)/10000) for n in range(2,80,20)]
}
grid_search = GridSearchCV(MLPC(), param_grid, n_jobs=-1)
# vectorizer = DictVectorizer()
# trans_aligned_labels = vectorizer.fit_transform([item[1] for item in aligned_labels.iterrows()])
# print(trans_aligned_labels)
# TODO: Implement feature vectorization to use multioutput classification of MLPC
manner_classifier = MLPC(solver='adam',hidden_layer_sizes=(21,21),random_state=3)
manner_classifier.fit(aligned_data, trans_labels['manner'])
m_score = manner_classifier.score(aligned_data, trans_labels['manner'])

place_classifier = MLPC(solver='adam',hidden_layer_sizes=(30,30),random_state=6)
place_classifier.fit(aligned_data, trans_labels['place'])
p_score = place_classifier.score(aligned_data, trans_labels['place'])

height_classifier = MLPC(solver='adam',hidden_layer_sizes=(30,30),random_state=9)
height_classifier.fit(aligned_data, trans_labels['height'])
h_score = height_classifier.score(aligned_data, trans_labels['height'])

vowel_classifier = MLPC(solver='adam',hidden_layer_sizes=(30,30),random_state=12)
vowel_classifier.fit(aligned_data, trans_labels['vowel'])
v_score = vowel_classifier.score(aligned_data, trans_labels['vowel'])

print('manner score:',m_score,'place score:',p_score,'height score:',h_score,'vowel score:',v_score)
print(data_1_proc.head(50), trans_labels['manner'].head(50))

manner score: 0.389433962264 place score: 0.280754716981 height score: 0.504905660377 vowel score: 0.609811320755
         0.00      0.02      0.02      0.04      0.04      0.06      0.06  \
0   78.065625  3.629224  0.839573  3.174895  1.251957  2.208909  1.631892   
1   81.932813  0.103984  0.130759  0.309920  0.064390  0.066917  0.193844   
2   81.610547  0.000919  0.088216  0.155386  0.019537  0.417452  0.116381   
3   81.507422  0.102635  0.009653  0.043715  0.095052  0.314954  0.227756   
4   80.850000  0.155503  0.197546  0.229949  0.048328  0.000701  0.013273   
5   81.236719  0.112354  0.056141  0.004062  0.008119  0.054205  0.101333   
6   80.772656  0.272261  0.141355  0.000662  0.071712  0.215915  0.154497   
7   81.339844  0.121409  0.035416  0.103136  0.015159  0.099398  0.092555   
8   81.417187  0.109337  0.133520  0.164354  0.112874  0.137802  0.280535   
9   81.584766  0.035747  0.029262  0.073996  0.024613  0.107165  0.467060   
10  81.816797  0.082701  0.026714  0.06

In [60]:
manner_classifier2 = grid_search
manner_classifier2.fit(aligned_data, trans_labels['manner'])
m_score2 = manner_classifier2.score(aligned_data, trans_labels['manner'])

print('manner score:',m_score2)



KeyboardInterrupt: 

In [74]:
place_classifier2 = MLPC(solver='adam',hidden_layer_sizes=(1),random_state=6, max_iter=300)
place_classifier2.fit(aligned_data, trans_labels['place'])
p_score2 = place_classifier2.score(aligned_data, trans_labels['place'])

print('place score:',p_score2)

place score: 0.280566037736


In [83]:
height_classifier2 = MLPC(solver='lbfgs',hidden_layer_sizes=(60,60),random_state=9)
height_classifier2.fit(aligned_data, trans_labels['height'])
h_score2 = height_classifier2.score(aligned_data, trans_labels['height'])

print('height score:',h_score2)

height score: 0.532051282051


In [84]:
vowel_classifier2 = MLPC(solver='lbfgs',hidden_layer_sizes=(240,240,240),random_state=12)
vowel_classifier2.fit(aligned_data, trans_labels['vowel'])
v_score2 = vowel_classifier2.score(aligned_data, trans_labels['vowel'])

print('vowel score:',v_score2)

vowel score: 0.647435897436


In [66]:
print(aligned_data.head(),trans_labels['manner'].head())

phoneme_inputs = pandas.concat([aligned_data,trans_labels['manner'],trans_labels['place'],trans_labels['height'],trans_labels['vowel']],axis=1,join='outer')
phoneme_labels = trans_labels.axes[0]
phoneme_classifier = MLPC(solver='adam',hidden_layer_sizes=(90,90),random_state=6, max_iter=300)
phoneme_classifier.fit(phoneme_inputs, phoneme_labels)

    0.000000  0.033333  0.033333  0.066667  0.066667  0.100000  0.100000  \
0  48.133594  0.184898  0.521387  0.658988  0.484811  0.207412  0.830928   
1  48.391406  0.434197  0.334258  0.323602  0.734820  0.048160  0.231146   
2  48.017578  0.087168  0.253308  0.201288  1.038683  0.474576  0.374263   
3  48.417187  0.001295  0.219229  0.280249  0.731593  0.085533  0.095184   
4  48.120703  0.063312  0.076657  0.783632  0.326545  0.120802  0.117231   

   0.133333  0.133333  0.166667    ...     0.333333  0.366667  0.366667  \
0  0.343484  0.126331  0.251367    ...     0.256763  0.126939  0.047097   
1  0.099682  0.464614  0.045117    ...     0.100472  0.059356  0.001035   
2  0.594198  0.037230  0.315820    ...     0.033491  0.175025  0.098646   
3  0.330538  0.124166  0.154687    ...     0.044654  0.026136  0.052012   
4  0.628709  0.007656  0.296484    ...     0.111636  0.072937  0.092085   

   0.400000  0.400000  0.433333  0.433333  0.466667  0.466667  0.500000  
0  0.047357  0.119

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

In [72]:
# Experiment with PCA here

from sklearn.decomposition import PCA
dim_red = PCA(n_components=40,random_state=18)
reduced_data = dim_red.fit_transform(aligned_data)

place_classifier3 = MLPC(solver='adam',hidden_layer_sizes=(30,30),random_state=6)
place_classifier3.fit(reduced_data, trans_labels['place'])
p_score3 = place_classifier3.score(reduced_data, trans_labels['place'])
print('place classifier score:',p_score3)

vowel_classifier3 = MLPC(solver='adam',hidden_layer_sizes=(120,120),random_state=12)
vowel_classifier3.fit(reduced_data, trans_labels['vowel'])
v_score3 = vowel_classifier3.score(reduced_data, trans_labels['vowel'])

print('vowel score:',v_score3)



place classifier score: 0.349811320755
vowel score: 0.951698113208


In [64]:
p_score3


0.28075471698113208