# Summary

&emsp; This notebook contains code to evaluate the performance of a logisitic regression model in classifying upper limb position from electromyography data. Performance is assesed within each subject and across subjects. In the 'across subject' case, the model is trained on data from one subject and data from all other subjects is used as the test data. Overall, gesture classification performance on held-out data is quite high when the training and test data come form the same subject, but significantly drops when the test data comes from other subjects.

The following notebooks in this repo contain useful data and analysis pipeline details:
- data_exploration_and_quality_check_demo.ipynb
- single_subject_classification_demo.ipynb

Logistic regression model performance is compared with RNN model performance in:
- compare_model_performance_within_and_across_subjects.ipynb

&emsp; The folder containing this notebook is expected to contain a utils.py script (containing custom functions for data wrangling and analysis) and the EMG_data folder (downloaded from: http://archive.ics.uci.edu/ml/datasets/EMG+data+for+gestures#)

In [1]:
#import necessary packages

#our workhorses
import numpy as np
import pandas as pd
import scipy

#to visualize
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
#style params for figures
sns.set(font_scale = 2)
plt.style.use('seaborn-white')
plt.rc("axes", labelweight="bold")
from IPython.display import display, HTML

#to load files
import os
import h5py

#load custon functions
from utils import *


### Within-subject performance

&emsp; The code blocks below train and evaluates a model based on simple logistic regression to classify limb position from the pattern of signals across electrodes. In order to put the performance in context, it's useful to also measure classifier performance after randomly shuffling the class labels (i.e., erasing the relationship between signal and class).

Some details on data preparation and the model
+ Values are standardized across samples within each feature dimension
+ Logistic regression is trained without regularization (simple model is good enough).
+ Using cross-entropy for loss function
+ Model performance is assesed on the held-out set with stratified k-fold cross-validation which keeps the class balance across train/test splits.


&emsp; Model performance is assesed for each subject individually using different train/test splits of the data and the results are written to a file.

In [2]:

#define where the data files are located
data_folder = './EMG_data/'

nsubjects = 36

#User-defined parameters

lo_freq = 20 #lower bound of bandpass filter
hi_freq = 450 #upper bound of bandpass filter

win_size = 200 #define window size over which to compute time-domain features
step = win_size #keeping this parameter in case we want to re-run later with some overlap

#Initialize empty lists
lr_results_df = []
#subject_id = 1

for subject_id in range(1,nsubjects+1):

    subject_folder = os.path.join(data_folder,'%02d'%(subject_id))
    print(subject_folder)

    # Process data and get features 



    feature_matrix, target_labels, window_tstamps, \
    block_labels, series_labels = get_subject_data_for_classification(subject_folder, lo_freq, hi_freq, \
                                                                      win_size, step)

    # resulting feature_matrix has dimension [nsamples, nfeatures]
    nsamples, nfeat = feature_matrix.shape
    print(feature_matrix.shape)


    # Set seed for replicability
    np.random.seed(1)

    # Repeat analysis over multiple repetitions to take into account stochasticity of experiment
    nreps = 10
    for rep in range(nreps):

        # Feed data into function for training and evaluation
        train_f1, test_f1, prob_class_rep = log_reg_on_labeled_data(feature_matrix, target_labels,window_tstamps,\
                                                                              series_labels, nsplits = 3, \
                                                                          penalty = 'none', \
                                                                          multiclass = 'multinomial')

        # Put results in dataframe
        lr_results_df.append(pd.DataFrame({'F1_score':train_f1,\
                                        'Rep':[rep+1 for x in range(train_f1.size)],\
                                        'Fold': np.arange(train_f1.size)+1,\
                                   'Shuffled':[False for x in range(train_f1.size)],\
                                   'Type':['Train' for x in range(train_f1.size)],\
                                          'Train_Subject':[subject_id for x in range(train_f1.size)],\
                                        'Test_Subject':[subject_id for x in range(train_f1.size)],\
                                          }))

        lr_results_df.append(pd.DataFrame({'F1_score':test_f1,\
                                        'Rep':[rep+1 for x in range(test_f1.size)],\
                                        'Fold': np.arange(test_f1.size)+1,\
                                   'Shuffled':[False for x in range(test_f1.size)],\
                                   'Type':['Test' for x in range(test_f1.size)],\
                                           'Train_Subject':[subject_id for x in range(test_f1.size)],\
                                        'Test_Subject':[subject_id for x in range(test_f1.size)],\
                                          }))
        
    # Run classifier with shuffled data (null hypothesis)
    for rep in range(nreps):
        #train and evaluate a classifer shuffling the class labels
        train_f1, test_f1, dummy = log_reg_on_labeled_data(feature_matrix, target_labels, window_tstamps,\
                                                                                series_labels,  nsplits = 3, \
                                                                          penalty = 'none', \
                                                                          multiclass = 'multinomial', permute = True)
        # Put results in dataframe
        lr_results_df.append(pd.DataFrame({'F1_score':train_f1,\
                                        'Rep':[rep+1 for x in range(train_f1.size)],\
                                        'Fold': np.arange(train_f1.size)+1,\
                                   'Shuffled':[True for x in range(train_f1.size)],\
                                   'Type':['Train' for x in range(train_f1.size)],\
                                           'Train_Subject':[subject_id for x in range(train_f1.size)],\
                                            'Test_Subject':[subject_id for x in range(train_f1.size)],\
                                          }))

        lr_results_df.append(pd.DataFrame({'F1_score':test_f1,\
                                        'Rep':[rep+1 for x in range(test_f1.size)],\
                                        'Fold': np.arange(test_f1.size)+1,\
                                   'Shuffled':[True for x in range(test_f1.size)],\
                                   'Type':['Test' for x in range(test_f1.size)],\
                                           'Train_Subject':[subject_id for x in range(test_f1.size)],\
                                            'Test_Subject':[subject_id for x in range(test_f1.size)],\
                                          }))

#concatenate all dataframes
lr_results_df = pd.concat(lr_results_df, axis =0)

#write within subject results to file
results_folder =  os.path.join(data_folder,'..','results_data','log_reg')
results_fn = 'all_subjects_within_subject_results.h5'
lr_results_df.to_hdf(os.path.join(results_folder,results_fn), key='results_df', mode='w')

./EMG_data/01
(604, 16)
./EMG_data/02
(681, 16)
./EMG_data/03
(522, 16)
./EMG_data/04
(571, 16)
./EMG_data/05
(530, 16)
./EMG_data/06
(499, 16)
./EMG_data/07
(656, 16)
./EMG_data/08
(582, 16)
./EMG_data/09
(640, 16)
./EMG_data/10
(619, 16)
./EMG_data/11
(735, 16)
./EMG_data/12
(632, 16)
./EMG_data/13
(780, 16)
./EMG_data/14
(495, 16)
./EMG_data/15
(519, 16)
./EMG_data/16
(523, 16)
./EMG_data/17
(661, 16)
./EMG_data/18
(645, 16)
./EMG_data/19
(541, 16)
./EMG_data/20
(641, 16)
./EMG_data/21
(599, 16)
./EMG_data/22
(589, 16)
./EMG_data/23
(573, 16)
./EMG_data/24
(578, 16)
./EMG_data/25
(577, 16)
./EMG_data/26
(533, 16)
./EMG_data/27
(526, 16)
./EMG_data/28
(483, 16)
./EMG_data/29
(523, 16)
./EMG_data/30
(734, 16)
./EMG_data/31
(452, 16)
./EMG_data/32
(610, 16)
./EMG_data/33
(554, 16)
./EMG_data/34
(723, 16)
./EMG_data/35
(487, 16)
./EMG_data/36
(504, 16)


In [3]:
#average over different train/test splits of the data
lr_results_df = lr_results_df.groupby(['Shuffled','Type','Train_Subject','Test_Subject'],as_index = False)\
.mean()\
.drop(columns = ['Fold','Rep'])

# Output summary
display(HTML(lr_results_df.groupby(['Shuffled','Type']).mean().drop(columns = ['Train_Subject','Test_Subject']).to_html()))


Unnamed: 0_level_0,Unnamed: 1_level_0,F1_score
Shuffled,Type,Unnamed: 2_level_1
False,Test,0.969033
False,Train,0.9999
True,Test,0.163066
True,Train,0.411994


### Across-subject performance

&emsp; The code block below asseses model performance across subjects. The classifier is trained on data from one subject and tested on data from all other subjects. I exclude unlabeled class timepoints as well as timepoints with labels not collected for al subjects (class 7). This prevents further complications in comparing model performance across subjects.

Results are written to an hdf5 file.

In [4]:
# Set seed for replicability
np.random.seed(1)


lr_xsubj_results_df = []


for src_subject_id in range(1,nsubjects+1):


    subject_folder = os.path.join(data_folder,'%02d'%(src_subject_id))
    print('Source Subject :%s'%(subject_folder))

    # Process data and get features 
    feature_matrix0, target_labels0, window_tstamps0, \
    block_labels0, series_labels0 = get_subject_data_for_classification(subject_folder, lo_freq, hi_freq, \
                                                                      win_size, step)
    # resulting feature_matrix has dimension [nsamples, nfeatures]
    print(feature_matrix0.shape)

    train_f1_scores = np.empty((0,))
    test_f1_scores = np.empty((0,))
    train_f1_scores_perm = np.empty((0,))
    test_f1_scores_perm = np.empty((0,))
    
    targ_subject_list = []
    src_subject_list = []

    for targ_subject_id in range(1,nsubjects+1):
        if targ_subject_id != src_subject_id:
            
            #define folder
            subject_folder = os.path.join(data_folder,'%02d'%(targ_subject_id))
            print('Target Subject :%s'%(subject_folder))

            # Process data and get features from target subject
            feature_matrix1, target_labels1, window_tstamps1, \
            block_labels1, series_labels1 = get_subject_data_for_classification(subject_folder, lo_freq, hi_freq, \
                                                                              win_size, step)

            # resulting feature_matrix has dimension [nsamples, nfeatures]
            print(feature_matrix1.shape)


            # train classifier model on source subject data and test on target subject data
            train_f1, test_f1 = log_reg_xsubject_labeled_data(feature_matrix0, target_labels0, feature_matrix1, target_labels1,\
                                                              exclude = [0,7],penalty = 'none', multiclass = 'multinomial',permute = False)
            train_f1_scores = np.hstack((train_f1_scores,train_f1))
            test_f1_scores = np.hstack((test_f1_scores,test_f1))
            
            src_subject_list.append(src_subject_id)
            targ_subject_list.append(targ_subject_id)
            
            #repeat classification with permuted data
            train_f1, test_f1 = log_reg_xsubject_labeled_data(feature_matrix0, target_labels0, feature_matrix1, target_labels1,\
                                                              exclude = [0,7],penalty = 'none', multiclass = 'multinomial',permute = True)
            train_f1_scores_perm = np.hstack((train_f1_scores_perm,train_f1))
            test_f1_scores_perm = np.hstack((test_f1_scores_perm,test_f1))

    #put results in a dataframe
    lr_xsubj_results_df.append(pd.DataFrame({'F1_score':test_f1_scores,\
                                             'Shuffled':[False for x in range(test_f1_scores.size)],\
                                             'Type':['Test' for x in range(test_f1_scores.size)],\
                                             'Train_Subject':src_subject_list,\
                                             'Test_Subject':targ_subject_list,\
                                              }))
    
    lr_xsubj_results_df.append(pd.DataFrame({'F1_score':test_f1_scores_perm,\
                                             'Shuffled':[True for x in range(test_f1_scores_perm.size)],\
                                             'Type':['Test' for x in range(test_f1_scores_perm.size)],\
                                             'Train_Subject':src_subject_list,\
                                             'Test_Subject':targ_subject_list,\
                                              }))
    
    
#concatenate all dataframes
lr_xsubj_results_df = pd.concat(lr_xsubj_results_df, axis =0)

#write cross-subject results to file
results_folder =  os.path.join(data_folder,'..','results_data','log_reg')
results_fn = 'all_subjects_across_subject_results.h5'
lr_xsubj_results_df.to_hdf(os.path.join(results_folder,results_fn), key='results_df', mode='w')



Source Subject :./EMG_data/01
(604, 16)
Target Subject :./EMG_data/02
(681, 16)
Target Subject :./EMG_data/03
(522, 16)
Target Subject :./EMG_data/04
(571, 16)
Target Subject :./EMG_data/05
(530, 16)
Target Subject :./EMG_data/06
(499, 16)
Target Subject :./EMG_data/07
(656, 16)
Target Subject :./EMG_data/08
(582, 16)
Target Subject :./EMG_data/09
(640, 16)
Target Subject :./EMG_data/10
(619, 16)
Target Subject :./EMG_data/11
(735, 16)
Target Subject :./EMG_data/12
(632, 16)
Target Subject :./EMG_data/13
(780, 16)
Target Subject :./EMG_data/14
(495, 16)
Target Subject :./EMG_data/15
(519, 16)
Target Subject :./EMG_data/16
(523, 16)
Target Subject :./EMG_data/17
(661, 16)
Target Subject :./EMG_data/18
(645, 16)
Target Subject :./EMG_data/19
(541, 16)
Target Subject :./EMG_data/20
(641, 16)
Target Subject :./EMG_data/21
(599, 16)
Target Subject :./EMG_data/22
(589, 16)
Target Subject :./EMG_data/23
(573, 16)
Target Subject :./EMG_data/24
(578, 16)
Target Subject :./EMG_data/25
(577, 16)


(533, 16)
Target Subject :./EMG_data/27
(526, 16)
Target Subject :./EMG_data/28
(483, 16)
Target Subject :./EMG_data/29
(523, 16)
Target Subject :./EMG_data/30
(734, 16)
Target Subject :./EMG_data/31
(452, 16)
Target Subject :./EMG_data/32
(610, 16)
Target Subject :./EMG_data/33
(554, 16)
Target Subject :./EMG_data/34
(723, 16)
Target Subject :./EMG_data/35
(487, 16)
Target Subject :./EMG_data/36
(504, 16)
Source Subject :./EMG_data/07
(656, 16)
Target Subject :./EMG_data/01
(604, 16)
Target Subject :./EMG_data/02
(681, 16)
Target Subject :./EMG_data/03
(522, 16)
Target Subject :./EMG_data/04
(571, 16)
Target Subject :./EMG_data/05
(530, 16)
Target Subject :./EMG_data/06
(499, 16)
Target Subject :./EMG_data/08
(582, 16)
Target Subject :./EMG_data/09
(640, 16)
Target Subject :./EMG_data/10
(619, 16)
Target Subject :./EMG_data/11
(735, 16)
Target Subject :./EMG_data/12
(632, 16)
Target Subject :./EMG_data/13
(780, 16)
Target Subject :./EMG_data/14
(495, 16)
Target Subject :./EMG_data/15


(519, 16)
Target Subject :./EMG_data/16
(523, 16)
Target Subject :./EMG_data/17
(661, 16)
Target Subject :./EMG_data/18
(645, 16)
Target Subject :./EMG_data/19
(541, 16)
Target Subject :./EMG_data/20
(641, 16)
Target Subject :./EMG_data/21
(599, 16)
Target Subject :./EMG_data/22
(589, 16)
Target Subject :./EMG_data/23
(573, 16)
Target Subject :./EMG_data/24
(578, 16)
Target Subject :./EMG_data/25
(577, 16)
Target Subject :./EMG_data/26
(533, 16)
Target Subject :./EMG_data/27
(526, 16)
Target Subject :./EMG_data/28
(483, 16)
Target Subject :./EMG_data/29
(523, 16)
Target Subject :./EMG_data/30
(734, 16)
Target Subject :./EMG_data/31
(452, 16)
Target Subject :./EMG_data/32
(610, 16)
Target Subject :./EMG_data/33
(554, 16)
Target Subject :./EMG_data/34
(723, 16)
Target Subject :./EMG_data/35
(487, 16)
Target Subject :./EMG_data/36
(504, 16)
Source Subject :./EMG_data/13
(780, 16)
Target Subject :./EMG_data/01
(604, 16)
Target Subject :./EMG_data/02
(681, 16)
Target Subject :./EMG_data/03


(522, 16)
Target Subject :./EMG_data/04
(571, 16)
Target Subject :./EMG_data/05
(530, 16)
Target Subject :./EMG_data/06
(499, 16)
Target Subject :./EMG_data/07
(656, 16)
Target Subject :./EMG_data/08
(582, 16)
Target Subject :./EMG_data/09
(640, 16)
Target Subject :./EMG_data/10
(619, 16)
Target Subject :./EMG_data/11
(735, 16)
Target Subject :./EMG_data/12
(632, 16)
Target Subject :./EMG_data/13
(780, 16)
Target Subject :./EMG_data/14
(495, 16)
Target Subject :./EMG_data/15
(519, 16)
Target Subject :./EMG_data/16
(523, 16)
Target Subject :./EMG_data/17
(661, 16)
Target Subject :./EMG_data/19
(541, 16)
Target Subject :./EMG_data/20
(641, 16)
Target Subject :./EMG_data/21
(599, 16)
Target Subject :./EMG_data/22
(589, 16)
Target Subject :./EMG_data/23
(573, 16)
Target Subject :./EMG_data/24
(578, 16)
Target Subject :./EMG_data/25
(577, 16)
Target Subject :./EMG_data/26
(533, 16)
Target Subject :./EMG_data/27
(526, 16)
Target Subject :./EMG_data/28
(483, 16)
Target Subject :./EMG_data/29


(523, 16)
Target Subject :./EMG_data/30
(734, 16)
Target Subject :./EMG_data/31
(452, 16)
Target Subject :./EMG_data/32
(610, 16)
Target Subject :./EMG_data/33
(554, 16)
Target Subject :./EMG_data/34
(723, 16)
Target Subject :./EMG_data/35
(487, 16)
Target Subject :./EMG_data/36
(504, 16)
Source Subject :./EMG_data/24
(578, 16)
Target Subject :./EMG_data/01
(604, 16)
Target Subject :./EMG_data/02
(681, 16)
Target Subject :./EMG_data/03
(522, 16)
Target Subject :./EMG_data/04
(571, 16)
Target Subject :./EMG_data/05
(530, 16)
Target Subject :./EMG_data/06
(499, 16)
Target Subject :./EMG_data/07
(656, 16)
Target Subject :./EMG_data/08
(582, 16)
Target Subject :./EMG_data/09
(640, 16)
Target Subject :./EMG_data/10
(619, 16)
Target Subject :./EMG_data/11
(735, 16)
Target Subject :./EMG_data/12
(632, 16)
Target Subject :./EMG_data/13
(780, 16)
Target Subject :./EMG_data/14
(495, 16)
Target Subject :./EMG_data/15
(519, 16)
Target Subject :./EMG_data/16
(523, 16)
Target Subject :./EMG_data/17


(661, 16)
Target Subject :./EMG_data/18
(645, 16)
Target Subject :./EMG_data/19
(541, 16)
Target Subject :./EMG_data/20
(641, 16)
Target Subject :./EMG_data/21
(599, 16)
Target Subject :./EMG_data/22
(589, 16)
Target Subject :./EMG_data/23
(573, 16)
Target Subject :./EMG_data/24
(578, 16)
Target Subject :./EMG_data/25
(577, 16)
Target Subject :./EMG_data/26
(533, 16)
Target Subject :./EMG_data/27
(526, 16)
Target Subject :./EMG_data/28
(483, 16)
Target Subject :./EMG_data/30
(734, 16)
Target Subject :./EMG_data/31
(452, 16)
Target Subject :./EMG_data/32
(610, 16)
Target Subject :./EMG_data/33
(554, 16)
Target Subject :./EMG_data/34
(723, 16)
Target Subject :./EMG_data/35
(487, 16)
Target Subject :./EMG_data/36
(504, 16)
Source Subject :./EMG_data/30
(734, 16)
Target Subject :./EMG_data/01
(604, 16)
Target Subject :./EMG_data/02
(681, 16)
Target Subject :./EMG_data/03
(522, 16)
Target Subject :./EMG_data/04
(571, 16)
Target Subject :./EMG_data/05
(530, 16)
Target Subject :./EMG_data/06


(499, 16)
Target Subject :./EMG_data/07
(656, 16)
Target Subject :./EMG_data/08
(582, 16)
Target Subject :./EMG_data/09
(640, 16)
Target Subject :./EMG_data/10
(619, 16)
Target Subject :./EMG_data/11
(735, 16)
Target Subject :./EMG_data/12
(632, 16)
Target Subject :./EMG_data/13
(780, 16)
Target Subject :./EMG_data/14
(495, 16)
Target Subject :./EMG_data/15
(519, 16)
Target Subject :./EMG_data/16
(523, 16)
Target Subject :./EMG_data/17
(661, 16)
Target Subject :./EMG_data/18
(645, 16)
Target Subject :./EMG_data/19
(541, 16)
Target Subject :./EMG_data/20
(641, 16)
Target Subject :./EMG_data/21
(599, 16)
Target Subject :./EMG_data/22
(589, 16)
Target Subject :./EMG_data/23
(573, 16)
Target Subject :./EMG_data/24
(578, 16)
Target Subject :./EMG_data/25
(577, 16)
Target Subject :./EMG_data/26
(533, 16)
Target Subject :./EMG_data/27
(526, 16)
Target Subject :./EMG_data/28
(483, 16)
Target Subject :./EMG_data/29
(523, 16)
Target Subject :./EMG_data/30
(734, 16)
Target Subject :./EMG_data/31


In [5]:
#average over tests subjects not used for training
lr_xsubj_results_df = lr_xsubj_results_df.groupby(['Shuffled','Type','Train_Subject'],as_index = False)\
.mean()\
.drop(columns = ['Test_Subject'])

# Output summary
display(HTML(lr_xsubj_results_df.groupby(['Shuffled','Type']).mean().drop(columns = ['Train_Subject']).to_html()))

Unnamed: 0_level_0,Unnamed: 1_level_0,F1_score
Shuffled,Type,Unnamed: 2_level_1
False,Test,0.561763
True,Test,0.144078
