# Virtual Piano Tutor

## Project Description

This is a notebook for tracking my progress on VPT...

- Best Classifier as of 11/30
    - SVM {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}

## TODO List

- TODAY
    - Decide on RDF model to keep for rest of project
    - Work on RDF data and annotations
    - Add results to file
    - Rewrite RDF for GridSearchCV
        - Extend RDF
    - Work on ideas for paper
        - Visualizations
    - Play with CAE
    - How to automate this...
    - Windowing/Summarizing
    
- DONE
    - ~~Organize RDF data~~
    - ~~Generate data from already extracted hands...~~
    - ~~Get notebook running on Compute Canada~~
    - ~~Get data on Compute Canada~~
    - ~~Setup CAE to deal with hand images~~
    - ~~setup data for training autoencoder on LH and RH~~
    - ~~Train Autoencoder for LH and Rh~~
    

- Bad Segmentation
    - p3c - left hand (not terrible)
    - p1s - right hand (shouldn't use)
    - p5a - Both could use some work but still caputures most of the left hand (RH not so good...)
    - p5c - not good (left hand passable...)
    
- Add noise to CAE
    - http://scikit-image.org/docs/dev/api/skimage.util.html#random-noise
    
- ~~Multiple Participants~~
    - ~~have one holdout set participant~~
        - ~~Test with p1&2 training p3 testing, then p1&3...~~
    - ~~have one holdout set exercise~~

- Test with RH too

- Windowing data
    - Summarize data for classification
    - Majority Voting (or with probabilities)

- Look for other features
    - Others??
    - ~~Autoencoder features~~
    - ~~HONV~~
    
- Work on hand segmentation
   - See p1e for bad examples
   - How to validate segmentation?
       - Statistical analysis on length and width ratios
       
- Visualize !!!
    - Input 
    - Results !!!
        - F Scores
        - Accuracy
        - Try weighted instead of macro




- Finish Project Description

- ~~Turn into functions~~
    
- ~~Verify Segmentation~~
    - have only done basic verification
    
- ~~FIRST THING: Test by ignoring training data (p1s) and then using train_test_split on recordings~~
    - ~~Data should be ready for spliting~~
    
- ~~Remove data from testing to find culprit~~
    
- ~~Track my progress better !!! (duh through notebooks!)~~

# Setup

## Libraries

In [1]:
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import cv2

from vpt.features.features import *
import vpt.utils.image_processing as ip
import vpt.settings as s
import vpt.hand_detection.depth_context_features as dcf

%load_ext autoreload
%autoreload 2

## Some helper functions

### Load/Save Data Set

In [2]:
def load_data(testing_p, M, radius, feature_type, data_type):
    base = "data/posture/extracted/"
    data_path = os.path.join(base, "{}_M{}_rad{:0.2f}_{}_".format(testing_p, M, radius, feature_type))
    data = np.load(data_path + data_type + "_data_combined.npz")    
    return data

## Project Setup

### Generate or Load Data

In [50]:
M = 5
radius = .15
feature_type = "hog"
testing_p = "p3"

In [23]:
#### Load data for a single paricipant
data = load_data("all_participants", M, radius, feature_type, "train")
print("X LH", data["X_lh"].shape, "y LH", data["y_lh"].shape, data["vis_lhs"].shape)
print("X RH", data["X_rh"].shape, "y RH", data["y_rh"].shape, data["vis_rhs"].shape)
print("Filenames", data["filenames"].shape)

X LH (15818, 1089) y LH (15818,) (15818, 180, 180)
X RH (15818, 1089) y RH (15818,) (15818, 180, 180)
Filenames (15818,)


In [83]:
r = re.compile(testing_p)
vmatch = np.vectorize(lambda x:bool(r.search(x)))
val_p = vmatch(data['filenames'])

X_lh = data['X_lh'][val_p]
y_lh = data['y_lh'][val_p]
X_rh = data['X_rh'][val_p]
y_rh = data['y_rh'][val_p]
filenames = data['filenames'][val_p]

X_comb = np.vstack((X_lh, X_rh))
y_comb = np.hstack((y_lh, y_rh))
filenames_comb = np.hstack((filenames, filenames))

print("{} Data:".format(testing_p))
print("LH:", X_lh.shape, y_lh.shape)
print("RH:", X_rh.shape, y_rh.shape)
print(filenames.shape)
print("Comb:", X_comb.shape, y_comb.shape)
print(filenames_comb.shape)

p3 Data:
LH: (4572, 1089) (4572,)
RH: (4572, 1089) (4572,)
(4572,)
Comb: (9144, 1089) (9144,)
(9144,)


# Classification

### Libraries

In [44]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, LeaveOneGroupOut
from imblearn.pipeline import Pipeline

from sklearn.decomposition import PCA

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

## Load Data for Classification

In [84]:
## find and remove all "static" data so we can ignore for now
# r = re.compile('{}s'.format(testing_p))
r = re.compile('p[\d]s')

# remove p#s data
vmatch = np.vectorize(lambda x:bool(r.search(x)))
rem_static = vmatch(filenames)
rem_static_comb = vmatch(filenames_comb)

X_lh, y_lh, filenames = X_lh[~rem_static], y_lh[~rem_static], filenames[~rem_static]
X_rh, y_rh = X_rh[~rem_static], y_rh[~rem_static]
X_comb, y_comb, filenames_comb = X_comb[~rem_static_comb], y_comb[~rem_static_comb], filenames_comb[~rem_static_comb]

In [85]:
## Train test split on exercise e
r = re.compile('p[\d]e')

# remove p#s data
vmatch = np.vectorize(lambda x:bool(r.search(x)))
split = vmatch(filenames)
split_comb = vmatch(filenames_comb)

X_lh_train, y_lh_train, filenames_train = X_lh[~split], y_lh[~split], filenames[~split]
X_rh_train, y_rh_train, filenames_train = X_rh[~split], y_rh[~split], filenames[~split]
X_comb_train, y_comb_train, filenames_comb_train = X_comb[~split_comb], y_comb[~split_comb], filenames_comb[~split_comb]

X_lh_test, y_lh_test, filenames_test = X_lh[split], y_lh[split], filenames[split]
X_rh_test, y_rh_test, filenames_test = X_rh[split], y_rh[split], filenames[split]
X_comb_test, y_comb_test, filenames_comb_test = X_comb[split_comb], y_comb[split_comb], filenames_comb[split_comb]

In [86]:
print("Training Data")
print("LH:", X_lh_train.shape, y_lh_train.shape)
print(np.unique(y_lh_train, return_counts=True))
print("RH:", X_rh_train.shape, y_rh_train.shape)
print(filenames_train.shape)
print(np.unique(y_rh_train, return_counts=True))
print("Combined:", X_comb_train.shape, y_comb_train.shape)
print(filenames_comb_train.shape)
print(np.unique(y_comb_train, return_counts=True))
print()
print()
print("Test Data")
print("LH:", X_lh_test.shape, y_lh_test.shape)
print(np.unique(y_lh_test, return_counts=True))
print("RH:", X_rh_test.shape, y_rh_test.shape)
print(np.unique(y_rh_test, return_counts=True))
print("Combined:", X_comb_test.shape, y_comb_test.shape)
print(np.unique(y_comb_test, return_counts=True))
print(filenames_comb_test.shape)

Training Data
LH: (2135, 1089) (2135,)
(array([0, 1, 2]), array([1238,  865,   32]))
RH: (2135, 1089) (2135,)
(2135,)
(array([0]), array([2135]))
Combined: (4270, 1089) (4270,)
(4270,)
(array([0, 1, 2]), array([3373,  865,   32]))


Test Data
LH: (515, 1089) (515,)
(array([0, 1, 2]), array([ 29, 471,  15]))
RH: (515, 1089) (515,)
(array([0]), array([515]))
Combined: (1030, 1089) (1030,)
(array([0, 1, 2]), array([544, 471,  15]))
(1030,)


## Model Testing

### SVM

### Single Classifier

In [55]:
# steps = [('PCA', PCA(n_components=1500)), ('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=10, gamma=.00001, kernel='rbf', probability=False))]
steps = [('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=1, gamma=.001, kernel='rbf', decision_function_shape='ovo', probability=False))]

In [56]:
print("Training LH Classifier")
pipeline_lh = Pipeline(steps)
pipeline_lh.fit(X_lh_train, y_lh_train)

print("Training RH Classifier")
pipeline_rh = Pipeline(steps)
pipeline_rh.fit(X_rh_train, y_rh_train)

print("Training Combined Classifier")
pipeline_comb = Pipeline(steps)
pipeline_comb.fit(X_comb_train, y_comb_train)

Training LH Classifier
Training RH Classifier
Training Combined Classifier


Pipeline(memory=None,
     steps=[('SMOTE', SMOTE(k=None, k_neighbors=5, kind='borderline2', m=None, m_neighbors=10,
   n_jobs=1, out_step=0.5, random_state=None, ratio='auto',
   svm_estimator=None)), ('SVC', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [57]:
print("Predicting LH..")
y_lh_true, y_lh_pred = y_lh_test, pipeline_lh.predict(X_lh_test)
print("LH Validation Score:", accuracy_score(y_lh_true, y_lh_pred))
print("LH Confusion Matrix:\n", confusion_matrix(y_lh_true, y_lh_pred))
print(classification_report(y_lh_true, y_lh_pred))

print("Predicting RH...")
y_rh_true, y_rh_pred = y_rh_test, pipeline_rh.predict(X_rh_test)
print("RH Validation Score:", accuracy_score(y_rh_true, y_rh_pred))
print("RH Confusion Matrix:\n", confusion_matrix(y_rh_true, y_rh_pred))
print(classification_report(y_rh_true, y_rh_pred))

print("Predicting Comb...")
y_comb_true, y_comb_pred = y_comb_test, pipeline_comb.predict(X_comb_test)
print("Comb Validatation Score:", accuracy_score(y_comb_true, y_comb_pred))
print("Comb Confustion Matrix:\n", confusion_matrix(y_comb_true, y_comb_pred))
print(classification_report(y_comb_true, y_comb_pred))

Predicting LH..
LH Validation Score: 0.557281553398
LH Confusion Matrix:
 [[  0  28   1]
 [ 35 279 157]
 [  0   7   8]]
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        29
          1       0.89      0.59      0.71       471
          2       0.05      0.53      0.09        15

avg / total       0.81      0.56      0.65       515

Predicting RH...
RH Validation Score: 0.988349514563
RH Confusion Matrix:
 [[509   6]
 [  0   0]]
             precision    recall  f1-score   support

          0       1.00      0.99      0.99       515
          2       0.00      0.00      0.00         0

avg / total       1.00      0.99      0.99       515

Predicting Comb...


  'recall', 'true', average, warn_for)


Comb Validatation Score: 0.772815533981
Comb Confustion Matrix:
 [[509  28   7]
 [ 35 279 157]
 [  0   7   8]]
             precision    recall  f1-score   support

          0       0.94      0.94      0.94       544
          1       0.89      0.59      0.71       471
          2       0.05      0.53      0.09        15

avg / total       0.90      0.77      0.82      1030



In [58]:
window_size = 15
y_comb_true_maj = []
for i in range(0, y_comb_true.size, window_size):
    u, counts = np.unique(y_comb_true[i:i+window_size], return_counts=True)
    pred = u[np.argmax(counts)]
    y_comb_true_maj.append(pred)
    
y_comb_true_maj = np.array(y_comb_true_maj)

In [59]:
y_comb_pred_maj = []
for i in range(0, y_comb_pred.size, window_size):
    u, counts = np.unique(y_comb_pred[i:i+window_size], return_counts=True)
    pred = u[np.argmax(counts)]
    y_comb_pred_maj.append(pred)
    
y_comb_pred_maj = np.array(y_comb_pred_maj)

In [60]:
print("Comb Maj Validation Score:", accuracy_score(y_comb_true_maj, y_comb_pred_maj))
print("Comb Maj Confusion Matrix:\n", confusion_matrix(y_comb_true_maj, y_comb_pred_maj))
print(classification_report(y_comb_true_maj, y_comb_pred_maj))

rH Validatation Score: 0.840579710145
rH Confustion Matrix:
 [[35  2  0]
 [ 2 23  6]
 [ 0  1  0]]
             precision    recall  f1-score   support

          0       0.95      0.95      0.95        37
          1       0.88      0.74      0.81        31
          2       0.00      0.00      0.00         1

avg / total       0.90      0.84      0.87        69



### Cross Validation

In [61]:
from sklearn.model_selection import cross_val_score, cross_validate

In [87]:
groups_cv = np.ones_like(filenames_comb, dtype=int)*-1

group_names = ["{}{}".format(testing_p, ex) for ex in ["a", "b", "c", "d", "e"]]
print(group_names)

for i, p in enumerate(group_names):
    p_num = i
    groups_cv[np.where(np.char.find(filenames_comb, p) != -1)] = p_num
    
print(groups_cv.shape)
print(np.unique(groups_cv, return_counts=True))
print(groups_cv[:])

['p3a', 'p3b', 'p3c', 'p3d', 'p3e']
(5300,)
(array([0, 1, 2, 3, 4]), array([ 880, 1398,  858, 1134, 1030]))
[0 0 0 ..., 4 4 4]


In [88]:
steps = [('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=1, gamma=.001, kernel='rbf', probability=False))]
pipeline = Pipeline(steps)
scoring = ("f1_macro", "accuracy")
logo = LeaveOneGroupOut()
scores = cross_validate(pipeline, X_comb, y_comb, cv=logo.split(X_comb, y_comb, groups=groups_cv), scoring=scoring, n_jobs=2, verbose=10)

[CV]  ................................................................
[CV]  ................................................................


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  , f1_macro=0.28887988109714907, accuracy=0.7646638054363376, total=  43.1s
[CV]  ................................................................


[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:  1.1min


[CV]  , f1_macro=0.39142461964038733, accuracy=0.6431818181818182, total= 1.2min
[CV]  ................................................................


  'recall', 'true', average, warn_for)


[CV]  , f1_macro=0.9906751800027169, accuracy=0.9906759906759907, total= 1.4min
[CV]  ................................................................


[Parallel(n_jobs=2)]: Done   3 out of   5 | elapsed:  3.5min remaining:  2.3min


[CV]  , f1_macro=0.3618298272774185, accuracy=0.7945326278659612, total= 1.1min
[CV] ...... , f1_macro=0.5317005822678431, accuracy=0.8, total= 1.2min


[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:  5.4min remaining:    0.0s
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:  5.4min finished


In [89]:
for k, v in scores.items():
    print(k, v)

fit_time [ 62.86090612  34.10787606  74.74845695  54.09495902  62.15966296]
score_time [  9.10481787   9.04074192  10.34519887  10.97806787  10.70870805]
test_f1_macro [ 0.39142462  0.28887988  0.99067518  0.36182983  0.53170058]
train_f1_macro [ 0.72207101  0.81161267  0.73108645  0.80671811  0.6966369 ]
test_accuracy [ 0.64318182  0.76466381  0.99067599  0.79453263  0.8       ]
train_accuracy [ 0.85475113  0.89800103  0.85772175  0.91238598  0.86135831]


In [91]:
steps = [('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=1, gamma=.001, kernel='rbf', probability=False))]
pipeline = Pipeline(steps)

logo = LeaveOneGroupOut()
for i, (train_idxs, test_idxs) in enumerate(logo.split(X_comb, y_comb, groups=groups_cv)):
    
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print("#### CV #: {} - Testing Group #: {} ####".format(i, np.unique(groups_cv[test_idxs])))
        print("Training...")
        pipeline.fit(X_comb[train_idxs], y_comb[train_idxs])
        print("Predicting...")
        y_true, y_pred = y_comb[test_idxs], pipeline.predict(X_comb[test_idxs])
        print("Comb Validation Score:", accuracy_score(y_true, y_pred))
        print("Comb Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
        print(classification_report(y_true, y_pred))

#### CV #: 0 - Testing Group #: [0] ####
Training...
Predicting...
Comb Validation Score: 0.655681818182
Comb Confusion Matrix:
 [[577 303]
 [  0   0]]
             precision    recall  f1-score   support

          0       1.00      0.66      0.79       880
          1       0.00      0.00      0.00         0

avg / total       1.00      0.66      0.79       880

#### CV #: 1 - Testing Group #: [1] ####
Training...
Predicting...
Comb Validation Score: 0.764663805436
Comb Confusion Matrix:
 [[1069   29    0]
 [ 268    0    0]
 [  32    0    0]]
             precision    recall  f1-score   support

          0       0.78      0.97      0.87      1098
          1       0.00      0.00      0.00       268
          2       0.00      0.00      0.00        32

avg / total       0.61      0.76      0.68      1398

#### CV #: 2 - Testing Group #: [2] ####
Training...
Predicting...
Comb Validation Score: 0.993006993007
Comb Confusion Matrix:
 [[429   0]
 [  6 423]]
             precision    rec

### Hyper Tuning

In [92]:
## Parameters for SVMs
# steps = [('SMOTE', SMOTE()), ("SVC", SVC())]
# param_grid = [
# #   {'SVC__C': [1, 10, 100], 'SVC__kernel': ['linear'], 'SMOTE__kind': ['regular', 'borderline1', 'borderline2', 'svm']},
# #   {'SVC__C': [1, 10, 100], 'SVC__gamma': [.0001, .001, .01, .1], 'SVC__kernel': ['rbf'], 'SMOTE__kind': ['regular', 'borderline1', 'borderline2', 'svm']},
# #   {'SVC__C': [10, 100], 'SVC__gamma': [.000005, .00001, .00005,], 'SVC__kernel': ['rbf'], 'SMOTE__kind': ['regular', 'borderline1', 'borderline2']},
#  ]

steps = [('SMOTE', SMOTEENN()), ("SVC", SVC())]
param_grid = [
  {'SVC__C': [1, 10, 100, 1000], 'SVC__kernel': ['linear'], 'SMOTE__smote': [SMOTE(kind='regular'), SMOTE(kind='borderline1'), SMOTE(kind='borderline2'), SMOTE(kind='svm')]},
  {'SVC__C': [1, 10, 100, 1000], 'SVC__gamma': [.0001, .001, .01, .1], 'SVC__kernel': ['rbf'], 'SMOTE__smote': [SMOTE(kind='regular'), SMOTE(kind='borderline1'), SMOTE(kind='borderline2'), SMOTE(kind='svm')]},
 ]

pipeline = Pipeline(steps)

scores = ['f1', 'accuracy']
logo = LeaveOneGroupOut()

In [None]:
# Hyper Parameter Tuning
for score in scores:
    
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print("## Tuning hyper-parameters for {}".format(score))
        print()

        if score is "accuracy":
            scoring = score
        else:
            scoring = '{}_macro'.format(score)      
        
        #### TRAIN COMBINED LH & RH
        clf_comb = GridSearchCV(pipeline, param_grid, cv=logo.split(X_comb, y_comb, groups=groups_cv), scoring=scoring, n_jobs=2, verbose=10)
        clf_comb.fit(X_comb, y_comb)

        print("Best Combined Parameters set found on data set:")
        print()
        print(clf_comb.best_params_)
        print()
        print("Grid scores on data set:")
        print()
        means = clf_comb.cv_results_['mean_test_score']
        stds  = clf_comb.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf_comb.cv_results_['params']):
            print("%0.3f (+/-%0.3f) for %r" % (mean, std, params))
        print()

## Tuning hyper-parameters for f1

Fitting 5 folds for each of 80 candidates, totalling 400 fits
[CV] SMOTE__smote=SMOTE(k=None, k_neighbors=5, kind='regular', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=None, ratio='auto', svm_estimator=None), SVC__C=1, SVC__kernel=linear 
[CV] SMOTE__smote=SMOTE(k=None, k_neighbors=5, kind='regular', m=None, m_neighbors=10, n_jobs=1,
   out_step=0.5, random_state=None, ratio='auto', svm_estimator=None), SVC__C=1, SVC__kernel=linear 
