# Virtual Piano Tutor

## Project Description

This is a notebook for tracking my progress on VPT...

- Best Classifier as of 11/30
    - SVM {'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}

## TODO List

- TODAY
    - Decide on RDF model to keep for rest of project
    - Work on RDF data and annotations
    - Add results to file
    - Rewrite RDF for GridSearchCV
        - Extend RDF
    - Work on ideas for paper
        - Visualizations
    - Play with CAE
    - How to automate this...
    - Windowing/Summarizing
    
- DONE
    - ~~Organize RDF data~~
    - ~~Generate data from already extracted hands...~~
    - ~~Get notebook running on Compute Canada~~
    - ~~Get data on Compute Canada~~
    - ~~Setup CAE to deal with hand images~~
    - ~~setup data for training autoencoder on LH and RH~~
    - ~~Train Autoencoder for LH and Rh~~
    

- Bad Segmentation
    - p3c - left hand (not terrible)
    - p1s - right hand (shouldn't use)
    - p5a - Both could use some work but still caputures most of the left hand (RH not so good...)
    - p5c - not good (left hand passable...)
    
- Add noise to CAE
    - http://scikit-image.org/docs/dev/api/skimage.util.html#random-noise
    
- ~~Multiple Participants~~
    - ~~have one holdout set participant~~
        - ~~Test with p1&2 training p3 testing, then p1&3...~~
    - ~~have one holdout set exercise~~

- Test with RH too

- Windowing data
    - Summarize data for classification
    - Majority Voting (or with probabilities)

- Look for other features
    - Others??
    - ~~Autoencoder features~~
    - ~~HONV~~
    
- Work on hand segmentation
   - See p1e for bad examples
   - How to validate segmentation?
       - Statistical analysis on length and width ratios
       
- Visualize !!!
    - Input 
    - Results !!!
        - F Scores
        - Accuracy
        - Try weighted instead of macro




- Finish Project Description

- ~~Turn into functions~~
    
- ~~Verify Segmentation~~
    - have only done basic verification
    
- ~~FIRST THING: Test by ignoring training data (p1s) and then using train_test_split on recordings~~
    - ~~Data should be ready for spliting~~
    
- ~~Remove data from testing to find culprit~~
    
- ~~Track my progress better !!! (duh through notebooks!)~~

# Setup

## Libraries

In [1]:
import os

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import cv2

from vpt.features.features import *
import vpt.utils.image_processing as ip
import vpt.settings as s
import vpt.hand_detection.depth_context_features as dcf

%load_ext autoreload
%autoreload 2

## Some helper functions

### Load/Save Data Set

In [2]:
def load_data(testing_p, M, radius, feature_type, data_type):
    base = "data/posture/extracted/"
    data_path = os.path.join(base, "{}_M{}_rad{:0.2f}_{}_".format(testing_p, M, radius, feature_type))
    data = np.load(data_path + data_type + "_data_combined.npz")    
    return data

## Project Setup

### Generate or Load Data

In [17]:
#### Load data for a single paricipant
data = load_data("all_participants", M, radius, feature_type, "train")
print("X LH", data["X_lh"].shape, "y LH", data["y_lh"].shape, data["vis_lhs"].shape)
print("X RH", data["X_rh"].shape, "y RH", data["y_rh"].shape, data["vis_rhs"].shape)
print("Filenames", data["filenames"].shape)

X LH (15818, 1089) y LH (15818,) (15818, 180, 180)
X RH (15818, 1089) y RH (15818,) (15818, 180, 180)
Filenames (15818,)


In [118]:
M = 5
radius = .15
feature_type = "hog"
testing_p = "p3"

In [119]:
## using p6 for validation
r = re.compile(testing_p)
vmatch = np.vectorize(lambda x:bool(r.search(x)))
val_p = vmatch(data['filenames'])

X_lh = data['X_lh'][~val_p]
y_lh = data['y_lh'][~val_p]
X_rh = data['X_rh'][~val_p]
y_rh = data['y_rh'][~val_p]
filenames = data['filenames'][~val_p]


X_lh_test = data['X_lh'][val_p]
y_lh_test = data['y_lh'][val_p]
X_rh_test = data['X_rh'][val_p]
y_rh_test = data['y_rh'][val_p]
filenames_test = data['filenames'][val_p]

print("Cross Val Data:")
print(X_lh.shape, y_lh.shape)
print(X_rh.shape, y_rh.shape)
print(filenames.shape)
print()
print("Validation Data")
print(X_lh_test.shape, y_lh_test.shape)
print(X_rh_test.shape, y_rh_test.shape)
print(filenames_test.shape)

Cross Val Data:
(11246, 1089) (11246,)
(11246, 1089) (11246,)
(11246,)

Validation Data
(4572, 1089) (4572,)
(4572, 1089) (4572,)
(4572,)


In [120]:
groups = np.zeros_like(filenames, dtype=int)

for p in ["p1", "p3", "p4", "p6"]:
    p_num = int(p[1])
    groups[np.where(np.char.find(filenames, p) != -1)] = p_num
    
print(groups.shape)
print(np.unique(groups))
print(groups)

(11246,)
[1 4 6]
[1 1 1 ..., 6 6 6]


# Classification

### Libraries

In [121]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, LeaveOneGroupOut
from imblearn.pipeline import Pipeline

from sklearn.decomposition import PCA

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

## Load Data for Classification

In [122]:
## find all "static" data so we can ignore for now
# r = re.compile('{}s'.format(testing_p))
r = re.compile('p[\d]ssfg')

# remove p#s data
vmatch = np.vectorize(lambda x:bool(r.search(x)))
rem_static = vmatch(filenames)
rem_static_test = vmatch(filenames_test)

X_lh_train, y_lh_train, filenames_train, groups_train = X_lh[~rem_static], y_lh[~rem_static], filenames[~rem_static], groups[~rem_static]
X_rh_train, y_rh_train, filenames_train, groups_train = X_rh[~rem_static], y_rh[~rem_static], filenames[~rem_static], groups[~rem_static]

In [123]:
X_lh_val, y_lh_val, filenames_val = X_lh_test[~rem_static_test], y_lh_test[~rem_static_test], filenames_test[~rem_static_test]
X_rh_val, y_rh_val, filenames_val = X_rh_test[~rem_static_test], y_rh_test[~rem_static_test], filenames_test[~rem_static_test]

In [124]:
X_comb_train = np.vstack((X_lh_train, X_rh_train))
y_comb_train = np.hstack((y_lh_train, y_rh_train))
groups_comb_train = np.hstack((groups_train, groups_train))
filenames_comb_train = np.hstack((filenames_train, filenames_train))

X_comb_val = np.vstack((X_lh_val, X_rh_val))
y_comb_val = np.hstack((y_lh_val, y_rh_val))
filenames_comb_val = np.hstack((filenames_val, filenames_val))

In [125]:
print("Cross Validation Data")
print("LH:", X_lh_train.shape, y_lh_train.shape)
print(np.unique(y_lh_train, return_counts=True))
print("RH:", X_rh_train.shape, y_rh_train.shape)
print(filenames_train.shape)
print(groups_train.shape)
print(np.unique(y_rh_train, return_counts=True))
print("Combined:", X_comb_train.shape, y_comb_train.shape)
print(filenames_comb_train.shape)
print(groups_comb_train.shape)
print(np.unique(y_comb_train, return_counts=True))
print()
print()
print("Validation Data")
print("LH:", X_lh_val.shape, y_lh_val.shape)
print(np.unique(y_lh_val, return_counts=True))
print("RH:", X_rh_val.shape, y_rh_val.shape)
print(np.unique(y_rh_val, return_counts=True))
print("Combined:", X_comb_val.shape, y_comb_val.shape)
print(np.unique(y_comb_val, return_counts=True))
print(filenames_comb_val.shape)

Cross Validation Data
LH: (11246, 1089) (11246,)
(array([0, 1, 2]), array([7579, 2219, 1448]))
RH: (11246, 1089) (11246,)
(11246,)
(11246,)
(array([0, 1, 2]), array([8185, 2607,  454]))
Combined: (22492, 1089) (22492,)
(22492,)
(22492,)
(array([0, 1, 2]), array([15764,  4826,  1902]))


Validation Data
LH: (4572, 1089) (4572,)
(array([0, 1, 2]), array([1882, 1982,  708]))
RH: (4572, 1089) (4572,)
(array([0, 1, 2]), array([3305,  632,  635]))
Combined: (9144, 1089) (9144,)
(array([0, 1, 2]), array([5187, 2614, 1343]))
(9144,)


## Model Testing

### SVM

### Single Classifier

In [126]:
## find all "static" data so we can ignore for now
r = re.compile('{}e'.format(testing_p))

# remove p#s data
vmatch = np.vectorize(lambda x:bool(r.search(x)))
split = vmatch(filenames_comb_val)

X_comb_val = np.vstack((X_lh_val, X_rh_val))
y_comb_val = np.hstack((y_lh_val, y_rh_val))
filenames_comb_val = np.hstack((filenames_val, filenames_val))

X_comb_train, y_comb_train, filenames_comb_train = X_comb_val[~split], y_comb_val[~split], filenames_comb_val[~split]
X_comb_test, y_comb_test, filenames_comb_test = X_comb_val[split], y_comb_val[split], filenames_comb_val[split]

In [None]:
print("Train Data")
print(X_comb_train.shape, y_comb_train.shape)
print(np.unique(y_comb_train, return_counts=True))
print(filenames_comb_train.shape)
print()
print("Test Data")
print(X_comb_test.shape, y_comb_test.shape)
print(np.unique(y_comb_test, return_counts=True))
print(filenames_comb_test.shape)

Train Data
(8114, 1089) (8114,)
(array([0, 1, 2]), array([4643, 2143, 1328]))
(8114,)

Test Data
(1030, 1089) (1030,)
(array([0, 1, 2]), array([544, 471,  15]))
(1030,)


In [130]:
# steps = [('PCA', PCA(n_components=1500)), ('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=10, gamma=.00001, kernel='rbf', probability=False))]
steps = [('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=1, gamma=.001, kernel='rbf', decision_function_shape='ovo', probability=False))]
pipeline = Pipeline(steps)
pipeline.fit(X_comb_train, y_comb_train)

Pipeline(memory=None,
     steps=[('SMOTE', SMOTE(k=None, k_neighbors=5, kind='borderline2', m=None, m_neighbors=10,
   n_jobs=1, out_step=0.5, random_state=None, ratio='auto',
   svm_estimator=None)), ('SVC', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [131]:
y_comb_true, y_comb_pred = y_comb_test, pipeline.predict(X_comb_test)
print("Comb Validatation Score:", accuracy_score(y_comb_true, y_comb_pred))
print("Comb Confustion Matrix:\n", confusion_matrix(y_comb_true, y_comb_pred))
print(classification_report(y_comb_true, y_comb_pred))

Comb Validatation Score: 0.778640776699
Comb Confustion Matrix:
 [[510  27   7]
 [ 38 284 149]
 [  0   7   8]]
             precision    recall  f1-score   support

          0       0.93      0.94      0.93       544
          1       0.89      0.60      0.72       471
          2       0.05      0.53      0.09        15

avg / total       0.90      0.78      0.82      1030



In [115]:
window_size = 15
y_comb_true_maj = []
for i in range(0, y_comb_true.size, window_size):
    u, counts = np.unique(y_comb_true[i:i+window_size], return_counts=True)
    pred = u[np.argmax(counts)]
    y_comb_true_maj.append(pred)
    
y_comb_true_maj = np.array(y_comb_true_maj)

In [116]:
y_comb_pred_maj = []
for i in range(0, y_comb_pred.size, window_size):
    u, counts = np.unique(y_comb_pred[i:i+window_size], return_counts=True)
    pred = u[np.argmax(counts)]
    y_comb_pred_maj.append(pred)
    
y_comb_pred_maj = np.array(y_comb_pred_maj)

In [117]:
print("rH Validatation Score:", accuracy_score(y_comb_true_maj, y_comb_pred_maj))
print("rH Confustion Matrix:\n", confusion_matrix(y_comb_true_maj, y_comb_pred_maj))
print(classification_report(y_comb_true_maj, y_comb_pred_maj))

rH Validatation Score: 0.826086956522
rH Confustion Matrix:
 [[35  2  0]
 [ 9 22  0]
 [ 0  1  0]]
             precision    recall  f1-score   support

          0       0.80      0.95      0.86        37
          1       0.88      0.71      0.79        31
          2       0.00      0.00      0.00         1

avg / total       0.82      0.83      0.82        69



  'precision', 'predicted', average, warn_for)


### Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate

In [None]:
X_comb_all_cv = np.vstack((X_comb_train, X_comb_val))
y_comb_all_cv = np.hstack((y_comb_train, y_comb_val))
filenames_comb_all_cv = np.hstack((filenames_comb_train, filenames_comb_val))

print("Combined:", X_comb_all_cv.shape, y_comb_all_cv.shape)
print(np.unique(y_comb_all_cv, return_counts=True))
print(filenames_comb_all_cv.shape)

In [None]:
groups_cv = np.zeros_like(filenames_comb_all_cv, dtype=int)

for p in ["p1", "p3", "p4", "p6"]:
    p_num = int(p[1])
    groups_cv[np.where(np.char.find(filenames_comb_all_cv, p) != -1)] = p_num
    
print(groups_cv.shape)
print(np.unique(groups_cv))
print(groups_cv)

In [None]:
steps = [('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=1, gamma=.001, kernel='rbf', probability=False))]
pipeline = Pipeline(steps)
scoring = ("f1_macro", "accuracy")
logo = LeaveOneGroupOut()
scores = cross_validate(pipeline, X_comb_all_cv, y_comb_all_cv, cv=logo.get_n_splits(groups=groups_cv), scoring=scoring, n_jobs=2, verbose=2)

In [None]:
print(scores)

In [None]:
for k, v in scores.items():
    print(k, v)

In [None]:
steps = [('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=1, gamma=.001, kernel='rbf', probability=False))]
pipeline = Pipeline(steps)
scoring = ("f1_macro", "accuracy")
logo = LeaveOneGroupOut()
scores = cross_validate(pipeline, X_comb_all_cv, y_comb_all_cv, cv=logo.split(X_comb_all_cv, y_comb_all_cv, groups=groups_cv), scoring=scoring, n_jobs=2, verbose=3)

In [None]:
for k, v in scores.items():
    print(k, v)

In [None]:
logo.split(X_comb_all_cv, y_comb_all_cv, groups=groups_cv)

## Other Classifiers

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# steps = [('PCA', PCA(n_components=1500)), ('SMOTE', SMOTE(kind="borderline2")), ("SVC", SVC(C=10, gamma=.00001, kernel='rbf', probability=False))]
steps2 = [('SMOTE', SMOTE(kind="borderline2")), ("CLF", AdaBoostClassifier(base_estimator=SVC(C=1, gamma=.001, kernel='rbf', probability=True, max_iter=), n_estimators=25))]
# steps2 = [('SMOTE', SMOTE(kind="borderline2")), ("CLF", AdaBoostClassifier(base_estimator=RandomForestClassifier(n_estimators=10, max_depth=20), n_estimators=50))]
pipeline2 = Pipeline(steps2)
pipeline2.fit(X_comb_train, y_comb_train)

In [None]:
y_comb_true, y_comb_pred = y_comb_val, pipeline2.predict(X_comb_val)
print("Comb Validatation Score:", accuracy_score(y_comb_true, y_comb_pred))
print("Comb Confustion Matrix:\n", confusion_matrix(y_comb_true, y_comb_pred))
print(classification_report(y_comb_true, y_comb_pred))