### Synthetic Minority Oversampling TechniquE (SMOTE)
This notebook has code that tries to use SMOTE (from `imblearn` library) to improve the performance of the classifier

In [1]:
%matplotlib inline
import pandas as pd
import datetime
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import numpy as np
from tsfresh import extract_features
from tsfresh import extract_relevant_features
import sklearn
import sklearn.naive_bayes
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import normalize, minmax_scale
from imblearn.over_sampling import SMOTE
from collections import Counter

#### Extracting features using `tsfresh`

In [2]:
data = pd.read_csv('../data/data_only_tsfresh_compatible.csv', names = ['x_acc', 'y_acc', 'z_acc', 'id'])
labels = pd.read_csv('../data/labels_only.csv', names = ['Blocking', 'Dodging', 'Inactive', 'Moving', 'Sprinting'])

In [3]:
extracted_features = extract_features(data, column_id = "id", column_sort = None, column_kind = None, column_value = None)
print(extracted_features.shape)

Feature Extraction: 100%|██████████| 10/10 [06:10<00:00, 31.40s/it]


(1068, 2382)


In [4]:
label_arr = labels.values
label_arr = np.argmax(label_arr, axis = 1)

y_features = np.zeros(extracted_features.shape[0])
for i in range(len(label_arr)) : 
    if i % 150 == 0 : 
        y_features[i // 150] = label_arr[i]
        
# Also converting into Pandas Series for use in extracting relevant features using tsfresh
y = pd.Series(y_features, dtype = int)

In [None]:
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

extracted_features = impute(extracted_features)
features_filtered = select_features(extracted_features, y)

In [19]:
x_features = np.asarray(extracted_features)
print(x_features.shape)
x_features_relevant = np.asarray(features_filtered)
print(x_features_relevant.shape)
# Gives the number of examples per label
print(Counter(y_features))

(1068, 2382)
(1068, 693)
Counter({3.0: 411, 2.0: 213, 4.0: 196, 0.0: 129, 1.0: 119})


#### Shuffle and split into train/test datasets, and normalize the datasets

In [33]:
x_train, x_test, y_f, y_test = train_test_split(x_features_relevant, y_features)
print(x_train.shape)
print(x_test.shape)
print(y_f.shape)
print(y_test.shape)
x_f = minmax_scale(x_train)
x_test_norm = minmax_scale(x_test)
print(Counter(y_f))

(801, 693)
(267, 693)
(801,)
(267,)
Counter({3.0: 317, 2.0: 159, 4.0: 136, 0.0: 97, 1.0: 92})


#### Use Synthetic Minority Oversampling to equalize all classes

In [34]:
sm = SMOTE(random_state = 33)
x_train_norm, y_train = sm.fit_resample(x_f, y_f)
print(Counter(y_train))
print(y_train.dtype)
y_train = y_train.astype(int)
print(y_train.dtype)
y_test = y_test.astype(int)

Counter({2.0: 317, 3.0: 317, 0.0: 317, 4.0: 317, 1.0: 317})
float64
int64


In [35]:
svm_lin = RandomForestClassifier(n_estimators = 100)
svm_lin.fit(x_train_norm, y_train)
y_pred = svm_lin.predict(x_test_norm)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
y_pred = svm_lin.predict(x_train_norm)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

[[ 2  1  9 12  8]
 [ 2  7  0 14  4]
 [ 3  2 22 21  6]
 [ 8 10 10 49 17]
 [ 4  8  2 20 26]]
              precision    recall  f1-score   support

           0       0.11      0.06      0.08        32
           1       0.25      0.26      0.25        27
           2       0.51      0.41      0.45        54
           3       0.42      0.52      0.47        94
           4       0.43      0.43      0.43        60

    accuracy                           0.40       267
   macro avg       0.34      0.34      0.34       267
weighted avg       0.39      0.40      0.39       267

[[317   0   0   0   0]
 [  0 317   0   0   0]
 [  1   0 315   1   0]
 [  0   1   0 315   1]
 [  0   1   2   1 313]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       317
           1       0.99      1.00      1.00       317
           2       0.99      0.99      0.99       317
           3       0.99      0.99      0.99       317
           4       1.00      0.99

Even using SMOTE did not improve the test accuracy, neither did it significantly reduce the bias due to class unbalance. So, now another option is to repeat the above feature extraction and class balancing on a running window based preprocessing of the data rather than the raw data.

Will do the above in another notebook.

#### Check if label one-hot encoding is required by `sklearn` (Not required, but it mostly requires labels to be `int` and not `float` or something else)

In [30]:
# Check if this is necessary
from sklearn.preprocessing import LabelBinarizer
y_train = LabelBinarizer().fit_transform(y_train)
y_test = LabelBinarizer().fit_transform(y_test)

In [31]:
svm_lin = RandomForestClassifier(n_estimators = 300)
svm_lin.fit(x_train_norm, y_train)
y_pred = svm_lin.predict(x_test_norm)
# print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
y_pred = svm_lin.predict(x_train_norm)
# print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        36
           1       0.00      0.00      0.00        20
           2       0.00      0.00      0.00        62
           3       0.00      0.00      0.00        98
           4       0.50      0.04      0.07        51

   micro avg       0.40      0.01      0.01       267
   macro avg       0.10      0.01      0.01       267
weighted avg       0.10      0.01      0.01       267
 samples avg       0.01      0.01      0.01       267

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       313
           1       1.00      1.00      1.00       313
           2       1.00      0.98      0.99       313
           3       1.00      0.98      0.99       313
           4       1.00      0.99      0.99       313

   micro avg       1.00      0.99      0.99      1565
   macro avg       1.00      0.99      0.99      1565
weighted avg       1.00

  'precision', 'predicted', average, warn_for)


### Using alternate examples only

In [2]:
data = pd.read_csv('../data/alt_data_only_tsfresh_compatible.csv', names = ['x_acc', 'y_acc', 'z_acc', 'id'])
labels = pd.read_csv('../data/alt_labels_only.csv', names = ['Blocking', 'Dodging', 'Inactive', 'Moving', 'Sprinting'])

In [3]:
extracted_features = extract_features(data, column_id = "id", column_sort = None, column_kind = None, column_value = None)
print(extracted_features.shape)

Feature Extraction: 100%|██████████| 10/10 [03:07<00:00, 15.12s/it]


(533, 2382)


In [4]:
label_arr = labels.values
label_arr = np.argmax(label_arr, axis = 1)

y_features = np.zeros(extracted_features.shape[0])
for i in range(len(label_arr)) : 
    if i % 150 == 0 : 
        y_features[i // 150] = label_arr[i]
        
# Also converting into Pandas Series for use in extracting relevant features using tsfresh
y = pd.Series(y_features, dtype = int)

In [5]:
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

extracted_features = impute(extracted_features)
features_filtered = select_features(extracted_features, y)

 'x_acc__fft_coefficient__coeff_76__attr_"angle"'
 'x_acc__fft_coefficient__coeff_76__attr_"imag"'
 'x_acc__fft_coefficient__coeff_76__attr_"real"'
 'x_acc__fft_coefficient__coeff_77__attr_"abs"'
 'x_acc__fft_coefficient__coeff_77__attr_"angle"'
 'x_acc__fft_coefficient__coeff_77__attr_"imag"'
 'x_acc__fft_coefficient__coeff_77__attr_"real"'
 'x_acc__fft_coefficient__coeff_78__attr_"abs"'
 'x_acc__fft_coefficient__coeff_78__attr_"angle"'
 'x_acc__fft_coefficient__coeff_78__attr_"imag"'
 'x_acc__fft_coefficient__coeff_78__attr_"real"'
 'x_acc__fft_coefficient__coeff_79__attr_"abs"'
 'x_acc__fft_coefficient__coeff_79__attr_"angle"'
 'x_acc__fft_coefficient__coeff_79__attr_"imag"'
 'x_acc__fft_coefficient__coeff_79__attr_"real"'
 'x_acc__fft_coefficient__coeff_80__attr_"abs"'
 'x_acc__fft_coefficient__coeff_80__attr_"angle"'
 'x_acc__fft_coefficient__coeff_80__attr_"imag"'
 'x_acc__fft_coefficient__coeff_80__attr_"real"'
 'x_acc__fft_coefficient__coeff_81__attr_"abs"'
 'x_acc__fft_coeffic

In [6]:
x_features = np.asarray(extracted_features)
print(x_features.shape)
x_features_relevant = np.asarray(features_filtered)
print(x_features_relevant.shape)
# Gives the number of examples per label
print(Counter(y_features))

(533, 2382)
(533, 584)
Counter({3.0: 204, 2.0: 108, 4.0: 95, 0.0: 67, 1.0: 59})


#### Shuffle and split into train/test datasets, and normalize the datasets

In [13]:
x_train, x_test, y_f, y_test = train_test_split(x_features_relevant, y_features)
print(x_train.shape)
print(x_test.shape)
print(y_f.shape)
print(y_test.shape)
x_f = minmax_scale(x_train)
x_test_norm = minmax_scale(x_test)
print(Counter(y_f))

(399, 584)
(134, 584)
(399,)
(134,)
Counter({3.0: 153, 2.0: 88, 4.0: 68, 0.0: 46, 1.0: 44})


#### Use Synthetic Minority Oversampling to equalize all classes

In [14]:
sm = SMOTE(random_state = 12)
x_train_norm, y_train = sm.fit_resample(x_f, y_f)
print(Counter(y_train))
print(y_train.dtype)
y_train = y_train.astype(int)
print(y_train.dtype)
y_test = y_test.astype(int)

Counter({4.0: 153, 3.0: 153, 2.0: 153, 0.0: 153, 1.0: 153})
float64
int64


In [15]:
svm_lin = RandomForestClassifier(n_estimators = 100)
svm_lin.fit(x_train_norm, y_train)
y_pred = svm_lin.predict(x_test_norm)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
y_pred = svm_lin.predict(x_train_norm)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

[[ 2  1  3  9  6]
 [ 0  4  1  9  1]
 [ 0  1  7  8  4]
 [ 1  3  3 29 15]
 [ 0  4  3  7 13]]
              precision    recall  f1-score   support

           0       0.67      0.10      0.17        21
           1       0.31      0.27      0.29        15
           2       0.41      0.35      0.38        20
           3       0.47      0.57      0.51        51
           4       0.33      0.48      0.39        27

    accuracy                           0.41       134
   macro avg       0.44      0.35      0.35       134
weighted avg       0.45      0.41      0.39       134

[[153   0   0   0   0]
 [  0 153   0   0   0]
 [  0   0 153   0   0]
 [  0   0   0 153   0]
 [  0   0   0   0 153]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       153
           1       1.00      1.00      1.00       153
           2       1.00      1.00      1.00       153
           3       1.00      1.00      1.00       153
           4       1.00      1.00

Even using alternate examples doesn't improve performance, clearly because the examples remain mislabelled overall. Now the solution would be to re-annotate the data and optionally also collect more data.