# Supervised learning pipeline

This notebook contains a supervised approach to predict the activities in the data. We consider the problem of predicting the activity for the time stamp of the same record input into the algorithm; a cross-sectional approach. We choose a particular type of model, `LogisticRegression`, in this notebook.

We build in a few conveniences into the code here. Specifically,
* Using a `Pipeline` to combine data transformation and model
* Using a `PCA` model to reduce the dimensionality of the data
* Using a grouped cross-validation to train
* Using a `GridSearchCV` to tune hyperparameters

We finally test the model on the test data set, as defined by the authors.

In [1]:
%cd ..

/project


In [2]:
from src.data import *

## Prepare the data

Load the feature data and then prepare the train / test objects. We use the author's definition of training and testing datasets. The training data will also include validation.

In [3]:
activities = load_activity_names(); activities
features_df = load_feature_data() \
    .merge(activities) \
    .drop('activity_id', axis=1) \
    .sort_values(['subject_id', 'time_window_s']) \
    .reset_index(drop=True)
features_df.shape

(7352, 564)

We only input the data features into the model, so we need to skip subject, time, and activity labels.

In [4]:
X_train = features_df.drop(['subject_id', 'time_window_s', 'activity_name'], axis=1)
y_train = features_df.activity_name

In [5]:
features_test_df = load_feature_data('test') \
    .merge(activities) \
    .drop('activity_id', axis=1) \
    .sort_values(['subject_id', 'time_window_s']) \
    .reset_index(drop=True)
features_test_df.shape

(2947, 564)

## Model set-up and fitting / searching

Since we have many features, reducing the dimensionality is recommended. Some notes:
* We use a `Pipeline` to facilitate fitted parameters of scaling, dimension reducing via `PCA`, and the classifier
* We use a grouped cross-validation strategy based on subjects
* The `GridSearchCV` will try all combinations of hyperparameters with brute force

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GroupKFold, GridSearchCV

### Model training / searching setup

When we cross-validate, we don't really want an individual's data points split across train and validation. This is leakage of the unique behavior of that individual from validation to train data. Therefore, we use a `GroupKFold` object which can `.split` the data by an index, and we can choose the `subject_id` for this. We'll define `subjects_train` for this.

In [7]:
cv_group = GroupKFold(n_splits=5)
subjects_train = features_df.subject_id

We'll try a range of parameters for the `C` parameter of `LogisticRegression` and `n_components` for the `PCA` model. This is flexible in that one can include other parameters with ranges, but it will only support one model object at a time.

In [8]:
hyperparam_dict = {'lr__C': [0.001, 0.01, 0.1, 1., 10., 100., 1000.],
                   'pca__n_components': [150, 200, 250]}

In [9]:
model_pipe = Pipeline([('ss', StandardScaler()),
                       ('pca', PCA()),
                       ('lr', LogisticRegression(solver='liblinear'))])
search_pipe = GridSearchCV(estimator=model_pipe,
                           param_grid=hyperparam_dict,
                           cv = cv_group.split(X_train, y_train, groups=subjects_train),
                           n_jobs=-1)

### Model training / searching

The `GridSearchCV` works like any model `Pipeline` or an individual model (or "estimator") in that it has a `.fit` method to "train" the model (performing the model search), and a `.predict` method to apply the chosen model to data.

To get the `Pipeline` it chose, it provides a `.best_estimator_` attribute once it's trained. 

In [10]:
search_pipe_fit = search_pipe.fit(X_train, y_train)
search_pipe_fit.best_estimator_

Pipeline(steps=[('ss', StandardScaler()), ('pca', PCA(n_components=250)),
                ('lr', LogisticRegression(solver='liblinear'))])

Separately, the hyperparameters can be found using the `.best_params_` attribute.

In [11]:
search_pipe_fit.best_params_

{'lr__C': 1.0, 'pca__n_components': 250}

### Evaluate on the test data

We evaluate the model on the test data that was defined by the authors. `sklearn` provides a convenience function `accuracy_score` to compute the accuracy, and `classification_report` to compute precision, recall, F1-score.

In [12]:
y_test_hat = search_pipe_fit \
    .predict(features_test_df.drop(['subject_id', 'time_window_s', 'activity_name'], axis=1))

In [13]:
from sklearn.metrics import accuracy_score, classification_report
print(classification_report(y_test_hat, features_test_df.activity_name))

                    precision    recall  f1-score   support

            LAYING       0.98      1.00      0.99       525
           SITTING       0.87      0.95      0.91       452
          STANDING       0.96      0.88      0.92       584
           WALKING       1.00      0.96      0.98       515
WALKING_DOWNSTAIRS       0.98      1.00      0.99       414
  WALKING_UPSTAIRS       0.96      0.99      0.97       457

          accuracy                           0.96      2947
         macro avg       0.96      0.96      0.96      2947
      weighted avg       0.96      0.96      0.96      2947



When we cross-tabulate the actual labels with the classified ones, we see a pretty diagonal matrix. Indeed, laying has been 100% correct.

Here we will want to validate if the errors made are acceptable. For example, errors for walking downstairs are either walking or walking upstairs. It may be important to continue tuning parameters such that this activity is never (or less commonly) misclassified as walking upstairs.

In [14]:
pd.crosstab(features_test_df.activity_name.values,
            y_test_hat,
            rownames=['True'],
            colnames=['Classified'])

Classified,LAYING,SITTING,STANDING,WALKING,WALKING_DOWNSTAIRS,WALKING_UPSTAIRS
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LAYING,524,0,13,0,0,0
SITTING,1,429,59,0,0,2
STANDING,0,20,512,0,0,0
WALKING,0,2,0,494,0,0
WALKING_DOWNSTAIRS,0,1,0,3,413,3
WALKING_UPSTAIRS,0,0,0,18,1,452
