# 0 - Parse Data
The datasets within the _datasets_ folder are nothing but pickled dictionaries of lists of numpy objects and other lists.

Use the *parse_mts_dataset* function in *mts_dataset_parser.py* to import them.

If you want, you can use your own dataset but it is advised that you bring it to the following format:

* **mts_list:** a list that contains MTS objects which are size MxT numpy arrays.
* **dmts_list (optional):** first difference of the MTS objects in mts_list in the same order 
* **labels_list:** labels for the given MTS objects in the same order
* **train_index:** list of indices to extract train instances from mts_list or dmts_list or labels_list
* **test_index:** same as above, just the test indices
* **num_attributes:** number of attributes in any given MTS object.

Note that *train_index* and *test_index* are for the predefined test-train split that are commonly being used for research purposes. The whole dataset could be random shuffled with for another test-train split of desired sizes.

In [2]:
# import required packages
import os
import pickle
import numpy as np
from mts_dataset_parser import parse_mts_dataset

"""
You can change the "file" var to try a different dataset in the selection
Or import your own dataset externally
Do not change the "dirname" var to work locally
""" 
dirname = "../data"
file = "scaled_Uwave_mts.p"

with open( os.path.join( dirname , file ) , 'rb' ) as infile:
    data = pickle.load( infile )
( mts_list , dmts_list , train_index , test_index , labels_list , num_attributes) = parse_mts_dataset( data )
labels_train = labels_list[train_index]; labels_test = labels_list[test_index]

# 1 - Import IMPHD and Extract Features
The *im-phd.py* file includes the class *IMPHD* that mainly wraps the feature extractor of the method. 

First, define the values of lambdas and betas that you would like to extract features for in variables *list_lambdas* and *list_betas*. Use the *extract_features* method that extracts and stores the features for all mts in the *mts_list* for the parameter grid produced by the values in *list_lambdas* and *list_betas*. 

If *dmts_list* is not provided, the method automatically computes that too. In this example however, it is precomputed.

In [4]:
from im_phd import IMPHD
list_lambdas = [ 5 , 10 , 15 ]
list_betas = [ 0 , 0.025 , 0.05 ]
imphd = IMPHD()
imphd.extract_features( mts_list=mts_list, num_attributes=num_attributes , list_lambdas=list_lambdas
                       , list_betas=list_betas, dmts_list=dmts_list )

Full combination set is being used
**** -------------------------------------- ****
IM features are being computed for the following lambda values:
[5, 10, 15]
**** -------------------------------------- ****
PHD features are being computed for the following beta values:
[0, 0.025, 0.05]
**** -------------------------------------- ****


# 2 - Find the Best Parameter Set Based on Out-of-Bag Scores

For multiple values of lambdas and betas, a parameter grid is already defined. In order to find the best parameter set, out-of-bag score is utilized as the classifier is a random forest and such an approach is allowed. The *ParameterGrid* functionality of Sci-kit Learn package could be used for setting up the parameter optimization. At each point in the grid (i.e. for each parameter pair) a classifier is fit and desired score is compared to the currently best score to find the best performing parameter pair. The best performing parameter pair in the training split may not be the best performing for the test split. It is shown within the thesis however that train errors over the parameter grid is consistent with the actual error on the test data.  

User can also work with another classifier such as SVM or GradientBoostingClassifier since the features are readily available in the imphd instance. In such a case, parameter optimization requires another procedure such as a K-Fold Cross Validation.

At any time, the *get_feature_set* method should be used for retrieving the feature set for a specific parameter tuple. It requires two parameters: with num_intervals keyword  

In [6]:
import copy
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ParameterGrid

# Initialize random forest parameters and the classifier
n_trees=100; n_jobs=-1; random_seed=42;
rf = RandomForestClassifier( oob_score = True , n_estimators = n_trees , n_jobs = n_jobs , random_state = random_seed )

# Create a parameter grid based on the list of lambdas and betas provided, this step can be adjusted 
# as long as the features are readily extracted in the IMPHD class
pg = ParameterGrid( {'num_intervals':list_lambdas , 'radius_cut':list_betas} )
best_score = -1
for params in list( pg ):
    # Use get_feature_set method to retrieve specific feature set per parameter tuple
    feature_set = imphd.get_feature_set( **params )[train_index]
    rf.fit( feature_set , labels_train )
    # Update the best model if the incoming score is better
    if rf.oob_score_ > best_score:
        rf_best = copy.copy(rf)
        params_best = params
        best_score = rf.oob_score_


# 3 - Predict test indices and report results
In order to predict labels for the incoming mts objects, simply use the predict method of the trained classifier as the features are readily available from the feature extraction section.

In [7]:
features_test = feature_set = imphd.get_feature_set( **params_best )[test_index]
predicted_labels = rf_best.predict( features_test )

from sklearn.metrics import accuracy_score, classification_report
print( classification_report( labels_test, predicted_labels ) )
print("The overall accuracy is:")
print( accuracy_score( labels_test , predicted_labels ) )

             precision    recall  f1-score   support

        1.0       0.95      0.98      0.97       437
        2.0       0.98      1.00      0.99       452
        3.0       0.97      0.98      0.97       454
        4.0       0.97      0.95      0.96       450
        5.0       0.91      0.97      0.94       433
        6.0       0.97      0.88      0.92       449
        7.0       0.98      0.98      0.98       447
        8.0       0.98      0.99      0.98       460

avg / total       0.97      0.97      0.97      3582

The overall accuracy is:
0.965661641541


# Miscellaneous
### Extract features for a single instance
The method *extract_features_for_single_instance* works pretty much the same as *extract_features* method only that it works for a single instance and does not only store the feature internally but immediately returns it. This method will be useful for production mode.

### Random reduced comination generator
The user is able to provide the model a custom combinations list. However for a quick trial, the *random_reduced_combination_generator* method will divide *num_attributes* dimensions into groups of *num_reduced_subsets* and only retrieve their 2-combinations.

In [3]:
# Immediately after running parts 1 and 2 run here for a cross-check
assert np.all( imphd.get_feature_set( **params_best )[0] == imphd.extract_features_for_single_instance( mts_list[0] , **params_best ) ), "Oops! The method is not working properly"

# Print a sample reduced combination
imphd.random_reduced_combination_generator( num_attributes=12 , num_reduced_subsets=3 , random_seed=10 )

[(9, 0),
 (9, 6),
 (0, 6),
 (10, 1),
 (10, 2),
 (1, 2),
 (7, 11),
 (7, 5),
 (11, 5),
 (4, 8),
 (4, 3),
 (8, 3)]