# Two step classification benchmark
The purpose of this notebook consists in benchmarking two step classification against one step classification. Advantage of a two step approach is that most classifiers (especially SVM) have significantly shorter training times. Thus it should be evaluated how precision behaves in both approaches and the best classifier for predicting the final return quantity should be found.

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix, hstack
import process as p
import dmc

In [2]:
df = p.processed_data()
for c in [col for col in df.columns if 'Prob' in col]:
    df = df.drop(c, 1)

Method for running all classifiers except for neural network and return precision for each and cost for each

In [3]:
def predict_return_quantity_direct(df, tr_size, te_size):
    results = []
    X, Y = dmc.transformation.transform(df, scaler=dmc.normalization.scale_features,
                                        binary_target=False)
    train = X[:tr_size], Y[:tr_size]
    test = X[tr_size:tr_size + te_size], Y[tr_size:tr_size + te_size]
    for classifier in p.basic[:-1]:
        clf = classifier(train[0], train[1])
        res = clf(test[0])
        precision = dmc.evaluation.precision(res, test[1])
        cost = dmc.evaluation.dmc_cost(res, test[1])
        results.append((precision, cost))
    return np.array([r[0] for r in results]), np.array([r[1] for r in results])

Method for running all classifiers except for neural network and return precision and cost for each but using the classifier twice. The chained fashion resembles classifying first if a row has a return and then predicting the exact label representing return Quantity.

In [6]:
def predict_return_quantity_twostep(df, tr_size, te_size):
    results = []
    X, Y = dmc.transformation.transform(df, scaler=dmc.normalization.scale_features,
                                        binary_target=True)
    Y_fin = dmc.transformation.transform_target_vector(df, binary=False)
    train = X[:tr_size], Y[:tr_size]
    test = X[tr_size:tr_size + te_size], Y[tr_size:tr_size + te_size]
    for classifier in p.basic[:-1]:
        clf = classifier(train[0], train[1])
        res = clf(test[0])
        Y_csr, res_csr = csr_matrix(Y).T, csr_matrix(res).T
        train_fin = hstack([train[0], Y_csr[:tr_size]]), Y_fin[:tr_size]
        test_fin = hstack([test[0], res_csr]), Y_fin[tr_size:tr_size + te_size]
        clf_fin = classifier(train_fin[0], train_fin[1])
        res_fin = clf_fin(test_fin[0])
        precision = dmc.evaluation.precision(res_fin, test_fin[1])
        cost = dmc.evaluation.dmc_cost(res_fin, test_fin[1])
        results.append((precision, cost))
    return np.array([r[0] for r in results]), np.array([r[1] for r in results])

In [4]:
def benchmark_prediction_target(df, tr_size, te_size, samplings=10):
    df_res = pd.DataFrame(index=p.basic[:-1])
    for i in range(samplings):
        df = p.shuffle(df)
        dfc = df[:te_size + tr_size].copy()
        res_dir = predict_return_quantity_direct(dfc, tr_size, te_size)
        res_two = predict_return_quantity_twostep(dfc, tr_size, te_size)
        df_res[str(i) + '_precision'] = res_two[0] - res_dir[0]
        df_res[str(i) + '_cost'] = res_dir[1] - res_two[1]
    return df_res

The following table shows precision and dmc cost advance when using a two step classification chain. This means, positive numbers are in both cases desirable and underline the positive effect of two chained classifiers. The following result is created using 5 random subsamples with 24000 elements using 4k as training set. Negative number indicate that the single target classifier is stronger.

In [5]:
benchmark_prediction_target(df, 4000, 20000, 5)

Unnamed: 0,0_precision,0_cost,1_precision,1_cost,2_precision,2_cost,3_precision,3_cost,4_precision,4_cost
<class 'dmc.classifiers.DecisionTree'>,-0.0031,-33,0.00265,80,0.00045,42,0.00575,137,0.00715,168
<class 'dmc.classifiers.Forest'>,-0.00185,-20,-0.0002,19,0.0038,91,0.00165,43,-5e-05,9
<class 'dmc.classifiers.NaiveBayes'>,0.0011,35,0.00155,52,0.0011,38,0.00115,31,0.0015,41
<class 'dmc.classifiers.SVM'>,0.0005,27,0.0021,64,0.0013,45,-0.00015,7,0.00035,19
