
# Quad-split
recursive subspace split

## How it works:
1. For every feature consider all values as splitting axis (eg. If we have samples: [[1,2], [3,4]] then we would consider two features: F1 with points: 1,3 and F2 with points 2,3
2. For every point split the feature space (so we have "left" and "right" subspaces). Only split that fulfil `minimal_split_percentage` criterion are considered (eg. if `minimal_split_percentage` is 0.1, then in either of the sides there needs to be at least 10% of samples) For every subspace calculate complexity using complexity metrics (https://arxiv.org/abs/1808.03591) using OVO approach if there are multiple classes. As a result, for every point we have sum of complexities for both subspaces.
3. Select point which offers lowest complexity after split
4. For left and right subspaces, repeat the process recursively UNTIL there are no more splitting points (due to `minimal_split_percentage`) or `min_samples` was reached.
5. In each of subpaces `base_clf` is trained. Or if subspace is "pure" (contains only samples of one class) its using DummyClassifier - all samples are according to the pure class


## Processing

In [1]:
import mlflow
from mlflow import MlflowClient
import numpy as np
from pandas import DataFrame, Series
from IPython.display import display, Markdown, HTML
from mlutils.mlflow.utils import create_runs_for_params, get_unfinished_run_ids, get_runs, get_unfinished_runs, get_run_params, experiment_name_to_id

In [2]:
def display_md(val):
    return display(Markdown(val))

def display_df(df):
    if type(df) == Series:
        df = df.to_frame()
    return display(HTML(df.to_html()))

mlflow.set_tracking_uri("sqlite:///experiments.db")
client = MlflowClient(tracking_uri="sqlite:///experiments.db")

In [3]:
# v6_runs = get_runs(experiment_name_to_id("linear-params-more-metrics", client=client))
v6_runs = get_runs(experiment_name_to_id("new_features_v2", client=client))
base_runs = get_runs(experiment_name_to_id("base", client=client))

In [4]:
v6_runs.columns

Index(['run_id', 'experiment_id', 'status', 'artifact_uri', 'start_time',
       'end_time', 'metrics.actual_min_split_percentage',
       'metrics.simple_areas', 'metrics.acc', 'metrics.statements_size',
       'metrics.no_statements', 'params.min_samples',
       'params.neighbors_in_learning', 'params.base_clf',
       'params.min_split_percentage', 'params.train_path',
       'params.oversampling_in_splitting', 'params.complexity_measure',
       'tags.mlflow.runName', 'tags.exception'],
      dtype='object')

In [5]:
merged = v6_runs.merge(base_runs, on="params.train_path", suffixes=('', '_base'))[
    ['status',
     'metrics.actual_min_split_percentage',
     'metrics.simple_areas', 'metrics.statements_size', 'metrics.acc',
     'metrics.rf_acc', 'metrics.dt_acc', 'metrics.perceptron_acc',
     'metrics.no_statements', 'params.train_path', 'params.min_samples',
     'params.base_clf', 'params.min_split_percentage',
     'params.neighbors_in_learning', 'params.complexity_measure',
     'params.oversampling_in_splitting', 'tags.exception']
]

In [6]:
merged['better_equal_rf'] = merged['metrics.acc'] >= merged['metrics.rf_acc']
merged['better_than_dt'] = merged['metrics.acc'] > merged['metrics.dt_acc']
merged['better_than_perceptron'] = merged['metrics.acc'] > merged['metrics.perceptron_acc']

In [7]:
display_df(merged.status.value_counts())

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
FINISHED,2562
FAILED,318


In [8]:
merged.columns

Index(['status', 'metrics.actual_min_split_percentage', 'metrics.simple_areas',
       'metrics.statements_size', 'metrics.acc', 'metrics.rf_acc',
       'metrics.dt_acc', 'metrics.perceptron_acc', 'metrics.no_statements',
       'params.train_path', 'params.min_samples', 'params.base_clf',
       'params.min_split_percentage', 'params.neighbors_in_learning',
       'params.complexity_measure', 'params.oversampling_in_splitting',
       'tags.exception', 'better_equal_rf', 'better_than_dt',
       'better_than_perceptron'],
      dtype='object')

In [9]:
merged = merged.query("status == 'FINISHED'")

## Visualization
### Better or equal RF 
[(with default params - 100 estimators)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [10]:
display_md("**All across**")
display(
    merged['better_equal_rf'].mean()
)

display_md("**Better or equal to RF - grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.complexity_measure", "params.base_clf"])\
    ['better_equal_rf'] \
    .apply(lambda x: np.sum(x)/len(x)) \
    .unstack()
)

display_md("**Better or equal to RF - grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.base_clf", "params.min_samples"]) \
    ['better_equal_rf'] \
    .apply(lambda x: np.sum(x)/len(x)) \
    .unstack()
)


display_md("**Better or equal to RF - grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.base_clf", "params.min_split_percentage"]) \
        ['better_equal_rf'] \
        .apply(lambda x: np.sum(x)/len(x))\
        .unstack()
)

display_md("**Better or equal to RF - grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_equal_rf'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.23145979703356753

**Better or equal to RF - grouped by complexity_measure and base_clf**

params.base_clf,dt,svm
params.complexity_measure,Unnamed: 1_level_1,Unnamed: 2_level_1
l1,0.221294,0.240625
l2,0.225,0.239264
l3,0.220833,0.247379


**Better or equal to RF - grouped by min_samples allowed in split/subspace**

params.min_samples,10,25
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.22114,0.223611
svm,0.232975,0.253097


**Better or equal to RF - grouped by min_split_percentage for the point to be considered as split**

params.min_split_percentage,0.1,0.4
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.398611,0.045897
svm,0.35,0.136767


**Better or equal to RF - grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_equal_rf
params.base_clf,Unnamed: 1_level_1
dt,0.222377
svm,0.243099


### Better than DT [(with default params - no max depth)](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) 

In [11]:
display_md("**All across**")
display(
    merged['better_than_dt'].mean()
)


display_md("**Better than DT - grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.base_clf", "params.complexity_measure"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))\
        .unstack()
)

display_md("**Better than DT - grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.base_clf", "params.min_samples"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x)) \
        .unstack()
)


display_md("**Better than DT - grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.base_clf", "params.min_split_percentage"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))\
        .unstack()
)

display_md("**Better than DT - grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.156128024980484

**Better than DT - grouped by complexity_measure and base_clf**

params.complexity_measure,l1,l2,l3
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dt,0.056367,0.060417,0.06875
svm,0.275,0.285276,0.272537


**Better than DT - grouped by min_samples allowed in split/subspace**

params.min_samples,10,25
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.058414,0.065278
svm,0.259857,0.293805


**Better than DT - grouped by min_split_percentage for the point to be considered as split**

params.min_split_percentage,0.1,0.4
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.0,0.123783
svm,0.35,0.204263


**Better than DT - grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_than_dt
params.base_clf,Unnamed: 1_level_1
dt,0.061849
svm,0.276937


### Better than perceptron

In [12]:
display_md("**All across**")
display(
    merged['better_than_perceptron'].mean()
)


display_md("**Better than perceptron - grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.base_clf", "params.complexity_measure"]) \
        ['better_than_perceptron'] \
        .apply(lambda x: np.sum(x)/len(x)) \
        .unstack()
)

display_md("**Better than perceptron - grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.base_clf", "params.min_samples"]) \
        ['better_than_perceptron'] \
        .apply(lambda x: np.sum(x)/len(x)) \
        .unstack()
)


display_md("**Better than perceptron - grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.base_clf", "params.min_split_percentage"]) \
        ['better_than_perceptron'] \
        .apply(lambda x: np.sum(x)/len(x)) \
        .unstack()
)

display_md("**Better than perceptron - grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_than_perceptron'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.7591725214676034

**Better than perceptron - grouped by complexity_measure and base_clf**

params.complexity_measure,l1,l2,l3
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dt,0.776618,0.775,0.78125
svm,0.7375,0.733129,0.735849


**Better than perceptron - grouped by min_samples allowed in split/subspace**

params.min_samples,10,25
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.773296,0.781944
svm,0.727599,0.743363


**Better than perceptron - grouped by min_split_percentage for the point to be considered as split**

params.min_split_percentage,0.1,0.4
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.797222,0.757997
svm,0.7,0.77087


**Better than perceptron - grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_than_perceptron
params.base_clf,Unnamed: 1_level_1
dt,0.777623
svm,0.73553


In [13]:
display_md("**Accuracy**")


display_df(
    merged.groupby(["params.neighbors_in_learning"]) \
        ['metrics.acc'] \
        .apply(np.mean)
)


display_df(
    merged.groupby(["params.oversampling_in_splitting"]) \
        ['metrics.acc'] \
        .apply(np.mean) 
)


display_df(
    merged.groupby(["params.complexity_measure", "params.oversampling_in_splitting"]) \
        ['metrics.acc'] \
        .apply(np.mean) \
        .unstack()
)

display_df(
    merged.groupby(["params.complexity_measure", "params.neighbors_in_learning"]) \
        ['metrics.acc'] \
        .apply(np.mean) \
        .unstack()
)



**Accuracy**

Unnamed: 0_level_0,metrics.acc
params.neighbors_in_learning,Unnamed: 1_level_1
20.0,0.764623
5.0,0.763488
,0.779999


Unnamed: 0_level_0,metrics.acc
params.oversampling_in_splitting,Unnamed: 1_level_1
,0.767662
SMOTE,0.768426


params.oversampling_in_splitting,None,SMOTE
params.complexity_measure,Unnamed: 1_level_1,Unnamed: 2_level_1
l1,0.769359,0.769993
l2,0.769914,0.771562
l3,0.76437,0.764456


params.neighbors_in_learning,20,5,None
params.complexity_measure,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
l1,0.763814,0.763279,0.794499
l2,0.76441,0.763681,0.796491
l3,0.765644,0.763504,0.764086


In [14]:
v6_runs['dataset_name'] = v6_runs['params.train_path'].str.split('/').str[-1].str.split("-").str[0]

In [15]:
merged['dataset_name'] = merged['params.train_path'].str.split('/').str[-1].str.split("-").str[0]


In [16]:
from scipy.stats import wilcoxon

In [17]:
def calculate_wilcoxon(df, significance=0.05):
    return wilcoxon(df['metrics.acc'] - df['metrics.dt_acc'], zero_method='zsplit')[1] < significance

In [18]:
def mean_better_than_df(df):
    return df['metrics.acc'].mean() > df['metrics.dt_acc'].mean()

In [19]:
def percent_better_than_dt(df):
    return ((df['metrics.acc'] > df['metrics.dt_acc']).sum() / df['metrics.dt_acc'].count()) * 100

In [20]:
def diff_to_dt(df):
    return (df['metrics.acc'] - df['metrics.dt_acc']).mean()

In [21]:
display_md("Diff to DT (perceptron)")
display_df(merged.query('`params.base_clf` == "svm"').groupby(['dataset_name', 'params.complexity_measure']).apply(diff_to_dt).unstack())

display_md("Diff to DT (dt)")
display_df(merged.query('`params.base_clf` == "dt"').groupby(['dataset_name', 'params.complexity_measure']).apply(diff_to_dt).unstack())

Diff to DT (perceptron)

params.complexity_measure,l1,l2,l3
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
automobile,-0.3419118,-0.345588,-0.319853
balance,0.069,0.069,0.074
banana,-0.3183962,-0.323585,-0.293829
breast,-0.004464286,-0.006696,-0.002232
car,-0.2833333,-0.283333,-0.283092
chess,-0.03169014,-0.03169,-0.032603
ecoli,-0.2182836,-0.211589,-0.215796
haberman,0.008196721,0.007172,0.00888
heart,-0.1944444,-0.177083,-0.177469
mammographic,0.04411765,0.052941,0.056863


Diff to DT (dt)

params.complexity_measure,l1,l2,l3
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
automobile,-0.063725,-0.06617647,-0.068627
balance,-0.002,-0.0003333333,0.001
banana,-0.019615,-0.01875,-0.017138
breast,-0.002232,1.387779e-17,-0.000744
car,-0.042271,-0.04227053,-0.042271
chess,-0.019301,-0.01930099,-0.019301
ecoli,-0.069652,-0.07089552,-0.074005
haberman,-0.012295,-0.01140413,-0.001426
heart,-0.04784,-0.0547504,-0.048611
mammographic,0.002046,-0.002205882,0.001961


In [22]:
display_md("Percentage of better than dt")
display_df(merged.groupby(["params.base_clf", "params.min_split_percentage"]).apply(percent_better_than_dt).unstack())

Percentage of better than dt

params.min_split_percentage,0.1,0.4
params.base_clf,Unnamed: 1_level_1,Unnamed: 2_level_1
dt,0.0,12.378303
svm,35.0,20.426288


In [23]:
display_md("Mean acc better than DF (perceptron)")
display_df(merged.query('`params.base_clf` == "svm"').groupby(['dataset_name', 'params.complexity_measure']).apply(mean_better_than_df).unstack())

Mean acc better than DF (perceptron)

params.complexity_measure,l1,l2,l3
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
automobile,False,False,False
balance,True,True,True
banana,False,False,False
breast,False,False,False
car,False,False,False
chess,False,False,False
ecoli,False,False,False
haberman,True,True,True
heart,False,False,False
mammographic,True,True,True


In [24]:
display_md("Mean better than DF (perceptron)")
display_df(merged.query('`params.base_clf` == "dt"').groupby(['dataset_name', 'params.min_split_percentage']).apply(mean_better_than_df).unstack())

Mean better than DF (perceptron)

params.min_split_percentage,0.1,0.4
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1
automobile,False,False
balance,False,False
banana,False,False
breast,False,False
car,False,False
chess,False,False
ecoli,False,False
haberman,False,False
heart,False,False
mammographic,False,True


In [25]:
display_md("Wilcoxon < 0.05")
display_df(merged.groupby(['dataset_name', 'params.base_clf', 'params.min_split_percentage']).apply(calculate_wilcoxon).unstack())

Wilcoxon < 0.05



Unnamed: 0_level_0,params.min_split_percentage,0.1,0.4
dataset_name,params.base_clf,Unnamed: 2_level_1,Unnamed: 3_level_1
automobile,dt,False,True
automobile,svm,True,True
balance,dt,False,False
balance,svm,True,True
banana,dt,False,True
banana,svm,True,True
breast,dt,False,False
breast,svm,True,True
car,dt,False,True
car,svm,True,True
