
# Quad-split
recursive subspace split

## How it works:
1. For every feature consider all values as splitting axis (eg. If we have samples: [[1,2], [3,4]] then we would consider two features: F1 with points: 1,3 and F2 with points 2,3
2. For every point split the feature space (so we have "left" and "right" subspaces). Only split that fulfil `minimal_split_percentage` criterion are considered (eg. if `minimal_split_percentage` is 0.1, then in either of the sides there needs to be at least 10% of samples) For every subspace calculate complexity using complexity metrics (https://arxiv.org/abs/1808.03591) using OVO approach if there are multiple classes. As a result, for every point we have sum of complexities for both subspaces.
3. Select point which offers lowest complexity after split
4. For left and right subspaces, repeat the process recursively UNTIL there are no more splitting points (due to `minimal_split_percentage`) or `min_samples` was reached.
5. In each of subpaces `base_clf` is trained. Or if subspace is "pure" (contains only samples of one class) its using DummyClassifier - all samples are according to the pure class


## Processing

In [1]:
import mlflow
from mlflow import MlflowClient
import numpy as np
from pandas import DataFrame, Series
from IPython.display import display, Markdown, HTML
from mlutils.mlflow.utils import create_runs_for_params, get_unfinished_run_ids, get_runs, get_unfinished_runs, get_run_params, experiment_name_to_id

In [2]:
def display_md(val):
    return display(Markdown(val))

def display_df(df):
    if type(df) == Series:
        df = df.to_frame()
    return display(HTML(df.to_html()))

mlflow.set_tracking_uri("sqlite:///experiments.db")
client = MlflowClient(tracking_uri="sqlite:///experiments.db")

In [3]:
v6_runs = get_runs(experiment_name_to_id("v8", client=client))
base_runs = get_runs(experiment_name_to_id("base", client=client))

In [4]:
merged = v6_runs.merge(base_runs, on="params.train_path", suffixes=('', '_base'))[
    ['status', 'params.train_path', 'params.complexity_measure', 'params.base_clf', 'metrics.dt_acc', 'metrics.rf_acc', 'metrics.perceptron_acc', 'metrics.acc', 'params.min_samples', 'params.min_split_percentage']
]

In [5]:
merged['better_equal_rf'] = merged['metrics.acc'] >= merged['metrics.rf_acc']
merged['better_than_dt'] = merged['metrics.acc'] > merged['metrics.dt_acc']

In [6]:
display_df(merged.status.value_counts())

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
FINISHED,1424
RUNNING,16


In [7]:
display_df(merged.groupby(["params.complexity_measure", "status"]).count()['params.train_path'])

Unnamed: 0_level_0,Unnamed: 1_level_0,params.train_path
params.complexity_measure,status,Unnamed: 2_level_1
f2,FINISHED,480
l3,FINISHED,480
n3,FINISHED,464
n3,RUNNING,16


In [8]:
merged = merged.query("status == 'FINISHED'")

## Visualization
### Better or equal RF 
[(with default params - 100 estimators)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [9]:
display_md("**All across**")
display(
    merged['better_equal_rf'].mean()
)


display_md("**Grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.complexity_measure", "params.base_clf"])\
    ['better_equal_rf'] \
    .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.min_samples"]) \
    ['better_equal_rf'] \
    .apply(lambda x: np.sum(x)/len(x))
)


display_md("**Grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.min_split_percentage"]) \
        ['better_equal_rf'] \
        .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_equal_rf'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.18117977528089887

**Grouped by complexity_measure and base_clf**

Unnamed: 0_level_0,Unnamed: 1_level_0,better_equal_rf
params.complexity_measure,params.base_clf,Unnamed: 2_level_1
f2,dt,0.233333
f2,knn,0.2
f2,nb,0.166667
f2,perceptron,0.116667
l3,dt,0.241667
l3,knn,0.2
l3,nb,0.183333
l3,perceptron,0.1
n3,dt,0.232759
n3,knn,0.206897


**Grouped by min_samples allowed in split/subspace**

Unnamed: 0_level_0,better_equal_rf
params.min_samples,Unnamed: 1_level_1
10,0.182584
25,0.179775


**Grouped by min_split_percentage for the point to be considered as split**

Unnamed: 0_level_0,better_equal_rf
params.min_split_percentage,Unnamed: 1_level_1
0.1,0.260417
0.3,0.154661
0.4,0.127119


**Grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_equal_rf
params.base_clf,Unnamed: 1_level_1
dt,0.235955
knn,0.202247
nb,0.179775
perceptron,0.106742


### Better than DT [(with default params - no max depth)](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) 

In [10]:
display_md("**All across**")
display(
    merged['better_than_dt'].mean()
)


display_md("**Grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.complexity_measure", "params.base_clf"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.min_samples"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)


display_md("**Grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.min_split_percentage"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.2654494382022472

**Grouped by complexity_measure and base_clf**

Unnamed: 0_level_0,Unnamed: 1_level_0,better_than_dt
params.complexity_measure,params.base_clf,Unnamed: 2_level_1
f2,dt,0.166667
f2,knn,0.383333
f2,nb,0.3
f2,perceptron,0.183333
l3,dt,0.183333
l3,knn,0.416667
l3,nb,0.333333
l3,perceptron,0.158333
n3,dt,0.163793
n3,knn,0.413793


**Grouped by min_samples allowed in split/subspace**

Unnamed: 0_level_0,better_than_dt
params.min_samples,Unnamed: 1_level_1
10,0.265449
25,0.265449


**Grouped by min_split_percentage for the point to be considered as split**

Unnamed: 0_level_0,better_than_dt
params.min_split_percentage,Unnamed: 1_level_1
0.1,0.225
0.3,0.286017
0.4,0.286017


**Grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_than_dt
params.base_clf,Unnamed: 1_level_1
dt,0.171348
knn,0.404494
nb,0.320225
perceptron,0.16573


In [11]:
v6_runs['dataset_name'] = v6_runs['params.train_path'].str.split('/').str[-1].str.split("-").str[0]

In [12]:
merged['dataset_name'] = merged['params.train_path'].str.split('/').str[-1].str.split("-").str[0]


In [13]:
merged

Unnamed: 0,status,params.train_path,params.complexity_measure,params.base_clf,metrics.dt_acc,metrics.rf_acc,metrics.perceptron_acc,metrics.acc,params.min_samples,params.min_split_percentage,better_equal_rf,better_than_dt,dataset_name
0,FINISHED,../datasets/notebooks/processed/chess-5-5tra.csv,l3,nb,0.998435,0.996870,0.945227,0.627543,25,0.1,False,False,chess
1,FINISHED,../datasets/notebooks/processed/chess-5-5tra.csv,l3,nb,0.998435,0.996870,0.945227,0.671362,25,0.3,False,False,chess
2,FINISHED,../datasets/notebooks/processed/chess-5-5tra.csv,l3,nb,0.998435,0.996870,0.945227,0.599374,25,0.4,False,False,chess
3,FINISHED,../datasets/notebooks/processed/chess-5-5tra.csv,l3,nb,0.998435,0.996870,0.945227,0.599374,10,0.4,False,False,chess
4,FINISHED,../datasets/notebooks/processed/chess-5-5tra.csv,l3,nb,0.998435,0.996870,0.945227,0.671362,10,0.3,False,False,chess
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1435,FINISHED,../datasets/notebooks/processed/breast-5-2tra.csv,f2,perceptron,0.696429,0.732143,0.696429,0.589286,25,0.3,False,False,breast
1436,FINISHED,../datasets/notebooks/processed/breast-5-2tra.csv,f2,perceptron,0.696429,0.732143,0.696429,0.696429,25,0.1,False,False,breast
1437,FINISHED,../datasets/notebooks/processed/breast-5-2tra.csv,f2,perceptron,0.696429,0.732143,0.696429,0.589286,10,0.4,False,False,breast
1438,FINISHED,../datasets/notebooks/processed/breast-5-2tra.csv,f2,perceptron,0.696429,0.732143,0.696429,0.589286,10,0.3,False,False,breast


In [14]:
merged.groupby('dataset_name')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x103199d60>

In [15]:
from scipy.stats import wilcoxon

In [16]:
def calculate_wilcoxon(df):
    return wilcoxon(df['metrics.acc'] - df['metrics.dt_acc'], zero_method='zsplit')[1]

In [17]:
def mean_better_than_df(df):
    return df['metrics.acc'].mean() > df['metrics.dt_acc'].mean()

In [18]:
merged.groupby(['dataset_name', 'params.base_clf']).apply(calculate_wilcoxon) 



dataset_name  params.base_clf
automobile    dt                 1.000000
              knn                0.000008
              nb                 0.000008
              perceptron         0.000008
balance       dt                 0.000973
                                   ...   
wisconsin     perceptron         0.351360
yeast         dt                 0.007167
              knn                0.000008
              nb                 0.000008
              perceptron         0.000008
Length: 80, dtype: float64

In [19]:
display_df(merged.groupby(['dataset_name', 'params.base_clf']).apply(calculate_wilcoxon) < 0.05)



Unnamed: 0_level_0,Unnamed: 1_level_0,0
dataset_name,params.base_clf,Unnamed: 2_level_1
automobile,dt,False
automobile,knn,True
automobile,nb,True
automobile,perceptron,True
balance,dt,True
balance,knn,True
balance,nb,True
balance,perceptron,True
banana,dt,True
banana,knn,True


In [20]:
display_df(merged.groupby(['dataset_name', 'params.base_clf']).apply(mean_better_than_df))

Unnamed: 0_level_0,Unnamed: 1_level_0,0
dataset_name,params.base_clf,Unnamed: 2_level_1
automobile,dt,False
automobile,knn,False
automobile,nb,False
automobile,perceptron,False
balance,dt,True
balance,knn,True
balance,nb,True
balance,perceptron,True
banana,dt,True
banana,knn,True


In [21]:
display_df(merged.groupby(['dataset_name', 'params.complexity_measure']).apply(mean_better_than_df))

Unnamed: 0_level_0,Unnamed: 1_level_0,0
dataset_name,params.complexity_measure,Unnamed: 2_level_1
automobile,f2,False
automobile,l3,False
automobile,n3,False
balance,f2,True
balance,l3,True
balance,n3,True
banana,f2,False
banana,l3,False
banana,n3,False
breast,f2,False


In [22]:
display_df(merged.groupby(['dataset_name', 'params.min_split_percentage']).apply(mean_better_than_df))

Unnamed: 0_level_0,Unnamed: 1_level_0,0
dataset_name,params.min_split_percentage,Unnamed: 2_level_1
automobile,0.1,False
automobile,0.3,False
automobile,0.4,False
balance,0.1,True
balance,0.3,True
balance,0.4,True
banana,0.1,False
banana,0.3,False
banana,0.4,False
breast,0.1,True
