
# Quad-split
recursive subspace split

## How it works:
1. For every feature consider all values as splitting axis (eg. If we have samples: [[1,2], [3,4]] then we would consider two features: F1 with points: 1,3 and F2 with points 2,3
2. For every point split the feature space (so we have "left" and "right" subspaces). Only split that fulfil `minimal_split_percentage` criterion are considered (eg. if `minimal_split_percentage` is 0.1, then in either of the sides there needs to be at least 10% of samples) For every subspace calculate complexity using complexity metrics (https://arxiv.org/abs/1808.03591) using OVO approach if there are multiple classes. As a result, for every point we have sum of complexities for both subspaces.
3. Select point which offers lowest complexity after split
4. For left and right subspaces, repeat the process recursively UNTIL there are no more splitting points (due to `minimal_split_percentage`) or `min_samples` was reached.
5. In each of subpaces `base_clf` is trained. Or if subspace is "pure" (contains only samples of one class) its using DummyClassifier - all samples are according to the pure class


## Processing

In [3]:
import mlflow
from mlflow import MlflowClient
import numpy as np
from pandas import DataFrame, Series
from IPython.display import display, Markdown, HTML
from mlutils.mlflow.utils import create_runs_for_params, get_unfinished_run_ids, get_runs, get_unfinished_runs, get_run_params, experiment_name_to_id

In [4]:
def display_md(val):
    return display(Markdown(val))

def display_df(df):
    if type(df) == Series:
        df = df.to_frame()
    return display(HTML(df.to_html()))

mlflow.set_tracking_uri("sqlite:///../experiments.db")
client = MlflowClient(tracking_uri="sqlite:///../experiments.db")

In [5]:
v6_runs = get_runs(experiment_name_to_id("v6", client=client))
base_runs = get_runs(experiment_name_to_id("base", client=client))

In [6]:
merged = v6_runs.merge(base_runs, on="params.train_path", suffixes=('', '_base'))[
    ['status', 'params.train_path', 'params.complexity_measure', 'params.base_clf', 'metrics.dt_acc', 'metrics.rf_acc', 'metrics.perceptron_acc', 'metrics.acc', 'params.min_samples', 'params.min_split_percentage']
]

In [7]:
merged['better_equal_rf'] = merged['metrics.acc'] >= merged['metrics.rf_acc']
merged['better_than_dt'] = merged['metrics.acc'] > merged['metrics.dt_acc']

In [8]:
merged.status.value_counts()

status
RUNNING     781
FAILED      657
FINISHED    482
Name: count, dtype: int64

In [9]:
merged = merged.query("status == 'FINISHED'")

## Visualization
### Better or equal RF 
[(with default params - 100 estimators)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [10]:
display_md("**All across**")
display(
    merged['better_equal_rf'].mean()
)


display_md("**Grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.complexity_measure", "params.base_clf"])\
    ['better_equal_rf'] \
    .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.min_samples"]) \
    ['better_equal_rf'] \
    .apply(lambda x: np.sum(x)/len(x))
)


display_md("**Grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.min_split_percentage"]) \
        ['better_equal_rf'] \
        .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_equal_rf'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.2012448132780083

**Grouped by complexity_measure and base_clf**

Unnamed: 0_level_0,Unnamed: 1_level_0,better_equal_rf
params.complexity_measure,params.base_clf,Unnamed: 2_level_1
f2,knn,0.255814
f2,nb,0.113636
l3,dt,0.189189
l3,knn,0.195122
l3,nb,0.142857
n3,knn,0.296296
n3,nb,0.27451


**Grouped by min_samples allowed in split/subspace**

Unnamed: 0_level_0,better_equal_rf
params.min_samples,Unnamed: 1_level_1
10,0.205357
25,0.197674


**Grouped by min_split_percentage for the point to be considered as split**

Unnamed: 0_level_0,better_equal_rf
params.min_split_percentage,Unnamed: 1_level_1
0.1,0.206731
0.3,0.162162
0.5,0.346154


**Grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_equal_rf
params.base_clf,Unnamed: 1_level_1
dt,0.189189
knn,0.243243
nb,0.161435


### Better than DT [(with default params - no max depth)](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) 

In [11]:
display_md("**All across**")
display(
    merged['better_than_dt'].mean()
)


display_md("**Grouped by complexity_measure and base_clf**")
display_df(
    merged.groupby(["params.complexity_measure", "params.base_clf"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by min_samples allowed in split/subspace**")
display_df(
    merged.groupby(["params.min_samples"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)


display_md("**Grouped by min_split_percentage for the point to be considered as split**")
display_df(
    merged.groupby(["params.min_split_percentage"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

display_md("**Grouped by base_clf trained in the split**")
display_df(
    merged.groupby(["params.base_clf"]) \
        ['better_than_dt'] \
        .apply(lambda x: np.sum(x)/len(x))
)

**All across**

0.3941908713692946

**Grouped by complexity_measure and base_clf**

Unnamed: 0_level_0,Unnamed: 1_level_0,better_than_dt
params.complexity_measure,params.base_clf,Unnamed: 2_level_1
f2,knn,0.418605
f2,nb,0.363636
l3,dt,0.324324
l3,knn,0.353659
l3,nb,0.369048
n3,knn,0.481481
n3,nb,0.470588


**Grouped by min_samples allowed in split/subspace**

Unnamed: 0_level_0,better_than_dt
params.min_samples,Unnamed: 1_level_1
10,0.392857
25,0.395349


**Grouped by min_split_percentage for the point to be considered as split**

Unnamed: 0_level_0,better_than_dt
params.min_split_percentage,Unnamed: 1_level_1
0.1,0.365385
0.3,0.36036
0.5,0.653846


**Grouped by base_clf trained in the split**

Unnamed: 0_level_0,better_than_dt
params.base_clf,Unnamed: 1_level_1
dt,0.324324
knn,0.40991
nb,0.390135
