# Examples for ML2DAC

In this notebook, we show examples on how to user our approach. Especially, how to set parameters and apply it on a custom dataset. Note that we use the MetaKnowledgeRepository (MKR) that we have created with the LearningPhase.py script. Hence, have a look at that script on how to built the MKR or how to extend it.

In [1]:
from MetaLearning.ApplicationPhase import ApplicationPhase
from MetaLearning import MetaFeatureExtractor
from pathlib import Path
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.filterwarnings(category=RuntimeWarning, action="ignore")
warnings.filterwarnings(category=SettingWithCopyWarning, action="ignore")
import numpy as np
np.random.seed(0)
# Specify where to find our MKR
# TODO: How to fix the path issue?
mkr_path = Path("/home/licari/AutoMLExperiments/ml2dac/src/MetaKnowledgeRepository/")

# Specify meta-feature set to use. This is the set General+Stats+Info 
mf_set = MetaFeatureExtractor.meta_feature_sets[4]

ModuleNotFoundError: No module named 'MetaLearning'

## Example on a simple synthetic dataset

First create a simple synthetic dataset.

In [2]:
# Create simple synthetic dataset
from sklearn.datasets import make_blobs
# We expect the data as numpy arrays
X,y = make_blobs(n_samples=1000, n_features=10, random_state=0)

# We also use a name to describe/identify this dataset
dataset_name = "simple_blobs_n1000_f10"

Specify some parameter settings of our approach.

In [3]:
# Parameters of our approach. This can be customized
n_warmstarts = 5 # Number of warmstart configurations (has to be smaller than n_loops)
n_loops = 10 # Number of optimizer loops. This is n_loops = n_warmstarts + x
limit_cs = True # Reduces the search space to suitable algorithms, dependening on warmstart configurations
time_limit = 120 * 60 # Time limit of overall optimization --> Aborts earlier if n_loops not finished but time_limit reached
cvi = "predict" # We want to predict a cvi based on our meta-knowledge

Instantiate our ML2DAC approach.

In [7]:
ML2DAC = ApplicationPhase(mkr_path=mkr_path, mf_set=mf_set)

Run the optimization procedure.

In [None]:
optimizer_result, additional_info = ML2DAC.optimize_with_meta_learning(X, n_warmstarts=n_warmstarts,
                                                                       n_optimizer_loops=n_loops, 
                                                                       limit_cs=limit_cs,
                                                                       cvi=cvi, time_limit=time_limit,
                                                                       dataset_name=dataset_name)

The result contains two parts: (1) opimizer_result, which contains a history of the executed configurations in their executed order, with their runtime and the scores of the selected CVI, and (2) additional_info, which has some basic information of our meta-learning procedure, i.e., how long the meta-feature extraction took, the selected CVI, the algorithms that we used in the configuraiton space, and the dataset from the MKR that was most similar to the new dataset.

In [9]:
optimizer_result.get_runhistory_df()

Unnamed: 0,runtime,CH,config,labels
0,0.061319,-870.9795,"{'algorithm': 'ward', 'n_clusters': 10}","[6, 5, 5, 1, 4, 5, 5, 8, 9, 7, 0, 0, 1, 1, 4, ..."
1,0.049947,-1315.794,"{'algorithm': 'dbscan', 'eps': 0.9536790514390...","[0, 0, 0, 1, 2, 0, 0, 1, 0, 2, 1, 1, 1, 1, 2, ..."
2,0.057392,-908.266,"{'algorithm': 'ward', 'n_clusters': 2}","[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ..."
3,0.0549,-1164.188,"{'algorithm': 'dbscan', 'eps': 0.9059743386946...","[0, 0, 0, 1, 2, 0, 0, 1, 0, 2, 1, 1, 1, 1, 2, ..."
4,0.05179,-1072.988,"{'algorithm': 'dbscan', 'eps': 0.8878634391450...","[0, 0, 0, 1, 2, 0, 0, 1, 0, 2, -1, 1, 1, 1, 2,..."
5,0.05129,-954.2423,"{'algorithm': 'dbscan', 'eps': 0.9703480015377...","[0, 0, 0, 1, 2, 0, -1, 1, -1, 2, -1, 1, 1, 1, ..."
6,0.051499,-818.1264,"{'algorithm': 'dbscan', 'eps': 0.96, 'min_samp...","[0, 0, 0, 1, 2, 0, 0, 1, 0, 2, 1, 1, 1, 1, 2, ..."
7,0.050556,-133.5773,"{'algorithm': 'dbscan', 'eps': 0.96, 'min_samp...","[-1, -1, -1, 1, 0, -1, -1, -1, -1, 0, -1, 1, 1..."
8,0.046074,2147484000.0,"{'algorithm': 'dbscan', 'eps': 0.79, 'min_samp...","[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -..."
9,0.048987,-664.9217,"{'algorithm': 'dbscan', 'eps': 0.95, 'min_samp...","[0, 0, 0, 2, 1, 0, -1, 2, -1, 1, -1, 2, 2, 2, ..."


In [10]:
additional_info

{'dataset': 'simple_blobs_n1000_f10',
 'mf time': 1.1253306865692139,
 'similar dataset': ['type=gaussian-k=10-n=1000-d=10-noise=0',
  'type=varied-k=10-n=1000-d=10-noise=0'],
 'cvi': 'CH',
 'algorithms': ['ward', 'dbscan']}

Now we retrieve the best configuration with its predicted clustering labels and compare it against the ground-truth clustering.

In [11]:
best_config_stats = optimizer_result.get_incumbent_stats()
best_config_stats



{'runtime': 0.04994654655456543,
 'CH': -1315.7939310607555,
 'config': {'algorithm': 'dbscan',
  'eps': 0.9536790514390626,
  'min_samples': 3},
 'labels': array([ 0,  0,  0,  1,  2,  0,  0,  1,  0,  2,  1,  1,  1,  1,  2,  0,  0,
         0, -1, -1,  1,  1,  2,  1,  1,  1,  1,  0,  0,  2,  0,  2,  1,  2,
         0,  2,  0,  1,  2,  1,  2,  1,  0,  2,  1,  1,  2,  2,  0,  2,  0,
         2,  0,  0,  0,  2,  0,  2,  2,  0,  0,  1,  2,  1,  2,  0,  0,  1,
         1,  2,  2,  0, -1,  1,  1,  1,  2,  1,  1,  1,  2,  0,  0,  0,  0,
         1,  0,  1,  2,  0,  2,  1,  2,  1,  2,  0,  1,  0,  2,  2,  1,  1,
         0, -1,  0,  0,  1,  1,  2,  1,  1,  1,  2,  0,  0,  0,  2,  1,  2,
         1,  1,  0,  1,  2,  0, -1,  0,  0,  1,  2,  0,  1,  2,  1,  1,  0,
         2,  0,  2,  2,  0,  0,  0,  2,  0,  0,  2,  2,  2, -1,  0,  1,  1,
         2,  1,  2,  2,  1,  0,  2,  0,  0,  1,  2,  0,  1,  1,  0,  1,  1,
         1,  2,  2,  2,  0,  2,  1,  1,  1,  1,  0,  1,  2,  0,  1,  2,  1,
        

In [12]:
predicted_labels = best_config_stats["labels"]

In [13]:
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(predicted_labels, y)

0.9170231805482065