In [1]:
import os

# change working directory, run this cell once
os.chdir("../../")

# supress warnings
import warnings
warnings.filterwarnings("ignore")

This notebook will demonstrate how to use Bayesian Optimization + Warmstarting to fit handwritten digits dataset.

### Import packages

In [5]:
import autocluster
from autocluster import AutoCluster, get_evaluator, MetafeatureMapper
from sklearn import datasets
from collections import Counter
from sklearn.metrics.cluster import v_measure_score
import pandas as pd

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
df = pd.DataFrame(datasets.load_digits(n_class=6)['data'])
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


We will be using the handwritten digits dataset from ``sklearn.dataset`` with 6 classes. 

### Fitting a clustering model using Bayesian Optimization (SMAC) + Warmstarting (Metalearning)

In [None]:
cluster = AutoCluster()
fit_params = {
    "df": df, 
    "cluster_alg_ls": [
        'KMeans', 'GaussianMixture', 'Birch', 
        'MiniBatchKMeans', 'AgglomerativeClustering', 'SpectralClustering'
    ], 
    "dim_reduction_alg_ls": [
        'TSNE', 'PCA', 'IncrementalPCA', 
        'KernelPCA', 'FastICA', 'TruncatedSVD'
    ],
    "optimizer": 'smac',
    "n_evaluations": 40,
    "run_obj": 'quality',
    "seed": 27,
    "cutoff_time": 60,
    "preprocess_dict": {
        "numeric_cols": list(range(64)),
        "categorical_cols": [],
        "ordinal_cols": [],
        "y_col": []
    },
    "evaluator": get_evaluator(evaluator_ls = ['silhouetteScore'], 
                               weights = [], clustering_num = None, 
                               min_proportion = .01),
    "n_folds": 3,
    "warmstart": True,
    "warmstart_datasets_dir": 'experiments/metaknowledge/benchmark_silhouette',
    "warmstart_metafeatures_table_path": 'experiments/metaknowledge/benchmark_silhouette_metafeatures_table.csv',
    "warmstart_n_neighbors": 10,
    "warmstart_top_n": 3,
    "general_metafeatures": MetafeatureMapper.getGeneralMetafeatures(),
    "numeric_metafeatures": MetafeatureMapper.getNumericMetafeatures(),
    "categorical_metafeatures": [],
    "verbose_level": 1,
}
result_dict = cluster.fit(**fit_params)

871/1083 datapoints remaining after outlier removal
Found 26 relevant intial configurations from warmstarter.
Truncated n_evaluations: 40
Fitting configuration: 
{'branching_factor___Birch': 291, 'dim_reduction_choice': 'TruncatedSVD', 'n_clusters___Birch': 45, 'clustering_choice': 'Birch', 'algorithm___TruncatedSVD': 'randomized', 'n_components___TruncatedSVD': 9}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 74, 'dim_reduction_choice': 'TruncatedSVD', 'n_clusters___Birch': 71, 'clustering_choice': 'Birch', 'algorithm___TruncatedSVD': 'randomized', 'n_components___TruncatedSVD': 6}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 55, 'dim_reduction_choice': 'TruncatedSVD', 'n_clusters___Birch': 78, 'clustering_choice': 'Birch', 'algorithm___TruncatedSVD': 'randomized', 'n_components___TruncatedSVD': 8}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 814, 'dim_reduction_choice': 'KernelPCA', 'n_components___KernelPCA': 7, 'n_clusters___Birch': 40, 'kernel___KernelPCA': 'linear', 'clustering_choice': 'Birch'}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 227, 'dim_reduction_choice': 'TruncatedSVD', 'n_clusters___Birch': 46, 'clustering_choice': 'Birch', 'algorithm___TruncatedSVD': 'randomized', 'n_components___TruncatedSVD': 7}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 149, 'dim_reduction_choice': 'TruncatedSVD', 'n_clusters___Birch': 38, 'clustering_choice': 'Birch', 'algorithm___TruncatedSVD': 'randomized', 'n_components___TruncatedSVD': 5}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 64, 'dim_reduction_choice': 'KernelPCA', 'n_components___KernelPCA': 8, 'n_clusters___Birch': 25, 'kernel___KernelPCA': 'linear', 'clustering_choice': 'Birch'}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 857, 'dim_reduction_choice': 'IncrementalPCA', 'batch_size___IncrementalPCA': 748, 'n_clusters___Birch': 59, 'clustering_choice': 'Birch', 'n_components___IncrementalPCA': 3}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'dim_reduction_choice': 'KernelPCA', 'linkage___AgglomerativeClustering': 'complete', 'affinity___AgglomerativeClustering': 'euclidean', 'kernel___KernelPCA': 'rbf', 'clustering_choice': 'AgglomerativeClustering', 'n_clusters___AgglomerativeClustering': 7, 'n_components___KernelPCA': 6}
Score obtained by this configuration: 0.27981443645008397
Fitting configuration: 
{'dim_reduction_choice': 'KernelPCA', 'linkage___AgglomerativeClustering': 'single', 'affinity___AgglomerativeClustering': 'euclidean', 'kernel___KernelPCA': 'rbf', 'clustering_choice': 'AgglomerativeClustering', 'n_clusters___AgglomerativeClustering': 6, 'n_components___KernelPCA': 4}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'dim_reduction_choice': 'KernelPCA', 'linkage___AgglomerativeClustering': 'single', 'affinity___AgglomerativeClustering': 'l2', 'kernel___KernelPCA': 'rbf', 'clustering_choice': 'AgglomerativeClustering', 'n_clusters___AgglomerativeClustering': 7, 'n_components___KernelPCA': 4}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'branching_factor___Birch': 91, 'dim_reduction_choice': 'TSNE', 'early_exaggeration___TSNE': 12.0, 'n_clusters___Birch': 6, 'perplexity___TSNE': 84.03630134401256, 'clustering_choice': 'Birch', 'n_components___TSNE': 2}
Score obtained by this configuration: 0.25396819909413654
Fitting configuration: 
{'dim_reduction_choice': 'TSNE', 'linkage___AgglomerativeClustering': 'average', 'affinity___AgglomerativeClustering': 'cityblock', 'early_exaggeration___TSNE': 7.471032321220759, 'perplexity___TSNE': 175.48390215534172, 'clustering_choice': 'AgglomerativeClustering', 'n_clusters___AgglomerativeClustering': 9, 'n_components___TSNE': 2}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'init_params___GaussianMixture': 'kmeans', 'dim_reduction_choice': 'TSNE', 'early_exaggeration___TSNE': 8.203390083753744, 'perplexity___TSNE': 231.40494920107596, 'clustering_choice': 'GaussianMixture', 'n_components___GaussianMixture': 6, 'covariance_type___GaussianMixture': 'full', 'n_components___TSNE': 2}
Score obtained by this configuration: 0.27213286856810254
Fitting configuration: 
{'init_params___GaussianMixture': 'kmeans', 'dim_reduction_choice': 'TSNE', 'early_exaggeration___TSNE': 8.3742023674892, 'perplexity___TSNE': 231.45241399045284, 'clustering_choice': 'GaussianMixture', 'n_components___GaussianMixture': 6, 'covariance_type___GaussianMixture': 'full', 'n_components___TSNE': 2}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'dim_reduction_choice': 'TSNE', 'linkage___AgglomerativeClustering': 'complete', 'affinity___AgglomerativeClustering': 'manhattan', 'early_exaggeration___TSNE': 11.993398613338442, 'perplexity___TSNE': 192.4356661604625, 'clustering_choice': 'AgglomerativeClustering', 'n_clusters___AgglomerativeClustering': 6, 'n_components___TSNE': 2}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'dim_reduction_choice': 'TSNE', 'n_clusters___MiniBatchKMeans': 7, 'early_exaggeration___TSNE': 13.048424472811831, 'batch_size___MiniBatchKMeans': 715, 'perplexity___TSNE': 278.53955190071804, 'clustering_choice': 'MiniBatchKMeans', 'n_components___TSNE': 2}
Score obtained by this configuration: inf


Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 2147483647.0 for quality scenarios. (Change value through "cost_for_crash"-option.)


Fitting configuration: 
{'dim_reduction_choice': 'TSNE', 'n_clusters___MiniBatchKMeans': 7, 'early_exaggeration___TSNE': 10.194958770285634, 'batch_size___MiniBatchKMeans': 967, 'perplexity___TSNE': 190.13168469295235, 'clustering_choice': 'MiniBatchKMeans', 'n_components___TSNE': 2}


Important parameters to take note:
- ``warmstart_datasets_dir``: Don't change this unless you have ran metalearning on your own datasets.
- ``warmstart_metafeatures_table_path``: This is the path to a csv table with metafeatures of all datasets used for warmstarting. Don't change this unless you have ran metalearning on your own datasets.
- ``warmstart_n_neighbors``: During warmstarting, the closest ``N`` datasets on the metefeatures space will be chosen to retrieve some initial configurations for Bayesian Optimization. ``warmstart_n_neighbors`` refers to ``N``.
- ``warmstart_top_n``: During warmstarting, a list of ``K`` best configurations will be retrieved from each of the chosen 'similar' datasets. ``warmstart_top_n`` refers to ``K``.
- ``general_metafeatures``: General metafeatures used for computing the distance or 'similarity' between two datasets.
- ``numeric_metafeatures``: Numeric metafeatures used for computing the distance or 'similarity' between two datasets. Numeric here means the metafeatures are computed using only the numeric columns of a dataset.

In [None]:
result_dict['metafeatures'][0]

In [None]:
predictions = cluster.predict(df, save_plot=False)

In [None]:
Counter(predictions)

In [None]:
v_measure_score(predictions, datasets.load_digits(n_class=6)['target'])

The V measure score is reasonably good given that we are just doing clustering. Refer to [sklearn's page](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html) for further explanation on v measure's interpretation.

In [None]:
cluster.plot_convergence()

In [None]:
cluster.get_trajectory()