# MLBox

#### Author's description:

MLBox is a powerful Automated Machine Learning python library. It provides the following features:

* Fast reading and distributed data preprocessing/cleaning/formatting
* Highly robust feature selection and leak detection
* Accurate hyper-parameter optimization in high-dimensional space
* State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,...)
* Prediction with models interpretation

#### Useful links:

[home](https://pypi.org/project/mlbox/),
[tutorial](https://www.analyticsvidhya.com/blog/2017/07/mlbox-library-automated-machine-learning/),
[manual](https://mlbox.readthedocs.io/en/latest/),
[git](https://github.com/AxeldeRomblay/MLBox),
[more examples](https://mlbox.readthedocs.io/en/latest/introduction.html)

## Install and import

In [1]:
!sudo -H pip install jsonschema==2.6

Collecting jsonschema==2.6
  Downloading jsonschema-2.6.0-py2.py3-none-any.whl (39 kB)
Installing collected packages: jsonschema
  Attempting uninstall: jsonschema
    Found existing installation: jsonschema 3.2.0
    Uninstalling jsonschema-3.2.0:
      Successfully uninstalled jsonschema-3.2.0
Successfully installed jsonschema-3.0.2
You should consider upgrading via the '/usr/local/anaconda/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
!sudo -H pip install mlbox==0.8.2

Collecting mlbox==0.8.2
  Downloading mlbox-0.8.2.tar.gz (30 kB)
Collecting numpy==1.17.0
  Downloading numpy-1.17.0-cp36-cp36m-manylinux1_x86_64.whl (20.4 MB)
[K     |████████████████████████████████| 20.4 MB 8.2 MB/s 
[?25hCollecting scipy==1.3.0
  Downloading scipy-1.3.0-cp36-cp36m-manylinux1_x86_64.whl (25.2 MB)
[K     |████████████████████████████████| 25.2 MB 24.0 MB/s 
[?25hCollecting matplotlib==3.0.3
  Downloading matplotlib-3.0.3-cp36-cp36m-manylinux1_x86_64.whl (13.0 MB)
[K     |████████████████████████████████| 13.0 MB 69.0 MB/s 
[?25hCollecting hyperopt==0.1.2
  Downloading hyperopt-0.1.2-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 88.2 MB/s 
[?25hCollecting Keras==2.2.4
  Downloading Keras-2.2.4-py2.py3-none-any.whl (312 kB)
[K     |████████████████████████████████| 312 kB 99.1 MB/s 
[?25hCollecting pandas==0.25.0
  Downloading pandas-0.25.0-cp36-cp36m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 

#### MLBox main package contains 3 sub-packages : preprocessing, optimisation and prediction. Each one of them are respectively aimed at reading and preprocessing data, testing or optimising a wide range of learners and predicting the target on a test dataset.

In [3]:
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
import numpy as np
import pandas as pd
import sklearn
import subprocess

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
# these notes were written for v 0.8.2
!pip show mlbox

Name: mlbox
Version: 0.8.2
Summary: A powerful Automated Machine Learning python library.
Home-page: https://github.com/AxeldeRomblay/mlbox
Author: Axel ARONIO DE ROMBLAY
Author-email: axelderomblay@gmail.com
License: BSD-3
Location: /usr/local/anaconda/lib/python3.6/site-packages
Requires: tables, tensorflow, pandas, joblib, xlrd, numpy, scipy, scikit-learn, hyperopt, Keras, lightgbm, matplotlib
Required-by: 


In [5]:
!rm -rf ../results/joblib

## A few pointers to keep in mind

#### Importing data
MLBox seems to prefer csv files. Otherwise you have to build your own dictionary. The dictionary structure is not overly complicated, but it introduces another chance for syntax or type errors. It might be wise to just use csv if saving and loading as csv is not too expensive.

#### Documentation
MLBox documentation is high-level. Implementing in practice is more difficult. Could not find anything on deep learning.

## Heart Disease

#### A note on importing data
csv files for the two datasets in this project are saved at **/mnt/data/raw/**

#### A note on the train & test function
MLBox has a function called **train_test_split()**. It does not behave like the scikit-learn function of the same name. It can take a little getting use to. It will help if you imagine that the authors of MLBox built it as a tool for Kaggle competitions. The training set needs to have **y** in it. The test set should not. You're on your own for accuracy against the test set as it is assumed you'll find out the real answers later with an external test set that is not part of the MLBox flow.

#### A note on categorical fields
MLBox tries to infer which columns are categorical. From what I can tell, it only looks at data type when doing so. This is a little annoying. Below, I had to take the extra step of mapping numeric values to text for each of the numeric/categorical columns so that MLBox will treat them as categorical.

#### Load the heart disease dataset

The raw data can be found in the project files at /mnt/data/raw/heart.csv

Attribute documentation:

      age: age in years
      sex: sex (1 = male; 0 = female)
      cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
     trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
     chol: serum cholestoral in mg/dl
     fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
     restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
     thalach: maximum heart rate achieved
     exang: exercise induced angina (1 = yes; 0 = no)
     oldpeak = ST depression induced by exercise relative to rest
     slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     ca: number of major vessels (0-3) colored by flourosopy
     thal: 
         3 = normal; 
         6 = fixed defect; 
         7 = reversable defect
     target: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing

In [6]:
# column names
names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang', \
         'oldpeak','slope','ca','thal','target']

# load data from Domino project directory
hd_data = pd.read_csv("/mnt/data/raw/heart.csv", header=None, names=names)

In [7]:
# in case some data comes in as string, convert to numeric and coerce errors to NaN
for col in hd_data.columns:  # Iterate over columns
    hd_data[col] = pd.to_numeric(hd_data[col], errors='coerce')
    
# drop nulls
hd_data.dropna(inplace=True)

In [8]:
# function to force non-numeric data for categorical columns
def force_non_numeric(data, cols):
    for c in cols:
        data[c] = 'text_' + data[c].map(str)  
    return data

In [9]:
cat_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
hd_data = force_non_numeric(hd_data, cat_cols)

In [11]:
# create MLBox random samples for train and test
hd_data_train = hd_data.sample(frac=0.7, replace=False, random_state=1)
hd_data_test = hd_data[~hd_data.isin(hd_data_train)].dropna()
hd_data_test_wo_target = hd_data_test.drop('target', axis=1)

hd_data_train.to_csv('/mnt/data/processed/hd_data_train.csv', index=False)
hd_data_test.to_csv('/mnt/data/processed/hd_data_test.csv', index=False)
hd_data_test_wo_target.to_csv('/mnt/data/processed/hd_data_test_wo_target.csv', index=False)

In [12]:
# the list of paths to your train datasets and test datasets
paths_hd = ["/mnt/data/processed/hd_data_train.csv", \
         "/mnt/data/processed/hd_data_test_wo_target.csv"]

# the name of the target you try to predict (classification or regression)
target_hd = "target"

#### Process the data

Pass the training set (with the target) and the test set (without the target) to the **train_test_split()** funciton. This automatically cleans both data sets.

Use **to_path** to keep your world organized. In my project I want everything in the results directory so we use **/mnt/results**.

Note that after adding text to the numeric/categorical columns, they are now recognized as such. 

In [13]:
# to read and preprocess your files
mlb_data_hd = Reader(sep=",", to_path = '/mnt/results').train_test_split(paths_hd, target_hd)


reading csv : hd_data_train.csv ...
cleaning data ...
CPU time: 0.05018115043640137 seconds

reading csv : hd_data_test_wo_target.csv ...
cleaning data ...
CPU time: 0.019963741302490234 seconds

> Number of common features : 13

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 8
> Number of training samples : 211
> Number of test samples : 91

> You have no missing values on train set...

> Task : classification
1.0    109
0.0    102
Name: target, dtype: int64

encoding target ...


#### Last processing note

After building the dictionary, we processes the data as below with the nice MLBox feature of automatically droping ids and [drifting variables](https://github.com/AxeldeRomblay/MLBox/blob/master/docs/webinars/features.pdf) between train and test datasets. I have found that it does not automatically drop ids. The source code only seems to detect drift, which is not found in randomly generated id fields.

In [14]:
# drop IDs and useless columns
mlb_data_hd = Drift_thresholder(to_path='/mnt/results').fit_transform(mlb_data_hd)


computing drifts ...
CPU time: 0.27405881881713867 seconds

> Top 10 drifts

('age', 0.16854601525233193)
('sex', 0.1267822961834446)
('restecg', 0.10845849013199294)
('ca', 0.08890731408778985)
('oldpeak', 0.08875589758280489)
('trestbps', 0.08502309117977713)
('thalach', 0.07370459349540548)
('thal', 0.07113500527364369)
('fbs', 0.06026993241923506)
('exang', 0.05645467921924041)

> Deleted variables : []
> Drift coefficients dumped into directory : /mnt/results


#### Build the modeling routine

#### Defining the search criteria

MLBox gives you good control over the modeling algorithms and parameter settings to try.

You define a space dictionary and pass it to the **Optimiser** function.

Then you pass that Optimiser and the data dictionary to the **Predictor** function.

In [15]:
space = {

        'ne__numerical_strategy' : {"space" : [0, 'mean']},

        'ce__strategy' : {"space" : ["label_encoding", "random_projection", \
                                     "entity_embedding"]},

        'fs__strategy' : {"space" : ["variance", "rf_feature_importance"]},
        'fs__threshold': {"search" : "choice", "space" : [0.1, 0.2, 0.3]},

        'est__strategy' : {"space" : ["LightGBM", "RandomForest", "ExtraTrees",\
                                      "Linear"]},
        'est__max_depth' : {"search" : "choice", "space" : [5,10,20]},
        'est__subsample' : {"search" : "uniform", "space" : [0.6,0.7]}

        }

In [16]:
%%time

best_hd = Optimiser(to_path = '../results').optimise(space, mlb_data_hd, max_evals = 10)



##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 5, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]

  +str(self.to_path)+"/joblib'. Please clear it regularly.")
  + ". Parameter IGNORED. Check the list of "







MEAN SCORE : neg_log_loss = -0.45016076541582595
VARIANCE : 0.0364766212935074 (fold 1 = -0.4136841441223185, fold 2 = -0.4866373867093333)
CPU time: 1.2122752666473389 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'entity_embedding'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}




Instructions for updating:
Please use `rate` instead of `ke

  + ". Parameter IGNORED. Check the list of "





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
MEAN SCORE : neg_log_loss = -0.47903456453258
VARIANCE : 0.0493714517947538 (fold 1 = -0.42966311273782615, fold 2 = -0.5284060163273337)
CPU time: 3.9905571937561035 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 10, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbo

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.4577304565518751
VARIANCE : 0.04399679251668459 (fold 1 = -0.4137336640351905, fold 2 = -0.5017272490685597)
CPU time: 1.1930012702941895 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 5, 'subsample': 0.6073569057269794, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample_for_bin': 200000, 'subsample_freq': 

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.5506958092288694
VARIANCE : 0.047523267304907024 (fold 1 = -0.5031725419239623, fold 2 = -0.5982190765337764)
CPU time: 0.2979397773742676 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 50%|█████     | 5/10 [00:09<00:07,  1.45s/it, best loss: 0.45016076541582595

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.49375991086701737
VARIANCE : 0.08285445396939733 (fold 1 = -0.41090545689762004, fold 2 = -0.5766143648364147)
CPU time: 1.0109946727752686 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.5415426855277767
VARIANCE : 0.14420497505711996 (fold 1 = -0.39733771047065675, fold 2 = -0.6857476605848967)
CPU time: 0.1106445789337158

  + ". Parameter IGNORED. Check the list of "

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.45016076541582595
VARIANCE : 0.0364766212935074 (fold 1 = -0.4136841441223185, fold 2 = -0.4866373867093333)
CPU time: 0.9656825065612793 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 80%|████████  | 8/10 [00:11<00:01,  1.04it/s, best loss: 0.45016076541582595]

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.48315759810843484
VARIANCE : 0.07098589911583106 (fold 1 = -0.4121716989926038, fold 2 = -0.5541434972242659)
CPU time: 1.003422498703003 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 20, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 90%|█████████ | 9/10 [00:12<00:00,  1.02it/s, best loss: 0.45016076541582595]

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.49375991086701737
VARIANCE : 0.08285445396939733 (fold 1 = -0.41090545689762004, fold 2 = -0.5766143648364147)
CPU time: 0.9843924045562744 seconds
100%|██████████| 10/10 [00:13<00:00,  1.31s/it, best loss: 0.45016076541582595]


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEST HYPER-PARAMETERS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{'ce__strategy': 'random_projection', 'est__max_depth': 5, 'est__strategy': 'ExtraTrees', 'est__subsample': 0.6116126711910101, 'fs__strategy': 'rf_feature_importance', 'fs__threshold': 0.1, 'ne__numerical_strategy': 'mean'}
CPU times: user 12.6 s, sys: 236 ms, total: 12.8 s
Wall time: 13.2 s


In [17]:
Predictor(to_path='/mnt/results').fit_predict(best_hd,mlb_data_hd)

  + ". Parameter IGNORED. Check the list of "



fitting the pipeline ...
CPU time: 0.6410982608795166 seconds

> Feature importances dumped into directory : /mnt/results

predicting ...
CPU time: 0.10602092742919922 seconds

> Overview on predictions : 

        0.0       1.0  target_predicted
0  0.331040  0.668960                 1
1  0.072041  0.927959                 1
2  0.231957  0.768043                 1
3  0.139823  0.860177                 1
4  0.398015  0.601985                 1
5  0.250361  0.749639                 1
6  0.326035  0.673965                 1
7  0.095133  0.904867                 1
8  0.291248  0.708752                 1
9  0.390119  0.609881                 1

dumping predictions into directory : /mnt/results ...


<mlbox.prediction.predictor.Predictor at 0x7f5d44c635f8>

## Breast Cancer

#### Load the breast cancer dataset

In [18]:
'''
Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)
'''

#column names
names = ['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', \
         'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', \
         'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', \
         'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', \
         'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', \
         'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', \
         'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', \
         'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst']

#load data from Domino project directory
bc_data = pd.read_csv("/mnt/data/raw/breast_cancer.csv", index_col=False, header=0, names=names)

#create MLBox random samples for train and test
bc_data_train = bc_data.sample(frac=0.7, replace=False, random_state=1)
bc_data_test = bc_data[~bc_data.isin(bc_data_train)].dropna()
bc_data_test_wo_target = bc_data_test.drop('diagnosis', axis=1)

bc_data_train.to_csv('/mnt/data/processed/bc_data_train.csv', index=False)
bc_data_test.to_csv('/mnt/data/processed/bc_data_test.csv', index=False)
bc_data_test_wo_target.to_csv('/mnt/data/processed/bc_data_test_wo_target.csv', index=False)

In [19]:
# the list of paths to your train datasets and test datasets
paths_bc = ["/mnt/data/processed/bc_data_train.csv", \
         "/mnt/data/processed/bc_data_test_wo_target.csv"]

# the name of the target you try to predict (classification or regression)
target_bc = "diagnosis"

#### Process the data

In [20]:
# to read and preprocess your files
mlb_data_bc = Reader(sep=",", to_path = '/mnt/results').train_test_split(paths_bc, target_bc)


reading csv : bc_data_train.csv ...
cleaning data ...
CPU time: 0.039377689361572266 seconds

reading csv : bc_data_test_wo_target.csv ...
cleaning data ...
CPU time: 0.031555891036987305 seconds

> Number of common features : 31

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 0
> Number of numerical features: 31
> Number of training samples : 398
> Number of test samples : 171

> You have no missing values on train set...

> Task : classification
B    249
M    149
Name: diagnosis, dtype: int64

encoding target ...


In [21]:
# drop IDs and useless columns
mlb_data_bc = Drift_thresholder(to_path='/mnt/results').fit_transform(mlb_data_bc)


computing drifts ...
CPU time: 0.42824673652648926 seconds

> Top 10 drifts

('smoothness_se', 0.12057311179701502)
('concavity_se', 0.10426929448886013)
('perimeter_se', 0.06487189710522512)
('texture_mean', 0.06449965284699832)
('texture_se', 0.06002275398882251)
('symmetry_se', 0.05570224583931971)
('perimeter_mean', 0.054076813616646735)
('fractal_dimension_mean', 0.04680344265788583)
('texture_worst', 0.044321126838020586)
('radius_worst', 0.041591679326866915)

> Deleted variables : []
> Drift coefficients dumped into directory : /mnt/results


#### Optimise the space and fit the model

In [22]:
%%time

best_bc = Optimiser(to_path = '/mnt/results').optimise(space, mlb_data_bc, max_evals = 10)



##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 10, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]

  +str(self.to_path)+"/joblib'. Please clear it regularly.")
  + ". Parameter IGNORED. Check the list of "







MEAN SCORE : neg_log_loss = -0.15344310487726132
VARIANCE : 0.014118581089442568 (fold 1 = -0.13932452378781876, fold 2 = -0.1675616859667039)
CPU time: 1.1181492805480957 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
 10%|█         | 1/10 [00:01<00:10,  1.13s/it, best loss: 0.15344310487726132]

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.6698380210671552
VARIANCE : 0.0006972148707201642 (fold 1 = -0.6705352359378753, fold 2 = -0.669140806196435)
CPU time: 0.329059362411499 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'random_projection'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.2}
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.6698380210671342
VARIANCE : 0.0006972148707410919 (fold 1 = -0.6705352359378753, fold 2 = -0.6691408061963932)
CPU time: 0.06641840934753418 seconds
####

  + ". Parameter IGNORED. Check the list of "

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.6698380210671342
VARIANCE : 0.0006972148707411474 (fold 1 = -0.6705352359378753, fold 2 = -0.669140806196393)
CPU time: 0.1796419620513916 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'entity_embedding'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'Linear', 'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': -1, 'penalty': 'l2', 'random_state': 0, 'solver': 'liblinear', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
MEAN SCORE : neg_log_loss = -0.6698380210671342
VARIANCE : 0.0006972148707412029 (fold 1 = -0.6705352359378753, fold 2 = -0.6691408061963929)
CPU time: 0.12563180923461914 seconds
####

  + ". Parameter IGNORED. Check the list of "

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.14987048638522532
VARIANCE : 0.006325396645336351 (fold 1 = -0.14354508973988897, fold 2 = -0.15619588303056167)
CPU time: 1.252192497253418 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'rf_feature_importance', 'threshold': 0.1}
>>> ESTIMATOR :{'strategy': 'RandomForest', 'max_depth': 5, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 60%|██████    | 6/10 [00:03<00:02,  1.36it/s, best loss: 0.1

  + ". Parameter IGNORED. Check the list of "







MEAN SCORE : neg_log_loss = -0.15231232453080193
VARIANCE : 0.008258411079460348 (fold 1 = -0.14405391345134158, fold 2 = -0.16057073561026228)
CPU time: 1.1310806274414062 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 0, 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'label_encoding'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'LightGBM', 'max_depth': 10, 'subsample': 0.680876860337814, 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': 

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.14987048638522532
VARIANCE : 0.006325396645336351 (fold 1 = -0.14354508973988897, fold 2 = -0.15619588303056167)
CPU time: 1.1349759101867676 seconds
##################################################### testing hyper-parameters... #####################################################
>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}
>>> CA ENCODER :{'strategy': 'entity_embedding'}
>>> FEATURE SELECTOR :{'strategy': 'variance', 'threshold': 0.3}
>>> ESTIMATOR :{'strategy': 'ExtraTrees', 'max_depth': 5, 'bootstrap': True, 'class_weight': None, 'criterion': 'gini', 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': -1, 'oob_score': False, 'random_state': 0, 'verbose': 0, 'warm_start': False}
 90%|█████████ | 9/10 [00:07<00:01,  1.13s/it, best loss: 0.1498704863852

  + ". Parameter IGNORED. Check the list of "





MEAN SCORE : neg_log_loss = -0.16601760645184827
VARIANCE : 0.000573284414337516 (fold 1 = -0.16544432203751075, fold 2 = -0.16659089086618578)
CPU time: 0.9411675930023193 seconds
100%|██████████| 10/10 [00:08<00:00,  1.23it/s, best loss: 0.14987048638522532]


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BEST HYPER-PARAMETERS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{'ce__strategy': 'label_encoding', 'est__max_depth': 10, 'est__strategy': 'RandomForest', 'est__subsample': 0.657658568660771, 'fs__strategy': 'rf_feature_importance', 'fs__threshold': 0.3, 'ne__numerical_strategy': 0}
CPU times: user 7.62 s, sys: 180 ms, total: 7.8 s
Wall time: 8.2 s


In [23]:
Predictor(to_path='/mnt/results').fit_predict(best_bc,mlb_data_bc)

  + ". Parameter IGNORED. Check the list of "



fitting the pipeline ...
CPU time: 0.7908856868743896 seconds

> Feature importances dumped into directory : /mnt/results

predicting ...
CPU time: 0.02890777587890625 seconds

> Overview on predictions : 

        B       M diagnosis_predicted
0  0.0000  1.0000                   M
1  0.0875  0.9125                   M
2  0.1875  0.8125                   M
3  0.0175  0.9825                   M
4  0.9900  0.0100                   B
5  0.0750  0.9250                   M
6  0.0075  0.9925                   M
7  0.0175  0.9825                   M
8  0.0000  1.0000                   M
9  0.3025  0.6975                   M

dumping predictions into directory : /mnt/results ...


<mlbox.prediction.predictor.Predictor at 0x7f5ccd4fb0f0>

## Print Accuracy and Save to Domino Stats File

Saving stats to this file [allows Domino to track and trend them in the Experiment Manager](https://support.dominodatalab.com/hc/en-us/articles/204348169-Diagnostic-statistics-with-dominostats-json) when this notebook is run as a batch or scheduled job.

In [24]:
# this predictions file is the output of the Prediction funtion from above
bc_pred = pd.read_csv('/mnt/results/diagnosis_predictions.csv')
y_bc_pred = bc_pred['diagnosis_predicted']

# these are the answers from the file stored in the project
bc_test = pd.read_csv('/mnt/data/processed/bc_data_test.csv')
y_bc_test = bc_test['diagnosis']

# this predictions file is the output of the Prediction funtion from above
hd_pred = pd.read_csv('/mnt/results/target_predictions.csv')
y_hd_pred = hd_pred['target_predicted']

# these are the answers from the file stored in the project
hd_test = pd.read_csv('/mnt/data/processed/hd_data_test.csv')
y_hd_test = hd_test['target']

In [25]:
import sklearn

hd_acc = sklearn.metrics.accuracy_score(y_hd_test,y_hd_pred)
bc_acc = sklearn.metrics.accuracy_score(y_bc_test,y_bc_pred)

print('bc ', bc_acc)
print('hd ', hd_acc)

bc  0.9766081871345029
hd  0.8901098901098901


#### Save to Domino

In [58]:
import json
with open('/mnt/dominostats.json', 'w') as f:
    f.write(json.dumps( {"HD_ACC": hd_acc, "BC_ACC": bc_acc}))