# MiraiML usage example

## The dataset

Let's use the csv file `pulsar_stars.csv` to explore MiraiML functionalities. It's a dataset downloaded from Kaggle ([Predicting a Pulsar Star](https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star)).

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('pulsar_stars.csv')
data.head()

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 Mean of the integrated profile                  17898 non-null float64
 Standard deviation of the integrated profile    17898 non-null float64
 Excess kurtosis of the integrated profile       17898 non-null float64
 Skewness of the integrated profile              17898 non-null float64
 Mean of the DM-SNR curve                        17898 non-null float64
 Standard deviation of the DM-SNR curve          17898 non-null float64
 Excess kurtosis of the DM-SNR curve             17898 non-null float64
 Skewness of the DM-SNR curve                    17898 non-null float64
target_class                                     17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


It's a pretty clean and simple dataset related to a classification problem and the target column is called `target_class`. Let's suppose that we have a training dataset (labeled) and a testing dataset, for which we don't have labels.

In [3]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, stratify=data['target_class'], test_size=0.2, random_state=0)

## The Engine

### `MiraiLayout`

First, let's define our list of `MiraiLayout`'s.

In [4]:
from core import MiraiLayout

mirai_layouts = []

#### Random Forest and Extra Trees

Let's use the same search space for both of them.

In [5]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import numpy as np

parameters = {
    'n_estimators': np.arange(5, 30),
    'max_depth': np.arange(2, 20),
    'min_samples_split': np.arange(0.1, 1.1, 0.1),
    'min_weight_fraction_leaf': np.arange(0, 0.6, 0.1),
    'random_state': [0]
}

mirai_layouts += [
    MiraiLayout(model_class=RandomForestClassifier, id='Random Forest', parameters_values=parameters),
    MiraiLayout(ExtraTreesClassifier, 'Extra Trees', parameters),
]

#### Gradient Boosting

Let's use similar parameters.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

mirai_layouts.append(MiraiLayout(
    GradientBoostingClassifier, 'Gradient Boosting', {
        'n_estimators': np.arange(10, 130),
        'learning_rate': np.arange(0.05, 0.15, 0.01),
        'subsample': np.arange(0.5, 1, 0.01),
        'max_depth': np.arange(2, 20),
        'min_weight_fraction_leaf': np.arange(0, 0.6, 0.1),
        'random_state': [0]
    }
))

#### Logistic Regression

Let's try something new here. Let's constraint a parameter.

In [None]:
from sklearn.linear_model import LogisticRegression

def logistic_regression_parameters_rules(parameters):
    if parameters['solver'] in ['newton-cg', 'sag', 'lbfgs']:
        parameters['penalty'] = 'l2'

mirai_layouts.append(MiraiLayout(LogisticRegression, 'Logistic Regression', {
        'penalty': ['l1', 'l2'],
        'C': np.arange(0.1, 2, 0.1),
        'max_iter': np.arange(50, 300),
        'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
        'random_state': [0]
    },
    parameters_rules=logistic_regression_parameters_rules
))

#### Gaussian Naive Bayes

No parameters here. The engine will just search for an interesting set of features.

In [None]:
from sklearn.naive_bayes import GaussianNB

mirai_layouts.append(MiraiLayout(GaussianNB, 'Gaussian NB'))

#### K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

mirai_layouts.append(MiraiLayout(KNeighborsClassifier, 'K-NN', {
    'n_neighbors': np.arange(1, 15),
    'weights': ['uniform', 'distance'],
    'p': np.arange(1, 5)
}))

Alright. Good enough for now.

### `MiraiConfig`

Now we define the general behavior of the engine.

In [None]:
from sklearn.metrics import roc_auc_score
from core import MiraiConfig

config = MiraiConfig(dict(
    n_folds=5,
    problem_type='classification',
    stratified=True,
    score_function=roc_auc_score,
    mirai_layouts=mirai_layouts,
    ensemble_id='Ensemble',
    n_ensemble_cycles=40,
    report=False
))

Ok, that was easy.

### `MiraiML`

Now we will see it working.

In [None]:
from core import MiraiML

mirai_ml = MiraiML(config)

Let's load the training and testing datasets.

In [None]:
mirai_ml.update_data(train_data, test_data, target='target_class')

Ready to roll. In order to keep this notebook clean, let's show the scores every 10 seconds three times and then interrupt the engine.

In [None]:
from time import sleep

mirai_ml.restart()

for _ in range(3):
    sleep(10)
    mirai_ml.report()

mirai_ml.interrupt()

We can also request the predictions for the testing data anytime we want:

In [None]:
test_predictions = mirai_ml.request_predictions()
test_predictions

For the sake of curiosity, let's see how we were able to perform.

In [None]:
roc_auc_score(test_data['target_class'], test_predictions)

That's it for now. There's more to come!