# Example notebook

This notebook will cover a regression case using scikit-learn's *California Housing* dataset.

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

X, y = fetch_california_housing(data_home='miraiml_local', return_X_y=True)
data = pd.DataFrame(X)
data['target'] = y

Let's split the data into training and testing data. In a real case scenario, we'd only have labels for training data.

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, test_size=0.2)

## Building the search spaces

Let's compare (and ensemble) a `KNeighborsRegressor` and a pipeline composed by `StandardScaler` and a `LinearRegression`.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

from miraiml import SearchSpace
from miraiml.pipeline import compose

Pipeline = compose(
    [('scaler', StandardScaler), ('lin_reg', LinearRegression)]
)

search_spaces = [
    SearchSpace(
        id='k-NN Regressor',
        model_class=KNeighborsRegressor,
        parameters_values=dict(
            n_neighbors=range(2, 9),
            weights=['uniform', 'distance'],
            p=range(2, 5)
        )
    ),
    SearchSpace(
        id='Pipeline',
        model_class=Pipeline,
        parameters_values=dict(
            scaler__with_mean=[True, False],
            scaler__with_std=[True, False],
            lin_reg__fit_intercept=[True, False]
        )
    )
]

## Configuring the Engine

For this test, let's use `r2_score` to evaluate our modeling.

In [None]:
from sklearn.metrics import r2_score

from miraiml import Config

config = Config(
    local_dir='miraiml_local',
    problem_type='regression',
    score_function=r2_score,
    search_spaces=search_spaces,
    ensemble_id='Ensemble'
)

## Triggering the Engine

Let's also print the scores everytime the Engine finds a better solution for any base model.

In [None]:
from miraiml import Engine

def on_improvement(status):
    scores = status.scores
    for key in sorted(scores.keys()):
        print('{}: {}'.format(key, round(scores[key], 3)), end='; ')
    print()

engine = Engine(config=config, on_improvement=on_improvement)

Now we're ready to load the data.

In [None]:
engine.load_train_data(train_data, 'target')
engine.load_test_data(test_data)

Let's leave it running for 2 minutes and then interrupt it.

In [None]:
from time import sleep

engine.restart()

sleep(120)

engine.interrupt()

## Status analysis

In [None]:
status = engine.request_status()

Let's see the status report.

In [None]:
print(status.build_report(include_features=True))

How does the k-NN Regressor's score changes with `n_neighbors`, on average?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

knn_history = status.histories['k-NN Regressor']

knn_history\
.groupby('n_neighbors__(hyperparameter)').mean()\
.reset_index()[['n_neighbors__(hyperparameter)', 'score']]\
.plot.scatter(x='n_neighbors__(hyperparameter)', y='score')

plt.show()

Again, in practice we wouldn't have labels for `test_data`, but how would the Engine perform on the test dataset?

In [None]:
r2_score(test_data['target'], status.test_predictions['Ensemble'])