In [None]:
# setup

import sys
from pathlib import Path

try:
    dirpath = Path(globals()['_dh'][0]).parent
except KeyError:
    dirpath = Path(__file__).parent.parent
sys.path.append(str(dirpath))

import logging
import warnings

logging.basicConfig(level=logging.ERROR)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)


## Installation

We recommend that you create a virtual environment to run ngautonml.  To do so with conda, run:
```
conda create -n env-name python=3.9
conda activate env-name
```

Install the whole system via `pip`.
```
pip install ngautonml
```

ngAutonML is designed to run on Python 3.9 and above.

In [None]:
%%capture
import os
if 'RUNNING_IN_TEST' not in os.environ:
    %pip install ngautonml

## Select a Dataset

For this example, we will use the [sklearn breast cancer dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer), a binary classification problem. 
ngAutonML also supports regression, multiclass classification, image classification and time series forecasting.

In [None]:
from sklearn import datasets
breast_cancer = datasets.load_breast_cancer(as_frame=True)
df = breast_cancer.frame
df

## Split into Train and Test sets. 

The train set is used to fit and rank pipelines. The test set is optional data to predict on using the trained pipelines. It should not contain the target column (and ngautonml will remove it if it finds it).

In [None]:
train = df.head(549)
test = df.tail(20)
del test['target']

## Create a Problem Definition

A [ProblemDefinition](https://autonlab.gitlab.io/ngautonml/_autosummary/ngautonml.problem_def.problem_def.ProblemDefinition.html) fully defines a machine learning problem for ngAutonML to solve. In this example, we create one in python, but you can also read from a `problem_def.json` file by supplying a path to the file as the first argument.

Setting `"config" : "memory"` indicates that the data is stored in-memory as a pandas dataframe (alternatively, set `"config: local"` for a dataset stored in a csv file). The `"column_roles"` clause specifies the name of the target column (in this case, just “target”). The `"metric"` clause specifies which metrics to use for scoring pipelines, in this case using accuracy and roc_auc.

### Caution

Setting `"hyperparams" : ["disable_grid_search"]` makes ngAutonML run considerably faster at the cost of worse results. This should be removed when running production problems.

In [None]:
from ngautonml.problem_def.problem_def import ProblemDefinition
pdef_dict = {
    "dataset" : {
        "config" : "memory",
        "params" : {
            "train_data": "train",
            "test_data": "test"
        },
        "column_roles": {
            "target": {
                "name": "target"
            }
        }
    },
    "problem_type" : {
        "data_type": "tabular",
        "task": "binary_classification"
    },
    "metrics" : {
        "accuracy_score": {},
        "roc_auc_score": {}
    },
    "hyperparams" : [
        "disable_grid_search"
    ]
}
pdef = ProblemDefinition(pdef_dict)

Create a [Wrangler](https://autonlab.gitlab.io/ngautonml/_autosummary/ngautonml.wrangler.wrangler.Wrangler.html) object and run `fit_predict_rank`.

ngAutonML generates a set of **pipelines** to solve the problem, and evaluates **metrics** on them to determine how well they do. A **pipeline** is essentially a sequence of **algorithms** to apply to the data set to yield predictions. A **metric** is a function that takes both predictions and **ground truth** as input, and returns a number representing how good the predictions are.

In [None]:
from ngautonml.wrangler.wrangler import Wrangler

wrangler = Wrangler(
    problem_definition=pdef)

got = wrangler.fit_predict_rank()
print(got.rankings)