## Run Gordo from only a config file

This is a higher level example of how gordo works.

Train a model from a config file.

---
** Some slight difference in how it _actually_ works, in that we normally use some resources from `gordo-infrastructure` for parsing the config that we don't have access to here. Yet.

---

In [None]:
import tempfile
import yaml
from pprint import pprint
from dateutil.parser import isoparse

from gordo_components import serializer
from gordo_components.builder import build_model
from gordo_components.data_provider.providers import DataLakeProvider

## Define some config file:

In [None]:
config = \
"""
machines:

  - name: asgb-morvin
    dataset:
      tags: #list of tags
        - asgb.19ZT3950%2FY%2FPRIM
        - asgb.19PST3925%2FDispMeasOut%2FPRIM
        - asgb.19PT3905%2FY%2F5mMID
      train_start_date: 2014-07-01T00:10:00+00:00
      train_end_date: 2015-01-01T00:00:00+00:00
    metadata:
      my-special-metadata-attr: is awesome!

model:
  sklearn.pipeline.Pipeline:
    steps:
      - sklearn.preprocessing.data.MinMaxScaler
      - gordo_components.model.models.KerasAutoEncoder:
          kind: feedforward_hourglass
"""

## Simulate how Gordo extract required information from a config file

##### Note:
This is _not_ exactly how it's actually done. We use some resources available in `gordo-infrastructure` which is not available from `gordo-components`

In [None]:
# Load into a normal dict
config = yaml.load(config, Loader=yaml.BaseLoader)

# Model configuration
model_config = config['model']

# In this case, we only build a model for a single machine
machine_config = config['machines'][0]

# TODO: This is the ugliest portion, as we normally use resources [`Machine`] found in `gordo-infrastructure`
data_config  = {
    "type": "TimeSeriesDataset",  # We want to use `DataLakeBackedDataset` for data acquisition
    "from_ts": isoparse(machine_config['dataset']['train_start_date']),
    "to_ts": isoparse(machine_config['dataset']['train_end_date']),
    "tag_list": machine_config['dataset']['tags'],
    "data_provider": DataLakeProvider(storename="dataplatformdlsprod", interactive=True),
}

### Build model from data and model configs

This also optionally takes and will return metadata `dict` updated with various model building events

In [None]:
pipe, metadata = build_model(
    name=config['model'],
    model_config=model_config, 
    data_config=data_config,
    metadata=machine_config['metadata']
)

---

### The trained model/pipeline:

In [6]:
pipe

Pipeline(memory=None,
     steps=[('step_0', MinMaxScaler(copy=True, feature_range=(0, 1))), ('step_1', <gordo_components.model.models.KerasAutoEncoder object at 0x7f7ac48ea908>)])

### Metadata from the model and build process

In [7]:
print(yaml.dump(metadata))

dataset:
  resolution: 10T
  tag_list: [asgb.19ZT3950%2FY%2FPRIM, asgb.19PST3925%2FDispMeasOut%2FPRIM, asgb.19PT3905%2FY%2F5mMID]
  train_end_date: 2015-01-01 00:00:00+00:00
  train_start_date: 2014-07-01 00:10:00+00:00
model:
  data_query_duration_sec: 21.27293300628662
  model_builder_version: 0.9.1.dev3+g9da7836.d20190225
  model_config:
    sklearn.pipeline.Pipeline:
      steps:
      - sklearn.preprocessing.data.MinMaxScaler
      - gordo_components.model.models.KerasAutoEncoder: {kind: feedforward_symetric}
  model_creation_date: '2019-02-25 12:21:38.595111+01:00'
  model_training_duration_sec: 3.5922129154205322
user-defined: {my-special-metadata-attr: is awesome!}

