# Machine Learning Pipeline

** WORK IN PROGRESS ** This guide isn't complete but the code examples may be useful as is.

This guide assumes you are familiar with all the content in the [Introductory Guide](intro-guide.ipynb).

A typical machine learning pipeline consists of loading data, extracting features, training models and storing the models for later use in a production system or further analysis. In some cases the feature extraction process is quick and the features are transitory without any need of saving them independetly of the finished trained model. Other times the features are a representation of the data that you wish to reuse in different settings, e.g. in a dashboard explaining predictions, ad-hoc analysis, further model development. 

In the end a good deal of plumbing is required to wire up an app/service with the latest models and features in such a way that API calls can be traced back to the originating model, features, and even data sources. `provenance` abstracts much of this plumbing so you can focus on writing parsimonious pythonic pipelines&trade; with plain old functions.

In [1]:
%load_ext yamlmagic

In [13]:
%%yaml basic_config
blobstores:
    disk:
        type: disk
        cachedir: /tmp/ml-artifacts
        read: True
        write: True
        delete: True
artifact_repos:
    local:
        type: postgres
        db: postgresql://localhost/provenance-ml-guide
        store: 'disk'
        read: True
        write: True
        delete: True
        # this option will create the database if it doesn't exist
        create_db: True
default_repo: local

<IPython.core.display.Javascript object>

In [14]:
import provenance as p


p.load_config(basic_config)

INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Running stamp_revision  -> e0317ab07ba4


<provenance.repos.Config at 0x1141279b0>

In [15]:
import numpy as np
import pandas as pd
import time
from sklearn.utils import check_random_state
import toolz as t

In [16]:

@p.provenance()
def load_data(query):
    # fetch something from the DB in real life...
    random_state = check_random_state(abs(hash(query)) // (10**10))
    return random_state.uniform(0, 10, 10)


@p.provenance()
def extract_features_a(data, hyperparam_a=5):
    time.sleep(2)
    rs = check_random_state(hyperparam_a)
    return data[0:5] + 1 + rs.rand(5)


@p.provenance()
def extract_features_b(data, hyperparam_x=10):
    time.sleep(2)
    rs = check_random_state(hyperparam_x)
    return data[5:] + 1 + rs.rand(5)


@p.provenance()
def build_model(features_a, features_b, num_trees=100):
    return {'whatever': 'special model with {} trees'.format(num_trees)}


@p.provenance()
def evaluate(model, data):
    return {'some_metric': 0.5, 'another_metric': 0.4}



def pipeline(train_query='some query', valid_query="another query", hyperparam_a=5, hyperparam_x=10):
    data = load_data("some query")
    features_a = extract_features_a(data, hyperparam_a)
    features_b = extract_features_b(data, hyperparam_x)
    model = build_model(data, features_a, features_b)

    validation_data = load_data("another query")
    evaluation = evaluate(model, validation_data)

    return {'features_a': features_a, 'features_b': features_b,
            'model': model, 'evaluation': evaluation}

@p.provenance()
def make_decision(model, request):
    # make some sort of prediction, classification, with the model
    # to help make a 'decision' and return it as the result
    return {'prediction': 0.5, 'model': model.artifact.id}

**TODO** explain everything.. including the concept of artifact sets and how they simpify the building and deploymenet of models

In [17]:
def run_production_pipeline():
    with p.capture_set('production'):
        return pipeline()

In [18]:
res = run_production_pipeline()

In [19]:
res = p.get_set_by_name('production')

In [20]:
res

ArtifactSet(id='50e1bcafc5713c94451422bf7612b2f6a14207fd', artifact_ids=frozenset({'e4350270457a9cf6d7f4f644e4e73a9b9ed602e8', 'd3c930d243d6ec4d7be481ddd1f4c3e9277d5f09', 'a647c6d0a30b704beae12005d026d748ec52bce4', '7099ea498cc7e3e3ea2acb1fea4bbe6a6510b7c4', '96c47ddbeff008e2b3a27913611c9648c3e74aa2', 'c67310b3e50456ef54728370f2db5d08a28870d4'}), created_at=datetime.datetime(2017, 4, 30, 23, 4, 22, 352814), name='production')

In [24]:
build_artifacts = res.proxy_dict(group_artifacts_of_same_name=True)

In [25]:
build_artifacts.keys()

dict_keys(['__main__.evaluate', '__main__.extract_features_a', '__main__.extract_features_b', '__main__.build_model', '__main__.load_data'])

In [26]:
model = build_artifacts['__main__.build_model']

In [27]:
model

<provenance.ArtifactProxy(7099ea498cc7e3e3ea2acb1fea4bbe6a6510b7c4) {'whatever': 'special model with [  3.79122989   9.15575636  11.36408233   8.27737009   6.72846267] trees'} >