# Databolt Flow
For data scientists and data engineers, d6tflow is a python library which makes building complex data science workflows easy, fast and intuitive.

https://github.com/d6t/d6tflow

## Benefits of using d6tflow

[4 Reasons Why Your Machine Learning Code is Probably Bad](https://medium.com/@citynorman/4-reasons-why-your-machine-learning-code-is-probably-bad-c291752e4953)

# Example Usage For a Machine Learning Workflow

Below is an example of a typical machine learning workflow: you retreive data, preprocess it, train a model and evaluate the model output.

In this example you will:
* Build a machine learning workflow made up of individual tasks
* Check task dependencies and their execution status
* Execute the model training task including dependencies
* Save intermediary task output to Parquet, pickle and in-memory
* Load task output to pandas dataframe and model object for model evaluation
* Intelligently rerun workflow after changing a preprocessing parameter


In [7]:
import d6tflow
import luigi
import sklearn, sklearn.datasets, sklearn.svm
import pandas as pd

# define workflow
class TaskGetData(d6tflow.tasks.TaskPqPandas):  # save dataframe as parquet

    def run(self):
        iris = sklearn.datasets.load_iris()
        df_train = pd.DataFrame(iris.data,columns=['feature{}'.format(i) for i in range(4)])
        df_train['y'] = iris.target
        self.save(df_train) # quickly save dataframe

class TaskPreprocess(d6tflow.tasks.TaskCachePandas):  # save data in memory
    do_preprocess = luigi.BoolParameter(default=True) # parameter for preprocessing yes/no

    def requires(self):
        return TaskGetData() # define dependency

    def run(self):
        df_train = self.input().load() # quickly load required data
        if self.do_preprocess:
            df_train.iloc[:,:-1] = sklearn.preprocessing.scale(df_train.iloc[:,:-1])
        self.save(df_train)

class TaskTrain(d6tflow.tasks.TaskPickle): # save output as pickle
    do_preprocess = luigi.BoolParameter(default=True)

    def requires(self):
        return TaskPreprocess(do_preprocess=self.do_preprocess)

    def run(self):
        df_train = self.input().load()
        model = sklearn.svm.SVC()
        model.fit(df_train.iloc[:,:-1], df_train['y'])
        self.save(model)


In [8]:
# Check task dependencies and their execution status
d6tflow.preview(TaskTrain())



└─--[TaskTrain-{'do_preprocess': 'True'} ([94mPENDING[0m)]
   └─--[TaskPreprocess-{'do_preprocess': 'True'} ([94mPENDING[0m)]
      └─--[TaskGetData-{} ([94mPENDING[0m)]


In [9]:
# Execute the model training task including dependencies
d6tflow.run(TaskTrain())


INFO: Informed scheduler that task   TaskTrain_True_e00389f8b2   has status   PENDING
INFO: Informed scheduler that task   TaskPreprocess_True_e00389f8b2   has status   PENDING
INFO: Informed scheduler that task   TaskGetData__99914b932b   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 18452] Worker Worker(salt=074772785, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) running   TaskGetData()
INFO: [pid 18452] Worker Worker(salt=074772785, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) done      TaskGetData()
INFO: Informed scheduler that task   TaskGetData__99914b932b   has status   DONE
INFO: [pid 18452] Worker Worker(salt=074772785, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) running   TaskPreprocess(do_preprocess=True)
INFO: [pid 18452] Worker Worker(salt=074772785, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) done      TaskPreprocess(do_preprocess=True)
INFO: Info

True

In [10]:
# Load task output to pandas dataframe and model object for model evaluation
model = TaskTrain().output().load()
df_train = TaskPreprocess().output().load()
print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1])))


0.9733333333333334


In [11]:
# Intelligently rerun workflow after changing a preprocessing parameter
d6tflow.preview(TaskTrain(do_preprocess=False))



└─--[TaskTrain-{'do_preprocess': 'False'} ([94mPENDING[0m)]
   └─--[TaskPreprocess-{'do_preprocess': 'False'} ([92mCOMPLETE[0m)]
      └─--[TaskGetData-{} ([92mCOMPLETE[0m)]


In [12]:
d6tflow.run(TaskTrain(do_preprocess=False)) # execute with new parameter


INFO: Informed scheduler that task   TaskTrain_False_57897150ee   has status   PENDING
INFO: Informed scheduler that task   TaskPreprocess_False_57897150ee   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 18452] Worker Worker(salt=335940119, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) running   TaskTrain(do_preprocess=False)
INFO: [pid 18452] Worker Worker(salt=335940119, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) done      TaskTrain(do_preprocess=False)
INFO: Informed scheduler that task   TaskTrain_False_57897150ee   has status   DONE
INFO: Worker Worker(salt=335940119, workers=1, host=DESKTOP-5ER1139, username=deepmind, pid=18452) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 1 complete ones were encountered:
    - 1 TaskPreprocess(do_preprocess=False)
* 1 ran successfully:
    - 1 TaskTrain(do_preprocess=False)

This progres

True

# Next steps: Transition code to d6tflow

See https://d6tflow.readthedocs.io/en/latest/transition.html