Skip to content
Branch: master
Find file Copy path
Find file Copy path
Your Name docs fcfc63d Apr 3, 2019
2 contributors

Users who have contributed to this file

@invalid-email-address @d6tdev
165 lines (119 sloc) 5.82 KB

4 Reasons Why Your Machine Learning Code is Probably Bad

Your current workflow probably chains several functions together like in the example below. While quick, it likely has many problems:

  • it doesn't scale well as you add complexity
  • you have to manually keep track of which functions were run with which parameter as you iterate through your workflow
  • you have to manually keep track of where data is saved
  • it's difficult for others to read
import pandas as pd
import sklearn.svm, sklearn.metrics

def get_data():
    data = download_data()

def preprocess(data):
    data = clean_data(data)
    return data

# flow parameters
do_preprocess = True

# run workflow

df_train = pd.read_pickle('data.pkl')
if do_preprocess:
    df_train = preprocess(df_train)
model = sklearn.svm.SVC()[:,:-1], df_train['y'])

What to do about it?

Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.

[d6tflow]( is a free open-source library which makes it easy for you to build highly effective data science workflows.

Instead of writing a function that does:

def process_data(df, parameter):
    df = do_stuff(df)
    return df

dataraw = download_data()
data = process_data(dataraw)

You can write tasks that you can chain together as a DAG:

class TaskGetData(d6tflow.tasks.TaskPqPandas):

    def run():
        data = download_data() # save output data

class TaskProcess(d6tflow.tasks.TaskPqPandas):

    def requires(self):
        return TaskGetData() # define dependency

    def run(self):
        data = self.input().load() # load input data
        data = do_stuff(data) # process data # save output data # execute task including dependencies
data = TaskProcess().output().load() # load output data

The benefits of doings this are:

  • All tasks follow the same pattern no matter how complex your workflow gets
  • You have a scalable input requires() and processing function run()
  • You can quickly load and save data without having to hardcode filenames
  • If the input task is not complete it will automatically run
  • If input data or parameters change, the function will automatically rerun
  • It’s much easier for others to read and understand the workflow

An example machine learning DAG

Below is a stylized example of a machine learning flow which is expressed as a DAG. In the end you just need to run TaskTrain() and it will automatically know which dependencies to run. For a full example see

import pandas as pd
import sklearn, sklearn.svm
import d6tflow
import luigi

# define workflow
class TaskGetData(d6tflow.tasks.TaskPqPandas):  # save dataframe as parquet

    def run(self):
        data = download_data()
        data = clean_data(data) # quickly save dataframe

class TaskPreprocess(d6tflow.tasks.TaskCachePandas):  # save data in memory
    do_preprocess = luigi.BoolParameter(default=True) # parameter for preprocessing yes/no

    def requires(self):
        return TaskGetData() # define dependency

    def run(self):
        df_train = self.input().load() # quickly load required data
        if self.do_preprocess:
            df_train = preprocess(df_train)

class TaskTrain(d6tflow.tasks.TaskPickle): # save output as pickle
    do_preprocess = luigi.BoolParameter(default=True)

    def requires(self):
        return TaskPreprocess(do_preprocess=self.do_preprocess)

    def run(self):
        df_train = self.input().load()
        model = sklearn.svm.SVC()[:,:-1], df_train['y'])

# Check task dependencies and their execution status

└─--[TaskTrain-{'do_preprocess': 'True'} (PENDING)]
   └─--[TaskPreprocess-{'do_preprocess': 'True'} (PENDING)]
      └─--[TaskGetData-{} (PENDING)]

# Execute the model training task including dependencies

===== Luigi Execution Summary =====

Scheduled 3 tasks of which:
* 3 ran successfully:
    - 1 TaskGetData()
    - 1 TaskPreprocess(do_preprocess=True)
    - 1 TaskTrain(do_preprocess=True)

# Load task output to pandas dataframe and model object for model evaluation
model = TaskTrain().output().load()
df_train = TaskPreprocess().output().load()
# 0.9733333333333334


Writing machine learning code as a linear series of functions likely creates many workflow problems. Because of the complex dependencies between different ML tasks it is better to write them as a DAG. makes this very easy. Alternatively you can use luigi and airflow but they are more optimized for ETL than data science.

You can’t perform that action at this time.