# Make For Data Science

* David Marx @DigThatData
* http://dmarx.github.io

Demo available at: https://github.com/dmarx/make_for_datascience

# Motivation

## How We Hurt Ourselves

* Unnecessary code reproduction
* Circular dependencies
* Unreproducible results
* Non-portable code
* Lack of documentation
* No standard project layout

## Use Cases/Pain Points

* Evaluating competing models
* Modify models/data, only need to refresh downstream dependencies
* Navigating dependencies
* Reducing "spaghetti" projects
* Defining clear entry-point into the code base
* On-boarding overhead
* Data provenance overhead (i.e. debugging)
* Reproducibility/portability

## Principles of a Solution

* Low adoption LOE
* Minimal external dependencies
* OS agnostic
* Tool agnostic
* Encourage project standardization
* Pipeline as a DAG
* Pipeline VC-able
* Pipeline as documentation



# Pipeline Management (build tools)

## Why should I use a pipeline manager?

* Eliminates circular dependencies
* Simplifies only rebuilding objects whose dependencies have been modified
* Defines a clear entry-point into the project
* You're going to author a pipeline anyway (sometimes called a "master script", e.g. `read_data_and_fit_models.r`)

## Popular tools

* Gnu Make
* Spotify Luigi
* Apache Airflow
* Apache Ant
* Waf (python)

... https://en.wikipedia.org/wiki/List_of_build_automation_software

# A Brief Intro To Gnu Make

https://www.gnu.org/software/make/manual/html_node/index.html#SEC_Contents

## Why Make?

* Pre-installed on \*NIX
  * To use on windows requires cygwin (or some other Unix emulator) or a windows port like gnuwin32
* Easy to use (if you KISS)
* Automatically identifies modified dependencies
* Long history of use, lots of tutorials/examples/documentation

## The Makefile

* Rules
* Special variables
* Functions

# Make Rules

## Simple Rule

```
target: dependency
    recipe
```

NB: Spaces != Tabs: recipes need to be indented with tabs

## Pattern Rule

```
%.rdata: %.csv
    Rscript read_data.r
```

## Pattern Rule: Semantic Folder Structure

```
models/%.rdata: data/%.rdata
    Rscript fit_model.r 
```

# Make Variables

## Automatic Variables

* \$@: The first target

```
models/%.rdata: train_and_save_model.r data/basetable.rdata code/models/%.r 
    Rscript train_and_save_model.r $@
```

* \$<: The first pre-requisite

```
models/%.rdata: train_and_save_model.r data/basetable.rdata code/models/%.r 
    Rscript $< $@
```

## Declared Variables

* '=' : Recursive (i.e. lazy) expansion
  * Values are defined verbatim via text substitution
  * If assigning from another variable, that variable is not evaluated: the text invoking it is what gets assigned

Running the following Makefile

```
foo = $(bar)
bar = $(ugh)
ugh = Huh?

all:
    echo $(foo)
```
will echo "Huh?": 

`$(foo)` expands to "\$(bar)" which expands to "\$(ugh)" which finally expands to "Huh?".

## Declared Variables

* **=** 
  * Recursive (i.e. lazy) expansion
  * Values are defined verbatim via text substitution
  * If assigning from another variable, that variable is not evaluated: the text invoking it is what gets assigned

* **:=** 
  * Simple expansion 
  * About as close to "immediate" as we can get
  * Applies value at time of assignment (if you're careful)
  * Really, nothing is for sure until the Makefile is fully compiled

```
x := foo
y := $(x) bar
x := baz
```

is equivalent to

```
y := foo bar
x := baz
```

# HERE THERE BE DRAGONS

```
x := foo

test::
	echo FIRST $(x)
    
x := bar

test::
	echo SECOND $(x)
```

will not echo:

```
FIRST foo
SECOND bar
```

but rather:

```
FIRST bar
SECOND bar
```

> ## 3.7 How make Reads a Makefile

> GNU make does its work in **two distinct phases**. During the first phase it reads all the makefiles, included makefiles, etc. and internalizes all the variables and their values, implicit and explicit rules, and constructs a dependency graph of all the targets and their prerequisites. During the second phase, make uses these internal structures to determine what targets will need to be rebuilt and to invoke the rules necessary to do so.

> It’s important to understand this two-phase approach because it has a direct impact on how variable and function expansion happens; **this is often a source of some confusion** when writing makefiles.

-- https://www.gnu.org/software/make/manual/html_node/Reading-Makefiles.html#Reading-Makefiles

## Functions

There are a ton of different functions. Here are a few that I find especially useful

* $(**patsubst** pattern,replacement,text)
  * Pattern substitution
  * `$(patsubst foo/%.bar,%.baz,foo/fname.bar)` :  -> `fname.baz`
  
* $(**wildcard** pattern)
  * Return a space-seperated list of filenames matching pattern
  * `$(wildcard data/*.csv)` -> `raw.csv results.csv errors.csv`
  
* $(**dir** names…)
  * Directory extraction
  * `$(dir path/to/file.txt foo/bar.csv)` : -> 'path/to/ foo/`
  
* $(**notdir** names…)
  * Non-directory part of file name
  * `$(dir path/to/file.txt foo/bar.csv)` : -> 'file.txt bar.csv'
  
* $(**shell** statement)
  * Let's you run shell commands anywhere in the makefile (i.e. outside of recipes, where you already can)
  
* $(**eval** statement)
  * Evaluate makefile syntax
  * Expanded twice: first by eval function, *then again* during the second phase when the makefile is actually "run"
  * Allows us to define and invoke Make variables (actually) immediately inside recipes
  

## The \$(eval ...) Trick

```
x := who

v1 = first/path
$(v1)_x := foo

test::
    $(eval x := $($(v1)_x))
    echo FIRST $(x)
    
x := cares

v2 = second/path
$(v2)_x := bar

test::
    $(eval x := $($(v2)_x))
    echo SECOND $(x)
```


will echo:

```
FIRST foo
SECOND bar
```
  
rather than:
    
```
FIRST cares
SECOND cares
```


# "The System"

## Pipeline entities

* raw data "getters"
* raw data
* features
* analytic base tables
* modeling tasks
* models
* model evaluation strategies