- Make & Makefile Overview
- ETL Styleguide
AKA "Makefile 301"
This page defines the DataMade styleguide for processing data, a collection of "best practices" for writing Makefiles.
To make use of this page, you should already have a good understanding of what Make is and why we use it. You should also be comfortable with bash syntax and the Unix filesystem.
Some loose notes on best practices:
- Some transformations, especially those chaining Unix tools, are obscure. Consider printing the purpose of the transformation (as in:
@echo "Downcasing the header of this csv"
) - Do not suppress output to the console. Being able to see commands as they execute helps users debug errors.
- Avoid piping stderr to dev/null.
- When possible, list recipes in rough order of processing steps (from start to finish).
Stay DRY
- To limit verbosity, use arg flags and automatic variables.
- Define 'all' and 'clean' phony targets for keeping the repo clean.
- Prefer implicit patterns over explicit recipes. Implicit patterns are DRY, and files created by implicit patterns will automatically be cleaned up.
- When implicit patterns are not attractive, set intermediate build files as dependencies of the
.INTERMEDIATE
target. Make will clean up these files for you. - Makefile directives go at the top of the file, followed by variables, followed by 'all' and 'clean' targets.
It's best to use implicit rules and .INTERMEDIATE targets to have Make clean up intermediate files for you. Sometimes, though, this is not so easy. In particular, if a step in the build process emits multiple files, only some of which will be dependencies, it's not very easy to get Make to clean up everything. This happens frequently when working with ESRI shapefiles which include a .prj, .dbf, .xml, and other files in addition to the .shp file.
To handle these types of issues, define a .PHONY cleanup target for everything that Make misses, and document the cleanup target in the README.
Variables are names defined in a Makefile to refer to files, directories, targets, or just about anything that you can represent with text. They are defined in ALL_CAPS
, and referred to with the same syntax as $(BASH_VARIABLES)
.
A few common variables we define:
variable | description |
---|---|
GENERATED_FILES |
A list of all the final output targets that the Makefile can build. This is used as a shorthand way of calling everything in the all phony target. |
DATA_DIRS |
If there is a master Makefile with imported sub-Makefiles, this includes all of the sub-directories. It is useful for the all phony target in the master Makefile. |
DIR |
Points to the directory that contains the master Makefile. |
PYTHON_BIN |
If a Python Virtual Environment needs to be created, this points to the bin/ directory inside that virtual environment. |
PROCESSOR_DIR |
If the workflow requires processors, this points at the directory containing the processors. |
If you have a master Makefile and multiple sub-Makefiles, you should define GENERATED_FILES
in each sub-Makefile, and the other variables above in the master Makefile.
When processing a target requires more than can be accomplished with our standard toolkit, a processor (i.e. a script for a single operation, often written in Python) can be written.
For the sake of easier reuse, each processor should be modular, only handling one operation on a file. Each processor should be configured to accept input on STDIN
and write output to STDOUT
, so that it's easy to chain processors and operations.
All processors should live in a processors/
directory in the root of the repository. To make processors available to all Makefiles, define the path to processors/
in the PROCESSORS
variable.
Some examples of single-purpose processors:
- excel date column -> ISO formatted date column
- 'NA' or 'N/A' -> None
- delete empty rows from a csv
- strip whitespace in a csv
- For fetching content on the web, use wget. Use the
--no-use-server-timestamps
flag so you don't download it every time Make runs; use the-O
flag to define a custom filepath for the output. - For manipulating geo files use GDAL/OGR.
- For concatenating, cropping, inspecting, or extracting single values from PDF files use the appropriate Poppler (
brew install poppler
orapt-get poppler-utils
) utility. - For scraping entire PDFs, use tabula-java. (See get-the-lead-out for an example.)
- Use CSVkit for manipulating spreadsheets, or things that can be made into spreadsheets. In particular:
- For simple SQL-like queries use CSVkit; for more complicated queries use postgreSQL.
- For geospatial queries use postgis.
- For text manipulation use perl (like sed but without portability issues), unless it's substantially easier to do it with awk.
- unzip, gzip, and tar for decompressing files. If you are compressing files and have the option, use tar zcvf.
- For custom transform code, write Python scripts.
In general, prefer simple Unix tools to custom scripts (processors). When a processor is absolutely necessary, try to follow the Unix philosophy ("do one thing and do it well").
In the case that a project has multiple separate data components, you can define a master Makefile at the root of the repository, along with sub-directories that each have a sub-Makefile at the root. When using this type of nested structure, all data processing/transformation should be defined in the sub-Makefiles - the master Makefile should only handle setting up the environment, defining variables/targets used by multiple sub-Makefiles, and calling sub-Makefiles.
|-- Makefile # the master makefile
|-- README.md
|-- data/
| |-- <sub-directory for a data processing component>
| | |-- Makefile # a sub-makefile
| | |-- README.md # documents the data source & how data gets processed
| | |-- finished/ # a directory for finished files
| |-- <sub-directory for another data processing component>
| | |-- Makefile # a sub-makefile
| | `-- < ... etc ... >
|-- processors
| |-- <processor_name>.py
| `-- <another_processor_name>.sh
`-- requirements.txt # lists install requirements for the pipeline