Where to start


Best practice

Example pipelines

Conference tracks and workshops

Big data on a single machine / on the command line


Managing building and deploying models

  • kubeflow Machine Learning Toolkit for Kubernetes (kubeflow)
  • ModelDB A system to manage machine learning models (MIT)
  • mlflow Open source platform for the complete machine learning lifecycle (Databricks)
  • datmo Open source model tracking tool for data scientists

Managing building models

  • Luigi is a Python module that helps you build complex pipelines of batch jobs. (Spotify)
  • Airflow is a platform to programmatically author, schedule, and monitor workflows (Netflix)
  • Azkaban workflow manager (LinkedIn)
  • Pinball is a scalable workflow manager (pinterest)

Deploying models

  • Serving A flexible, high-performance serving system for machine learning models (Google)
  • deepdetect Deep Learning API and Server in C++11 with Python bindings and support for Caffe, Tensorflow, XGBoost and TSNE (deepdetect)
  • clipper A low-latency prediction-serving system (Berkeley)
  • MLeap Deploy Spark Pipelines to Production (combust.ml)
  • openscoring REST web service for the true real-time scoring (<1 ms) of R, Scikit-Learn and Apache Spark models (openscoring)
  • mxnet-model-server Model Server for Apache MXNet is a tool for serving neural net models for inference (AWS)
  • hydro-serving ML FaaS - Machine Learning Serving cluster (hydrosphere.io)

Serialising and transpiling models

Monitoring models

  • Knowledge Repo A next-generation curated knowledge sharing platform for data scientists and other technical professions.


  • Data Pipeline "is a web service that you can use to automate the movement and transformation of data"
  • Glue "is a fully managed ETL (extract, transform, and load) service"
  • Simple Workflow "makes it easy to build applications that coordinate work across distributed components"
  • Batch "enables you to run batch computing workloads on the AWS Cloud"
  • Machine Learning "cloud-based service that makes it easy for developers of all skill levels to use machine learning technology"
  • Sagemaker "is a fully managed machine learning service"

Google Cloud

  • Dataflow "is a unified programming model and a managed service for developing and executing a wide variety of data processing patterns"
  • ML Engine "brings the power and flexibility of TensorFlow, scikit-learn and XGBoost to the cloud"


  • Batch AI "helps you experiment with your AI models using any framework and then train them at scale across GPU and CPU clusters"
  • Machine Learning services "enable building, deploying, and managing machine learning and AI models using any Python tools and libraries"
  • Machine Learning Studio "is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions on your data"

