Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

something about all these pipelining tools #86

Open
ajschumacher opened this issue Sep 7, 2015 · 10 comments
Open

something about all these pipelining tools #86

ajschumacher opened this issue Sep 7, 2015 · 10 comments
Labels

Comments

@ajschumacher
Copy link
Owner

Collecting items from a slack with @lauralorenz, etc.:

Framing question(s):

Are there common frameworks out there that people use to manage larger data science software projects? something like django, but with a data spin? Or is there any sort of best practices about how to manage a large data science software project?

I guess another way to structure my question, is are there frameworks or best practices out there that enforce any sort of convention on the data pipeline.

Follow-up question:

Why are there so many of these??? And yet I don't know/like any of them very much? (Maybe that's part of the answer to the first part...)

A tour through several different angles on this:


sklearn has its pipeline stuff, but it's just for sklearn-compliant parts, and naturally just in Python: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html


make: the classic (as espoused by kaggle, even!) http://blog.kaggle.com/2012/10/15/make-for-data-scientists/

https://www.gnu.org/software/make/


drake

Data workflow tool, like a "Make for data"

Kind of neat; and people at Factual get to play with Clojure!


http://bubbles.databrewery.org/ Bubbles is a Python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch.


http://keystone-ml.org/ "KeystoneML is software from the UC Berkeley AMPLab designed to simplify the construction of large-scale end-to-end machine learning pipelines."


https://github.com/spotify/luigi "Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."


https://github.com/airbnb/airflow AirFlow is a system to programmatically author, schedule and monitor data pipelines.


https://github.com/pinterest/pinball Pinball is a scalable workflow manager


http://oozie.apache.org/ Oozie; you have to specify stuff with a bunch of XML (Steph presented on this)

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Oozie is a scalable, reliable and extensible system.


fluentd / logstash / kafka are sort of related in that they are about routing messages around and making sure those "pipelines" work without fail. Add on elasticsearch/kibana for even more data do-stuffery.


http://www.treasuredata.com/ PAAS: Probably As A Service - you can use this thing to set up fluentd stuff?


https://civisanalytics.com/products/platform/ the Civis Data Science Platform https://www.youtube.com/watch?v=nMalICUv1UM a layer of fancy on top of redshift?


http://www.blendo.co/ "Create one SingleSource of Truthfor your data." and host you data in el cloudo


something on versioning databases http://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/


What am I missing? What are the best and coolest tools?

@ajschumacher
Copy link
Owner Author

Harlan has been working on a thing: https://twitter.com/HarlanH/status/641026432349675521

@tlevine
Copy link

tlevine commented Sep 7, 2015

I manage to fit a lot into ordinary Python. I have been working on documentation of this approach, but the only decent documentation I have so far is this.

@ajschumacher
Copy link
Owner Author

Thanks @tlevine!

@karlhigley
Copy link

Spark's MLlib now supports pipelines.

@ajschumacher
Copy link
Owner Author

Thanks @karlhigley!

@lauralorenz
Copy link

More stuff I found:
Joblib, which I've used before for parallel processing (a nice wrapper of the standard multiprocessing library), but actually can also manage checkpointing large computational pipelines with an easy decorator.
Spyre, nascent data application framework built on cherrypy and jinja2 with convenience classes for data wrangling and data visualization.
CubicWeb, a "semantic web framework" that builds from the datamodel upwards. Useful tools for building, observing, and updating RDBMS schemas out of the box.
Cubes, framework to describe data models and auto-build APIs into it.
Dispel4py, framework for abstractly defining distributed data workflows with supported backends such as Apache Storm. Also, the release paper
Luigi, python library for job pipelining and comes with a web management console to track tasks.
alembic and south for relational database versioning

@ajschumacher
Copy link
Owner Author

Awesome! Thanks @lauralorenz!

@ajschumacher
Copy link
Owner Author

More thoughtful thoughts:

I've separated out our list of concerns with our current solution in parentheses. I'd love suggestions for

  1. some mega framework this all fits into nicely
  2. suggestions at a more modular level e.g. ETL, ML, visualization
  3. suggestions of better libraries/tools to use in the pipeline we've grown with to date
    And to clarify my meaning by ‘large’, I mean something that was once a collection of scripts, grew organically/without structure, and now is too big to handle. So not large in the sense of needing distributed infrastructure, as so far we've been able to deal by just going up hardware classes.
    ETL
  • Performs data ingestion and wrangling from diverse APIs into a data store (pandas, regular ol' python)
  • Supports smart rollback and reporting when data is corrupted (some function-bound commit/rollback with psycopg and try/excepts, 'logging' with print statements, not that wide reaching or easy to trace back)
  • Supports distributing the ETL tasks nightly onto on-demand large instances (cron/boto)
    ML
  • Supports distributing ML tasks (e.g. train, predict) on weekly/nightly schedules onto on-demand large instances (cron/boto)
    Visualization/Output
  • Interactive reporting of data from the ETL for use by business users (Shiny)
  • Construction of data API with endpoints for raw data, filtered/queried data, ML results, or to trigger a prediction ()
    Overall
  • Supports versioning of database schemas preferably with upgrade/downgrade capabilities a la Django/south/alembic for relational and graph databases ()
  • Preserves secret keys safely (localsettings anti-pattern)
  • Supports seamless relational imports even when running convoluted/dependent file heirarchies as scripts ie however Django does that (a lot of mumbo jumbo a la PEP 328)

@ajschumacher
Copy link
Owner Author

I had another open issue called "something about Luigi" (#78) which just contained a note that an "old email thread with Caroline Alexiou is relevant".

@ajschumacher
Copy link
Owner Author

also Bazel is a DAG-based thing, specifically for builds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants