New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
something about all these pipelining tools #86
Comments
Harlan has been working on a thing: https://twitter.com/HarlanH/status/641026432349675521 |
I manage to fit a lot into ordinary Python. I have been working on documentation of this approach, but the only decent documentation I have so far is this. |
Thanks @tlevine! |
Spark's MLlib now supports pipelines. |
Thanks @karlhigley! |
More stuff I found: |
Awesome! Thanks @lauralorenz! |
More thoughtful thoughts:
|
I had another open issue called "something about Luigi" (#78) which just contained a note that an "old email thread with Caroline Alexiou is relevant". |
also Bazel is a DAG-based thing, specifically for builds |
Collecting items from a slack with @lauralorenz, etc.:
Framing question(s):
Follow-up question:
A tour through several different angles on this:
sklearn
has its pipeline stuff, but it's just forsklearn
-compliant parts, and naturally just in Python: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.htmlmake
: the classic (as espoused by kaggle, even!) http://blog.kaggle.com/2012/10/15/make-for-data-scientists/https://www.gnu.org/software/make/
drake
Kind of neat; and people at Factual get to play with Clojure!
http://bubbles.databrewery.org/ Bubbles is a Python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch.
http://keystone-ml.org/ "KeystoneML is software from the UC Berkeley AMPLab designed to simplify the construction of large-scale end-to-end machine learning pipelines."
https://github.com/spotify/luigi "Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
https://github.com/airbnb/airflow AirFlow is a system to programmatically author, schedule and monitor data pipelines.
https://github.com/pinterest/pinball Pinball is a scalable workflow manager
http://oozie.apache.org/ Oozie; you have to specify stuff with a bunch of XML (Steph presented on this)
fluentd / logstash / kafka are sort of related in that they are about routing messages around and making sure those "pipelines" work without fail. Add on elasticsearch/kibana for even more data do-stuffery.
http://www.treasuredata.com/ PAAS: Probably As A Service - you can use this thing to set up fluentd stuff?
https://civisanalytics.com/products/platform/ the Civis Data Science Platform https://www.youtube.com/watch?v=nMalICUv1UM a layer of fancy on top of redshift?
http://www.blendo.co/ "Create one SingleSource of Truthfor your data." and host you data in el cloudo
something on versioning databases http://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/
What am I missing? What are the best and coolest tools?
The text was updated successfully, but these errors were encountered: