Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft of institutional FAQ #5214

Merged
merged 7 commits into from Aug 12, 2019

Conversation

@mrocklin
Copy link
Member

commented Aug 2, 2019

Fixes #5016

  • Tests added / passed
  • Passes black dask / flake8 dask
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Soliciting @mmccarty @datametrician @ericdill @hussainsultan @dharhas for additional questions

@hussainsultan

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

How about something along the lines of "How do I convince my IT teams to use Dask?" Things like security, open license, pre-requisites for the requester to self-service, accessibility of admin logs might be included for convincing IT teams. Thoughts?

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

@mmccarty

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

Is Dask Resilient? What happens when a machine goes down?

Today I demo'ed killing a worker during a large matrix operation on a k8s cluster. Many minds blown!

@hussainsultan

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2019

I am hoping we can crowdsource some of the things that may matter to an IT team. In my mind, we should start off with how to build a business case for dask, a set of actions that a dask user may ask an IT team to perform, and a set of best practices around operations.

  1. Business Case: real-world use-cases, team efficiency?, ROI? , new capability
  2. Actions that IT team need to perform to enable operation ( create TLS certs, creation of yarn queues, new cloud roles, ...)
  3. Best practices to avoid any added risks (directory spaces could be filled during operation, hanging jobs, clean-up, logging,...)

What else matters to an IT organization to build comfort in accepting a new distributed service?

@mmccarty

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

#1 question I get is how does Dask compare to Spark. The article on the docs is great, however unfortunately execs don't have time to absorb it. It would be great to have a concise explanation (or picture?) outlining the differences.

@hussainsultan

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2019

@mmccarty I agree, that should be part of the business case to answer a question like: "Why should I maintain a new service when I already maintain Spark?"

@mmccarty

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

I'm working on an executive-level deck to explain "what is Dask and why do I care?" I've been asked to keep it to 5 - 6 slides. It is currently 23 slides. :(

@mmccarty

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

How would I set up Dask on institutional hardware?

Or "How would I set up Dask under my company's cloud governance policies?"

A few random topics come to mind:

  • Behind a firewall that cannot access outside package repos. Most haven't figured out how to internally mirror conda and pip.
  • My IT team is just now allowing me to run containers in production accounts. No K8s, Fargate, EKS, GKE, etc.
  • My users don't have direct access to cloud provisioning APIs. This limits SpecCluster implementations. Perhaps making dask-gateway plugable?
@ericdill

This comment has been minimized.

Copy link

commented Aug 3, 2019

Thanks for pinging me on this @mrocklin

Questions that are at the forefront of my thinking coming from a background in Data Engineering on Hadoop / Spark / Yarn:

  • How do I deploy a Dask job into production? Things that I care about for "production":
    • Deploying the runtime environment
    • Hooking in to alerting systems
    • Finding execution logs
    • Monitoring the job over time:
      • Performance: Runtime, accuracy, number of rows / columns, summary statistics
      • Resource Usage: Monitoring CPU-hours and GB-hours (memory) over time is a common pattern for determining which steps in your pipeline need to be updated because the performance / resource usage is out of the acceptable bounds.
    • The Spark ecosystem has the Spark History Server and a tool for creating historical trends from LinkedIn called dr elephant. Is there anything similar for Dask?
  • Tuning a Spark job for optimal performance is notoriously difficult. How does Dask compare (easier / harder / different)?
  • Spark can easily interoperate with a number of data formats like parquet, orc, avro, json, csv, etc., how about Dask?
  • Spark can easily interop with a number of data sources like HDFS, s3, Hive and any database that supports JDBC, how about Dask?
  • Spark has a SQL interface via spark.sql (This is how you load in data from Hive) and people that are comfortable with SQL can get started with their big data journey by using this spark.sql module. Is there anything analogous in the Dask ecosystem?
  • Can I run Dask on my existing Hadoop cluster(s)? What are the infrastructure requirements for using Dask in this kind of setting? (links to skein, dask-yarn, dask-gateway are all probably relevant here)

And some generic questions / statements to consider

  • My team has a ton of experience writing / debugging / owning code written in .

    • What are the kinds of problems that we should start to solve with Dask?
    • How should we think about solving problems with Dask?
    • Is it well-suited for data engineering type tasks (data ingest, data transformations)?
    • how much data does it scale to (millions / billions / trillions of rows by tens / hundreds / thousands of columns)?
  • We use a workflow / pipeline tool like Oozie, Airflow, Tivoli or we rolled our own workflow scheduler (probably something like cron plus a sql plus a web interface for monitoring). How would Dask fit into this kind of tool? Can we start sprinkling Dask jobs in as single nodes in these workflow management tools?

cc @mariusvniekerk who can provide some additional color / context / ideas / questions to this thread.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2019

I'm working on an executive-level deck to explain "what is Dask and why do I care?" I've been asked to keep it to 5 - 6 slides. It is currently 23 slides. :(

I suspect that this is internal-specific, but if a generic version of that existed I'm sure that it would see some use. @mmccarty

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2019

Thanks all. These responses have been great. I'm now thinking about how to organize this information into a concise set of questions for users to navigate, and that I (and hopefully others?) can write. To start off, I'm personally thinking of writing a short paragraph or two for around 20 questions. Should we be thinking bigger than this? If so, would you all be interested in helping as a group? If not, how do we select a small set of questions from those that are listed above, and how do we organize them?

@mariusvniekerk

This comment has been minimized.

Copy link

commented Aug 4, 2019

I'd probably structure them around high level topics like

Deploying dask at scale
Monitoring dask
Debugging dask
Securing dask
Interoperability of dask with other tools

@mmccarty

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

@mrocklin Happy to share a generic version of the deck once it is ready. Also, happy to help out with this document.

@dharhas

This comment has been minimized.

Copy link

commented Aug 5, 2019

Need a section that addresses stuff that the security folks will be concerned about, i.e. which ports does dask use, do any ports need to be opened etc.

@mrocklin mrocklin force-pushed the mrocklin:institutional-faq branch from 306495a to 8c1208a Aug 8, 2019

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

OK, I've pushed up some content. Currently it is organized under three common roles that I often find in presentation rooms. Here is the current structure

  • For Management
    • Briefly, what problem does Dask solve for us?
    • Is Dask Mature? Why should we trust it?
    • Who else uses Dask?
    • How does Dask compare with Apache Spark?
  • For IT
    • How would I set up Dask on institutional hardware?
    • Are these long running, or ephemeral?
    • Is Dask Secure?
    • Do I need to purchase a new Cluster?
    • How do I manage users?
    • How do I manage software environments?
  • For Technical Leads
    • Will Dask “just work” on our existing code?
    • How well does Dask scale? What are Dask’s limitations?
    • Is Dask Resilient? What happens when a machine goes down?
    • Is the API exactly the same as NumPy/Pandas/Scikit-Learn?
    • How much performance tuning does Dask require?
    • What Data formats does Dask support?
    • Does Dask have a SQL interface?

There were many questions above I didn't have time to answer. My hope is that this provides a firm enough base that we can add things too it as necessary.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

I also plan to cull this list in a bit after I rest and clear out my memory of it.


If Dask's centralized scheduler goes down then you would need to resubmit the
computation. This is a fairly standard level of resiliency today, shared with
other tooling like Apache Spark, Flink, and others.

This comment has been minimized.

Copy link
@mmccarty

mmccarty Aug 9, 2019

Member

It might be good to mention distributed vs centralized schedulers here? Ray and Spark both have a distributed scheduler, but I don't know much about them.

A concern I've heard is over the centralized scheduler going down, but how likely is this given the dask scheduler isn't doing any of the work itself? It only has one job.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Aug 9, 2019

Author Member

From a brief look, tangram seems to be a distributed system that is happy to launch Spark jobs, rather than an alternative distributed Spark driver.

Ray is genuinely a different architecture. I'm curious, I know that it uses Redis for metadata communication. Does it use distributed Redis for this or the more common single-node Redis server, if so, how resilient is distributed Redis?

This comment has been minimized.

Copy link
@mmccarty

mmccarty Aug 9, 2019

Member

Sorry, I was miss informed about tangram and Spark. I don't see a native distributed Spark driver. My internal expert is double checking to see if this is natively possible. Going back to Spark scheduler resiliency, it is achievable with a resource manager like Yarn, which also transfers state.

From this discussion on Ray:

specifications of the tasks are stored in a sharded in-memory database, currently we are not resilient to the failure of this database, but we're prototyping a fault tolerance scheme based on chain replication for it

I believe this is the Redis metadata store you mention, so it is indeed no different from a resilience perspective.

Is it safe to save the scheduler resilience is left to a resource manager and relying on the user to resubmit computations?

This fact checking was useful since this topic recently came up at C1. I think what you have is fine, but other readers may want to dig in on this as well.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Aug 9, 2019

Author Member

Right, in normal operation relying on a resource manager like Yarn or Kubernetes seems to solve problems for most institutions. That being said, highly available schedulers have been requested a few times, and is something that could be done. It's a non-trivial effort though and, despite interest, I haven't seen a case where it was actually necessary.

This comment has been minimized.

Copy link
@mmccarty

mmccarty Aug 9, 2019

Member

Agreed. I haven't either, yet people seem to be concerned about it. I'll keep an eye on it.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Aug 9, 2019

Author Member

The stock answer here is that "Dask has the same resilience properties as Spark"

The *vast majority* of institutional users though do not reach this limit.
For more information you may want to peruse our :doc:`best practices
<best-practices>`

This comment has been minimized.

Copy link
@mmccarty

mmccarty Aug 9, 2019

Member

One question I get. How does Dask move data around?

This comment has been minimized.

Copy link
@mmccarty

mmccarty Aug 9, 2019

Member

Send this around has been helpful. Perhaps it would be good to list out the serializers here.

@mmccarty

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

Good stuff! A few more thoughts above. Let me know if you want help.

docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
docs/source/institutional-faq.rst Outdated Show resolved Hide resolved
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Good stuff! A few more thoughts above. Let me know if you want help.

That would be welcome

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Thanks Tom

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

I plan to merge this in later today if there are no further comments. There is still plenty of work to do here, but hopefully this can provide a base from which to expand.

@TomAugspurger TomAugspurger merged commit 876f4ef into dask:master Aug 12, 2019

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

Thanks @mrocklin. We'll do additional items as followup PRs.

@mrocklin mrocklin deleted the mrocklin:institutional-faq branch Aug 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.