-
Notifications
You must be signed in to change notification settings - Fork 39
Implemented prototype of vector UDFs #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
To continue our discussion, I see several possible options for implementing Julia workers. Option 1. Julia + librariesMany people look at Spark.jl as a way to run custom Julia code in a distributed environment. Currently it's kind of supported via RNN API, but the implementation is buggy, sometimes slow and produces cryptic error messages, so I don't really want to bring it to the UDFs. Also, using current implementation we cannot use 3rd party libraries unless they are already installed on the worker nodes. If we want to be really flexible about the supported operations, we should have a way to bring dependencies to executors. How can we do this?
All in all, Apache Spark looks like a terrible environment for running such workflows, and even making a custom solution based e.g. on pure Kubernetes seems a way easier (see some of my musings on this topic). There's also a number of specially designed frameworks for different tasks (e.g. training huge ML models) which we can wrap. Option 2. Julia onlyIf we drop the requirement to support custom libraries, we neither need to pack the environment, nor install the libraries during initialization. However, the Julia startup time is still pretty high (Julia 1.6 takes ~1.3 seconds on my machine), so launching it for every batch still looks like a huge overhead. Also, without custom libraries it's unclear how much value we bring. Yeah, we will be able to use simple transformations like adding 2 fields in a dataframe or e.g. computing their sine, but definitely not things like parsing a custom string format or running MCMC. Option 3. Julia-to-Java compilerIn case we really only need simple transformations, instead of create a Julia processes on worker nodes we can actually compile Julia functions to equivalent Java UDFs and use them directly. A huge advantage of this approach is that we get 100% performance and all the flexibility of the native Java/Scala implementation. The main disadvantage of course is that we lose all the value if Julia. Spark.jl then becomes a thin and convenient interface to the modern Spark, but not more than that. Technically, creating a simple Julia-to-Java compiler doesn't seem too complicated. I already experimented with generating Java classes in runtime here, and generating Java code from a Julia function can be achieved using a tracer (e.g. via Ghost.jl) or directly analyzing the IR code (e.g. via IRTools.jl). |
|
I think option 1 is the only reasonable way forward. 2, as you say, makes this somewhat pointless, and 3 is a research project that will not be feasible for many years to come. I think leaning into what Julia already provides is the best option going forward. This basically means using the manifest.toml as the mechanism to transfer and activate an environment. From what I can see, a manifest with a package server provides the same functionality as This does not solve the latency problem, but for batch/"production" jobs, using a startup script to instantiate/precompile is very reasonable. For interactive jobs, the precompilation latency remains, but that's not too different from using the REPL, is it? I suppose we'll need some way for the user to say "use this manifest" to run my job. Maybe model it similar to how pyspark specifies the environments? |
The hard part is how to do it inside Spark executors. Spark doesn't provide any guarantees about executor environment, we only know that the process will run inside some directory, but this directory may be on a local disk, in YARN container, Kubernetes pod, etc. We also don't how long this directory will live and whether it's safe to store anything (e.g. downloaded packages) outside of this dir. Now consider several options:
Maybe, we can speed up (1) by writing a file-flag to indicate that the project is already instantiated in this executor, but all-in-all the absence of any guarantees about executor lifetime makes me really sad. We also should be very careful with |
|
I'd still like to see Julia UDF, but it's not happening in this form, so I'll close this PR. |
This is just a rebase of this PR exyi#1, a first small step towards implementing Julia UDFs