Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using Distributed #300

Closed
fonsp opened this issue Aug 16, 2020 · 24 comments · Fixed by #2240
Closed

Allow using Distributed #300

fonsp opened this issue Aug 16, 2020 · 24 comments · Fixed by #2240
Labels
backend Concerning the julia server and runtime enhancement New feature or request

Comments

@fonsp
Copy link
Owner

fonsp commented Aug 16, 2020

Pluto uses Distributed to create worker processes and to send Julia data structures between them. It works really well! And Distributed is pleasant to work with.

However, it means that Pluto notebooks cannot use Distributed, because your notebook's code is executed on a slave process - you can't create processes, and you can accidentally control other running notebooks.

One solution is to run all your notebooks in the master process by setting the parameter:

import Pluto; Pluto.run(workspace_use_distributed=false)

But this makes the notebook server unresponsive while any notebook is running code, and the stop button is disabled.

Solutions

What would a solution be? Should Pluto implement its own Distributed? This seems silly - we would get the most robust implementation by copying Distributed's source code directly. Maybe there is a way to internally use a copy of Distributed? Copy the contents of julia/stdlib/Distributed into /tmp, rename Distributed to DistributedCopy and import that? But is the session state completely contained inside the package?

Is there a way to use Distributed, but with "multiple global sessions"?

Can we create a thin wrapper around Distributed and make sure that the Pluto process is using this instead? For example, Distributed.add_procs() would be RealDistributed.remoteeval(Main, 1, :(RealDistributed.add_procs())). Will that also work for the packages that you import inside your notebook, that depend on Distributed?

@fonsp fonsp changed the title Allow Distributed Allow using Distributed Aug 16, 2020
@fonsp fonsp added backend Concerning the julia server and runtime enhancement New feature or request help welcome If you are experienced in this topic - let us know! labels Aug 16, 2020
@marius311
Copy link
Contributor

Is there a reason why Pluto's workers need to be processes and can't just be threads?

(Two other upsides of threads would be reduced memory usage and the ability to work with non-serializable objects.)

@fonsp
Copy link
Owner Author

fonsp commented Sep 7, 2020

Hmm. Can one thread interrupt another thread stuck in while true end? What happens when one thread segfaults?

@fonsp
Copy link
Owner Author

fonsp commented Sep 7, 2020

Can different threads use separate package environments? Can they load different versions of the same package?

@fonsp
Copy link
Owner Author

fonsp commented Sep 7, 2020

(Those aren't rhetorical questions 🙃 it's just that I have never used threads in Julia)

@Moelf
Copy link
Contributor

Moelf commented Sep 13, 2020

jupyter notebook can do it because kernel is one process and jupyter notebook is running on an entirely separate (python) process.

I see two outs:

  1. mimic jupyter and make each notebook a separate process and that the web UI is tied to main session where Pluto.run() happened (so we can send SIGINT to each notebook process)

  2. use main session as a worker broker, i.e when adding worker in a notebook, add at main process and 'shadow' / forward it to notebook worker via a socket file or something

@carlocab
Copy link

One solution is to run all your notebooks in the master process by setting an environment variable:

import Pluto; withenv(Pluto.run, "PLUTO_WORKSPACE_USE_DISTRIBUTED" => "false")

This workaround doesn't work for me, unfortunately. I'm still getting the same errors like workspace3 not defined, which is what would happen if I try to run Pluto normally but insist on using Distributed.

I guess it's not a big deal since this isn't the intended usage, but I thought it might be useful for you to know.

Thanks for the work you've put into Pluto!

@fonsp
Copy link
Owner Author

fonsp commented Oct 21, 2020

Some motivational words:

https://www.youtube.com/watch?v=nwdGsz4rc3Q

@lukeburns
Copy link

Workaround also fails for me. I get @everywhere not defined and Distributed not defined when I try to access it via Distributed.@everywhere.

This behavior is unexpected:

Screen Shot 2021-04-22 at 2 28 25 PM

Screen Shot 2021-04-22 at 2 29 25 PM

Why is this happening?

@lukeburns
Copy link

lukeburns commented Apr 24, 2021

Quick and dirty workaround for addprocs and @everywhere, in case it's helpful to others.

### A Pluto.jl notebook ###
# v0.14.3

using Markdown
using InteractiveUtils

# ╔═╡ 797267f8-c7e6-4cb3-81d9-3ccc12956f56
begin
	macro everywhere(procs, ex)
		return esc(:(Main.@everywhere $procs $ex))
	end
	workers() = filter(pid -> pid != Main.myid(), Main.workers())
	macro everywhere(ex)
		# have pluto handle evaluation on workspace process
		return esc(:(@everywhere workers() $ex; eval($(Expr(:quote, ex)))))
	end
end

# ╔═╡ cfa09121-2457-42b1-9d20-e2518e7474e0
begin
	@everywhere 1 using Distributed
	addprocs(args...; kwargs...) = @everywhere 1 addprocs($args...; $kwargs...)
	rmprocs(args...; kwargs...) = @everywhere 1 rmprocs($args...; $kwargs...)
end

# ╔═╡ 6c0e0a11-3cc5-4ebe-b6f5-8df3e409cd05
@everywhere a = 2

# ╔═╡ ef45e5cc-cd10-4ad4-811a-9b767670dbf4
a^2

# ╔═╡ Cell order:
# ╠═797267f8-c7e6-4cb3-81d9-3ccc12956f56
# ╠═cfa09121-2457-42b1-9d20-e2518e7474e0
# ╠═6c0e0a11-3cc5-4ebe-b6f5-8df3e409cd05
# ╠═ef45e5cc-cd10-4ad4-811a-9b767670dbf4

@fonsp
Copy link
Owner Author

fonsp commented May 18, 2021

Hey @r-acad !

Can you remove this question here and open a new Discussion?

@Oblynx
Copy link

Oblynx commented Dec 30, 2021

I wonder why Distributed can't nest. Is there any ongoing discussion with upstream?

@Oblynx
Copy link

Oblynx commented Jan 7, 2022

You mention that

  1. you can't create processes
  2. you can accidentally control other running notebooks

If (2) is undesirable, we can use https://docs.julialang.org/en/v1/manual/distributed-computing/#Specifying-Network-Topology-(Experimental) like in #1812

For (1), I wonder if it would be possible to launch independent Julia processes from each worker, instead of running the notebooks inside the process created by Distributed.addproc. This kind of encapsulation might be possible with a custom Distributed.ClusterManager.
Pluto uses the default LocalManager ClusterManager; however, by overriding the Distributed.launch method, we could possibly do this.

Second thought, without having seen the internals of Distributed but after checking the ClusterManager API docs: there might be shared state between worker and master that contains a map pid => worker_id, maybe belonging to the ClusterManager.

In such a case, I wonder if it is simply a matter of explicitly creating a new LocalManager instance to use in the notebooks.

@Oblynx
Copy link

Oblynx commented Jan 10, 2022

I believe this issue is important because Distributed is part of Julia stdlib and supports a core paradigm of modern programming. It arguably has a high teaching value and can help "beginner" programmers to understand how to make use of modern computing infrastructure in a very high level way. IIUC, Pluto aims to facilitate prototyping of code that can be used in practice, and spawning processes or using Dagger is usually best considered from the beginning, it's not just an optimization.

What do the Pluto maintainers think, is this still interesting?

@fonsp
Copy link
Owner Author

fonsp commented Jan 11, 2022

Hi @Oblynx ! Thanks so much for your input, we really want this fixed! I fully agree that Distributed is essential to Julia's ecosystem, and also to beginners.

We have not posted much to this issue, but we have been regularly discussing this for a long time now. @dralletje has made a prototype of Pluto without distributed, but I think the performance hit was too big.

I did not bring this up at julia itself because I am quite intimidated by the problem, since I have little experience with distributed computing. I am also worried that the API of Distributed is not designed to handle a nested tree structure.

Your approach sounds very promising! Going through the distributed codebase, I felt like a good approach would be to override some globals, simulating the PID=1 context on notebook processes. Creating a new ClusterManager sounds even better!

@dralletje
Copy link
Collaborator

image

😏

@Oblynx
Copy link

Oblynx commented Jan 26, 2022

🐙 ! But how?

@fonsp fonsp linked a pull request Jan 26, 2022 that will close this issue
@fonsp fonsp linked a pull request Feb 4, 2022 that will close this issue
@fonsp
Copy link
Owner Author

fonsp commented Jun 15, 2022

Good news! We have a GSoC student working on this issue this summer! @savq

@fonsp fonsp removed the help welcome If you are experienced in this topic - let us know! label Jun 15, 2022
@schlichtanders
Copy link

Awesome to see progress with replacing Distributed!

As the GSoC is over, is there a further roadmap with next steps planned?

@fonsp
Copy link
Owner Author

fonsp commented Feb 1, 2023

Lots of progress happening in https://github.com/JuliaPluto/Malt.jl ! Take a look at https://github.com/JuliaPluto/Malt.jl/milestone/1

@fonsp
Copy link
Owner Author

fonsp commented Sep 18, 2023

We fixed it! 🎉

The fix is in #2240, thanks to @savq (GSoC), @Pangoraw, @habemus-papadum and @pankgeorg! Also thanks to @dralletje for previous work in #1854 and #1896.

Test release

The upcoming Pluto release will start a testing period where Pluto still uses Distributed by default, but you can enable Malt with:

Pluto.run(workspace_use_distributed_stdlib=false)

Please try it out (] add Pluto#main) and give us your feedback!

@dralletje
Copy link
Collaborator

O MY GODDDDD

@schlichtanders
Copy link

schlichtanders commented Oct 11, 2023

Now that Distributed is able to be loaded, it seems it runs into the serialization problem reported here #1030

so still no real use of Distributed till serialization is handled?

@fonsp
Copy link
Owner Author

fonsp commented Oct 18, 2023

@schlichtanders Can you give an example?

@schlichtanders
Copy link

I was adding it to the other mentioned open issue #1030 (comment)

Now as Pluto supports the use of Distributed via Malt.jl, this issue appears again.

A simple remote call like

remotecall_fetch(() -> readchomp(`hostname`), pid) 

already gives the error like

UndefVarError: `workspace#72` not defined

I opened an issue on julialang/julia for this JuliaLang/Distributed.jl#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Concerning the julia server and runtime enhancement New feature or request
Projects
None yet
8 participants