# Multiprocessing and Distributed Computing

**What**
* **single Julia process -> multiple Julia processes** that coordinate to perform certain computations

**Why**
* **Scaling things up**: run computations on multiple CPU cores, potentially even on different machines, e.g. nodes of a supercomputer or a local cluster of desktop machines.
* Effectively increase your memory: process a large dataset, which wouldn't fit into local memory, in parallel across multiple machines with separate dedicated RAM.
* (Typically avoids some multithreading / shared memory computing issues: e.g. multiple GCs.)

**Julia provides two fundamental implementations and paradigms**
* Julia's built-in [Distributed.jl](https://docs.julialang.org/en/v1/stdlib/Distributed/) standard library
* [Message Passing Interface (MPI)](https://www.mpi-forum.org/) through [MPI.jl](https://github.com/JuliaParallel/MPI.jl)

 

## Distributed.jl (standard library) vs MPI

**Distributed.jl**
* convenient for **"ad-hoc" distributed computing** (e.g. data processing)
* intuitive **master-worker model** (often naturally aligns with the structure of scientific computations)
* **interactivity**, e.g. in a REPL / in Jupyter
* built-in, no external setup necessary
* higher overhead than MPI and doesn't scale as well (by default doesn't utilizie Infiniband/OmniPath)

**MPI**
* **de-facto industry standard** for massively parallel computing, e.g. large scale distributed computing
* **known to scale well** up to thousands of compute nodes
* Single Program Multiple Data (SPMD) programming model can be more challenging (at least at first)
* **No (or very poor) interactivity** (see [MPIClusterManager.jl](https://github.com/JuliaParallel/MPIClusterManagers.jl))


The focus of this notebook is on **Distributed.jl** (MPI → later).

## Distributed.jl (standard library)

Julia's Distributed.jl follows a master-worker paradigm for its native distributed parallelism: **One master process coordinates all the worker processes, which perform the actual computations.**

In [1]:
using Distributed

In [2]:
nprocs()

1

In [3]:
nworkers() # the master is considered a worker as long as there are no real workers

1

To increase the number of workers, i.e. Julia processes, from within a Julia session we can dynamically call **`addprocs`**.

Alternatively, when starting Julia from the command line, one can use the `-p` option up front. Example,

```
julia -p 4
```

will start Julia with 5 processes, 1 master and 4 workers.

In [4]:
addprocs(4)

4-element Vector{Int64}:
 2
 3
 4
 5

Every process has a Julia internal `pid` (process id). The master is always 1. You can get the workers pids from `workers()`.

In [5]:
workers()

4-element Vector{Int64}:
 2
 3
 4
 5

Note that the 4 worker's pids aren't necessarily 2, 3, 4 and 5 and one shouldn't rely on those literal values. Let's remove the processes and add them once more.

In [6]:
rmprocs(workers())

Task (done) @0x000000017377fb70

In [7]:
nworkers() # only the master is left

1

In [8]:
addprocs(4)

4-element Vector{Int64}:
 6
 7
 8
 9

In [9]:
workers()

4-element Vector{Int64}:
 6
 7
 8
 9

### One master to rule them all - `@spawn`, `@spawnat`, `@fetch`, `@fetchfrom`, `@everywhere`...

To execute commands and start computations on workers we can use the following macros

* `@spawn`: run a command or a code block on any worker and return a `Future` (a wrapped `Task`) to it's result.
* `@spawnat`: same as `@spawn` but one can choose a specific worker by providing its pid.

Note the conceptual similarity between `Threads.@spawn` (task → thread) and `Distributed.@spawn` (task → process) and also `@async`.

In [10]:
@spawn 3+3

Future(6, 1, 10, ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), (0, 0, 429749394650)), nothing)

In [11]:
result = @spawn 3+3

Future(7, 1, 11, ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), (6, 8830587504640, 4294967298)), nothing)

In [12]:
fetch(result)

6

Because the combination of spawning at fetching is so common, there is `@fetch` (blocking) which combines them.

In [13]:
@fetch 3+3

6

In [15]:
@fetch rand(3,3)

3×3 Matrix{Float64}:
 0.871188  0.313584  0.990036
 0.651916  0.524621  0.811801
 0.394714  0.270683  0.363036

Which worker did the work?

In [16]:
@fetch begin
    println(myid());
    3+3
end

      From worker 7:	7


6

Using `@spawnat` and `@fetchfrom` we can delegate the work to a specific worker.

In [17]:
@fetchfrom 7 begin
    println(myid());
    3+3
end

      From worker 7:	7


6

We can use `@sync` as a blocker to wait for all workers to complete their tasks.

In [19]:
@sync begin
    pids = workers()
    @spawn (sleep(2); println("Today is reverse day!"))
    @spawn (sleep(1); println(" Stuttgart!"))
    @spawn println("Hello")
end;
println("Done!")

      From worker 9:	Hello
      From worker 8:	 Stuttgart!
      From worker 7:	Today is reverse day!
Done!


Ok, now that we understood all that, let's delegate a *complicated* calculation

In [20]:
using Random

function complicated_calculation()
    sleep(1) # so complex that it takes a long time :)
    randexp(5)
end

@fetch complicated_calculation()

LoadError: On worker 6:
UndefVarError: `#complicated_calculation` not defined
Stacktrace:
  [1] [0m[1mdeserialize_datatype[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:1385[24m[39m
  [2] [0m[1mhandle_deserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:869[24m[39m
  [3] [0m[1mdeserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:816[24m[39m
  [4] [0m[1mhandle_deserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:876[24m[39m
  [5] [0m[1mdeserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:816[24m[39m[90m [inlined][39m
  [6] [0m[1mdeserialize_global_from_main[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Distributed/src/[39m[90m[4mclusterserialize.jl:160[24m[39m
  [7] [0m[1m#5[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Distributed/src/[39m[90m[4mclusterserialize.jl:72[24m[39m[90m [inlined][39m
  [8] [0m[1mforeach[22m
[90m    @[39m [90m./[39m[90m[4mabstractarray.jl:3075[24m[39m
  [9] [0m[1mdeserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Distributed/src/[39m[90m[4mclusterserialize.jl:72[24m[39m
 [10] [0m[1mhandle_deserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:962[24m[39m
 [11] [0m[1mdeserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:816[24m[39m
 [12] [0m[1mhandle_deserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:873[24m[39m
 [13] [0m[1mdeserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:816[24m[39m
 [14] [0m[1mhandle_deserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:876[24m[39m
 [15] [0m[1mdeserialize[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Serialization/src/[39m[90m[4mSerialization.jl:816[24m[39m[90m [inlined][39m
 [16] [0m[1mdeserialize_msg[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Distributed/src/[39m[90m[4mmessages.jl:87[24m[39m
 [17] [0m[1m#invokelatest#2[22m
[90m    @[39m [90m./[39m[90m[4messentials.jl:819[24m[39m[90m [inlined][39m
 [18] [0m[1minvokelatest[22m
[90m    @[39m [90m./[39m[90m[4messentials.jl:816[24m[39m[90m [inlined][39m
 [19] [0m[1mmessage_handler_loop[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Distributed/src/[39m[90m[4mprocess_messages.jl:176[24m[39m
 [20] [0m[1mprocess_tcp_streams[22m
[90m    @[39m [90m~/.julia/juliaup/julia-1.9.3+0.aarch64.apple.darwin14/share/julia/stdlib/v1.9/Distributed/src/[39m[90m[4mprocess_messages.jl:133[24m[39m
 [21] [0m[1m#103[22m
[90m    @[39m [90m./[39m[90m[4mtask.jl:514[24m[39m

What happened?

**Every worker is a separate Julia process.** (Think of having multiple Julia REPLs open at once.)

We only defined `complicated_calculation()` on the master process. The function doesn't exist on any of the workers yet.

The macro `@everywhere` allows us to perform steps on all processes (master and worker). This is particularly useful for loading packages and functions definitions etc.

In [21]:
@everywhere begin # execute this block on all workers
    using Random
    
    function complicated_calculation()
        sleep(1)
        randexp(5) # lives in Random
    end
end

In [22]:
@fetch complicated_calculation()

5-element Vector{Float64}:
 0.0865544577944237
 0.9688157400923854
 1.66180698451168
 0.7621178809337334
 1.9397115722497218

### Data movement

There is a crucial difference between the following two pieces of code. Can you guess what it is? (without reading on 😉)

In [23]:
function method1()
    A = rand(100,100)
    B = rand(100,100)
    C = @fetch A^2 * B^2
end

method1 (generic function with 1 method)

In [24]:
function method2()
    C = @fetch rand(100,100)^2 * rand(100,100)^2
end

method2 (generic function with 1 method)

Let's benchmark them.

In [25]:
using BenchmarkTools
@btime method1();
@btime method2();

  463.750 μs (90 allocations: 237.76 KiB)
  344.292 μs (72 allocations: 80.98 KiB)


Method 1 is slower, because `A` and `B` are created on the master process, transferred to a worker, and squared and multiplied on the worker process before the result is finally transferred back to the master.

Method 2, on the other hand, creates, squares, and multiplies the random matrix all on the work process and only submits the result to the master.

Hence, `method1` is **transferring 3x as much data** between the master and the worker!

In this toy example, it's rather easy to identify the faster method.

In a real program, however, understanding data movement does require more thought and likely some measurement.

For example, if the first process needs matrix `A` in a follow-up computation then the first method might be better in this case. Or, if computing `A` is expensive and only the current process has it, then moving it to another process might be unavoidable.

#### Computer latency at a human scale

To understand why thinking about data is important it's instructive to look at the time scales involved in data access.

<img src="./imgs/latency_human_scales.png" width=900px>

(taken from https://www.prowesscorp.com/computer-latency-at-a-human-scale/)

**Efficient data movement is crucial for efficient distributed computing!** (!)

Note that there are more tools for explicit data movement like `Channels`/`RemoteChannels` and [ParallelDataTransfer.jl](https://github.com/ChrisRackauckas/ParallelDataTransfer.jl/). For the sake of brevity, we will not cover them here.

#### Avoid globals (once more)

In [26]:
myglobal = 4

4

In [27]:
function whohas(s::String)
    @everywhere begin
        var = Symbol($s)
        if isdefined(Main, var)
            println("$var exists.")
        else
            println("Doesn't exist.")
        end
    end
    nothing
end

whohas (generic function with 1 method)

In [28]:
whohas("myglobal")

myglobal exists.
      From worker 6:	Doesn't exist.
      From worker 7:	Doesn't exist.
      From worker 8:	Doesn't exist.
      From worker 9:	Doesn't exist.


In [29]:
@fetchfrom 6 myglobal+2

6

In [30]:
whohas("myglobal")

myglobal exists.
      From worker 6:	myglobal exists.
      From worker 7:	Doesn't exist.
      From worker 8:	Doesn't exist.
      From worker 9:	Doesn't exist.


Globals get copied to workers and continue to exist as globals even after the call.

This could lead to **memory accumulation** if many globals are used (just as it would in a single Julia session). → avoid them as much as possible

## High-level tools: `@distributed` and `pmap`

So far we have seen some of the fundamental building blocks for distributed computing with Distributed.jl. However, in practice, one wants to think as little as possible about how to distribute the work and explicitly spawn tasks.

Fortunately, many useful parallel computations do not require (much) data movement at all. A common example is a direct Monte Carlo simulation, where multiple processes can handle independent simulation trials simultaneously. (→ **montecarlo_pi** exercise)

Julia provides **high-level convenience** tools to
 * distribute `map` $\quad$ → $\quad$ [`pmap`](https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.pmap)
 * distribute loops $\quad$ → $\quad$ [`@distributed`](https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.@distributed)

### Parallel map: `pmap`

Very often, one simply wants to parallize a `map` operation where a function `f` is applied to all elements of a collection. This is a typical instance of **data parallelism**, which covers a vast class of compute-intensive programs.

In [33]:
map(x->x^2, 1:10)

10-element Vector{Int64}:
   1
   4
   9
  16
  25
  36
  49
  64
  81
 100

Such a pattern can be parallelized in Julia via the high-level function `pmap` ("parallel map").

**Remark:** No (non-local) data structure mutation in the function `f`!

#### Example: Singular values of multiple matrices

In [34]:
using Distributed, BenchmarkTools; rmprocs(workers()); addprocs(4); nworkers() # clean reset :)

4

In [35]:
@everywhere using LinearAlgebra

M = Matrix{Float64}[rand(200,200) for i = 1:10];

In [36]:
svdvals(rand(2,2))

2-element Vector{Float64}:
 0.9548868891807812
 0.1056781577053824

In [37]:
map(svdvals, M)

10-element Vector{Vector{Float64}}:
 [100.29927777182766, 7.990316604241419, 7.8371029575689555, 7.7843950520762295, 7.662012327809352, 7.603372854609421, 7.457614802948028, 7.35365819790247, 7.322883810374362, 7.250512223832819  …  0.2968023424574505, 0.2538544229110964, 0.22602966105410705, 0.1958812482110839, 0.176911430431651, 0.16220287713481782, 0.1210078382201315, 0.10152893341424117, 0.026572839002506943, 0.01823374857093899]
 [100.12554088524026, 7.991193997597216, 7.818119393025698, 7.717962036055705, 7.646255466942014, 7.569593876839441, 7.55521002331126, 7.391370146889184, 7.351788124030589, 7.270813845213522  …  0.3203466390456625, 0.2607704410176586, 0.24581478576906313, 0.23946715071544775, 0.21577788347688648, 0.17051192361726872, 0.12762826521200285, 0.07714555748641064, 0.032857210373608385, 0.007578189795361163]
 [99.90710008663977, 8.06965032045116, 7.94032456743014, 7.8715585696199835, 7.721531429258713, 7.647060667105083, 7.561582860711206, 7.4706168672497535, 7.4

In [38]:
pmap(svdvals, M)

10-element Vector{Vector{Float64}}:
 [100.29927777182766, 7.990316604241419, 7.8371029575689555, 7.7843950520762295, 7.662012327809352, 7.603372854609421, 7.457614802948028, 7.35365819790247, 7.322883810374362, 7.250512223832819  …  0.2968023424574505, 0.2538544229110964, 0.22602966105410705, 0.1958812482110839, 0.176911430431651, 0.16220287713481782, 0.1210078382201315, 0.10152893341424117, 0.026572839002506943, 0.01823374857093899]
 [100.12554088524026, 7.991193997597216, 7.818119393025698, 7.717962036055705, 7.646255466942014, 7.569593876839441, 7.55521002331126, 7.391370146889184, 7.351788124030589, 7.270813845213522  …  0.3203466390456625, 0.2607704410176586, 0.24581478576906313, 0.23946715071544775, 0.21577788347688648, 0.17051192361726872, 0.12762826521200285, 0.07714555748641064, 0.032857210373608385, 0.007578189795361163]
 [99.90710008663977, 8.06965032045116, 7.94032456743014, 7.8715585696199835, 7.721531429258713, 7.647060667105083, 7.561582860711206, 7.4706168672497535, 7.4

Let's check that this indeed utilized multiple workers.

In [39]:
pmap(M) do m
    println(myid());
    svdvals(m)
end;

      From worker 17:	17
      From worker 14:	14
      From worker 15:	15
      From worker 16:	16
      From worker 17:	17
      From worker 15:	15
      From worker 14:	14
      From worker 16:	16
      From worker 17:	17
      From worker 15:	15


In [40]:
@btime map($svdvals, $M);
@btime pmap($svdvals, $M);

  34.796 ms (81 allocations: 4.22 MiB)
  13.161 ms (510 allocations: 37.11 KiB)


#### Distributed loops (`@distributed`)

In [41]:
using Distributed, BenchmarkTools; rmprocs(workers()); addprocs(4); nworkers() # clean reset :)

4

#### Example: Reduction

Task: Counting heads in a series of coin tosses.

In [46]:
function count_heads_loop(n)
    c = 0
    for i = 1:n
        c += rand((0,1))
    end
    return c
end

N = 200_000_000
@btime count_heads_loop($N);

  250.096 ms (0 allocations: 0 bytes)


Note that these kinds of computations are called **reductions** (with `+` being the **reducer function**).

In [47]:
count_heads_reduce(n) = mapreduce(i -> rand((0,1)), +, 1:n)
@btime count_heads_reduce($N)

  248.884 ms (0 allocations: 0 bytes)


100006669

We can parallelize this reduction using
* `@distributed (reducer function) for ...`.

Note that this has **blocking** character and returns the result.

In [48]:
function count_heads_distributed_loop(n)
    c = @distributed (+) for i in 1:n
        rand((0,1))
    end
    return c
end

count_heads_distributed_loop (generic function with 1 method)

In [49]:
@btime count_heads_distributed_loop($N);

  67.515 ms (302 allocations: 12.80 KiB)


The distributed version is about **4x faster**, which is all we could hope for.

With `@distributed` the work is **evenly distributed** between the workers.

In [50]:
function count_heads_distributed_verbose(n)
    c = @distributed (+) for i in 1:n
        x = rand((0,1))
        println(x);
        x
    end
    c
end

count_heads_distributed_verbose(8);

      From worker 20:	0
      From worker 20:	1
      From worker 19:	0
      From worker 19:	1
      From worker 18:	1
      From worker 18:	0
      From worker 21:	0
      From worker 21:	1


#### A common mistake

In [51]:
function g(n)
    a = 0
    @distributed (+) for i in 1:n
        a += 1
    end
    a
end

a = g(10);

What do you expect the value of `a` to be?

In [52]:
a

0

#### Example: Array mutation

Apart from `@distributed (reducer) ...` there also is a `@distributed for ...` form. The latter is **non-blocking** and returns a `Task`. (You can think of it as a distributed version of `@spawn` for all the iterations.)

However, since the loop body will be executed on different processes, one must be careful to operate on **data structures that are available on all processes** (similar to the mistake highlighted above).

In [53]:
function square_broken()
    A = collect(1:10)
    @sync @distributed for i in eachindex(A)
        A[i] = A[i]^2
    end
    return A
end

square_broken (generic function with 1 method)

In [54]:
square_broken()

10-element Vector{Int64}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

To actually make all processes operate on the same array, one can use a `SharedArray`. For this to work, the **processes need to live on the same host**.

In [55]:
@everywhere using SharedArrays # must be loaded everywhere

In [56]:
A = rand(3,2)

3×2 Matrix{Float64}:
 0.254347  0.451978
 0.625417  0.0150406
 0.216238  0.68484

In [57]:
S = SharedArray(A)

3×2 SharedMatrix{Float64}:
 0.254347  0.451978
 0.625417  0.0150406
 0.216238  0.68484

In [59]:
function square!(X)
    for i in eachindex(X)
        sleep(0.001) # mimic some computational cost
        X[i] = X[i]^2
    end
end

function square_distributed!(X)
    @sync @distributed for i in eachindex(X)
        sleep(0.001) # mimic some computational cost
        X[i] = X[i]^2
    end
end

A = rand(10,10)
S = SharedArray(A)

@btime square!($A);
@btime square_distributed!($S);

  228.113 ms (508 allocations: 15.31 KiB)
  57.689 ms (543 allocations: 25.12 KiB)


Note that the functions `square!` and `square_distributed!` could have also been written with `map` and `pmap`, respectively. (data parallelism)

### `@distributed` vs `pmap`: When to choose which?

Julia's `pmap` is designed for the case where

* one wants to apply **a function to a collection**,
* one needs **load-balancing** (e.g. workload is non-uniform), and/or
* each function call does enough work to amortize the overhead. 

On the other hand, `@distributed` is good for

* **reductions** (can't be written as `map`/`pmap`),
* loops where each iteration **takes about the same time** (uniform workload), and/or
* loops where the workload of each iteration is (very) small.

## High-level array abstractions: [DistributedArrays.jl](https://github.com/JuliaParallel/DistributedArrays.jl)

In a `DArray`, each process has local access to just a chunk of the data, and no two processes share the same chunk. Processes can be on different hosts.

Distributed arrays are for example useful if

* Expensive calculations should be performed in parallel on parts of the array on different hosts.
* The data doesn't fit into the local machines memory (i.e. loading big files in parallel).

In [None]:
using Distributed, BenchmarkTools; rmprocs(workers());

In [None]:
# make sure that all workers use the same Julia environment
addprocs(4; exeflags="--project")

In [None]:
# check
@everywhere @show Base.active_project()

In [None]:
@everywhere using DistributedArrays, LinearAlgebra

In [None]:
M = Matrix{Float64}[rand(200,200) for i = 1:10];

In [None]:
D = distribute(M)

Which workers hold parts of D?

In [None]:
procs(D)

In [None]:
workers()

Which parts do they hold?

In [None]:
localpart(D) # the master doesn't hold anything

In [None]:
# Which parts do they hold?
for p in workers()
    println(@fetchfrom p DistributedArrays.localindices(D))
end

In [None]:
@btime map($svdvals, $M);
@btime map($svdvals, $D);

In [None]:
@btime pmap($svdvals, $M);

## *Actual* distributed computing with Distributed.jl (some comments)

### Creating workers on other machines

So far we have worked with multiple process on the same system, because we simply used `addprocs(::Integer)`. To put worker processes on other machines, e.g. nodes of a cluster, we need to modify the initial `addprocs` call appropriately.

In Julia, starting worker processes is handled by [ClusterManagers](https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers).

* The default one is `LocalManager`. It is automatically used when running `addprocs(i::Integer)` and we have implicitly used it already.
* Another important one is `SSHManager`. It is automatically used when running `addprocs(hostnames::Array)`, e.g. `addprocs(["node123", "node456"])`. The only requirement is a **passwordless ssh access** to all specified hosts.
* Cluster managers for SLURM, PBS, and others are provided in [ClusterManagers.jl](https://github.com/JuliaParallel/ClusterManagers.jl). For SLURM, this will make `addprocs` use `srun` under the hood.

*Example*

```julia
using Distributed

addprocs(["node123", "node123"])

@everywhere println(gethostname())
```

One can also start multiple processes on different machines:
```julia
addprocs([("node123", 2), ("node456", 3)]) # starts 2 workers on node123 and 3 workers on node456

# Use :auto to start as many processes as CPU threads are available
```

**Be aware of different paths:**
* By default, `addprocs` expects to find the julia executable on the remote machines under the same path as on the host (master).
* It will also try to `cd` to the same folder (set the working directory).


As you can see from `?addprocs`, `addprocs` takes a bunch of keyword arguments, two of which are of particular importance in this regard:

* `dir`: working directory for the worker processes
* `exename`: path to julia executable (potentially augmented with pre-commands) for the worker processes

In [61]:
# cleanup
rmprocs(workers())

Task (done) @0x000000010952f210