# Exercise: Parallel Monte Carlo (Threads)

**Note: You should use multiple Julia threads for this exercise!**

In [None]:
using Base.Threads
@assert Threads.nthreads() > 1
Threads.nthreads()

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$. Visualization:

In [None]:
using Plots
using Distributions

# plot circle
circlepts = Plots.partialcircle(0, 2π, 100)
plot(circlepts, aspect_ratio=:equal, xlims=(-1, 1), ylims=(-1, 1), legend=false, lw=3, grid=false, frame=:box)

# plot darts
N = 400
d = Uniform(-1, 1)
scatter!(rand(d, N), rand(d, N), ms=2.5, color=:black)

### Basic Julia Implementation

In [None]:
function compute_pi(N)
    M = 0 # number of darts that landed in the circle
    for i in 1:N
        if sqrt(rand()^2 + rand()^2) < 1.0
            M += 1
        end
    end
    return 4 * M / N
end

In [None]:
compute_pi(10_000_000)

### Tasks

1. Based on `compute_pi`, write a parallel version `compute_pi_parallel(N::Int)` that divides the work into `Threads.nthreads()` parallel tasks. The final estimate for π should be the average of the estimate of each task. 
    1. **Hint:** Be aware of false sharing, that is, make sure that every tasks operates locally and only shares the local result at the end.
    2. **Hint:** You may call the serial `compute_pi` in your code.
    3. **Hint:** If you want, implement two versions, one based on `@tasks` and one using `@spawn`. (Bonus: write one using `tmapreduce` from OhMyThreads).

2. Benchmark and compare the serial and parallel variants.
    1. **Hint:** A reasonable value for $N$ could be `N = 10_000_000`.

In [None]:
using BenchmarkTools
using OhMyThreads
using Base.Threads

In [None]:
# TODO...

# @btime compute_pi_parallel(10_000_000) samples=5 evals=2

3. Write a function `compute_pi_multiple(Ns::Vector{Int})` which takes in a collection of values for $N$ (`Ns`) and **in serial** computes $\pi$ for all these values. The function should be **entirely serial** and based on `compute_pi`. Benchmark and compare to the previous variants.

In [None]:
some_Ns = [10, 100, 1000, 10_000, 100_000, 1_000_000, 2_000_000, 3_000_000, 4_000_000]

# TODO...

# @btime compute_pi_multiple(some_Ns) samples=5 evals=2

4. Write a function `compute_pi_multiple_parallel(Ns::Vector{Int})` which takes in a collection of values for $N$ (`Ns`) and **in parallel** computes $\pi$ for all these values. The function should still be based on the serial `compute_pi`. Benchmark and compare to the previous variants.
    1. **Hint:** You shouldn't use `@tasks` with default configuration (`ntasks=nthreads()`) here, because the workload is non-uniform. Either use `@set ntasks=length(Ns)` or write a version using `@spawn`. (Bonus: write a variant using `tmap` from OhMyThreads.)

In [None]:
some_Ns = [10, 100, 1000, 10_000, 100_000, 1_000_000, 2_000_000, 3_000_000, 4_000_000]

# TODO...

# @btime compute_pi_multiple_parallel($some_Ns) samples=5 evals=2

5. Calculate $\pi$ estimates for the following $N$ values: `Ns = ceil.(Int, exp10.(range(1, stop=8, length=50)))`. Plot $\pi$ vs $N$ on a semi-log plot.

In [None]:
# N values (nothing todo here)
Ns = ceil.(Int, exp10.(range(1, stop=8, length=50)));

In [None]:
# Important: the resulting pi estimates should be stored in a variable named: pis
# TODO...

In [None]:
# Plotting (nothing todo here)
plot(Ns, pis, color=:black, marker=:circle, lw=1, label="MC", xscale=:log10, frame=:box)
plot!(x -> π, label="π", xscale=:log10, linestyle=:dash, color=:red, lw=2)
ylabel!("π estimate")
xlabel!("number of dart throws N")

6. **Bonus:** Try to write a function `compute_pi_multiple_nested_parallel(Ns::Vector{Int})` which computes $\pi$ for all given $N$ values using nested multithreading: Both the outer computation ("for each N in Ns") as well as the inner computation ("compute pi for a given N") should be parallelized. Benchmark and compare to the previous variants.

In [None]:
some_Ns = [10, 100, 1000, 10_000, 100_000, 1_000_000, 2_000_000, 3_000_000, 4_000_000]

# TODO...

# @btime compute_pi_multiple_nested_parallel($some_Ns) samples=5 evals=2

# from above, for comparison
# @btime compute_pi_multiple_parallel($some_Ns) samples=5 evals=2