# Julia CPU Parallelism

## Summary:
MacBook M2 Max has 12 CPU cores, </br>
however setting JULIA_NUM_THREADS = 12 <font color=red>is NOT beneficial</font><br/>
I found that setting <font color=green>JULIA_NUM_THREADS = 6 is optimal</font>.

# Number of cores in various computers

- Win 10 Anaconda Prompt: 1
- MacBook Pro Intel CPU: 4
- MacBook Pro 14-inch M2 Pro CPU: 10
- MacBook Pro 16-inch M2 Pro CPU: 12
- <font color=green>MacBook Pro M2 Max CPU : 12</font>

- 2023-07-06 MacBook Pro M2 Max (JupytherLab) returns 1

# Create Julia kernel with 12 threads

- execute the blow code a single time to create kernel

# Notice 12 cores above (MacBook M2 Max)

## <font color=red>Unfortunately, this does not work when starting JupyterLab from Anaconda</font> 
returns 1. 

To fix that you will have to create a JupyterLab KERNEL with number of threads you want (after experimentation, I set 6 threads)

In [1]:
Threads.nthreads()

6

In [2]:
# Run once
#using IJulia
#installkernel("Julia (7 threads)", env=Dict("JULIA_NUM_THREADS"=>"7"))

## [ Info: Installing Julia (7 threads) kernelspec in ~/Library/Jupyter/kernels/julia-_7-threads_-1.9
## "~/Library/Jupyter/kernels/julia-_7-threads_-1.9"

- open Kernel menu > change kernel > select number of cores from the dropdown

<img src="assets/Screenshot 2023-07-08 at 06.49.40.png" alt="select Julia kernel" />

In [3]:
Threads.nthreads()

6

# The following might not be necessary anymore if the above worked

## MacOs

### Set enviroment variable:
- bbedit ~/.zshrc # EDIT variables, use bbedit if you installed it
- nano ~/.zshrc # EDIT variables

- export JULIA_NUM_THREADS=6 # ADD this line

### REFRESH the terminal
- source  ~/.zshrc

### Test the variable is set
$ echo $JULIA_NUM_THREADS
6

### Restart JupyterLab
- Anaconda app > start JupytherLab
- $ jupyter lab 

---

## Windows 10, 11

### Command Prompt

- C:\Users\UkiDL>set JULIA_NUM_THREADS=6
- C:\Users\UkiDL>setx JULIA_NUM_THREADS=6  # set permanantly, persists after closing cmd window
- C:\Users\UkiDL>echo %JULIA_NUM_THREADS%
- 6


In [4]:
Threads.nthreads()

6

In [5]:
# run ONCE, 216 dependencies successfully precompiled in 160 seconds. 84 already precompiled.
# import Pkg; 
# using Pkg
# Pkg.add("BenchmarkTools")  # @btime @test

In [6]:
using BenchmarkTools

size = 2^20             # 2^20 = 1,048,576
#x = fill(1.0f0, size)  # a vector filled with 1.0 (Float32)
x = fill(2.0, size)     # 1048576-element Array{Float64,1}
y = fill(4.0, size)       # 1048576-element Array{Int64,1}:

# DOT ADD two arrays
# https://docs.julialang.org/en/v1/manual/functions/#man-vectorized

y .+= x                 # add each element of x to each element of y

# macBook M2 MAX (12 cores) 1.149 ms (2 allocations: 64 bytes)
y[1] # show first element

using Test
@test all(y .== 6.0)

[32m[1mTest Passed[22m[39m

## Single thread sequential add

In [7]:
function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 4.0) # re-set the y array
sequential_add!(y, x)
@test all(y .== 6.0f0)

[32m[1mTest Passed[22m[39m

### Parallelization using Threads

In [8]:
function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 4.0) # re-set the y array
parallel_add!(y, x)
#@test all(y .== 6.0f0)

In [9]:
function parallel_multiply!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] *= x[i]
    end
    return nothing
end

fill!(y, 4.0) # re-set the y array
parallel_multiply!(y, x)
@test all(y .== 8.0f0)

[32m[1mTest Passed[22m[39m

In [10]:
function parallel_division!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] \= x[i]
    end
    return nothing
end

fill!(y, 4.0) # re-set the y array
parallel_division!(y, x)
y[1]
@test all(y .== 0.5f0)

[32m[1mTest Passed[22m[39m

In [18]:
using BenchmarkTools
fill!(y, 4.0) # re-set the y array
@btime sequential_add!($y, $x)

# fast to slow (best of 3 runs):
# 237.375 μs (0 allocations: 0 bytes)  -- 2023 MacBook Pro M2 Max 14 inch - 7 threads 
# 1.162 ms (0 allocations: 0 bytes)  -- 2023 MacBook Pro M2 Max 14 inch - 12 threads
# 1.162 ms (0 allocations: 0 bytes)  -- 2023 MacBook Pro M2 Max 14 inch - 10 threads
# 243.584 μs (0 allocations: 0 bytes)  -- 2023 MacBook Pro M2 Max 14 inch - 6 threads 
# 243.375 μs (0 allocations: 0 bytes)  -- 2023 MacBook Pro M2 Max 14 inch - 8 threads

# 1.220 ms (0 allocations: 0 bytes)  -- MacBook Pro M2 Max 14 inch - 1 thread
# 1.477 ms (0 allocations: 0 bytes)  -- Predator Helios 300 Win 10
# 1.853 ms (0 allocations: 0 bytes)  -- MacBook Pro Intel

  243.750 μs (0 allocations: 0 bytes)


In [21]:
fill!(y, 4.0) # re-set the y array
@btime parallel_add!($y, $x)

# fast to slow (best of 3 runs):

# 43.708 μs (48 allocations: 4.03 KiB)   -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 8
# 52.000 μs (41 allocations: 3.52 KiB)   -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 7
# 216.292 μs (36 allocations: 3.03 KiB)   -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 6

# 58.459 μs (36 allocations: 3.03 KiB)   -- Predator Helios 300 Win 10 -- Threads.nthreads() 6
# 660.600 μs (23 allocations: 2.72 KiB)   -- MacBook Pro -- Threads.nthreads() 4

# 1.235 ms (7 allocations: 592 bytes)     -- MacBook Pro M2 Max 14 inch -- Threads.nthreads() 1
# 1.482 ms (6 allocations: 944 bytes)     -- Predator Helios 300 Win 10 -- Threads.nthreads() 1

# FAIL TO FINISH                          -- MacBook Pro M2 Max 14 inch -- Threads.nthreads() 10
# FAIL TO FINISH                          -- MacBook Pro M2 Max 14 inch -- Threads.nthreads() 12

  59.333 μs (37 allocations: 3.06 KiB)


In [27]:
fill!(y, 4.0) # re-set the y array
@time parallel_multiply!(y, x)

# fast to slow (best of 3 runs):
# 0.000202 seconds (38 allocations: 3.094 KiB) -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 6
# 0.000209 seconds (44 allocations: 3.609 KiB) -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 7
# 0.000242 seconds (51 allocations: 4.125 KiB) -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 8

  0.000202 seconds (38 allocations: 3.094 KiB)


In [33]:
fill!(y, 4.0) # re-set the y array
@time parallel_add!(y, x)


# fast to slow (best of 3 runs):
# 0.000212 seconds (44 allocations: 3.609 KiB)  -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 7
# 0.000222 seconds (38 allocations: 3.094 KiB)  -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 6
# 0.000298 seconds (51 allocations: 4.125 KiB)  -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 8

  0.000228 seconds (37 allocations: 3.062 KiB)


In [39]:
fill!(y, 4.0) # re-set the y array
@time sequential_add!(y, x)


# fast to slow (best of 3 runs):
# 0.000278 seconds  -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 7
# 0.000295 seconds  -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 8
# 0.000346 seconds  -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 6

  0.000358 seconds


In [42]:
fill!(y, 4.0) # re-set the y array
@time parallel_division!(y, x)



# fast to slow (best of 3 runs):
# 0.000200 seconds (37 allocations: 3.062 KiB) -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 6
# 0.000218 seconds (46 allocations: 3.672 KiB) -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 7
# 0.000203 seconds (53 allocations: 4.156 KiB) -- 2023 MacBook Pro M2 Max 14 inch -- Threads.nthreads() 8

  0.000411 seconds (35 allocations: 3.000 KiB)
