# Julia GPU Support

- https://juliagpu.gitlab.io/CUDA.jl/

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Julia-GPU-Support" data-toc-modified-id="Julia-GPU-Support-1">Julia GPU Support</a></span></li><li><span><a href="#Julia-Threads-(JULIA_NUM_THREADS)" data-toc-modified-id="Julia-Threads-(JULIA_NUM_THREADS)-2">Julia Threads (JULIA_NUM_THREADS)</a></span><ul class="toc-item"><li><span><a href="#MacOs" data-toc-modified-id="MacOs-2.1">MacOs</a></span></li><li><span><a href="#Windows-10" data-toc-modified-id="Windows-10-2.2">Windows 10</a></span><ul class="toc-item"><li><span><a href="#Command-Prompt" data-toc-modified-id="Command-Prompt-2.2.1">Command Prompt</a></span></li></ul></li></ul></li><li><span><a href="#CUDA-Package" data-toc-modified-id="CUDA-Package-3">CUDA Package</a></span><ul class="toc-item"><li><span><a href="#Parallelization-using-CPU" data-toc-modified-id="Parallelization-using-CPU-3.1">Parallelization using CPU</a></span><ul class="toc-item"><li><span><a href="#Parallelization-using-Threads" data-toc-modified-id="Parallelization-using-Threads-3.1.1">Parallelization using Threads</a></span></li></ul></li></ul></li></ul></div>

# Julia Threads (JULIA_NUM_THREADS)

## MacOs

Set enviroment variable:
- nano ~/.bash_profile     # EDIT variables
- export JULIA_NUM_THREADS=4 # ADD this line
- source  ~/.bash_profile  # REFRESH the terminal
- REPOS $ jupyter lab # restart jupyter notebook

## Windows 10

### Command Prompt

- C:\Users\UkiDL>set JULIA_NUM_THREADS=6
- C:\Users\UkiDL>setx JULIA_NUM_THREADS=6  # set permanantly, persists after closing cmd window
- C:\Users\UkiDL>echo %JULIA_NUM_THREADS%
- 6

Unfortunately, this does not work when starting Jupyther from Anaconda prompt. 

- Threads.nthreads() 

returns 1

In [4]:
# MacBook Pro Intel CPU: 4
# MacBook Pro 14-inch M2 Pro CPU: 10
# MacBook Pro 16-inch M2 Pro CPU: 12
# MacBook Pro M2 Max CPU : 12
# The main advantage of M2 Max over M2 Pro is the GPU

# Win 10 Anaconda Prompt: 1
# 2023-07-06 MacBook Pro M2 Max (JupytherLab) returns 1

Threads.nthreads()

1

In [5]:
size = 2^20             # 1,048,576
#x = fill(1.0f0, size)  # a vector filled with 1.0 (Float32)
x = fill(1.0, size)     # 1048576-element Array{Float64,1}
y = fill(2, size)       # 1048576-element Array{Int64,1}:

y .+= x                 # add each element of x to each element of y

using Test
@test all(y .== 3.0)

[32m[1mTest Passed[22m[39m

## Single thread sequential add

In [6]:
function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2)
sequential_add!(y, x)
@test all(y .== 3.0f0)

[32m[1mTest Passed[22m[39m

### Parallelization using Threads

In [7]:
function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2)
parallel_add!(y, x)
@test all(y .== 3.0f0)

[32m[1mTest Passed[22m[39m

In [8]:
using BenchmarkTools
@btime sequential_add!($y, $x)

# 1.853 ms (0 allocations: 0 bytes) -- MacBook Pro Intel
# 1.477 ms (0 allocations: 0 bytes) -- Predator Helios 300 Win 10
# 1.220 ms (0 allocations: 0 bytes) -- MacBook Pro M2 Max 14 inch

  1.223 ms (0 allocations: 0 bytes)


In [10]:
@btime parallel_add!($y, $x)
# 390.601 μs (31 allocations: 5.09 KiB) -- Predator Helios 300 Win 10 -- Threads.nthreads() 6
# 660.600 μs (23 allocations: 2.72 KiB) -- MacBook Pro -- Threads.nthreads() 4
# 1.482 ms (6 allocations: 944 bytes)  -- Predator Helios 300 Win 10 -- Threads.nthreads() 1
# 1.235 ms (7 allocations: 592 bytes) -- MacBook Pro M2 Max 14 inch -- Threads.nthreads() 1

  1.239 ms (7 allocations: 592 bytes)
