# PyCUDA installation

In [1]:
!pip install pycuda

Collecting pycuda
  Downloading pycuda-2025.1.2.tar.gz (1.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.7/1.7 MB[0m [31m63.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pytools>=2011.2 (from pycuda)
  Downloading pytools-2025.2.5-py3-none-any.whl.metadata (2.9 kB)
Collecting siphash24>=1.6 (from pytools>=2011.2->pycuda)
  Downloading siphash24-1.8-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading pytools-2025.2.5-py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[



---



# Version #3: using ```gpuArrays```


The following initial code portion is the same as for Version #1 and #2.

In [2]:
import numpy as np

# --- PyCUDA initialization
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit

########
# MAIN #
########

start = cuda.Event()
end   = cuda.Event()

N = 100000

h_a = np.random.randn(N).astype(np.float32)
h_b = np.random.randn(N).astype(np.float32)

h_c = np.empty_like(h_a)

This version uses the ```gpuarray``` class. In this way, it is possible to allocate and move host memory space to device by ```gpuarray.to_gpu()```, perform the sum of the two arrays by simply using ```d_c = (d_a + d_b)``` and finally to move the result to host by the ```.get()``` method. There is no explicit declaration of ```d_c``` which automatically occurs during the execution of the ```d_c = (d_a + d_b)``` instruction. The gpuarray class internally compiles an elementwise kernel for arithmetic operations, allowing expressions like ```d_c = (d_a + d_b)``` to execute directly on the GPU without explicit kernel code.


In [3]:
d_a = gpuarray.to_gpu(h_a)
d_b = gpuarray.to_gpu(h_b)

# --- Warmup execution
d_c = (d_a + d_b)

cuda.Context.synchronize()
start.record()
d_c = d_a + d_b
end.record()
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

h_c = d_c.get()

Processing time = 0.000450s


This last part is the same as for the previous versions.

In [4]:
if np.array_equal(h_c, h_a + h_b):
  print("Test passed!")
else :
  print("Error!")

cuda.Context.synchronize()

Test passed!
