Code adapted from  https://shephexd.github.io/development/2017/02/19/pycuda.html

If using google colab:
* Click on Runtime (excecution) and select Change runtime type (modifier le type d'excecution).
  Then select GPU in Hardware Acceleration (accélérateur matériel)
* Start your session by installing pycuda with the command:

  -> !pip install pycuda

In [3]:
!pip install pycuda

Collecting pycuda
[?25l  Downloading https://files.pythonhosted.org/packages/46/61/47d3235a4c13eec5a5f03594ddb268f4858734e02980afbcd806e6242fa5/pycuda-2020.1.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 9.1MB/s 
[?25hCollecting pytools>=2011.2
[?25l  Downloading https://files.pythonhosted.org/packages/b7/30/c9362a282ef89106768cba9d884f4b2e4f5dc6881d0c19b478d2a710b82b/pytools-2020.4.3.tar.gz (62kB)
[K     |████████████████████████████████| 71kB 8.2MB/s 
Collecting appdirs>=1.4.0
  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Collecting mako
[?25l  Downloading https://files.pythonhosted.org/packages/a6/37/0e706200d22172eb8fa17d68a7ae22dec7631a0a92266634fb518a88a5b2/Mako-1.1.3-py2.py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 7.2MB/s 
Building wheels for collected packages: pycuda, pytools
  Building wheel for pycuda (setup.py) ... 

In [4]:
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import time

In [5]:
# -- initialize the device
import pycuda.autoinit


#get device information
MyDevice=pycuda.driver.Device(0)
MyDevice.get_attributes()

{pycuda._driver.device_attribute.MAX_THREADS_PER_BLOCK: 1024,
 pycuda._driver.device_attribute.MAX_BLOCK_DIM_X: 1024,
 pycuda._driver.device_attribute.MAX_BLOCK_DIM_Y: 1024,
 pycuda._driver.device_attribute.MAX_BLOCK_DIM_Z: 64,
 pycuda._driver.device_attribute.MAX_GRID_DIM_X: 2147483647,
 pycuda._driver.device_attribute.MAX_GRID_DIM_Y: 65535,
 pycuda._driver.device_attribute.MAX_GRID_DIM_Z: 65535,
 pycuda._driver.device_attribute.MAX_SHARED_MEMORY_PER_BLOCK: 49152,
 pycuda._driver.device_attribute.TOTAL_CONSTANT_MEMORY: 65536,
 pycuda._driver.device_attribute.WARP_SIZE: 32,
 pycuda._driver.device_attribute.MAX_PITCH: 2147483647,
 pycuda._driver.device_attribute.MAX_REGISTERS_PER_BLOCK: 65536,
 pycuda._driver.device_attribute.CLOCK_RATE: 1590000,
 pycuda._driver.device_attribute.TEXTURE_ALIGNMENT: 512,
 pycuda._driver.device_attribute.GPU_OVERLAP: 1,
 pycuda._driver.device_attribute.MULTIPROCESSOR_COUNT: 40,
 pycuda._driver.device_attribute.KERNEL_EXEC_TIMEOUT: 0,
 pycuda._driver.device

In [6]:
#define the kernel
kernel_code_template = """
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Pvalue is used to store the element of the matrix
    // that is computed by the thread
    float Pvalue = 0;

    // Each thread loads one row of M and one column of N,
    //   to produce one element of P.
    for (int k = 0; k < %(MATRIX_SIZE)s; ++k) {
        float Aelement = a[ty * %(MATRIX_SIZE)s + k];
        float Belement = b[k * %(MATRIX_SIZE)s + tx];
        Pvalue += Aelement * Belement;
    }

    // Write the matrix to device memory;
    // each thread writes one element
    c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;
}
"""

In [30]:
# define the (square) matrix size
#  note that we'll only use *one* block of threads here
#  as a consequence this number (squared) can't exceed max_threads
# -> use MyDevice.get_attributes() to get this information
MATRIX_SIZE = 32

# create two random square matrices
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

In [31]:
# compute reference on the CPU to verify GPU computation
time_start=time.time()
c_cpu = np.dot(a_cpu, b_cpu)
time_end=time.time()
timeCPU = time_end-time_start
print('enlapsed time (CPU):',timeCPU,' seconds')

enlapsed time (CPU): 0.00013875961303710938  seconds


In [32]:
# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)

# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

# get the kernel code from the template
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
    'MATRIX_SIZE': MATRIX_SIZE
    }

# compile the kernel code
mod = compiler.SourceModule(kernel_code)

# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")

# call the kernel on the card
time_start=time.time()

matrixmul(
    # inputs
    a_gpu, b_gpu,
    # output
    c_gpu,
    # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
    block = (MATRIX_SIZE, MATRIX_SIZE, 1),
    )

time_end=time.time()
timeGPU = time_end-time_start
print('enlapsed time (GPU):',timeGPU,' seconds')

enlapsed time (GPU): 0.0001888275146484375  seconds


In [33]:
# print the results
def display(verbose=False):

  print("Taille matrices : ", MATRIX_SIZE)
  #print Matrices
  if verbose == True:
    print("-" * 80)
    print("Matrix A (GPU):")
    print(a_gpu.get())

    print("-" * 80)
    print("Matrix B (GPU):")
    print(b_gpu.get())

    print("-" * 80)
    print("Matrix C (GPU):")
    print(c_gpu.get())

    print("-" * 80)

  #print difference
  print("CPU-GPU difference:")
  norm = np.linalg.norm(c_cpu - c_gpu.get())
  if norm != 0:
    print(c_cpu - c_gpu.get())
  else:
    print(norm)

  #print difference time
  print()
  print("Rapport de temps CPU/GPU : ", round(timeCPU/timeGPU,3))

display()

Taille matrices :  32
CPU-GPU difference:
0.0

Rapport de temps CPU/GPU :  0.735


#QUESTION 1: Comprenez bien chaque partie du code

Done

#QUESTION 2: Comparez le temps necessaire pour la multiplication en CPU et en GPU pour MATRIX_SIZE = 8, 16, 32. Qu'en pensez vous?

Le calcul avec CPU est plus rapide. Cela vient sûrement du fait qu'avec des matrices de cette taille, on perd plus de temps à transférer les données d'une zone mémoire à une autre que le temps gagné avec la parallélisation.

#QUESTION 3: Par quelle methode simple pourriez vous rendre la parallelisation GPU competitive par rapport a la methode CPU (i.e. avec numpy)?

Aucune idée