## Reduction operation: the sum of the numbers in the range [0, value)

In [5]:
import numpy as np

def reduc_operation(A):
    """Compute the sum of the elements of Array A in the range [0, value)."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

# Secuencial

value = 5*10**4

X = np.random.rand(value)

# Para imprimir los pimeros valores del array

# print(X[0:12])

# Utilizando las operaciones mágicas de ipython

tiempo = %timeit -r 2 -o -q reduc_operation(X)

print("Time taken by reduction operation using a function:", tiempo)


print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation(X)}\n")


# Utilizando numpy.sum()

tiempo = %timeit -r 2 -o -q np.sum(X)

print("Time taken by reduction operation using numpy.sum():", tiempo)

print("Now, the result using numpy.sum():", np.sum(X),"\n ")


# Utilizando numpy.ndarray.sum()

tiempo= %timeit -r 2 -o -q X.sum()

print("Time taken by reduction operation using numpy.ndarray.sum():", tiempo)

print("Now, the result using numpy.ndarray.sum():", X.sum())




Time taken by reduction operation using a function: 5.27 ms ± 202 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
And the result of the sum of numbers in the range [0, value) is: 25010.449831532787

Time taken by reduction operation using numpy.sum(): 14.9 µs ± 2.99 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.sum(): 25010.449831532696 
 
Time taken by reduction operation using numpy.ndarray.sum(): 12.8 µs ± 19.8 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.ndarray.sum(): 25010.449831532696


3.3a) LIBRERIA CUPY

In [6]:
import cupy as cp  # Importamos CuPy

def reduc_operation(A):
    """Compute the sum of the elements of Array A in the range [0, value)."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

# Configuración inicial
value = 5 * 10**4  # Tamaño del array

# Crear el array directamente en la GPU utilizando cupy.random.rand
X_gpu = cp.random.rand(value)

# Para imprimir los primeros valores del array (traspasándolos a la CPU con .get())
# print(cp.asnumpy(X_gpu[:12]))  # Convierte a un array de NumPy para imprimirlo

# Reducir operación en la GPU con cupy.sum()
tiempo = %timeit -r 2 -o -q cp.sum(X_gpu)

print("Time taken by reduction operation using cupy.sum():", tiempo)
print("Result of the sum using cupy.sum():", cp.sum(X_gpu), "\n")

# Alternativamente, usar el método .sum() del array creado por CuPy 
tiempo = %timeit -r 2 -o -q X_gpu.sum()

print("Time taken by reduction operation using cupy.ndarray.sum():", tiempo)
print("Result of the sum using cupy.ndarray.sum():", X_gpu.sum())


Time taken by reduction operation using cupy.sum(): 17.9 µs ± 51.6 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.sum(): 25107.20226476049 

Time taken by reduction operation using cupy.ndarray.sum(): 17 µs ± 51.1 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.ndarray.sum(): 25107.20226476049


3.3 b)LIBRERIA NUMBA

In [7]:
import numpy as np
import numba
from numba import cuda

# Función de reducción para la GPU utilizando numba
@cuda.jit
def reduc_operation_gpu(arr, result):
    """Compute the sum of elements in arr and store the result in result."""
    idx = cuda.grid(1)  # Obtiene el índice único para cada hilo

    if idx < arr.size:
        # Cada hilo realiza la suma parcial
        cuda.atomic.add(result, 0, arr[idx])

# Secuencial

value = 5 * 10**4

# Crear el array en la CPU (esto puede transferirse a la GPU si lo deseas)
X = np.random.rand(value)

# Transferir el array de NumPy a la GPU
X_gpu = cuda.to_device(X)

# Crear un array para almacenar el resultado en la GPU
result_gpu = cuda.to_device(np.array([0.0]))  # Inicializa el resultado en 0.0

# Llamar a la función de reducción en la GPU
threads_per_block = 256
blocks_per_grid = (X_gpu.size + (threads_per_block - 1)) // threads_per_block

# Lanzar el kernel
reduc_operation_gpu[blocks_per_grid, threads_per_block](X_gpu, result_gpu)

# Transferir el resultado de la GPU a la CPU
result = result_gpu.copy_to_host()[0]

print(f"Result of reduction using Numba (GPU): {result}\n")

# Utilizando %timeit para medir el tiempo de ejecución de la operación
tiempo = %timeit -r 2 -o -q reduc_operation_gpu[blocks_per_grid, threads_per_block](X_gpu, result_gpu)
print(f"Time taken by reduction operation using Numba (GPU): {tiempo}")


Result of reduction using Numba (GPU): 25051.210227551637

Time taken by reduction operation using Numba (GPU): 84.8 µs ± 0.243 ns per loop (mean ± std. dev. of 2 runs, 10,000 loops each)


3.3 c) Lánzalo con el valor de 5 ∗ 10**6, 5 ∗ 10**7 y 5 ∗ 10**8 elementos; los datos de la salida.

Ejecutando con 5000000 elementos
Time taken by reduction operation using a function: 2.61 ms ± 11.1 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
And the result of the sum of numbers in the range [0, value) is: 24970.76122879585

Time taken by reduction operation using numpy.sum(): 7.7 µs ± 0.53 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.sum(): 24970.761228796026 
 
Time taken by reduction operation using numpy.ndarray.sum(): 6.94 µs ± 4.63 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.ndarray.sum(): 24970.761228796026
Time taken by reduction operation using cupy.sum(): 25.8 µs ± 15.5 µs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result of the sum using cupy.sum(): 24996.494543374458 

Time taken by reduction operation using cupy.ndarray.sum(): 6.04 µs ± 4.54 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.ndarray.sum(): 24996.494543374458
Result of reduction using Numba (GPU): 24911.578116681303

Time taken by reduction operation using Numba (GPU): 81.8 µs ± 8.71 ns per loop (mean ± std. dev. of 2 runs, 10,000 loops each)



Ejecutando con 50000000 elementos
Time taken by reduction operation using a function: 2.62 ms ± 722 ns per loop (mean ± std. dev. of 2 runs, 100 loops each)
And the result of the sum of numbers in the range [0, value) is: 25078.57857486828

Time taken by reduction operation using numpy.sum(): 7.91 µs ± 0.695 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.sum(): 25078.57857486829 
 
Time taken by reduction operation using numpy.ndarray.sum(): 7.14 µs ± 11 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.ndarray.sum(): 25078.57857486829
Time taken by reduction operation using cupy.sum(): 6.2 µs ± 1.07 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.sum(): 24943.07510200829 

Time taken by reduction operation using cupy.ndarray.sum(): 6.01 µs ± 1.72 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.ndarray.sum(): 24943.07510200829
Result of reduction using Numba (GPU): 25060.09478794771

Time taken by reduction operation using Numba (GPU): 82.5 µs ± 0.826 ns per loop (mean ± std. dev. of 2 runs, 10,000 loops each)




Ejecutando con 500000000 elementos
Time taken by reduction operation using a function: 2.62 ms ± 1.08 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
And the result of the sum of numbers in the range [0, value) is: 25020.25675150178

Time taken by reduction operation using numpy.sum(): 7.76 µs ± 0.0628 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.sum(): 25020.25675150158 
 
Time taken by reduction operation using numpy.ndarray.sum(): 6.97 µs ± 1.3 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Now, the result using numpy.ndarray.sum(): 25020.25675150158
Time taken by reduction operation using cupy.sum(): 6.17 µs ± 1.92 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.sum(): 25077.323703846654 

Time taken by reduction operation using cupy.ndarray.sum(): 5.94 µs ± 0.357 ns per loop (mean ± std. dev. of 2 runs, 100,000 loops each)
Result of the sum using cupy.ndarray.sum(): 25077.323703846654
Result of reduction using Numba (GPU): 25123.77668474764

Time taken by reduction operation using Numba (GPU): 82.5 µs ± 0.262 ns per loop (mean ± std. dev. of 2 runs, 10,000 loops each)


3.3d) Para arrays pequeños o medianos (como 5*10** 4 o 5*10** 6), la CPU con herramientas como NumPy sigue siendo la opción más rápida debido a la baja sobrecarga y la optimización interna de estas herramientas.
Para arrays grandes (más allá de los 5*10** 7 elementos), haciendo uso de una GPU con CuPy o Numba empieza a mostrar su verdadera capacidad gracias al paralelismo masivo. Aunque la transferencia de datos y la inicialización de la GPU son factores a tener en cuenta, estos son rápidamente compensados en problemas más grandes.
