In [1]:
%load_ext lab_black

In [2]:
import numpy as np
import cupy as cp

In [3]:
def dot(x, y):
    dot_product = 0
    for x_i, y_i in zip(x, y):
        dot_product += x_i * y_i
    return dot_product

In [4]:
ARRAY_SIZE = 10 ** 7

In [5]:
rng = np.random.default_rng(1234)

x = np.array(rng.random(ARRAY_SIZE), dtype=np.float32)
y = np.array(rng.random(ARRAY_SIZE), dtype=np.float32)

In [6]:
def dot(x, y):
    dot_product = 0
    for x_i, y_i in zip(x, y):
        dot_product += x_i * y_i
    return dot_product

## Timing with Time Magic

Jupyter can time a single line of code with the `%time` magic.

In [7]:
%time dot(x, y)

CPU times: user 5.38 s, sys: 10.7 ms, total: 5.39 s
Wall time: 5.43 s


2500685.6167771975

Alternatively, Jupyter can time an entire cell with the `%%time` magic.

In [8]:
%%time
dot(x, y)

CPU times: user 5.28 s, sys: 1.4 ms, total: 5.28 s
Wall time: 5.3 s


2500685.6167771975

Another option is the `%%timeit` cell magic. By default, this does seven runs. In each run, the cell is executed several times according to how long the executions take. If the cell executes quickly, the magic will execute the cells more time. Finally, it reports the average and standard deviation of the executions' total wall time across the runs.

In [9]:
%%timeit
dot(x, y)

5.41 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


It is a good practice to take finer level control. Since the array I am working with is large, I only do one execution per run. I specify this behavior by setting `-n 1`. Additionally, to limit our waiting time, I set the runs to be three: `-r 3`. It is a good pratice to time multiple runs. Finally, I specify that I want the magic to output the result by setting the `-o` flag.

In [10]:
%%timeit -n 1 -r 3 -o
dot(x, y)

5.53 s ± 139 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


<TimeitResult : 5.53 s ± 139 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>

As before, the magic prints the mean and standard deviation across the runs. It is tempting to use this information, but differences in wall time are far more likely to be caused by unrelated overhead on the computer or in some cases JIT compilation. So, it is good to look at the minimum, rather than the average, time for a run for a more accurate benchmark.

I do this by referencing the output of the previous cell and accessing the `best` attribute. With Juypter, one can reference the output of the last cell by referencing the variable `_`.

In [11]:
python_dot_time = _.best
python_dot_time

5.336721798999861

Generally, you'll notice that this result is similar to but slightly faster than the result obtained by the `%%time` magic.

In addition to explaining the timeing concepts, I also now have a baseline time for a dot product in pure Python.

## Timing Numpy's Dot Product

In [12]:
%%timeit -n 1 -r 3 -o
np.dot(x, y)

10.2 ms ± 818 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


<TimeitResult : 10.2 ms ± 818 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)>

In [13]:
numpy_dot_time = _.best
numpy_dot_time
_.all_runs

[0.01128629400045611, 0.009792339999876276, 0.009381500999552372]

## Timing CuPy's Dot Product

### CPU

In [14]:
%%timeit -n 1 -r 3 -o
cp.dot(x, y)

10.4 ms ± 1.01 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


<TimeitResult : 10.4 ms ± 1.01 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>

In [15]:
cp_cpu_time = _.best
cp_cpu_time
_.all_runs

[0.01183192999997118, 0.009935895999660715, 0.009495285000411968]

### GPU

In [16]:
x_cp = cp.asarray(x)
y_cp = cp.asarray(y)

In [17]:
%%timeit -n 1 -r 3 -o
cp.dot(x_cp, y_cp)

The slowest run took 2119.78 times longer than the fastest. This could mean that an intermediate result is being cached.
43.5 ms ± 61.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


<TimeitResult : 43.5 ms ± 61.5 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>

In [18]:
cp_gpu_time = _.best
print(cp_gpu_time)
print(_.all_runs)

6.153899994387757e-05
[0.13044936200003576, 0.00011599699973885436, 6.153899994387757e-05]


### CPU to GPU

In [19]:
%%timeit -n 1 -r 3 -o
x_cp = cp.asarray(x)
y_cp = cp.asarray(y)
cp.dot(x_cp, y_cp)

23.7 ms ± 12.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


<TimeitResult : 23.7 ms ± 12.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>

In [20]:
cp_cpu_gpu_time = _.best
print(cp_cpu_gpu_time)
print(_.all_runs)

0.011271320000560081
[0.019287747999442217, 0.0404844980002963, 0.011271320000560081]


### CPU to GPU to CUP

In [21]:
%%timeit -n 1 -r 3 -o
x_cp = cp.asarray(x)
y_cp = cp.asarray(y)
result = cp.dot(x_cp, y_cp)
result.get()

30.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


<TimeitResult : 30.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)>

In [22]:
cp_cpu_gpu_cpu_time = _.best
print(cp_cpu_gpu_cpu_time)
print(_.all_runs)

0.029510293999919668
[0.0323292989996844, 0.029632329000378377, 0.029510293999919668]
