# High Performing Computing, Part III

## Numpy performance

In [1]:
import numpy as np

In [2]:
%%timeit
foo=np.arange(1,10001)

3.09 µs ± 232 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [3]:
%%timeit
foo=list[(x for x in range(10000))]

469 ns ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [4]:
%%timeit
foo=[x for x in range(1,10001)]

144 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


numpy arange is about 50 times faster than the list comprehension.

In [5]:
def naive_dist(p, q):
    square_distance = 0
    for p_i, q_i in zip(p, q):
        square_distance += (p_i - q_i) ** 2
    return square_distance ** 0.5
p = [i for i in range(1000)]
q = [i + 2 for i in range(1000)]

In [6]:
%%timeit
naive_dist(p, q)

41.4 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [7]:
def simple_numpy_dist(p, q):
    return (np.sum((p - q) ** 2)) ** 0.5
p = np.arange(1000); q = np.arange(1000) + 2

In [8]:
%%timeit
simple_numpy_dist(p, q)

6.59 µs ± 310 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [9]:
def numpy_norm_dist(p, q):
    return np.linalg.norm(p - q)

In [10]:
%%timeit
numpy_norm_dist(p,q)

3.61 µs ± 98.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


numpy function is about 12 times faster than python regular function

## Numpy’s dispatch mechanism

References:
- [Writing custom array containers](https://numpy.org/doc/stable/user/basics.dispatch.html#basics-dispatch)

A convenient pattern is to define a decorator `implements` that can be used to add functions to `HANDLED_FUNCTIONS`.

In [11]:
HANDLED_FUNCTIONS = {}
def implements(np_function):
    def decorator(func):
        HANDLED_FUNCTIONS[np_function] = func
        return func
    return decorator

In [12]:
from Contexts.Functor import ListFunctor

@implements(np.sum)
def sum(arr:ListFunctor):
    "Implementation of np.sum for list object"
    return arr

@implements(np.mean)
def mean(arr:ListFunctor):
    "Implementation of np.mean for list objects"
    return arr / len(arr)

@implements(np.fft.fft)
def fft(arr:ListFunctor):
    return arr

In [13]:
N=100000
a=[x for x in range(N)]
b=np.array(a,dtype=float)

In [14]:
f=ListFunctor([b])

In [15]:
%%timeit
np.sum(f)

45 µs ± 777 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [16]:
%%timeit
f.fmap(np.sum)

20 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [17]:
%%timeit
np.sum(b)

20.3 µs ± 823 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [18]:
%%timeit
np.sum(np.fromiter((x for x in range(N)),dtype=float))

4.33 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%%timeit
np.fft.fft(b)

1.38 ms ± 78.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [20]:
%%timeit
f.fmap(np.fft.fft)

1.35 ms ± 6.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Conclusions

- Small overhead on fmap, its performace is close to native numpy function call.
- iterator with numpy function is very slow

## Cupy
References:
- [Cupy github](https://github.com/cupy/cupy)
- [User Guide](https://docs.cupy.dev/en/stable/user_guide/index.html)

In [21]:
import numpy as np

In [22]:
%%timeit
x=np.arange(6).reshape(2,3).astype('f')
x.sum(axis=1)

2.65 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [23]:
import cupy as cp

In [24]:
%%timeit
with cp.cuda.Device(0):
   x_on_gpu0 = cp.array([1, 2, 3, 4, 5])


71.9 µs ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
x_on_gpu=cp.array([1,2,3,4,5])

In [26]:
x_on_gpu0 = cp.array([1, 2, 3, 4, 5])
x_on_gpu0.device

<CUDA Device 0>

In [27]:
x_on_gpu.device

<CUDA Device 0>

In [28]:
%%timeit
x_on_gpu=cp.array([1,2,3,4,5])
cp.linalg.norm(x_on_gpu)

85.2 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [29]:
%%timeit
x_cpu = np.array([1,2,3,4,5])
l2_cpu = np.linalg.norm(x_cpu)

2.74 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [30]:
%%timeit
x = cp.arange(600).reshape(20, 30).astype('f')
x.sum(axis=1)

57.2 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [31]:
%%timeit
x = cp.arange(600).reshape(20, 30).astype('f')
x.sum(axis=1)

60.2 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Conclusions
- not sure how cupy can help

## Dask
References:
- [Dask github](https://github.com/dask/dask)
- [Dask](https://examples.dask.org/array.html)

In [32]:
npx = np.random.random((10000,10000))

In [33]:
%%timeit
y=npx+npx.T
z=y[::2,5000:].mean(axis=1)

248 ms ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Dashboard

In [34]:
from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://198.18.198.14:8787/status,

0,1
Dashboard: http://198.18.198.14:8787/status,Workers: 1
Total threads: 4,Total memory: 1.86 GiB
Status: running,Using processes: False

0,1
Comm: inproc://198.18.198.14/677845/1,Workers: 1
Dashboard: http://198.18.198.14:8787/status,Total threads: 4
Started: Just now,Total memory: 1.86 GiB

0,1
Comm: inproc://198.18.198.14/677845/4,Total threads: 4
Dashboard: http://198.18.198.14:40273/status,Memory: 1.86 GiB
Nanny: None,
Local directory: /tmp/dask-worker-space/worker-93k12xxb,Local directory: /tmp/dask-worker-space/worker-93k12xxb


In [35]:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x



Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 762.94 MiB 7.63 MiB Shape (10000, 10000) (1000, 1000) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [36]:
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,3.91 kiB
Shape,"(5000,)","(500,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 39.06 kiB 3.91 kiB Shape (5000,) (500,) Dask graph 10 chunks in 7 graph layers Data type float64 numpy.ndarray",5000  1,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,3.91 kiB
Shape,"(5000,)","(500,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [37]:
%%timeit
y = x + x.T
z = y[::2, 5000:].mean(axis=1)

1.55 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [38]:
y.persist()



Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 762.94 MiB 7.63 MiB Shape (10000, 10000) (1000, 1000) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",10000  10000,

Unnamed: 0,Array,Chunk
Bytes,762.94 MiB,7.63 MiB
Shape,"(10000, 10000)","(1000, 1000)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Numba

## MKL

## Open-BLAS