# Why CrossPy?

Heterogeneous architectures, typically consisting of CPU and GPU-based systems, have become ubiquitous in clusters for scientific computing and machine learning.
To harness the power of these architectures, libraries and packages have been developed in Python, the dominant programming language for scientific computing applications. [NumPy][1]/[SciPy][2] has emerged as a fundamental library for scientific computing on CPU hosts, while [CuPy][3] has been developed for GPU accelerators.
Although they work efficiently with respect to the specific architecture, the challenge of programming with a mix of the libraries is left to the programmer.

[1]: https://numpy.org/doc/stable/index.html
[2]: https://scipy.org
[3]: https://docs.cupy.dev/en/stable/index.html

Here is a simple example of array addition. On a CPU host, we use [NumPy][1]:

In [1]:
import numpy as np

a = np.random.rand(10)
b = np.random.rand(10)
c = a + b
print(c)

[0.56178508 0.83718242 0.34234627 1.66146186 0.76599034 0.4140401
 0.64815243 0.47140767 0.40633476 1.77319491]


Simple. By replacing `numpy` with `cupy`, we get a single-GPU implementation:

In [2]:
import cupy as cp

with cp.cuda.Device(0):
    ag = cp.random.rand(10)
    bg = cp.random.rand(10)
    cg = ag + bg
print(cg)

[0.89871743 1.0159016  0.79963647 1.32453652 0.55732429 0.58419898
 1.17315572 1.43277415 1.00451392 1.51350342]


Simple, too. Well, what if we want to (or have to when the array is too large to reside on a single device) make use of multiple devices?
Say, half computation with a CPU and half with a GPU:

In [3]:
dummy_large_number = 10

# conceptually
a_origin = np.random.rand(dummy_large_number)
b_origin = np.random.rand(dummy_large_number)

a_first_half = a_origin[:dummy_large_number // 2]
b_first_half = b_origin[:dummy_large_number // 2]
c_first_half = a_first_half + b_first_half

with cp.cuda.Device(0):
    a_second_half = cp.asarray(a_origin[dummy_large_number // 2:])
    b_second_half = cp.asarray(b_origin[dummy_large_number // 2:])
    c_second_half = a_second_half + b_second_half

c = np.concatenate((c_first_half, cp.asnumpy(c_second_half)))
print(c)

[1.95264092 0.68129014 0.86476446 1.15659911 1.38821283 1.54767908
 1.51356108 1.16500384 0.12428594 1.62914811]


Already looks cumbersome. Similarly, we can use two GPUs with minimal changes to the example above but without reducing the complexity of programming.

Now, let's see how CrossPy programs this:

In [4]:
import crosspy as xp

ax = xp.array([a_first_half, a_second_half], axis=0)
bx = xp.array([b_first_half, b_second_half], axis=0)
cx = ax + bx
print(cx)

array([1.95264092, 0.68129014, 0.86476446, 1.15659911, 1.38821283])@<CPU 0>; array([1.54767908, 1.51356108, 1.16500384, 0.12428594, 1.62914811])@<CUDA Device 0>


As simple as the first example! CrossPy handles the complexity of cross-device manipulation and thus eliminates the burden of tedious programming. The printed `cx` is a CrossPy array across two devices where the range `[0:5]` is on one and `[5:10]` is on the other.

In [5]:
print(cx.shape)
print(cx.device)

(10,)
[<CPU 0>, <CUDA Device 0>]


We can also start from the original large array and let CrossPy handle the partitioning, without adding to the lines of code:

In [6]:
from crosspy import cpu, gpu

ax = xp.array(a_origin, distribution=[cpu(0), gpu(0)], axis=0)
bx = xp.array(b_origin, distribution=[cpu(0), gpu(0)], axis=0)
cx = ax + bx
print(cx)

array([1.95264092, 0.68129014, 0.86476446, 1.15659911, 1.38821283])@<CPU 0>; array([1.54767908, 1.51356108, 1.16500384, 0.12428594, 1.62914811])@<CUDA Device 0>


This example shows the power of CrossPy that, for a single-device implementation to migrate to a multi-device version, we only need to initialize the arrays with CrossPy and specify how we want the array to be distributed.

Here we highlight an incomplete list of CrossPy features:

- [Flexible data distribution (including uniform partitioning and arbitrary coloring)](start/quickstart.ipynb#array-creation)

- [Distribution-agnostic data transfer interfaces](start/quickstart.ipynb#array-creation)


More details can be found in this documentation.