# Introduction to Data Processing with Numpy

In [1]:
%pip install numpy
import numpy as np

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


The central datastructure offered by Numpy is the `ndarray`, a versatile multi-dimensional typed array datastructure with fixed size.
Internally an `ndarray` is a struct with the following fields:
- `data`: A pointer to a buffer of contiguous memory.
- `shape`: A tuple of integers, describing the size of each dimension, i.e., `(3, 4)` denotes a $3 \times 4$ matrix.
- `dtype`: The datatype of the elements in `buffer`, e.g., `np.float64` or `np.int16`.
- `strides`: A tuple of integers describing the byte stride size for each dimension.

In [None]:
X = np.arange(
    12, dtype=np.int16
)  # Create a 1-dimensional int16 ndarray containing the numbers 0 to 11.
print(f"{X=}")

print(f"X.data=<ptr 0x{X.ctypes.data:x}>")

print(f"{X.shape=}")

print(f"{X.dtype=} ({X.itemsize=} bytes)")
print(f"{X.strides=}")

Numpy offers a large collection of efficient multi-threaded operations for the `ndarray` datastructure.
Many operations can even be performed in $\mathcal{O}(1)$ time, by simply changing the `shape` or `strides` tuples without touching the values in the `data` buffer.
Three very useful operations, which (often) run in constant time, are reshaping, transposition and slicing.

## 1. Reshaping and Transposition

We begin by reshaping the array $X = [0, \dots, 11]$ defined above into a $3 \times 4$ matrix and then taking its transpose:

In [None]:
Y = X.reshape((3, 4))
print("The array X viewed as a 3 by 4 matrix:")
print(Y)
print(
    f"Both arrays are backed by the same data buffer: X.data=<ptr 0x{X.ctypes.data:x}>, Y.data=<ptr 0x{Y.ctypes.data:x}>."
)
print(f"Only the shape and strides had to be changed:")
print(f"{X.shape=} -> {Y.shape=}")
print(f"{X.strides=} -> {Y.strides=}")

In [None]:
print("Next, we transpose Y (via Y.T):")
print(Y.T)
print(
    f"This, again, does not change the data buffer: Y.data = Y.T.data = <ptr 0x{Y.T.ctypes.data:x}>"
)
print(f"Only the shape and strides tuples had to be reversed:")
print(f"{Y.shape=} -> {Y.T.shape=}")
print(f"{Y.strides=} -> {Y.T.strides=}")

## 2. Slicing

In addition to reshaping, we can also create subarrays from a given array in constant time: 

In [None]:
print("Simple slicing example:")
print(f"{X[2:]=}")
print(
    f"Data pointer difference between X and X[2:]: {X[2:].ctypes.data - X.ctypes.data} bytes"
)

In [None]:
print("Multidimensional slice + adding a third dimension of size 1:")
Y_sub = Y[1:, -3:-1, None]
print(f"{Y_sub=}, {Y_sub.shape=}")
print(
    f"Data pointer difference between Y and Y_sub: {Y_sub.ctypes.data - Y.ctypes.data} bytes"
)
print(f"{Y.strides=} -> {Y_sub.strides=}")

## 3. Vectorization and Broadcasting

As we have just seen, quite a few operations on `ndarray`s can be performed very efficiently without actually changing the data buffer.
However, achieving constant runtime is of course not always possible; performing an arithmetic operation on every element of an arbitrary $n$-element array is in $\Omega(n)$.

Numpy uses highly optimized multithreaded C implementations internally, which makes such array operations still very fast, much faster than using loops in Python.
Replacing a slow Python loop with calls to efficient native parallel implementations is called *vectorization*.

In [None]:
print("Computing the first 100000 square numbers using numpy:")
%timeit (np.arange(100000, dtype=np.int64) + 1) ** 2

print("Computing the first 100000 square numbers using a Python list comprehension:")
%timeit [(z + 1) ** 2 for z in range(100000)]

Vectorization can also be used to apply an element-wise filter to an array.
We can, for example, use it to remove all odd numbers from the array $X = [0, \dots, 11]$ we defined above: 

In [None]:
print(
    "Applying a boolean operator to an ndarray results in a boolean ndarray:",
    X % 2 == 0,
)

print("Boolean arrays can be used as indices, to filter other arrays:", X[X % 2 == 0])

Numpy can not only apply operations to each element of a given array, it also supports the combination of arrays of differing shapes via a technique called [*broadcasting*](https://numpy.org/doc/stable/user/basics.broadcasting.html).
To get an intuition for what broadcasting is, we use it to add a vector to each row and to each column of the $3 \times 4$ matrix $Y$:

In [None]:
print(f"{Y=}")
a = np.array([1, 0, -1, 0])
print(f"We add {a=} to each row of Y:")
Y_a = Y + a.reshape((1, -1))
print(Y_a)

b = a[:-1]
print(f"We add {b=} to each column of Y:")
Y_b = Y + b.reshape((-1, 1))
print(Y_b)

The following example shows how broadcasting can even be used to compute the products of all pairs of elements in $X = [1, \dots, 11]$:

In [None]:
X.reshape((-1, 1)) * X.reshape((1, -1))