# PPY lecture #9, April ~~11~~18 2023

We will begin our talk on efficient computations in Python using NumPy and SciPy.

# Before we begin, a couple notes:

1. Nice [article](https://www.pythonmorsels.com/any-and-all/) on checking if `any` or `all` conditions are met, in descriptive, readable, and efficient way using generators
2. ...I found it in one of the latest [editions](https://mailchi.mp/pythonweekly/python-weekly-issue-594?e=b50db26839) of [Python Weekly](https://www.pythonweekly.com) newsletter, and I must highly recommend subscribing to it!

3. Did you know about [Saint-Python](https://fr.wikipedia.org/wiki/Secteur_pavé_de_Saint-Python)?

This is how [AI](https://www.craiyon.com) could draw it:

![Saint-Python pavé secteur](src/saint_python.png)

(I was watching Paris-Roubaix cycling race when preparing this lecture... 😉)

# N-dimensional arrays, or `ndarray`s



An `ndarray` is:
* a multidimensional array object provided by the NumPy library
* a data structure that can store homogeneous or heterogeneous data of fixed-size in a contiguous block of memory. 
* a fundamental data structure in scientific computing and data analysis in Python, and it forms the foundation of many popular libraries and frameworks, including NumPy, Pandas, and TensorFlow.

Ndarrays can have any number of dimensions, from one to many, and can be created using a variety of methods, such as the `numpy.array()` function, or by reading data from a file or another data source.

In [1]:
import numpy as np  # short as np...

Once an ndarray is created, its elements can be accessed and manipulated using indexing and slicing operations, just like a regular Python list:

In [2]:
# create a 1D array from a list
a = np.array([1, 2, 3, 4, 5])

In [3]:
print(a[2:4])

[3 4]


And since ndarrays are stored in a contiguous block of memory, they allow for much faster and more efficient for performing mathematical operations on large amounts of data.

Couple more ways of creating ndarrays (look for similarities with eg. MATLAB):

In [4]:
# create a 2D array from a list of lists
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [5]:
# create a 2D array with shape (3, 4) filled with zeros
a = np.zeros((3, 4))
a

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [6]:
# create a 1D array with shape (5) filled with ones
b = np.ones(5)
b

array([1., 1., 1., 1., 1.])

In [7]:
# create a 3D array with shape (2, 3, 4) filled with random values
c = np.random.rand(2, 3, 4)
c

array([[[0.39394696, 0.1895839 , 0.15529477, 0.8791911 ],
        [0.55727713, 0.27283673, 0.23688819, 0.02393742],
        [0.03752672, 0.0719595 , 0.84586446, 0.04532555]],

       [[0.44900557, 0.58184562, 0.01418723, 0.56344271],
        [0.12959124, 0.80552154, 0.88453276, 0.5342757 ],
        [0.04181917, 0.6753316 , 0.52827266, 0.97476493]]])

## About data types

In NumPy, specifying the data format of an ndarray is important when working with data that has a specific data type or when there is a need to optimize memory usage or processing speed. Here are some situations when you may need to specify the data format of an ndarray:

1. When working with data of a specific type:

In NumPy, the data type of an ndarray is determined automatically based on the input data, but sometimes you may need to work with data of a specific type, such as integers or floats of a certain size. In such cases, you can specify the data type using the `dtype` parameter. For example:

In [8]:
# create an array of integers
a = np.array([1, 2, 3], dtype=np.int32)
print(a)

# create an array of floats
b = np.array([1.0, 2.0, 3.0], dtype=np.float64)
print(b)

[1 2 3]
[1. 2. 3.]


2. When optimizing memory usage:

By default, NumPy uses the smallest data type that can represent the input data when creating an array. However, in some cases, you may want to use a smaller data type to optimize memory usage, especially when working with large datasets. For example, if you know that your input data will never exceed the range of a uint8 data type, you can specify dtype=np.uint8 to save memory.

3. When optimizing processing speed:

The data format can also affect the processing speed of NumPy operations. For example, using a float32 data type instead of a float64 data type can result in faster computations, especially when working with large datasets. Similarly, using a contiguous memory layout can improve the speed of some operations, such as matrix multiplication, compared to a non-contiguous layout.

In [9]:
# create two arrays of size 1000x1000
a = np.random.rand(1000, 1000)
b = np.random.rand(1000, 1000)

In [10]:
a.dtype

dtype('float64')

In [11]:
%%timeit
c = np.dot(a, b)

22.5 ms ± 3.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
a = a.astype(np.float32)
b = b.astype(np.float32)

In [13]:
%%timeit
c = np.dot(a, b)

9.57 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In this code, we create two 1000x1000 arrays filled with random values and perform matrix multiplication using `np.dot()` function. We measure the time taken for matrix multiplication using both `float64` and `float32` data types. The results may vary on different machines, but on most machines, you should see that matrix multiplication with `float32` data type is faster than `float64` data type.

## ...but why bother with `ndarray` at all?

Suppose we have two arrays: one ndarray and one ordinary Python list, each containing 1 million integers. We want to compute the square of each element in the array and measure the time taken for the operation.

In [14]:
a = np.random.randint(0, 100, size=1_000_000)

In [15]:
%%timeit
b = a*2

645 µs ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


And here's how we can do the same thing using a list:

In [16]:
a = [i for i in range(1000000)]

In [17]:
%%timeit
b = [i*2 for i in a]

23.2 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This is because the ndarray is optimized for **vectorized operations**, and can perform mathematical operations on large amounts of data in parallel using low-level, optimized code. In contrast, the list requires looping over each element and performing the operation one at a time, which is much slower.

A nice summarization article on vectorization: https://towardsdatascience.com/how-to-speedup-data-processing-with-numpy-vectorization-12acac71cfca, including links to more detailed sources.

TL;DR the (second) key term is **broadcasting**:

> The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python.

However, in the article, you will find also a primer for a later talk about (fast) data processing using `pandas`, we have seen in 8th lecture — we will get back to it next time.