# Lesson 2: Number Crunching in Python

There are two types of languages, and Python is one of the slow ones.

<br>

<center>
<img src="img/benchmark-games-2023.svg" width="75%">
</center>

That is, it was designed with convenience in mind, rather than speed.

<br><br>

Data analysis frequently involves calculations on large datasets. Speed (and memory use) are important!

<br><br>

How did Python come to be such a popular data analysis language with this against it?

## The why and the how

Let's reload the Higgs dataset to get a single list of numbers.

In [30]:
import json
dataset_python = json.load(open("data/SMHiggsToZZTo4L.json"))

<br>

In [31]:
pt_python = []
for event in dataset_python:
    for electron in event["electron"]:
        pt_python.append(electron["pt"])
    for muon in event["muon"]:
        pt_python.append(muon["pt"])

<br>

In [32]:
len(pt_python)

28809

<br>

Look at the first few by slicing it.

In [33]:
pt_python[0:4]

[63.04386901855469, 38.12034606933594, 4.04868745803833, 21.902679443359375]

How much memory is this list using?

<br>

In [34]:
import sys

num_bytes_python = 0
num_bytes_python += sys.getsizeof(pt_python)   # size of the list, not including the numbers

for x in pt_python:
    num_bytes_python += sys.getsizeof(x)       # size of each number
    
num_bytes_python

937904

<br>

How many bytes per value? More than 8?

<br>

In [35]:
num_bytes_python / len(pt_python)

32.55593738067965

Get the same data as an array (from an HDF5 file).

<br>

In [36]:
import h5py

dataset_hdf5 = h5py.File("data/SMHiggsToZZTo4L.h5")

pt_numpy = dataset_hdf5["particles"]["pt"][:]
pt_numpy

array([63.04387  , 38.120346 ,  4.0486875, ..., 60.098644 ,  3.7663147,
       21.205685 ], dtype=float32)

<br>

Are they all the same?

In [37]:
assert len(pt_python) == len(pt_numpy)

for list_x, array_x in zip(pt_python, pt_numpy):
    assert list_x == array_x

How does their memory use compare?

In [38]:
sys.getsizeof(pt_numpy) / len(pt_numpy)

4.003887673990767

<br>

In [39]:
num_bytes_python / len(pt_python)

32.55593738067965

<br>

How does speed of computation compare? (Note the units.)

In [40]:
%%timeit

pt2_python = []
for x in pt_python:
    pt2_python.append(x**2)

1.56 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


<br>

In [41]:
%%timeit

pt2_numpy = pt_numpy**2

1.67 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


Memory layout of a Python list.

<br>

<center>
<img src="img/python-list-layout.svg" width="75%">
</center>

Memory layout of a NumPy array.

<br>

<center>
<img src="img/python-array-layout.svg" width="75%">
</center>

This also hints at a limitation of NumPy: it can't hold values of different types.

<br>

Python list: data type is a property of each element.

In [43]:
type(pt_python[0])

float

<br>

NumPy array: data type is a property of the whole array.

In [44]:
pt_numpy.dtype

dtype('float32')

In [45]:
pt_numpy.dtype.type

numpy.float32

<br>

(Caveat: actually, NumPy has a `dtype('object')` to store Python objects, but that has no advantage over Python lists.)

## NumPy

NumPy is a third-party (but central!) library for data and computations _in_ C, _from_ Python.

<center>
<img src="img/Numpy_Python_Cheat_Sheet.svg" width="75%">
</center>