<a href="https://scipp.github.io"><img src="https://scipp.github.io/_static/logo-2022.svg" width="600" /></a>

# Multi-dimensional arrays with labeled dimensions and physical units

## [scipp.github.io](https://scipp.github.io)

In [None]:
%matplotlib inline
import numpy as np
import scipp as sc
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=1234)

In [None]:
def plot(*x):
    """
    Useful plot function for 1d and 2d data
    """
    fig, ax = plt.subplots()
    for a in x:
        if a.ndim == 1:
            ax.plot(np.arange(len(a)), a)
        elif a.ndim == 2:
            ax.imshow(a, origin="lower")

def scatter(x, y):
    """
    Simple scatter plot
    """
    fig, ax = plt.subplots()
    ax.scatter(x, y, marker=".", s=1)
    ax.set_aspect("equal")
    ax.set_xlim(x.min(), x.max())
    ax.set_ylim(y.min(), y.max())
    return ax

## 1. Introduction to labeled dimensions: why do we need them?

In [None]:
ny, nx = 10, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

In [None]:
plot(a)

In [None]:
# Slice out row number 4
plot(a[4, :])

### We can't always deduce from the shape

In [None]:
ny, nx = 20, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

In [None]:
plot(a)

In [None]:
# Not always obvious which dimension is which
plot(a[:, 4], a[4, :])

### The situation gets worse with more dimensions

Say I now have an array that has 4 dimensions: `x, y, z, time` (in that order, maybe?)

In [None]:
a = np.random.random([20] * 4)
a.shape

I want to get the first `z` slice...

Which one was it again?

In [None]:
z_slice = a[:, :, 0, :]  # x,y,z,t
z_slice = a[0, :, :, :]  # z,y,x,t
z_slice = a[:, :, :, 0]  # t,x,y,z

Quiz: Which one is the 4th `z`, 3rd `x`, 5th `y` between 10th to 15th of `t`?
1. `a[3, 5, 4, 9:15]`
2. `a[9:15, 2, 4, 3]`
3. `a[2, 4, 3, 9:15]`
4. `a[3, 4, 5, 9:15]`

<br><br>

### Introducing labeled dimensions

<img src="https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png" width="220" /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="https://scipp.github.io/_static/logo-2022.svg" width="220" />

[Xarray](https://docs.xarray.dev/en/stable/index.html) (https://docs.xarray.dev) introduced labels to multi-dimensional Numpy arrays.

"*real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.*"

We have embraced, and to a large extent copied, the Xarray mechanism.

In [None]:
var = sc.array(dims=["x", "y", "z", "time"], values=a)
var

Getting the `z` slice is now easy and **readable**

In [None]:
var["z", 0]

Quiz: Which one is the 4th `z`, 3rd `x`, 5th `y` between 10th to 15th of `t`?
1. `a["x", 3]["y", 5]["z", 4]["time", 9:15]`
2. `a["x", 9:15]["y", 2]["z", 4]["time", 9:15]`
3. `a["x", 2]["y", 4]["z", 3]["time", 9:15]`
4. `a["x", 3]["y", 4]["z", 3]["time", 9:15]`

Easy!

<br><br>

### Adding coordinates

- Coordinates can be specified for each dimension.
- They describe the extent of each axis, as well as how far each data point is from its neighbours.

In [None]:
data = sc.array(dims=["space", "time"], values=rng.random((5, 10)))
sc.show(data)

In Scipp and Xarray, coordinates are added in a data structure called `DataArray`:

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("space", 0, 8000, 5),
    },
)
sc.show(da)

In [None]:
da

### Exercise 1. Add `year` coordinate `[2000, 2010)` to the `time` dimension.
Each column for all `altitude`s is collected every `year` from 2000 to 2009

**Hint: You can create a ``Variable`` with consecutive numbers by ``sc.arange(dim, start, stop)``.**

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "year": sc.arange("", 2000, ),
        "altitude": sc.linspace("space", 0, 8000, 5),
    },
)
sc.show(da)
da

**Solution:**

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "year": sc.arange("time", 2000, 2010),
        "altitude": sc.linspace("space", 0, 8000, 5),
    },
)
sc.show(da)
da

### Exercise 2. Add `year` coordinate based on the `scipp-year`.
`scipp-year` was used instead of Gregorian `year` while collecting data.
Please add Gregorian `year` coordinate.
> Hint: First `scipp` was released in 2020.

In [None]:
da = sc.DataArray(
    data= sc.array(dims=["space", "time"], values=rng.random((5, 24))),
    coords={
        "scipp-year": sc.arange("time", -20, 4),
        "altitude": sc.linspace("space", 0, 80000, 5)
    },
)
sc.show(da)
da

**Solution:**

In [None]:
da.coords['year'] = da.coords['scipp-year'] + 2020
sc.show(da)
da

**Hint:**

In [None]:
da.coords['year'] = da.coords['scipp-year'] + 
sc.show(da)
da

<br><br>

## 2. Going further

<img src="https://scipp.github.io/_static/logo-2022.svg" width="220" />

### 2.1 Physical units

Every data variable and coordinate in Scipp has physical units.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
(see also [pint](https://pint.readthedocs.io/en/stable/), [astropy.units](https://docs.astropy.org/en/stable/units/index.html), [pint-xarray](https://pint-xarray.readthedocs.io/en/stable/), ...)

In [None]:
x = sc.array(dims=['row'], values=rng.normal(size=10000), unit='cm')
y = sc.array(dims=['row'], values=rng.normal(size=10000), unit='cm')
recording = sc.DataArray(data=sc.ones(sizes=x.sizes, unit='counts'),
                         coords={'x': x, 'y': y})
image = recording.hist(y=100, x=100)
image

In [None]:
image.plot(aspect="equal")

In [None]:
integration_time = sc.scalar(300.0, unit="s")
image /= integration_time
print(image.unit)

image.plot(aspect="equal")

<br><br>

### Units also provide protection

Say I now have a background image (dark frame) which I want to subtract from the signal image above,
but I forgot to first normalize it by integration time

In [None]:
background = sc.array(dims=["y", "x"], values=rng.random((100, 100)), unit="counts")

image - background

In [None]:
background_integration_time = sc.scalar(60.0, unit="s")
background /= background_integration_time

background_subtracted = image - background

In [None]:
background_subtracted.plot(aspect="equal")

- The units are very useful in early prevention of difficult-to-spot bugs in a workflow.
- They save **hours** of debugging time, free-up mental capacity and let the user focus on the important thing: **doing science**.

<br><br>

### Units for label-based indexing

We also use units to distinguish between positional indexing and label-based indexing:

In [None]:
image['x', 0.5 * sc.Unit('cm')].plot()

Positional index is based on the `dimension`, and the value index is based on the `coordinates`.

In [None]:
da = sc.DataArray(
    data=sc.array(dims=["space", "time"], values=rng.random((5, 9))),
    coords={
        "time": sc.arange("time", 19, 28, unit='s'),
        "altitude": sc.linspace("space", 0, 800, 5, unit='m')
    },
)
sc.show(da)
da

We want to select the data where the `time` is `20` seconds.

In [None]:
da.coords['time']  # 20 seconds is the 1-st value of the `time` coordinate.

So we can select the 1st slice of `time` dimension.

In [None]:
da['time', 1]

But instead, we can use `unit` for selecting the data corresponding to the `time` of `20` seconds.

In [None]:
da['time', sc.scalar(20, unit='s')]

Quiz. Which ones are selecting the data slice where `altitude` is `600 m`? (Multiple choices)

1. da['altitude', 3]
2. da['space', 3]
3. da['altitude', sc.scalar(600, unit='m')]
4. da['space', sc.scalar(600, unit='m')]

You can also use it for selecting a range.

In [None]:
da['time', sc.scalar(20, unit='s'):sc.scalar(24, unit='s')]

Quiz. Which ones are selecting the data slice where `altitude` is between `300 m` to `700 m`? (Multiple choices)

1. da['altitude', 2:4]
2. da['space', 2:4]
3. da['altitude', sc.scalar(300, unit='m'):sc.scalar(700, unit='m')]
4. da['space', sc.scalar(300, unit='m'):sc.scalar(700, unit='m')]

### Exercise 3. Coordinate and Units

Instead of `altitude`, we want to use `pressure` coordinate for `space` dimension.

Here is the incomplete function `altitude_to_pressure` that converts `altitude[m]` into `pressure[hPa]`.

Complete the function and use it to add the `pressure` coordinate.

In [None]:
def altitude_to_pressure(altitude):
    p_b = sc.scalar(1013.25, unit='hPa')
    return p_b*(sc.scalar(1) - altitude/sc.scalar(44307, unit=''))**5

da.coords['pressure'] = altitude_to_pressure(da.coords['altitude'])
da

Now we can drop the unecessary coordinate, `altitude`.

In [None]:
da = da.drop_coords(['altitude'])
da

<br><br>

### 2.2 Bin-edge coordinates

- It is sometimes necessary to have coordinates that represent a range for each data value.
- E.g. "the temperature was 310 K in the time span between 10 and 20 seconds".
- This also arises every time we histogram data, as in the image above.
- Scipp supports this by having **bin-edge coordinates**: a coordinate which has a length of 1 more than the dimension length.

In [None]:
image = recording.hist(y=8, x=8)
sc.show(image)

In [None]:
image

In [None]:
image.plot(aspect='equal')

- Numpy and Matplotlib return the bin edges and the data counts separately
- We have everything stored inside a single data structure

### Exercise 4. Bin-edge coordinates and Units

We would like to investigate data by subtracting `background` from `recording`.

Here is the collected data `recording`.

In [None]:
b_x = sc.array(dims=['row'], values=rng.normal(size=100000-200), unit='cm')
b_y = sc.array(dims=['row'], values=rng.normal(size=100000-200), unit='cm')

x1 = rng.random(size=100)*2 + 2
x2 = rng.random(size=100)*2 + 2
y1 = (np.sqrt(1 - (np.abs(x1-3)*2 - 1)**2))/3 - 2
y2 = (np.arccos(1-np.abs(x2-3)*2) - 3.14)/3 - 2
s_x = sc.array(dims=['row'], values = np.concatenate((x1, x2)), unit='cm')
s_y = sc.array(dims=['row'], values = np.concatenate((y1, y2)), unit='cm')

x = sc.concat([s_x, b_x], dim='row')
y = sc.concat([s_y, b_y], dim='row')

recording = sc.DataArray(data=sc.ones(sizes=x.sizes, unit='counts'),
                         coords={'x': x, 'y': y})

In [None]:
recording

And the `recording` was collected for `recording_time` seconds.

In [None]:
recording_time = sc.scalar(100000, unit='s')
recording_time

#### 4-1. First, we want to make a 100 by 100 histogram of the `recording` per second.

**Hint:**

In [None]:
signal = recording.hist(y=, x=)/recording_time
signal.plot(aspect='equal')

**Solution:**

In [None]:
signal = recording.hist(y=100, x=100)/recording_time
signal.plot(aspect='equal')

#### 4-2. Let's substract `background` from the `signal`. `background` was collected for `50,000` seconds.

In [None]:
b_x = sc.array(dims=['row'], values=rng.normal(size=50_000), unit='cm')
b_y = sc.array(dims=['row'], values=rng.normal(size=50_000), unit='cm')

bg_recording = sc.DataArray(data=sc.ones(sizes=b_x.sizes, unit='counts'),
                          coords={'x': b_x, 'y': b_y})
bg_recording

**Hint:**

In [None]:
background_recording_time = sc.scalar( , unit='s')

background = bg_recording.hist(y=signal.coords[''], x=signal.coords[''])
background /= background_recording_time

subtracted = signal - background
subtracted.plot(aspect='equal')

**Solution:**

In [None]:
background_recording_time = sc.scalar(50_000, unit='s')

background = bg_recording.hist(y=signal.coords['y'], x=signal.coords['x'])
background /= background_recording_time

subtracted = signal - background
subtracted.plot(aspect='equal')

<br><br>

## 3. Binned data

Scipp distinguishes **histogrammed** data from **binned** data:

- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.
- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum over all events or values in a bin.

<!-- 
TODO: Add image file.
<img src="binned_drawing.svg" />
-->

This is conceptually similar to a multi-dimensional <a href="https://awkward-array.org/doc/main/"><img src="https://iris-hep.org/assets/logos/awkward.svg" width="100" /></a>.

It is best illustrated with an example of data analysis.
For this, we will use one of the NYC taxi datasets.

<br><br>

### NYC yellow taxi dataset

<img src="https://vaex.readthedocs.io/en/latest/_images/datasets_2_1.png" /> <img src="https://cdn-images-1.medium.com/v2/resize:fit:2680/1*fqrY2h4uLD3eKEvJ6hlI2g.png" width="600" />

(https://vaex.readthedocs.io/en/latest/datasets.html, Dataset from 2015, obtained as a HDF5 file from the Vaex docs,
and subsequently cleaned of outliers).

For today, we will use a small set of it.

In [None]:
!wget https://public.esss.dk/groups/scipp/dmsc-summer-school/scipp/nyc_taxi_data_2015_small.h5

In [None]:
%matplotlib widget

da = sc.io.load_hdf5('nyc_taxi_data_2015_small.h5')
da

In [None]:
n = 1000
x = da.coords["dropoff_longitude"].values[::n]
y = da.coords["dropoff_latitude"].values[::n]
scatter(x, y)

### Binning the data records

- Working with binned data is most efficient when keeping the number of bins relatively low.
- Binning is essentially like overlaying a grid of bin edges onto our data

In [None]:
ax = scatter(x, y)
for lon in np.linspace(*ax.get_xlim(), 9):
    ax.axvline(lon, color="gray")
for lat in np.linspace(*ax.get_ylim(), 9):
    ax.axhline(lat, color="gray")

In [None]:
# Bin into 8 longitude & latitude bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)
binned

In [None]:
sc.show(binned)

In [None]:
# Histogramming is summing all the counts in each bin

binned_sum = binned.bins.sum()
binned_hist = binned.hist()
data_hist = da.hist(dropoff_latitude=8, dropoff_longitude=8)

data_hist.plot(aspect="equal", norm="log") + binned_hist.plot(aspect="equal", norm="log") + binned_sum.plot(aspect="equal", norm="log")

<br><br>

### Selecting/slicing bins

- Binning *groups* the data into bins, but keeps the underlying table of records beneath
- **No information is lost, it is simply re-ordered**
- The bins can then be used for slicing the data, providing extremely efficient data selection and filtering

In [None]:
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
manh

In [None]:
# We can now histogram this with a much finer resolution

manh.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

In [None]:
# We select another bin, which contains the JFK airport

jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
jfk.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

![jfk](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/JFK_airport_terminal_map.png/640px-JFK_airport_terminal_map.png)

(https://commons.wikimedia.org/wiki/File:JFK_airport_terminal_map.png)

<br><br>

### Binning into a new dimension

- Data that has already been binned can also be binned further into new dimensions

In [None]:
manh

- we look at the trip distances inside the Manhattan and JFK bins we have selected above.

In [None]:
# Use 100 distance bins
manh_dist = manh.bin(trip_distance=100)
manh_dist

In [None]:
manh_dist.hist().plot()

In [None]:
jfk_dist = jfk.bin(trip_distance=100)
jfk_dist.hist().plot()

<br><br>


### Other operations on bins: what is the fare amount as a function of distance?

- In addition to summing/histogramming, bins can be used for other reduction operations: `min()`, `max()`, and `mean()`.

In [None]:
manh_dist

- To get the minimum and maximum fares for all trips that ended inside our Manhattan area, we can do

In [None]:
manh_dist.bins.coords['fare_amount'].min(), manh.bins.coords['fare_amount'].max()

- These values are somewhat strange, indicative of bad data in the table.
- We restrict our fare range from \\$0 to \\$200.

In [None]:
# Make 100 bins between 0 and 200 dollars
nbins = 100
fare_bins = sc.linspace('fare_amount', 0, 200, nbins + 1, unit='dollar')

# Bin & plot our data
manh_dist.bin(fare_amount=fare_bins).hist().transpose().plot(norm="log")

Some things we can say about the data:

- there appears to be a (somewhat expected) correlation between fare amount and trip distance: the further you go, the more you'll have to pay
- for a given trip distance, clients usually pay above the diagonal line, rarely below
- there appears to be a magic fare amount of \\$52 that will take you anywhere from 0 to 60 miles!

<br><br>

## 4. Plopp: interactive data visualization tools

<img src="https://scipp.github.io/plopp/_static/logo.svg" width="200" />

https://scipp.github.io/plopp 

In [None]:
import plopp as pp

fare_lat_lon = da.hist(fare_amount=fare_bins, dropoff_latitude=300, dropoff_longitude=300)
fare_lat_lon

In [None]:
pp.inspector(fare_lat_lon, dim='fare_amount', norm='log')

### Final Exercise.