# Introduction to Scipp

<a href="https://scipp.github.io"><img src="../images/scipp-logo-2022.svg" width="400" /></a>

<h4><i>Multi-dimensional arrays with labeled dimensions and physical units</i></h4>

<h3><a href="https://scipp.github.io">scipp.github.io</a></h3>

<br><br>

Scipp is an open-source library developed by ESS for handling, manipulating and visualizing multi-dimensional data arrays.

It enriches raw NumPy-like arrays by adding named dimensions and associated coordinates.
In addition, it supports

- Physical units which are handled in arithmetic operations
- Histograms, i.e., bin-edge axes, which are by 1 longer than the data extent
- Propagation of uncertainties

<br><br>

In [None]:
%matplotlib inline
import numpy as np
import scipp as sc
import matplotlib.pyplot as plt

from scipp_utils import quiz, plot, scatter, fetch_data

rng = np.random.default_rng(seed=1234)

<br><br><br><br>

## 1. Labeled dimensions: why do we need them?

Say we have a 2D rectangular array of data

In [None]:
ny, nx = 10, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

that looks like

In [None]:
plot(a)

The task is now to slice out row number 4.
Because of the shape of the array, we know that the row dimension is the smallest, so we slice the first dimension of the 2D array:

In [None]:
# Slice out row number 4
plot(a[4, :])

### We can't always deduce from the shape

Now say we have an array which has a square shape:

In [None]:
ny, nx = 20, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

In [None]:
plot(a)

Do we slice the first or the second index of the 2D array?

In [None]:
# Not always obvious which dimension is which
plot(a[:, 4], a[4, :])

### The situation gets worse with more dimensions

Say we now have an array that has 4 dimensions: `x, y, z, t` (in that order, maybe?, or is it `z, y, x, t`, or `t, x, y, z`?)

In [None]:
a = np.random.random([20] * 4)
a.shape

**Quiz time!**

In [None]:
quiz(1)

<br><br>

### Introducing labeled dimensions

<img src="../images/Xarray_Logo_RGB_Final.svg" width="220" /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="../images/scipp-logo-2022.svg" width="220" />

[Xarray](https://docs.xarray.dev/en/stable/index.html) introduced labels to multi-dimensional Numpy arrays.

"*real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.*"

We have embraced, and to a large extent copied, the Xarray mechanism.

In [None]:
var = sc.array(dims=["x", "y", "z", "time"], values=a)
var

**Quiz time again!**

Can you guess the syntax?

In [None]:
quiz(2)

<br><br>

Getting the `z` slice is now easy and **readable**.

<br><br>

### Adding coordinates

- Coordinates can be specified for each dimension.
- They describe the extent of each axis, as well as how far each data point is from its neighbours.

Here is an array that represents air pollution levels as a function of altitude and time.

In [None]:
data = sc.array(
    dims=["altitude", "year"],
    values=np.linspace(500, 10, 5).reshape((5, 1)) * rng.random(10),
)
sc.show(data)

In [None]:
data.plot()

In Scipp and Xarray, coordinates are added in a data structure called `DataArray`:

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
    },
)
sc.show(da)

In [None]:
da

In [None]:
da.plot()

### Accessing and adding more coordinates

Coordinates are stored in a `dict`,
and each dimension can have more than one coordinate.

Getting and setting coordinates is done using the same syntax as Python dicts:

In [None]:
print(da.coords.keys())
da.coords["altitude"]

### Exercise 1.1: Adding a new coordinate

The air pollution data was collected every `year` from 2014 to 2023; `[2014, 2024)`. </br>
Let's add a coordinate, `year` to the `year` dimension.
> **Tip:** You can create a ``Variable`` with consecutive numbers by using ``sc.arange(dim, start, stop)``.

**Hint**

```python
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
        "year": sc.arange(..., 2014, ...)
    },
)
```
or
```python
da.coords['year'] = sc.arange(..., 2014, ...)
```

**Solution:**

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
        "year": sc.arange("year", 2015, 2025),
    },
)
sc.show(da)
da

### Exercise 1.2: Compute new coordinate

Add a new coordinate representing the Scipp-year.
> **Hint:** Scipp was first released in 2020

**Solution:**

In [None]:
da.coords["scipp-year"] = da.coords["year"] - 2020
sc.show(da)
da

<br><br><br><br><br><br><br><br>

## 2. Going further

<img src="../images/scipp-logo-2022.svg" width="220" />

### 2.1 Physical units

Every data variable and coordinate in Scipp has physical units.
(see also [pint](https://pint.readthedocs.io/en/stable/), [astropy.units](https://docs.astropy.org/en/stable/units/index.html), [pint-xarray](https://pint-xarray.readthedocs.io/en/stable/))

Array `Variable` with unit:

In [None]:
temperature = sc.array(dims=["time"], values=[300.0, 301.0, 312.0, 340.0], unit="K")
temperature

Scalar `Variable` (no dimensions) with unit:

In [None]:
sound_speed = sc.scalar(340.0, unit="m/s")
sound_speed

Coordinates and data with units in a `DataArray`:

In [None]:
cph_air = sc.DataArray(
    data=sc.array(
        dims=["altitude", "year"],
        values=np.linspace(500, 10, 5).reshape((5, 1)) * rng.random(10),
        unit="m^-3",
    ),
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5, unit="m"),
        "year": sc.arange("year", 2014, 2024, unit="year"),
    },
)
cph_air

Units are automatically handled in arithmetic operations.

Say we know the mean ultra-fine particle mass

In [None]:
ultra_fine_particle_mass = sc.scalar(1.0e-6, unit="kg")

cph_air *= ultra_fine_particle_mass
cph_air

<br><br>

### Units also provide protection

Say we now also have air pollution data for another city, e.g., NYC.

We would like to compute the difference between CPH and NYC air pollution (as a function of altitude and year),
but we forgot to multiply the NYC data by particle mass:

In [None]:
nyc_air = sc.DataArray(
    data=sc.array(
        dims=["altitude", "year"],
        values=np.linspace(800, 20, 5).reshape((5, 1)) * rng.random(10),
        unit="m-3",
    ),
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5, unit="m"),
        "year": sc.arange("year", 2014, 2024, unit="year"),
    },
)

cph_air - nyc_air

In [None]:
nyc_air *= ultra_fine_particle_mass

air_difference = cph_air - nyc_air

In [None]:
air_difference.plot()

- Units are very useful in early prevention of difficult-to-spot bugs in a workflow.
- They save **hours** of debugging time, free-up mental capacity and let the user focus on the important thing: **doing science**.

<br><br><br><br><br><br>

### Units for label-based indexing

We also use units to distinguish between positional indexing and label-based indexing:

In [None]:
cph_air["altitude", 2000.0 * sc.Unit("m")].plot()

Positional indices are based on the `dimension`, and value indices are based on the `coordinates`.

<br><br><br><br><br><br>

### Exercise 2: Coordinate and Units

We have a data array that contains `air pollution` as a function of `year` and `altitude` above the city of Copenhagen.
However, we want to have a `pressure` coordinate for the `altitude` dimension instead of `altitude`.

Assuming a constant air temperature $T$ of 300 K, the pressure as a function of height $h$ is given by

$$ P = P_{0} \exp{\left[ \frac{-g_{0}Mh}{RT} \right]} $$

Here is the incomplete function `altitude_to_pressure` that converts `altitude[m]` into `pressure[hPa]`.

Complete the function and use it to add the `pressure` coordinate to `cph_air`.

In [None]:
def altitude_to_pressure(altitude):
    M = sc.scalar(0.0289644, unit="kg/mol")
    g0 = sc.scalar(9.80665, unit="m/s2")
    R = sc.scalar(8.3144598, unit="J/mol/K")
    T = sc.scalar(300.0)
    p0 = sc.scalar(1013.25, unit="hPa")
    return p0 * sc.exp(-g0 * M * altitude / (R * T))

**Solution:**

In [None]:
def altitude_to_pressure(altitude):
    M = sc.scalar(0.0289644, unit="kg/mol")
    g0 = sc.scalar(9.80665, unit="m/s2")
    R = sc.scalar(8.3144598, unit="J/mol/K")
    T = sc.scalar(300.0, unit="K")
    p0 = sc.scalar(1013.25, unit="hPa")
    return p0 * sc.exp(-g0 * M * altitude / (R * T))


cph_air.coords["pressure"] = altitude_to_pressure(cph_air.coords["altitude"])
cph_air

<br><br><br><br><br><br><br><br><br><br>

### 2.2 Histogramming and bin-edge coordinates

- It is sometimes necessary to have coordinates that represent a range for each data value.
- E.g., "the temperature was 310 K in the time span between 10 and 20 seconds".
- This also arises every time we histogram data.
- Scipp supports this by having **bin-edge coordinates**: a coordinate which has a length of 1 more than the dimension length.

The next data set is meant to represent photon events in a camera.
We have a long list of `x` and `y` positions for the photons.

In [None]:
x = sc.array(dims=["row"], values=rng.normal(size=10000), unit="cm")
y = sc.array(dims=["row"], values=rng.normal(size=10000), unit="cm")
recording = sc.DataArray(
    data=sc.ones(sizes=x.sizes, unit="counts"), coords={"x": x, "y": y}
)
recording

In [None]:
scatter(x.values, y.values)

It is very common to histogram such data.

In Scipp, histogramming has a very concise and easy-to-use syntax.
To make 8 bins in both the `x` and `y` dimensions:

In [None]:
image = recording.hist(y=8, x=8)
image.plot(aspect="equal")

The `x` and `y` coordinates are now **bin-edge** coordinates.

In [None]:
sc.show(image)
image

- Numpy and Matplotlib return the bin edges and the data counts separately.
- We have everything stored inside a single data structure.

You can, of course, adjust the number of bins:

In [None]:
recording.hist(y=100, x=100).plot(aspect="equal")

<br><br><br><br><br>

### Exercise 3: Histogramming

We found a 2D detector that reads your mood!

We recorded a signal with it, and now we can visualize the signal by histogramming.

In [None]:
from scipp_utils import load_signal_to_histogram
signal_rng = np.random.default_rng(1)
signal = load_signal_to_histogram(signal_rng)
signal

#### Exercise 3-1: Number of bins for histogramming.

First, we need to find the right number of bins to histogram the signal.

We tried 200 bins and 4 bins for each axis, but none of them seems meaningful!

In [None]:
signal.hist(x=200, y=200).plot() + signal.hist(x=4, y=4).plot()

**Solution:**

In [None]:
# 30~50 bins are enough to see the meaningful shape!
signal.hist(x=50, y=50).plot() + signal.hist(x=30, y=30).plot()

#### Exercise 3-2: Custom histogram edges.

However, there is a suspicious hot spot in the very middle of the image.

We want to investigate those signals within the specific range of ``x`` and ``y``.

Let's histogram the hot spot and see what is in there.

You can histogram the data with custom histogram edges like below.

**Hint:**

In [None]:
hist_edges_x = sc.linspace(dim='x', start=-10, stop=10, unit='cm', num=200)
hist_edges_y = sc.linspace(dim='y', start=-10, stop=10, unit='cm', num=200)
signal.hist(x=hist_edges_x, y=hist_edges_y).plot()

**Solution:**

In [None]:
# There was a smiley in the middle of the heart!

hist_edges_x = sc.linspace(dim='x', start=-0.15, stop=0.15, unit='cm', num=50)
hist_edges_y = sc.linspace(dim='y', start=-0.15, stop=0.15, unit='cm', num=50)
signal.hist(x=hist_edges_x, y=hist_edges_y).plot()

#### Hidden Exercise

The smiley is smiling but not really!

You can find a teardrop on the right eye.

In [None]:
tear_range_x = sc.linspace(dim='x', start=0.015, stop=0.021, unit='cm', num=32)
tear_range_y = sc.linspace(dim='y', start=0.056, stop=0.064, unit='cm', num=32)
signal.hist(x=tear_range_x, y=tear_range_y).plot()

<br><br><br><br><br><br><br><br><br><br><br><br>

## 3. Binned data

Scipp distinguishes **histogrammed** data from **binned** data:

- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.
- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum over all events or values in a bin.

![binned](../images/binned_drawing.svg)

This is conceptually similar to a multi-dimensional <a href="https://awkward-array.org/doc/main/"><img src="../images/awkward.svg" width="100" /></a>.

It is best illustrated with an example of data analysis.
For this, we will use one of the NYC taxi datasets.

<br><br>

### NYC yellow taxi dataset

<img src="../images/taxi_datasets_2_1.png" /> <img src="../images/taxi_dataset_table.png" width="600" />

(https://vaex.readthedocs.io/en/latest/datasets.html, Dataset from 2015, obtained as a HDF5 file from the Vaex docs,
and subsequently cleaned of outliers).

For today, we will use a small set of it.

In [None]:
file = fetch_data("4-reduction/nyc_taxi_data_2015_small")

In [None]:
# %matplotlib widget

da = sc.io.load_hdf5(file)
da

In [None]:
n = 100
x = da.coords["dropoff_longitude"].values[::n]
y = da.coords["dropoff_latitude"].values[::n]
scatter(x, y)

### Binning the data records

- Working with binned data is most efficient when keeping the number of bins relatively low.
- Binning is essentially like overlaying a grid of bin edges onto our data

In [None]:
ax = scatter(x, y, get_ax=True)
for lon in np.linspace(*ax.get_xlim(), 9):
    ax.axvline(lon, color="gray")
for lat in np.linspace(*ax.get_ylim(), 9):
    ax.axhline(lat, color="gray")

In [None]:
# Bin into 8 longitude & latitude bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)
binned

In [None]:
# Histogramming is summing all the counts in each bin
binned_sum = binned.bins.sum()

binned_sum.plot(aspect="equal", norm="log")

<br><br><br><br>

### Selecting/slicing bins

- Binning *groups* the data into bins, but keeps the underlying table of records.
- **No information is lost, it is simply re-ordered.**
- The bins can then be used for slicing the data, providing extremely efficient data selection and filtering.

In [None]:
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
manh

In [None]:
# We can now histogram this with a much finer resolution

manh.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

In [None]:
# We select another bin, which contains the JFK airport

jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
jfk.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

![jfk](../images/640px-JFK_airport_terminal_map.png)

(https://commons.wikimedia.org/wiki/File:JFK_airport_terminal_map.png)

<br><br>

### Binning into a new dimension

- Data that has already been binned can also be binned further into new dimensions.

In [None]:
manh

- We look at the trip distances inside the Manhattan and JFK bins we have selected above.

In [None]:
# Use 100 distance bins
manh_dist = manh.bin(trip_distance=100)
manh_dist

In [None]:
manh_dist.hist().plot()

In [None]:
jfk_dist = jfk.bin(trip_distance=100)
jfk_dist.hist().plot()

<br><br>


### Other operations on bins: what is the fare amount as a function of distance?

- In addition to summing/histogramming, bins can be used for other reduction operations: `min()`, `max()`, and `mean()`.

In [None]:
manh_dist

- To get the minimum and maximum fares for all trips that ended inside our Manhattan area, we can do

In [None]:
manh_dist.bins.coords["fare_amount"].min(), manh.bins.coords["fare_amount"].max()

- These values are somewhat strange, indicative of bad data in the table.
- We restrict our fare range from 0 to 200 dollars.

In [None]:
# Make 100 bins between 0 and 200 dollars
nbins = 100
fare_bins = sc.linspace("fare_amount", 0, 200, nbins + 1, unit="$")

# Bin & plot our data
manh_dist.bin(fare_amount=fare_bins).hist().transpose().plot(norm="log")

Some things we can say about the data:

- there appears to be a (somewhat expected) correlation between fare amount and trip distance: the further you go, the more you'll have to pay
- for a given trip distance, clients usually pay above the diagonal line, rarely below
- there appears to be a magic fare amount of &#36;52 that will take you anywhere from 0 to 60 miles!

<br><br>

## 4. Plopp: interactive data visualization tools

<img src="../images/plopp-logo.svg" width="200" />

https://scipp.github.io/plopp 

In [None]:
import plopp as pp

fare_lat_lon = da.hist(
    fare_amount=fare_bins, dropoff_latitude=300, dropoff_longitude=300
)
fare_lat_lon

In [None]:
%matplotlib widget

inspect = pp.inspector(fare_lat_lon, dim="fare_amount", norm="log")
inspect

In [None]:
tool = inspect[0][0].toolbar["inspect"]._tool
tool.start()
tool.click(-73.9859, 40.7463)
tool.click(-73.9575, 40.7120)
tool.click(-73.9522, 40.7777)
display(inspect[0][0].fig)
display(inspect[0][1].fig)

<br><br><br><br><br><br><br>

### Exercise 4.1: Rush hours

Histogram the Manhattan and JFK bins according to hour-of-the-day,
to show the quiet and busy hours for both boroughs.

**Solution:**

In [None]:
# In Plopp, you can use the + and / operators to make tiled figures
manh.hist(dropoff_hour=24).plot(title='Manhattan') / jfk.hist(dropoff_hour=24).plot(title='JFK')

### Exercise 4.2: Expensive hours

The final exercise is to create an interactive figure that will show histograms of how expensive trips were,
as a function of the hour-of-the-day, for the entire dataset.

You should:

1. Create a `price_per_mile` coordinate on the original dataset `da`
1. Bin `da` using two dimensions: hour-of-the-day and `price_per_mile`
1. Use Plopp's `superplot` function to make a figure with a 1D histogram and an interactive slider to navigate the hour dimension

Use the slider to find the hour of the day when trips are the most expensive!

**Hint:** For binning in hour-of-the-day, using 24 bins should work well.
For binning in `price_per_mile`, you will have to manually set the bin boundaries.

**Solution:**

In [None]:
da.coords['price_per_mile'] = da.coords['fare_amount'] / da.coords['trip_distance']
sp = pp.superplot(
         da.bin(dropoff_hour=24,
                price_per_mile=sc.linspace('price_per_mile', 0, 20, 100, unit='$/mi')).hist())
sp

In [None]:
pp.widgets.Box([[sp.canvas.to_image(), sp.right_bar], sp.bottom_bar])

<br><br><br><br><br><br><br>

### Bonus Exercise

You decided to join an exchange program in NY.

But living expenses are too high there, even compared to Copenhagen!

Luckily, you can take over a car from a previous student in the same program,
and you are allowed to have a part-time job for 2 hours every day,
and there is no limit of income.

So you decide to be a shared-car driver.
Your goal is to maximize your income within those 2 hours,
so you are going to analyse which hours to drive in which borough!

You are free to choose 2 hours among all 24 in a day,
and there are 2 places, Manhattan and JFK airport,
where you can be registered as a driver.

**Solution**

In [None]:
# Bin the data again to get the `price_per_mile` coord in the
# Manhattan and JFK bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)

# Manhattan bin
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
# Create a new data array with the `price_per_mile` as weights
prices_manh = sc.DataArray(data=manh.values.coords['price_per_mile'],
                           coords={'dropoff_hour': manh.values.coords['dropoff_hour']})
# Bin by hour-of-the-day and get the mean inside each bin
mean_manh = prices_manh.bin(dropoff_hour=24).bins.mean()

# Repeat for JFK
jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
prices_jfk = sc.DataArray(data=jfk.values.coords['price_per_mile'],
                          coords={'dropoff_hour': jfk.values.coords['dropoff_hour']})
mean_jfk = prices_jfk.bin(dropoff_hour=24).bins.mean()

# Plot
fig = pp.plot({'Manhattan': mean_manh, 'JFK': mean_jfk})
fig

In [None]:
fig.fig