# Introduction to Scipp

<a href="https://scipp.github.io"><img src="https://scipp.github.io/_static/logo-2022.svg" width="400" /></a>

<h4><i>Multi-dimensional arrays with labeled dimensions and physical units</i></h4>

<h3><a href="https://scipp.github.io">scipp.github.io</a></h3>

<br><br>

Scipp is an open-source library developed by ESS for handling, manipulating and visualizing multi-dimensional data arrays.

It enriches raw NumPy-like arrays by adding named dimensions and associated coordinates.
In addition, it supports

- Physical units which are handled in arithmetic operations
- Histograms, i.e., bin-edge axes, which are by 1 longer than the data extent
- Propagation of uncertainties

<br><br>

In [None]:
%matplotlib inline
import numpy as np
import scipp as sc
import matplotlib.pyplot as plt

# import scipp_intro
from scipp_utils import quiz, plot, scatter

rng = np.random.default_rng(seed=1234)

<br><br><br><br>

## 1. Labeled dimensions: why do we need them?

Say I have a 2D rectangular array of data

In [None]:
ny, nx = 10, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

that looks something like

In [None]:
plot(a)

My task is now to slice out row number 4.
Because of the shape of the array, I know that the row dimension in the smallest, so I slice the first dimension of the 2D array:

In [None]:
# Slice out row number 4
plot(a[4, :])

### We can't always deduce from the shape

Now say I have an array which has a square shape:

In [None]:
ny, nx = 20, 20
a = np.sin(np.arange(ny) / (ny / 4)).reshape((-1, 1)) * np.cos(np.arange(nx) / (ny / 4))
a.shape

In [None]:
plot(a)

Do I slice the first or the second index of the 2D array?

In [None]:
# Not always obvious which dimension is which
plot(a[:, 4], a[4, :])

### The situation gets worse with more dimensions

Say I now have an array that has 4 dimensions: `x, y, z, t` (in that order, maybe?, or is it `z, y, x, t`, or `t, x, y, z`?)

In [None]:
a = np.random.random([20] * 4)
a.shape

**Quiz time!**

In [None]:
quiz(1)

<br><br>

### Introducing labeled dimensions

<img src="https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png" width="220" /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <img src="https://scipp.github.io/_static/logo-2022.svg" width="220" />

[Xarray](https://docs.xarray.dev/en/stable/index.html) (https://docs.xarray.dev) introduced labels to multi-dimensional Numpy arrays.

"*real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.*"

We have embraced, and to a large extent copied, the Xarray mechanism.

In [None]:
var = sc.array(dims=["x", "y", "z", "time"], values=a)
var

**Quiz time again!**

Can you guess the syntax?

In [None]:
quiz(2)

<br><br>

Getting the `z` slice is now easy and **readable**.

<br><br>

### Adding coordinates

- Coordinates can be specified for each dimension.
- They describe the extent of each axis, as well as how far each data point is from its neighbours.

Here is an array that represents air pollution levels as a function of altitude and time.

In [None]:
data = sc.array(
    dims=["altitude", "year"],
    values=np.linspace(500, 10, 5).reshape((5, 1)) * rng.random(10),
)
sc.show(data)

In [None]:
data.plot()

In Scipp and Xarray, coordinates are added in a data structure called `DataArray`:

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
    },
)
sc.show(da)

In [None]:
da

In [None]:
da.plot()

### Accessing and adding more coordinates

Coordinates are stored in a `dict`,
and each dimension can have more that one coordinate.

Getting and setting coordinates is done using the same syntax as Python dicts:

In [None]:
print(da.coords.keys())
da.coords["altitude"]

### Exercise 1.1: Adding a new coordinate

The air pollution data was collected every `year` from 2014 to 2023; `[2014, 2024)`. </br>
Let's add a coordinate, `year` to the `year` dimension.
> **Tip:** You can create a ``Variable`` with consecutive numbers by ``sc.arange(dim, start, stop)``.

**Hint:**

```python
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
        "year": sc.arange(..., 2014, ...)
    },
)
```
or
```python
da.coords['year'] = sc.arange(..., 2014, ...)
```

**Solution:**

In [None]:
da = sc.DataArray(
    data=data,
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5),
        "year": sc.arange("year", 2014, 2024),
    },
)
sc.show(da)
da

### Exercise 1.2: Compute new coordinate

Add a new coordinate representing the Scipp-year.
> **Hint:** Scipp was first released in 2020

**Solution:**

In [None]:
da.coords["scipp-year"] = da.coords["year"] - 2020
sc.show(da)
da

<br><br><br><br><br><br><br><br>

## 2. Going further

<img src="https://scipp.github.io/_static/logo-2022.svg" width="220" />

### 2.1 Physical units

Every data variable and coordinate in Scipp has physical units.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
(see also [pint](https://pint.readthedocs.io/en/stable/), [astropy.units](https://docs.astropy.org/en/stable/units/index.html), [pint-xarray](https://pint-xarray.readthedocs.io/en/stable/), ...)

Array `Variable` with unit:

In [None]:
temperature = sc.array(dims=["time"], values=[300.0, 301.0, 312.0, 340.0], unit="K")
temperature

Scalar `Variable` (no dimensions) with unit:

In [None]:
sound_speed = sc.scalar(340.0, unit="m/s")
sound_speed

Coordinates and data with units in a `DataArray`:

In [None]:
cph_air = sc.DataArray(
    data=sc.array(
        dims=["altitude", "year"],
        values=np.linspace(500, 10, 5).reshape((5, 1)) * rng.random(10),
        unit="m-3",
    ),
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5, unit="m"),
        "year": sc.arange("year", 2014, 2024, unit="year"),
    },
)
cph_air

Units are automatically handled in arithmetic operations.

Say I know the mean utra-fine particle mass

In [None]:
utra_fine_particle_mass = sc.scalar(1.0e-6, unit="kg")

cph_air *= utra_fine_particle_mass
cph_air

<br><br>

### Units also provide protection

Say I now also have air pollution data for another city, e.g. NYC.

I would like to compute the difference between CPH and NYC air pollution (as a function of altitude and year),
but I forgot to multiply the NYC by particle mass:

In [None]:
nyc_air = sc.DataArray(
    data=sc.array(
        dims=["altitude", "year"],
        values=np.linspace(800, 20, 5).reshape((5, 1)) * rng.random(10),
        unit="m-3",
    ),
    coords={
        "altitude": sc.linspace("altitude", 0, 8000, 5, unit="m"),
        "year": sc.arange("year", 2014, 2024, unit="year"),
    },
)

cph_air - nyc_air

In [None]:
nyc_air *= utra_fine_particle_mass

air_difference = cph_air - nyc_air

In [None]:
air_difference.plot()

- The units are very useful in early prevention of difficult-to-spot bugs in a workflow.
- They save **hours** of debugging time, free-up mental capacity and let the user focus on the important thing: **doing science**.

<br><br><br><br><br><br>

### Units for label-based indexing

We also use units to distinguish between positional indexing and label-based indexing:

In [None]:
cph_air["altitude", 2000.0 * sc.Unit("m")].plot()

Positional index is based on the `dimension`, and the value index is based on the `coordinates`.

<br><br><br><br><br><br>

### Exercise 2: Coordinate and Units

We have a data array that contains `air polution` as a function of `year` and `altitude` above the city of Copenhagen.
However, we want to have a `pressure` coordinate for the `altitude` dimension instead of `altitude`.

Assuming a constant air temperature $T$ of 300 K, the pressure as a function of height $h$ is given by

$$ P = P_{0} \exp{\left[ \frac{-g_{0}Mh}{RT} \right]} $$

Here is the incomplete function `altitude_to_pressure` that converts `altitude[m]` into `pressure[hPa]`.

Complete the function and use it to add the `pressure` coordinate to `cph_air`.

In [None]:
def altitude_to_pressure(altitude):
    M = sc.scalar(0.0289644, unit="kg/mol")
    g0 = sc.scalar(9.80665, unit="m/s2")
    R = sc.scalar(8.3144598, unit="J/mol/K")
    T = sc.scalar(300.0)
    p0 = sc.scalar(1013.25, unit="hPa")
    return p0 * sc.exp(-g0 * M * altitude / (R * T))

**Solution:**

In [None]:
def altitude_to_pressure(altitude):
    M = sc.scalar(0.0289644, unit="kg/mol")
    g0 = sc.scalar(9.80665, unit="m/s2")
    R = sc.scalar(8.3144598, unit="J/mol/K")
    T = sc.scalar(300.0, unit="K")
    p0 = sc.scalar(1013.25, unit="hPa")
    return p0 * sc.exp(-g0 * M * altitude / (R * T))


cph_air.coords["pressure"] = altitude_to_pressure(cph_air.coords["altitude"])
cph_air

<br><br><br><br><br><br><br><br><br><br>

### 2.2 Histogramming and Bin-edge coordinates

- It is sometimes necessary to have coordinates that represent a range for each data value.
- E.g. "the temperature was 310 K in the time span between 10 and 20 seconds".
- This also arises every time we histogram data, as in the image above.
- Scipp supports this by having **bin-edge coordinates**: a coordinate which has a length of 1 more than the dimension length.

The next data set is meant to represent photon events arriving on a camera.
We have a long list of `x` and `y` positions for the photons.

In [None]:
x = sc.array(dims=["row"], values=rng.normal(size=10000), unit="cm")
y = sc.array(dims=["row"], values=rng.normal(size=10000), unit="cm")
recording = sc.DataArray(
    data=sc.ones(sizes=x.sizes, unit="counts"), coords={"x": x, "y": y}
)
recording

In [None]:
scatter(x.values, y.values)

It is very common to histogram such data.

In Scipp, histogramming has a very concise and easy-to-use syntax.
To make 8 bins in both the `x` and `y` dimensions:

In [None]:
image = recording.hist(y=8, x=8)
image.plot(aspect="equal")

The `x` and `y` coordinates are now **bin-edge** coordinates.

In [None]:
sc.show(image)
image

- Numpy and Matplotlib return the bin edges and the data counts separately
- We have everything stored inside a single data structure

You can of course adjust the number of bins:

In [None]:
recording.hist(y=100, x=100).plot(aspect="equal")

<br><br><br><br><br>

### Exercise 3: Histogramming (TODO)

<br><br><br><br><br><br><br><br><br><br><br><br>

## 3. Binned data

Scipp distinguishes **histogrammed** data from **binned** data:

- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.
- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum over all events or values in a bin.

![binned](../images/binned_drawing.svg)

This is conceptually similar to a multi-dimensional <a href="https://awkward-array.org/doc/main/"><img src="https://iris-hep.org/assets/logos/awkward.svg" width="100" /></a>.

It is best illustrated with an example of data analysis.
For this, we will use one of the NYC taxi datasets.

<br><br>

### NYC yellow taxi dataset

<img src="https://vaex.readthedocs.io/en/latest/_images/datasets_2_1.png" /> <img src="https://cdn-images-1.medium.com/v2/resize:fit:2680/1*fqrY2h4uLD3eKEvJ6hlI2g.png" width="600" />

(https://vaex.readthedocs.io/en/latest/datasets.html, Dataset from 2015, obtained as a HDF5 file from the Vaex docs,
and subsequently cleaned of outliers).

For today, we will use a small set of it.

In [None]:
!wget -nc --no-verbose https://public.esss.dk/groups/scipp/dmsc-summer-school/scipp/nyc_taxi_data_2015_small.zip
!unzip -qq nyc_taxi_data_2015_small.zip

In [None]:
# %matplotlib widget

da = sc.io.load_hdf5("nyc_taxi_data_2015_small.h5")
da

In [None]:
n = 100
x = da.coords["dropoff_longitude"].values[::n]
y = da.coords["dropoff_latitude"].values[::n]
scatter(x, y)

### Binning the data records

- Working with binned data is most efficient when keeping the number of bins relatively low.
- Binning is essentially like overlaying a grid of bin edges onto our data

In [None]:
ax = scatter(x, y, get_ax=True)
for lon in np.linspace(*ax.get_xlim(), 9):
    ax.axvline(lon, color="gray")
for lat in np.linspace(*ax.get_ylim(), 9):
    ax.axhline(lat, color="gray")

In [None]:
# Bin into 8 longitude & latitude bins
binned = da.bin(dropoff_latitude=8, dropoff_longitude=8)
binned

In [None]:
# Histogramming is summing all the counts in each bin
binned_sum = binned.bins.sum()

binned_sum.plot(aspect="equal", norm="log")

<br><br><br><br>

### Selecting/slicing bins

- Binning *groups* the data into bins, but keeps the underlying table of records beneath
- **No information is lost, it is simply re-ordered**
- The bins can then be used for slicing the data, providing extremely efficient data selection and filtering

In [None]:
manh = binned["dropoff_longitude", 1]["dropoff_latitude", 4]
manh

In [None]:
# We can now histogram this with a much finer resolution

manh.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

In [None]:
# We select another bin, which contains the JFK airport

jfk = binned["dropoff_longitude", 6]["dropoff_latitude", 1]
jfk.hist(dropoff_latitude=300, dropoff_longitude=300).plot(norm="log", aspect="equal")

![jfk](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/JFK_airport_terminal_map.png/640px-JFK_airport_terminal_map.png)

(https://commons.wikimedia.org/wiki/File:JFK_airport_terminal_map.png)

<br><br>

### Binning into a new dimension

- Data that has already been binned can also be binned further into new dimensions

In [None]:
manh

- we look at the trip distances inside the Manhattan and JFK bins we have selected above.

In [None]:
# Use 100 distance bins
manh_dist = manh.bin(trip_distance=100)
manh_dist

In [None]:
manh_dist.hist().plot()

In [None]:
jfk_dist = jfk.bin(trip_distance=100)
jfk_dist.hist().plot()

<br><br>


### Other operations on bins: what is the fare amount as a function of distance?

- In addition to summing/histogramming, bins can be used for other reduction operations: `min()`, `max()`, and `mean()`.

In [None]:
manh_dist

- To get the minimum and maximum fares for all trips that ended inside our Manhattan area, we can do

In [None]:
manh_dist.bins.coords["fare_amount"].min(), manh.bins.coords["fare_amount"].max()

- These values are somewhat strange, indicative of bad data in the table.
- We restrict our fare range from \\$0 to \\$200.

In [None]:
# Make 100 bins between 0 and 200 dollars
nbins = 100
fare_bins = sc.linspace("fare_amount", 0, 200, nbins + 1, unit="dollar")

# Bin & plot our data
manh_dist.bin(fare_amount=fare_bins).hist().transpose().plot(norm="log")

Some things we can say about the data:

- there appears to be a (somewhat expected) correlation between fare amount and trip distance: the further you go, the more you'll have to pay
- for a given trip distance, clients usually pay above the diagonal line, rarely below
- there appears to be a magic fare amount of \\$52 that will take you anywhere from 0 to 60 miles!

<br><br>

## 4. Plopp: interactive data visualization tools

<img src="https://scipp.github.io/plopp/_static/logo.svg" width="200" />

https://scipp.github.io/plopp 

In [None]:
import plopp as pp

fare_lat_lon = da.hist(
    fare_amount=fare_bins, dropoff_latitude=300, dropoff_longitude=300
)
fare_lat_lon

In [None]:
%matplotlib widget

inspect = pp.inspector(fare_lat_lon, dim="fare_amount", norm="log")
inspect

In [None]:
tool = inspect[0][0].toolbar['inspect']._tool
tool.start()
tool.click(-73.9859, 40.7463)
tool.click(-73.9575, 40.7120)
tool.click(-73.9522, 40.7777)
display(inspect[0][0].fig)
display(inspect[0][1].fig)

<br><br><br><br><br><br><br>

### Final Exercise.

You decided to join an exchange program in NY.

But living expense is too much there, even compared to Cph!

Luckily you can take over a car from a previous student in the same program, and you are allowed to have a part-time job for 12 hours a week and there is no limit of income.

So you decided to be a shared-car driver. And you want to maximize your income within those 12 hours, so you are going to analyse which day to drive around which place!

You are free to choose 2 days among all 7 days, and there are 2 places, Manhattan and JFK airport region, you can be registered as a driver.

In order to do so, let's
    1. Add a coordinate of `weekday` based on the `pickup_datetime`.
    2. Draw histograms of `fair/trip_distance` for each region.

**Solution**

In [None]:
from datetime import datetime


def datetime_to_weekday(t: datetime) -> str:
    ...