# Getting Started

This package is designed to aid in the efficient analysis of large simulations, such as cosmological (hydrodynamical) simulations of large-scale structure.

It uses the [dask](https://dask.org/) library to perform computations, which has several key advantages:
* (i) very large datasets which cannot normally fit into memory can be analyzed,
* (ii) calculations can be automatically distributed onto parallel 'workers', across one or more nodes, to speed them up.
* (iii) we can create abstract graphs ("recipes", such as for derived quantities) and only evaluate on actual demand.

## Loading an individual dataset

The first step is to choose an existing snapshot of a simulation. To start, we will intentionally select the $z=0$ output of TNG50-4, which is the lowest resolution version of [TNG50](https://www.tng-project.org/), a suite for galaxy formation simulations in cosmological volumes. Choosing TNG50-4 means that the data size in the snapshot is small and easy to work with. We demonstrate how to work with larger data sets at a later stage.

In [None]:
from scida import load
ds = load("TNG50-4_snapshot", units=True)

We can get some general information about this dataset:

In [None]:
ds.info()

## Metadata
Loading this data gives us access to the simulation snapshot's contents. For example, in the case of [AREPO](https://arepo-code.org) we find most of the metadata in the attributes "config", "header" and "parameters". The raw metadata these dictionaries are derived from is given under
```
ds.metadata
```

irrespective of the dataset type.

In [None]:
print("some ds.config entry:", next(iter(ds.config.items())))
print("some ds.header entry:", next(iter(ds.header.items())))
print("some ds.parameters entry:",next(iter(ds.parameters.items())))

If you are familiar with AREPO snapshots, you will know that oftentimes the output is split into multiple files. Most of the metadata will be the same for all files, but some (such as the number of particles in given file `NumPart_ThisFile`) will not. In these cases, the differing entries are stacked along the first axis, so that we also have access to this information:

In [None]:
print("Gas cells for each file:", ds.header['NumPart_ThisFile'][:, 0])

## Particle/cell data

Within our `ds` object, `ds.data` contains references to all the particle/cell data in this snapshot. Data is organized in a nested dictionary depending on the type of data.

If the snapshot is split across multiple file chunks on disk (as is the case for most large cosmological simulations), then these are virtually "combined" as for the metadata, see above.

As a result, there is a single array per data entry at the leaves of the nested dictionary. Note that these arrays are **not** normal numpy arrays, but are instead **dask arrays**, which we will return to later.

For the TNG50-4 datasets, the first level of `ds.data` maps the different particle types (such as gas and dark matter), and the second level holds the different physical field arrays (such as density and ionization).

In [None]:
for key,val in ds.data.items():
    print("Particle species:", key)
    print("Three of its fields:", list(val.keys())[:3], end='\n\n')

## Analyzing snapshot data

In order to perform a given analysis on some available snapshot data, we would normally first explicitly load the required data from disk, and then run some calculations on this data (in memory).

Instead, with dask, our fields are loaded automatically as well as "lazily" -- only when actually required.

### Computing a simple statistic on (all) particles

The fields in our snapshot object behave similar to actual numpy arrays. 

As a first simple example, let's calculate the total mass of gas cells in the entire simulation. Just as in numpy we can write

In [None]:
masses = ds.data["PartType0"]["Masses"]
task = masses.sum()

Note that all objects remain 'virtual': they are not calculated or loaded from disk, but are merely the required instructions, encoded into tasks. In a notebook we can inspect these:

In [None]:
masses

In [None]:
task

We can request a calculation of the actual operation(s) by applying the `.compute()` method to the task.

In [None]:
task.compute()

### Creating a visualization: projecting onto a 2D image

As an example of calculating something more complicated than just `sum()`, let's do the usual "poor man's projection" via a 2D histogram.

To do so, we use [da.histogram2d()](https://docs.dask.org/en/latest/array.html) of dask, which is analogous to [numpy.histogram2d()](https://numpy.org/doc/stable/reference/generated/numpy.histogram2d.html), except that it operates on a dask array. Later on, we will discuss more advanced, interactive visualization methods.

In [None]:
import dask.array as da
import numpy as np

coords = ds.data["PartType0"]["Coordinates"]
x = coords[:,0]
y = coords[:,1]

nbins = 512
bins1d = np.linspace(0, ds.header["BoxSize"], nbins+1)

result = da.histogram2d(x,y,bins=[bins1d,bins1d])
im2d = result[0].compute()

The resulting `im2d` is just a two-dimensional array which we can display.

In [None]:
from io import BytesIO
import matplotlib.pyplot as plt
from PIL import Image

import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.colors import LogNorm
fig = plt.figure(figsize=(6, 6), dpi=300)
cmap = mpl.cm.viridis
cranges = np.logspace(*[np.percentile(np.log10(im2d), i) for i in [1, 99.9]], 10)
norm = mpl.colors.BoundaryNorm(cranges, cmap.N, extend='both')
plt.imshow(im2d.T, norm=norm, extent=[0, ds.header["BoxSize"], 0, ds.header["BoxSize"]], interpolation="bilinear", rasterized=True)
plt.xlabel("x (ckpc/h)")
plt.ylabel("y (ckpc/h)")
ram = BytesIO()
plt.savefig(ram, bbox_inches="tight", dpi=150)
ram.seek(0)
im = Image.open(ram)
im2 = im.convert('RGB').convert('P', palette=Image.ADAPTIVE)
im2.save("hist.png" , format='PNG')
plt.show()