# Tutorial: inflating ensembles (pre-binned data format)

While ensembles are useful data format to store nested histograms in, they are somewhat unwieldy and require many special functions to handle. The largest problem is, however, that they do not work well with vector-based operations, and require recursive traversal by the convolution code. 

To combat this we can `inflate` ensembles by changing their nested structure to a flattened format. 

In [1]:
import json
import pkg_resources

from syntheticstellarpopconvolve.ensemble_utils import convert_ensemble_to_dataframe

# load the data
example_ensemble_filename = pkg_resources.resource_filename(
    "syntheticstellarpopconvolve", "example_data/example_ensemble.json"
)
with open(example_ensemble_filename, "r") as f_ensemble:
    ensemble = json.loads(f_ensemble.read())

One example of binned data is the `Ensemble` data type generated by `binary_c`.

Ensemble-based data is stored as nested dictionaries. An example of ensemble-based data looks like this:

``` python
"Xyield": {
    "time": {
        "-0.1": {
            "source": {
                "Wind": {
                    "isotope": {
                        "Al27": 1.3202421292393783e-08,
                        "Ar36": 1.7624018781546946e-08,
                        "Ar38": 3.502033439864038e-09,
                        "Ar40": 5.758546201573555e-12,
                        "B10": 2.4295643555965993e-13,
                        "B11": 1.0109986571758494e-12,
                        "Be9": 3.7843822119497306e-14,
                        [...]
                        }
                    }
                [...]
                }
            }
        [...]
        }
    }
```

With `binary_c-python` this type of data can be generated through the options explained in [the ensemble-data logging notebook](https://binary_c.gitlab.io/binary_c-python/examples/notebook_ensembles.html).

To use this type of data, however, one must first transform it to a different shape. In particular one must `inflate` the ensemble, turning it from a nested dictionairy to a rectangular data format. How to do so is covered in XXX (TODO: refer to notebook).


We can then inflate this ensmeble by using the `convert_ensemble_to_dataframe` function.

It is best to already know the structure of the ensemble, so you know exactly which subtree you want to take, and whether it contains named layers. If the ensemble contains named layers, the structure should be `named_layer_1, value_layer_1, ... named_layer_n, value_layer_n, normalized_yield_layer_n`.

```python
inflate_ensemble = convert_ensemble_to_dataframe(
    ensemble_data, # subtree of ensemble
    verbose=False, # flag to show info while inflating
    contains_named_layers=True, # flag to indicate whether the ensemble contains named layer (i.e. those that indicate what is in the next layer) 
)
```

Particularly, if you indicate that the ensemble contains named layers, the first layer _should_ be a named layer. If it does not, or somehow the structure is not like it should be, the read-out is misaligned. If that is the case, please double check you provided the correct subtree, or wrap it in {'ensemble': ensemble_data}

In [2]:
inflated_ensemble = convert_ensemble_to_dataframe(
    ensemble_data=ensemble["ensemble"]['Xyield'],
    verbose=False,
    contains_named_layers=True,
)

print(inflated_ensemble.head())

   time source isotope probability
0  -0.1   Wind    Al27         0.0
1  -0.1   Wind    Ar36         0.0
2  -0.1   Wind    Ar38         0.0
3  -0.1   Wind    Ar40         0.0
4  -0.1   Wind     B10         0.0


The inflated ensemble by default calls the final layer 'probability', but that data can ofcourse be anything depending on the pop-synth simulation output.

Moreover, the object-type of all columns are by default string based (except for the final layer). This is because, when reading out the ensemble, we do not want to impose any type on the data, and everything can be converted to a string, but not everything can be converted to a numerical type.

In [3]:
print(inflated_ensemble.dtypes)

time           object
source         object
isotope        object
probability    object
dtype: object


After inflating the ensemble you should convert the columns to their actual types

In [4]:
inflated_ensemble = inflated_ensemble.astype({'time': 'float'})
print(inflated_ensemble.dtypes)

time           float64
source          object
isotope         object
probability     object
dtype: object


In [5]:
print(10**inflated_ensemble['time'])

0             0.794328
1             0.794328
2             0.794328
3             0.794328
4             0.794328
              ...     
102787    15848.931925
102788    15848.931925
102789    15848.931925
102790    15848.931925
102791    15848.931925
Name: time, Length: 102792, dtype: float64
