Here we are giving a brief introduction in working with IAMC-styled data with pandas and pandas-indexing.

In [None]:
import pandas as pd

# Test data set

For experimenting and easy testing `pandas-indexing` brings along the power sector generation and capacity of the HighRE illustrative modelling pathway from the IPCC AR6 scenario database in IAMC format.

In [None]:
from pandas_indexing.datasets import remindhighre_power


df = remindhighre_power()
df.head()

# Usage styles

`pandas-indexing` defines two different usage styles:

1. functions that can be imported from the toplevel module, like
   
   ```python
   from pandas_indexing import assignlevel
   assignlevel(df, unit="Mt CO2e/yr")
   ```
2. convenience accessors that are hooking into pandas as extensions, like
   
   ```python
   df.pix.assign(unit="Mt CO2e/yr)
   ```

Most of the functionality is available with both styles under slightly different names. I'll present the functional style here first (and add the alternative as comments)

In [None]:
from pandas_indexing.core import describelevel


describelevel(df)  # or: df.pix

In [None]:
df.pix

As one can see the IAMC format is defined by five index levels: `model`, `scenario`, `variable`, `unit` and `region`. In this data subset, we have a single `model`, `scenario` combination for one `region` and with several capacity `variable`s starting with `Capacity|Electricity|` and generation variables starting with `Secondary Energy|Electricity|`.

The data comes with two different units: `GW` and `GWh/yr`, (hopefully) for capacity and generation, respectively.

# Selecting data

For using pandas indexes effectively for computations, it makes sense to split the hierarchically variable index out into separate python variables: `generation` and `capacity`. The standard pandas tools for this job are `pd.DataFrame.loc` in conjunction with `pd.IndexSlice` or `pd.DataFrame.query`. 

`pandas_indexing` brings `ismatch` and `isin` to make this job as easy as possible.

In [None]:
from pandas_indexing import isin, ismatch  # no .idx equivalents

In [None]:
df.loc[ismatch(variable="Capacity|**"), 2030]

`ismatch` allows using a glob-like pattern to subset into one or multiple named levels, together with the standard `rename` method we can get cleaned up capacity and generation data easily:

In [None]:
generation = df.loc[ismatch(variable="Secondary Energy|**")].rename(
    index=lambda s: s.removeprefix("Secondary Energy|Electricity|")
)
generation

Since this extraction of data is relatively common, `extractlevel` simplifies this by matching against a format-like template string:

In [None]:
from pandas_indexing import extractlevel, formatlevel


generation = extractlevel(df, variable="Secondary Energy|{carrier}|{fuel}", drop=True)
capacity = extractlevel(df, variable="Capacity|{carrier}|{fuel}", drop=True)
# or: df.pix.extract(variable="Secondary Energy|{carrier}|{fuel}")
generation

The inverse operation is to combine strings back together with `formatlevel`:

In [None]:
formatlevel(generation, variable="Secondary Energy|{carrier}|{fuel}", drop=True)
# or: df.pix.format(variable="Secondary Energy|{carrier}|{fuel}")

With `generation` and `capacity` conveniently split into separate variables, we can calculate capacity factors (ratios of generation and capacity) directly, as long as we take care of removing the conflicting `unit` level. Similarly to `ismatch`, `isin` can be provided as an argument to `.loc[]` to select on named index levels with the difference that only exact matches are considered.

In [None]:
capacity_factor = generation.droplevel("unit") / 8760 / capacity.droplevel("unit")
capacity_factor.loc[isin(fuel=["Solar", "Wind", "Hydro", "Geothermal"]), 2030:2051]

Instead of dropping the `unit` level, there is also a set of unit-aware calculation functions, so that this full capacity factor calculation can be performed in very few steps (the unit aware calculation realizes correctly that the capacity factor is unit-less):

In [None]:
generation = extractlevel(df, variable="Secondary Energy|{carrier}|{fuel}", drop=True)
capacity = extractlevel(df, variable="Capacity|{carrier}|{fuel}", drop=True)
generation.pix.unitdiv(capacity)

Under the hood `isin` and `ismatch` generate `Selector` objects. They can be composed into complex queries intuitively, which are kept as a hierarchical structure of objects.

In [None]:
query = isin(fuel=["Coal", "Gas", "Nuclear"], unit="GW") & ~ismatch(fuel="S*")
query

For evaluating such a query one needs to pass in a data object to produce a boolean mask. Since pandas `.loc` indexer does exactly that, these queries work as expected.


In [None]:
query(capacity)

````{note}
It is only possible from version 0.5.2 to use a pandas boolean series **in front of** a selector; ie.
```python
(capacity[2030] > 250) & isin(variable=["Coal", "Gas", "Nuclear"], unit="GW")
```
works, as you would expect it, in the same way as
```python
isin(variable=["Coal", "Gas", "Nuclear"], unit="GW") & (capacity[2030] > 250)
```
````

In [None]:
high_capacity_fossil = capacity.loc[
    isin(fuel=["Coal", "Gas", "Nuclear"], unit="GW") & (capacity[2030] > 250),
    :2041,
]
high_capacity_fossil

The simple fact that this is an operation on `[]`, means that we can also use it to modify values in-place:

In [None]:
high_capacity_fossil.loc[isin(fuel="Gas"), 2030:] = 1000.0
high_capacity_fossil

Most methods in `pandas_indexing` do not care whether they are run on an index, a series or a dataframe, but will transiently take care of handing them down to the appropriate level:

In [None]:
fossil_series = (
    capacity.loc[isin(fuel=["Coal", "Gas", "Nuclear"]), [2030, 2040, 2050, 2060]]
    .rename_axis(columns="year")
    .stack()
)
fossil_series

In [None]:
fossil_series.loc[isin(year=[2030, 2050])]

In [None]:
isin(fossil_series.index, fuel="Nuclear")

# Selecting based on a multi-index

If we need pairs of data like `Coal` in 2030 and `Gas` in 2035 and `Nuclear` in 2040 and 2050, then we can pass a multiindex to `isin`:

In [None]:
idx = pd.MultiIndex.from_tuples(
    [("Coal", 2030), ("Gas", 2035), ("Nuclear", 2040), ("Nuclear", 2050)],
    names=["fuel", "year"],
)
idx

In [None]:
fossil_series.loc[isin(idx)]

Since `("Gas", 2035)` is not part of the original `fossil_series` it is silently ignored, just like with other uses of `isin`.

Alternatively, the same result can be retrieved with the more powerful `semijoin` using an `"inner"` join:

In [None]:
from pandas_indexing import semijoin


semijoin(
    fossil_series, idx, how="inner"
)  # or: fossil_series.pix.semijoin(idx, how="inner")


A `"right"`-join on the other hand will follow the order and keep all elements of the provided `idx`. Since `("Gas", 2035)` is not part of the original `fossil_series` it shows up as `NaN`s here:

In [None]:
semijoin(fossil_series, idx, how="right")

It is also possible to get the inverted result, with only the not matching rows, with an `antijoin`

In [None]:
from pandas_indexing import antijoin


antijoin(fossil_series, idx)
# or: fossil_series.pix.antijoin(idx)

# Projecting levels

Often after selecting the right subsets, ie the interesting `model` or `scenario` it makes sense to consolidate the data to a given set of `levels`. That is what `projectlevel` is used for:

In [None]:
from pandas_indexing import projectlevel


simple_fossil_series = projectlevel(fossil_series, ["fuel", "year"])
# or: fossil_series.pix.project(["fuel", "year"])
simple_fossil_series

`projectlevel` reduces the levels attached to a multiindex to the ones explicitly named. It is basically the complement to `droplevel` which removes the listed names

In [None]:
projectlevel(fossil_series, ["model", "scenario"]) == fossil_series.droplevel(
    ["carrier", "fuel", "unit", "region", "year"]
)

# Assigning to levels

`assignlevel` allows to modify individual values with helpful keyword arguments,

In [None]:
from pandas_indexing import assignlevel


assignlevel(df, variable="Updated|" + projectlevel(df.index, "variable"), unit="bla")
# or: df.pix.assign(variable=df.index.pix.project("variable"), unit="bla")

This particular case is even more clearly handled with `formatlevel`:

In [None]:
from pandas_indexing import formatlevel


formatlevel(df, variable="Updated|{variable}", unit="bla")
# or: df.pix.format(variable=...)

Both functions avoid having to rely on `reset_index`, `set_index` pairs, which are painful for large data, since `set_index` is expensive!

In [None]:
df.reset_index().assign(variable="Capacity").set_index(df.index.names)

# Examining level values and level combinations

We already encountered the possibility to get an overview of the available levels and their values with describelevel:

In [None]:
describelevel(df)  # or: df.idx

Often it is necessary to get programmatic access to the unique values of one or more levels:

In [None]:
from pandas_indexing import uniquelevel


uniquelevel(df, "variable")
# or: df.pix.unique("variable")
# or in vanilla pandas: df.index.unique("variable")

In [None]:
uniquelevel(df, ["variable", "unit"])

# BEWARE: Pitfalls

`concat` ignores level order, so make sure to `reorder_levels` them

In [None]:
pd.concat([simple_fossil_series, simple_fossil_series.swaplevel()])

In [None]:
pd.concat(
    [
        simple_fossil_series,
        simple_fossil_series.swaplevel().reorder_levels(
            simple_fossil_series.index.names
        ),
    ]
)

Therefore, `pandas-indexing` brings a variant which does this automatically:

In [None]:
from pandas_indexing import concat


concat([simple_fossil_series, simple_fossil_series.swaplevel()])

# Additional helpful multi-index helpers

MultiIndex rendering is often annoying to read, since the important information might get abbreviated away, then converting it into a dataframe is helpful

In [None]:
projectlevel(fossil_series.index, ["model", "scenario", "fuel"])

In [None]:
projectlevel(fossil_series.index, ["model", "scenario", "fuel"]).to_frame(index=False)