# balance: transformations and formulas

This tutorial focuses on the ways transformations, formulas and penalty can be included in your pre-processing of the coveriates before adjusting for them.

## Example dataset - preparing the objects

The following is a toy simulated dataset.

For a more basic walkthrough of the elements in the next code block, please take a look at the tutorial: [balance Quickstart: Analyzing and adjusting the bias on a simulated toy dataset](https://import-balance.org/docs/tutorials/quickstart/)


In [None]:
from balance import load_data
target_df, sample_df = load_data()
from balance import Sample
sample = Sample.from_frame(sample_df, outcome_columns=["happiness"])
target = Sample.from_frame(target_df, outcome_columns=["happiness"])
sample_with_target = sample.set_target(target)
sample_with_target

# Transformations

## Basic usage: manipulating existing variables

When trying to understand what an adjustment does, we can look at the model_coef items collected from the diagnostics method.

In [None]:
adjusted = sample_with_target.adjust(
    # method="ipw", # default method
    # transformations=None,
    # formula=None,
    # penalty_factor=None, # all 1s
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")

As we can see from the glm coefficients, the age and gender groups got an extra NA column. And the income variable is bucketed into 10 buckets.

We can change these defaults by deciding on the specific transformation we want.

Let's start with NO transformations.

The transformation argument accepts either a dict or None. None indicates no transformations.

In [None]:
adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=None,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")

In this setting, income was treated as a numeric variable, with no transformations (e.g.: bucketing) on it.
Regardless of the transformations, the model matrix made sure to turn the gender and age_group into dummy variables (including a column for NA).


Next we can fit a simple transformation.

Let's say we wanted to bucket age_groups groups that are smaller than 25% of the data, and use different bucketing on income, here is how we'd do it:

In [None]:
from balance.util import fct_lump, quantize

transformations = {
    "age_group": lambda x: fct_lump(x, 0.25),
    "gender": lambda x: x,
    "income": lambda x: quantize(x.fillna(x.mean()), q=3),
}

adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=transformations,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")

As we can see - we managed to change the bucket sizes of income to have only 3 buckets, and we lumped the age_group to two groups (and collapsed together "small" buckets into the _lumped_other bucket).

Lastly, notice that if we omit a variable from transformations, it will not be available for the model construction (This behavior might change in the future).

In [None]:
transformations = {
    # "age_group": lambda x: fct_lump(x, 0.25),
    "gender": lambda x: x,
    # "income": lambda x: quantize(x.fillna(x.mean()), q=3),
}

adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=transformations,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")

As we can see, only gender was included in the model.

In [None]:
# TODO: add more examples about how add_na works
# TODO: add more examples about rare values in categorical variables and how they are grouped together. 

## Creating new variables

In the next example we will create several new transformations of income. 

The info gives information on which variables were added, which were transformed, and what is the final variables in the output.

The x in the lambda function can have one of two meanings:
1. When the keys in the dict match the exact names of the variables in the DataFrame (e.g.: "income"), then the lambda function treats x as the pandas.Series of that variable.
2. If the name of the key does NOT exist in the DataFrame (e.g.: "income_squared"), then x will become the DataFrame of the data.

In [None]:
from balance.util import fct_lump, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_squared": lambda x: x.income**2,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=3),
}

adjusted = sample_with_target.adjust(
    # method="ipw",
    transformations=transformations,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)
adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")

# Formula

The formula can accept a list of strings indicating how to combine the transformed variables together. It follows [the formula notation from patsy](https://patsy.readthedocs.io/en/latest/formulas.html).

For example, we can have an interaction between age_group and gender:

In [None]:
from balance.util import fct_lump_by, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: quantize(x.fillna(x.mean()), q=20),
}
formula = ["age_group * gender"]
# the penalty is per elemnt in the list of formula:
# penalty_factor = [0.1, 0.1, 0.1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")


As we can see, the formula makes it so that we have combinations of age_group and gender, as well as a main effects of age_group and gender. Since income was not in the formula, it is not included in the model.

# Formula and penalty_factor

The formula can be provided as several strings, and then the penalty factor can indicate how much the model should focus to adjust to that element of the formula. Larger penalty factors means that element will be less corrected.

The next two examples shows how in one case we focus on correcting for income, and in the second case we focus to correct for age and gender.

In [None]:
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
}
formula = ["age_group + gender", "income"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [10, 0.1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")


The above example corrected more to income. As we can see, age and gender got 0 correction (since their penalty was so high). Let's now over correct for age and gender:

In [None]:
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
}
formula = ["age_group + gender", "income"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [0.1, 10]  # this is flipped

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")


In the above case, income basically got 0 correction.

We can add two versions of income, and give each of them a higher penalty than the age and gender:

In [None]:
from balance.util import fct_lump_by, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=4),
}
formula = ["age_group + gender", "income", "income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 2, 2]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")


Another way is to create a formula for several variations of each variable, and give each a penalty of 1. For example:

In [None]:
from balance.util import fct_lump_by, quantize

transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=4),
}
formula = ["age_group", "gender", "income + income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 1, 1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

adj_diag = adjusted.diagnostics()
adj_diag.query("metric == 'model_coef'")


# The impact of transformations and formulas

## ipw

Using the above can have an impact on the final design effect, ASMD, and outcome. Here are several simple examples.

In [None]:
# Defaults from the package

adjusted = sample_with_target.adjust(
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

In [None]:
# No transformations at all

# transformations = None is just like using:
# transformations = {
#     "age_group": lambda x: x,
#     "gender": lambda x: x,
#     "income": lambda x: x,
# }

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=None,
    # formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# slightly smaller design effect, slightly better ASMD reduction.

In [None]:
# No transformations at all
transformations = None
# But passing a squared term of income to the formula:
formula = ["age_group + gender + income + income**2"]
# the penalty is per elemnt in the list of formula:
# penalty_factor = [1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    # penalty_factor=penalty_factor,
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# Adding income**2 to the formula led to lower Deff but also lower ASMD reduction.

In [None]:
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    "income": lambda x: x,
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=20),
}
formula = ["age_group + gender", "income_buckets"]
# the penalty is per elemnt in the list of formula:
penalty_factor = [1, 0.1]

adjusted = sample_with_target.adjust(
    method="ipw",
    transformations=transformations,
    formula=formula,
    penalty_factor=penalty_factor,
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# By adding income_buckets and using it instead of income, as well as putting more weight in it in terms of penalty
# we managed to correct income quite well, but at the expense of age and gender.

## CBPS

Let's see if we can improve on CBPS a bit.

In [None]:
# Defaults from the package

adjusted = sample_with_target.adjust(
    method = "cbps",
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

# CBPS already corrects a lot. Let's see if we can make it correct a tiny bit more.

In [None]:
import numpy as np

# No transformations at all
transformations = {
    "age_group": lambda x: x,
    "gender": lambda x: x,
    # "income": lambda x: x,
    "income_log": lambda x: np.log(x.income.fillna(x.income.mean())),
    "income_buckets": lambda x: quantize(x.income.fillna(x.income.mean()), q=5),
}
formula = ["age_group + gender + income_log * income_buckets"]

adjusted = sample_with_target.adjust(
    method="cbps",
    transformations=transformations,
    formula=formula,
    # penalty_factor=penalty_factor, # CBPS seems to ignore the penalty factor.
    # max_de=None,
)

print(adjusted.summary())
print(adjusted.outcomes().summary())
adjusted.covars().plot(library="seaborn", dist_type="kde")

# Trying various transformations gives slightly different results (some effect on the outcome, Deff and ASMD) - but nothing too major here.

In [None]:
# Sessions info
import session_info
session_info.show(html=False, dependencies=True)