# balance Quickstart: Analyzing and adjusting the bias on a simulated toy dataset

'balance' is a Python package that is maintained and released by the Core Data Science Tel-Aviv team in Meta. 'balance' performs and evaluates bias reduction by weighting for a broad set of experimental and observational use cases.

Although balance is written in Python, you don't need a deep Python understanding to use it. In fact, you can just use this notebook, load your data, change some variables and re-run the notebook and produce your own weights!

This quickstart demonstrates re-weighting specific simulated data, but if you have a different usecase or want more comprehensive documentation, you can check out the comprehensive balance tutorial.

## Analysis

There are four main steps to analysis with balance:
- load data
- check diagnostics before adjustment
- perform adjustment + check diagnostics
- output results

Let's dive right in!

## Example dataset

The following is a toy simulated dataset.

In [None]:
import warnings
warnings.filterwarnings("ignore")

from balance import load_data

In [None]:
target_df, sample_df = load_data()

print("target_df: \n", target_df.head())
print("sample_df: \n", sample_df.head())

In [None]:
target_df.head().round(2).to_dict()
# sample_df.shape

In practice, one can use pandas loading function(such as `read_csv()`) to import data into the DataFrame objects `sample_df` and `target_df`.

# Load data into a Sample object

The first thing to do is to import the `Sample` class from balance. All of the data we're going to be working with, sample or population, will be stored in objects of the `Sample` class.

In [None]:
from balance import Sample

Using the Sample class, we can fill it with a "sample" we want to adjust, and also a "target" we want to adjust towards.

We turn the two input pandas DataFrame objects we created (or loaded) into a balance.Sample objects, by using the `.from_frame()` 

In [None]:
sample = Sample.from_frame(sample_df, outcome_columns=["happiness"])
target = Sample.from_frame(target_df)

If we use the `.df` property call, we can see the DataFrame stored in sample. We can see how we have a new weight column that was added (it will all have 1s) in the importing of the DataFrames into a `balance.Sample` object.

In [None]:
sample.df.info()

We can get a quick overview text of each Sample object, but just calling it.

Let's take a look at what this produces:

In [None]:
sample

In [None]:
target

Next, we combine the sample object with the target object. This is what will allow us to adjust the sample to the target.

In [None]:
sample_with_target = sample.set_target(target)

Looking on `sample_with_target` now, it has the target atteched:

In [None]:
sample_with_target

# Pre-Adjustment Diagnostics

We can use `.covars()` and then followup with `.mean()` and `.plot()` (barplots and qqplots) to get some basic diagnostics on what we got.

We can see how:
- The proportion of missing values in gender is similar in sample and target.
- We have younger people in the sample as compared to the target.
- We have more females than males in the sample, as compared to around 50-50 split for the (non NA) target.
- Income is more right skewed in the target as compared to the sample.

In [None]:
print(sample_with_target.covars().mean().T)

In [None]:
print(sample_with_target.covars().asmd().T)

In [None]:
print(sample_with_target.covars().asmd(aggregate_by_main_covar = True).T)

In [None]:
sample_with_target.covars().plot()

# Adjusting Sample to Population

Next, we adjust the sample to the target. The default method to be used is 'ipw' (which uses inverse probability/propensity weights, after running logistic regression with lasso regularization).

In [None]:
# Using ipw to fit survey weights
adjusted = sample_with_target.adjust(max_de=None)

In [None]:
print(adjusted)

# Evaluation of the Results

We can get a basic summary of the results:

In [None]:
print(adjusted.summary())

In [None]:
print(adjusted.covars().mean().T)

We see an improvement in the average ASMD. We can look at detailed list of ASMD values per variables using the following call.

In [None]:
print(adjusted.covars().asmd().T)

It's easier to learn about the biases by just running `.covars().plot()` on our adjusted object.

In [None]:
adjusted.covars().plot()

We can also use different plots, using the seaborn library, for example with the "kde" dist_type.

In [None]:
# This shows how we could use seaborn to plot a kernel density estimation
adjusted.covars().plot(library = "seaborn", dist_type = "kde")

### Understanding the weights

We can look at the distribution of weights using the following call.

In [None]:
adjusted.weights().plot()

And get the design effect using:

In [None]:
adjusted.weights().design_effect()

# Outcome analysis

In [None]:
print(adjusted.outcomes().summary())

The estimated mean happiness according to our sample is 48 without any adjustment and 54 with adjustment.  The following show the distribution of happinnes:

In [None]:
adjusted.outcomes().plot()

# Downloading data

Finally, we can prepare the data to be downloaded for future analyses.

In [None]:
adjusted.to_download()

In [None]:
# We can prepare the data to be exported as csv - showing the first 500 charaacters for simplicity:
adjusted.to_csv()[0:500]

In [None]:
# Sessions info
import session_info
session_info.show(html=False, dependencies=True)