# balance Quickstart: Inverse Probability Weighting (IPW)

This quickstart demonstrates how to compute inverse probability weighting (IPW) weights with `balance` using both the default logistic regression model and alternative scikit-learn classifiers.

The examples below use the simulated data available in ``balance.datasets.load_sim_data``.


In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import balance
from balance.weighting_methods.ipw import ipw


INFO (2025-11-26 14:04:41,642) [__init__/<module> (line 70)]: Using balance version 0.12.x



Welcome to balance (Version 0.12.x)!
An open-source Python package for balancing biased data samples.

üìñ Documentation: https://import-balance.org/
üõ†Ô∏è Get Help / Report Issues: https://github.com/facebookresearch/balance/issues/
üìÑ Citation:
    Sarig, T., Galili, T., & Eilat, R. (2023).
    balance - a Python package for balancing biased data samples.
    https://arxiv.org/abs/2307.06024

Tip: You can access this information at any time with balance.help()



## Load simulated sample and target data

The helper ``balance.datasets.load_sim_data`` returns a sample and target DataFrame with demographic variables. For this example we use uniform design weights.


In [2]:
target_df, sample_df = balance.datasets.load_sim_data()

sample_weights = pd.Series(1.0, index=sample_df.index)
target_weights = pd.Series(1.0, index=target_df.index)
variables = ["gender", "age_group", "income"]

sample_df.head()


Unnamed: 0,id,gender,age_group,income,happiness
0,0,Male,25-34,6.428659,26.043029
1,1,Female,18-24,9.94028,66.885485
2,2,Male,18-24,2.673623,37.091922
3,3,,18-24,10.550308,49.39405
4,4,,18-24,2.689994,72.304208


## Estimate IPW weights with logistic regression (default)

Passing ``model="sklearn"`` uses logistic regression with sensible defaults. You can customize the logistic regression arguments with ``logistic_regression_kwargs``.


In [3]:
ipw_result_lr = ipw(
    sample_df,
    sample_weights,
    target_df,
    target_weights,
    variables=variables,
    model="sklearn",
    logistic_regression_kwargs={"max_iter": 2000},
)

ipw_result_lr["weight"].head()


INFO (2025-11-26 14:04:41,694) [ipw/ipw (line 552)]: Starting ipw function


INFO (2025-11-26 14:04:41,699) [adjustment/apply_transformations (line 418)]: Adding the variables: []


INFO (2025-11-26 14:04:41,700) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender', 'age_group', 'income']


INFO (2025-11-26 14:04:41,715) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender', 'age_group', 'income']


INFO (2025-11-26 14:04:41,727) [ipw/ipw (line 586)]: Building model matrix


INFO (2025-11-26 14:04:41,856) [ipw/ipw (line 608)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']


INFO (2025-11-26 14:04:41,858) [ipw/ipw (line 611)]: The number of columns in the model matrix: 16


INFO (2025-11-26 14:04:41,858) [ipw/ipw (line 612)]: The number of rows in the model matrix: 11000


INFO (2025-11-26 14:05:05,796) [ipw/ipw (line 779)]: Done with sklearn


INFO (2025-11-26 14:05:05,797) [ipw/ipw (line 781)]: max_de: None


INFO (2025-11-26 14:05:05,797) [ipw/ipw (line 803)]: Starting model selection


INFO (2025-11-26 14:05:05,800) [ipw/ipw (line 836)]: Chosen lambda: 0.041158338186664825


INFO (2025-11-26 14:05:05,802) [ipw/ipw (line 852)]: Proportion null deviance explained 0.172637976731584


0    6.531728
1    9.617159
2    3.562973
3    6.952117
4    5.129230
dtype: float64

## Estimate IPW weights with a custom scikit-learn classifier

You can supply any classifier implementing ``fit`` and ``predict_proba``. Here we use a random forest to capture nonlinear relationships.


In [4]:
rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
)

ipw_result_rf = ipw(
    sample_df,
    sample_weights,
    target_df,
    target_weights,
    variables=variables,
    model=rf_model,
)

ipw_result_rf["weight"].head()


INFO (2025-11-26 14:05:05,855) [ipw/ipw (line 552)]: Starting ipw function


INFO (2025-11-26 14:05:05,858) [adjustment/apply_transformations (line 418)]: Adding the variables: []


INFO (2025-11-26 14:05:05,859) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender', 'age_group', 'income']


INFO (2025-11-26 14:05:05,871) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender', 'age_group', 'income']


INFO (2025-11-26 14:05:05,882) [ipw/ipw (line 586)]: Building model matrix


INFO (2025-11-26 14:05:06,005) [ipw/ipw (line 608)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']


INFO (2025-11-26 14:05:06,006) [ipw/ipw (line 611)]: The number of columns in the model matrix: 16


INFO (2025-11-26 14:05:06,006) [ipw/ipw (line 612)]: The number of rows in the model matrix: 11000


INFO (2025-11-26 14:05:06,979) [ipw/ipw (line 779)]: Done with sklearn


INFO (2025-11-26 14:05:06,980) [ipw/ipw (line 781)]: max_de: None


INFO (2025-11-26 14:05:06,983) [ipw/ipw (line 836)]: Chosen lambda: nan


INFO (2025-11-26 14:05:06,985) [ipw/ipw (line 852)]: Proportion null deviance explained 0.20840028289431922


0    6.848176
1    6.611668
2    1.834019
3    5.237445
4    7.586927
dtype: float64

## Comparing outputs

Both approaches return a dictionary with the computed weights and model metadata. You can inspect the distribution of the weights or downstream balance metrics to decide which classifier best fits your data.
