# balance Quickstart: Inverse Probability Weighting (IPW)

This quickstart demonstrates how to compute inverse probability weighting (IPW) weights with `balance` using both the default logistic regression model and alternative scikit-learn classifiers.

The examples below use the simulated data available in ``balance.datasets.load_sim_data``.


In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import balance
from balance.weighting_methods.ipw import ipw


INFO (2025-11-26 15:44:56,937) [__init__/<module> (line 70)]: Using balance version 0.12.x



Welcome to balance (Version 0.12.x)!
An open-source Python package for balancing biased data samples.

üìñ Documentation: https://import-balance.org/
üõ†Ô∏è Get Help / Report Issues: https://github.com/facebookresearch/balance/issues/
üìÑ Citation:
    Sarig, T., Galili, T., & Eilat, R. (2023).
    balance - a Python package for balancing biased data samples.
    https://arxiv.org/abs/2307.06024

Tip: You can access this information at any time with balance.help()



## Load simulated sample and target data

The helper ``balance.datasets.load_sim_data`` returns a sample and target DataFrame with demographic variables. For this example we use uniform design weights.


In [2]:
target_df, sample_df = balance.datasets.load_sim_data()

sample_weights = pd.Series(1.0, index=sample_df.index)
target_weights = pd.Series(1.0, index=target_df.index)
variables = ["gender", "age_group", "income"]

sample_df.head()


Unnamed: 0,id,gender,age_group,income,happiness
0,0,Male,25-34,6.428659,26.043029
1,1,Female,18-24,9.94028,66.885485
2,2,Male,18-24,2.673623,37.091922
3,3,,18-24,10.550308,49.39405
4,4,,18-24,2.689994,72.304208


## Estimate IPW weights with logistic regression (default)

Passing ``model="sklearn"`` uses logistic regression with sensible defaults. To customize the solver, regularization, or convergence settings, pass a configured ``sklearn.linear_model.LogisticRegression`` instance via the ``model`` argument.


In [3]:
ipw_result_lr = ipw(
    sample_df,
    sample_weights,
    target_df,
    target_weights,
    variables=variables,
    model=LogisticRegression(max_iter=2000, solver="liblinear"),
)

ipw_result_lr['weight'].head()


INFO (2025-11-26 15:44:56,994) [ipw/ipw (line 546)]: Starting ipw function


INFO (2025-11-26 15:44:56,998) [adjustment/apply_transformations (line 418)]: Adding the variables: []


INFO (2025-11-26 15:44:56,998) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender', 'age_group', 'income']


INFO (2025-11-26 15:44:57,012) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender', 'age_group', 'income']


INFO (2025-11-26 15:44:57,025) [ipw/ipw (line 580)]: Building model matrix


INFO (2025-11-26 15:44:57,157) [ipw/ipw (line 602)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']


INFO (2025-11-26 15:44:57,158) [ipw/ipw (line 605)]: The number of columns in the model matrix: 16


INFO (2025-11-26 15:44:57,159) [ipw/ipw (line 606)]: The number of rows in the model matrix: 11000


INFO (2025-11-26 15:44:57,190) [ipw/ipw (line 767)]: Done with sklearn


INFO (2025-11-26 15:44:57,191) [ipw/ipw (line 769)]: max_de: None


INFO (2025-11-26 15:44:57,194) [ipw/ipw (line 824)]: Chosen lambda: nan


INFO (2025-11-26 15:44:57,195) [ipw/ipw (line 840)]: Proportion null deviance explained 0.18270004840760057


0    5.721442
1    7.641214
2    2.243185
3    4.943831
4    3.441654
dtype: float64

## Estimate IPW weights with a custom scikit-learn classifier

You can supply any classifier implementing ``fit`` and ``predict_proba``. Here we use a random forest to capture nonlinear relationships.


In [4]:
rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
)

ipw_result_rf = ipw(
    sample_df,
    sample_weights,
    target_df,
    target_weights,
    variables=variables,
    model=rf_model,
)

ipw_result_rf["weight"].head()


INFO (2025-11-26 15:44:57,204) [ipw/ipw (line 546)]: Starting ipw function


INFO (2025-11-26 15:44:57,207) [adjustment/apply_transformations (line 418)]: Adding the variables: []


INFO (2025-11-26 15:44:57,208) [adjustment/apply_transformations (line 419)]: Transforming the variables: ['gender', 'age_group', 'income']


INFO (2025-11-26 15:44:57,220) [adjustment/apply_transformations (line 456)]: Final variables in output: ['gender', 'age_group', 'income']


INFO (2025-11-26 15:44:57,231) [ipw/ipw (line 580)]: Building model matrix


INFO (2025-11-26 15:44:57,343) [ipw/ipw (line 602)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']


INFO (2025-11-26 15:44:57,344) [ipw/ipw (line 605)]: The number of columns in the model matrix: 16


INFO (2025-11-26 15:44:57,345) [ipw/ipw (line 606)]: The number of rows in the model matrix: 11000


INFO (2025-11-26 15:44:58,344) [ipw/ipw (line 767)]: Done with sklearn


INFO (2025-11-26 15:44:58,344) [ipw/ipw (line 769)]: max_de: None


INFO (2025-11-26 15:44:58,347) [ipw/ipw (line 824)]: Chosen lambda: nan


INFO (2025-11-26 15:44:58,348) [ipw/ipw (line 840)]: Proportion null deviance explained 0.20840028289431922


0    6.848176
1    6.611668
2    1.834019
3    5.237445
4    7.586927
dtype: float64

## Comparing outputs

Both approaches return a dictionary with the computed weights and model metadata. You can inspect the distribution of the weights or downstream balance metrics to decide which classifier best fits your data.
