# CLI tutorial

This tutorial walks through using the `balance` command-line interface (CLI) to adjust a sample dataset to a target. We will build a small synthetic dataset, run the CLI, and inspect the outputs.

The real power of a CLI lies in how seamlessly it integrates into the broader ecosystem of automation and data workflows. A CLI command can be invoked directly from shell scripts, scheduled via cron jobs, embedded in CI/CD pipelines, or orchestrated through tools like Airflow - all with minimal overhead. This composability means you can chain balance operations with other command-line tools using pipes, process batches of files in a loop, or trigger analyses based on events, all while maintaining a clear audit trail since the command itself documents exactly what was run. The non-zero exit codes that CLIs return on failure integrate naturally with automated systems that need to halt pipelines or send alerts when something goes wrong. In short, a CLI transforms balance from something you use interactively into a building block for production-grade, reproducible workflows.

## Prerequisites

Make sure `balance` is installed and the `balance` CLI is on your PATH. You can also run the CLI via `python -m balance.cli` from a checkout of the repository.


In [None]:
import os
import subprocess
import tempfile

import numpy as np
import pandas as pd

from balance import load_data
from IPython.display import display


## Create a sample + target dataset

We'll create a CSV with two groups: respondents (sample) and non-respondents (target). The CLI expects a binary sample indicator column (`is_respondent` by default), an `id`, a `weight`, and covariates.


In [None]:
rng = np.random.default_rng(2021)
n_sample = 1000
n_target = 2000

sample_df = pd.DataFrame(
    {
        "age": rng.uniform(18, 80, n_sample),
        "gender": rng.choice([1, 2, 3, 4], n_sample),
        "id": range(n_sample),
        "weight": 1.0,
        "is_respondent": 1,
    }
)
target_df = pd.DataFrame(
    {
        "age": rng.uniform(18, 80, n_target),
        "gender": rng.choice([1, 2, 3, 4], n_target),
        "id": range(n_sample, n_sample + n_target),
        "weight": 1.0,
        "is_respondent": 0,
    }
)

input_df = pd.concat([sample_df, target_df], ignore_index=True)
input_df.head()


## Alternative: use the bundled demo data

Balance ships with a small demo dataset via `load_data()`. You can build the CLI input by
adding a sample indicator and weight columns, then concatenate sample and target frames.


In [None]:
target_df, sample_df = load_data()

sample_df = sample_df.copy()
target_df = target_df.copy()
sample_df["is_respondent"] = 1
target_df["is_respondent"] = 0
sample_df["weight"] = 1.0
target_df["weight"] = 1.0

load_data_input_df = pd.concat([sample_df, target_df], ignore_index=True)
load_data_input_df.head()


## Run the CLI

We'll write the input dataset to disk, then call the CLI to compute weights and diagnostics.


In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input.csv")
    output_path = os.path.join(tmpdir, "weights_out.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_out.csv")

    input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file",
        input_path,
        "--output_file",
        output_path,
        "--diagnostics_output_file",
        diagnostics_path,
        "--covariate_columns",
        "age,gender",
        "--method",
        "ipw",
    ]

    print("CLI command:", " ".join(cmd))
    subprocess.check_call(cmd)

    adjusted_df = pd.read_csv(output_path)
    diagnostics_df = pd.read_csv(diagnostics_path)

adjusted_df.head()


## Run the CLI on the bundled demo data

Here is the same CLI flow using the data returned by `load_data()`.


In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input_load_data.csv")
    output_path = os.path.join(tmpdir, "weights_load_data.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_load_data.csv")

    load_data_input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file",
        input_path,
        "--output_file",
        output_path,
        "--diagnostics_output_file",
        diagnostics_path,
        "--covariate_columns",
        "gender,age_group,income",
        "--outcome_columns",
        "happiness",
        "--method",
        "ipw",
    ]

    print("CLI command:", " ".join(cmd))
    subprocess.check_call(cmd)

    load_data_adjusted_df = pd.read_csv(output_path)
    load_data_diagnostics_df = pd.read_csv(diagnostics_path)

display(load_data_adjusted_df.head())
display(load_data_diagnostics_df.head())


## Inspect diagnostics

The diagnostics output is a flat table that includes adjustment metadata and balance
metrics. The `metric` column identifies the type of diagnostic, while `var` indicates the
variable (or `NaN` for overall summaries). The cells below use the diagnostics from the
`load_data()` run (`load_data_diagnostics_df`).


In [None]:
sorted(load_data_diagnostics_df["metric"].unique())


In [None]:
sorted(load_data_diagnostics_df["var"].dropna().unique())


In [None]:
load_data_diagnostics_df.query("metric == 'adjustment_method'")


## CLI Help and Arguments

You can view all available CLI arguments using `--help`:


In [None]:
# Print all CLI arguments
subprocess.run("python -m balance.cli --help", shell=True)


### Key CLI Arguments Summary

Here are the most commonly used arguments:

| Argument | Default | Description |
|----------|---------|-------------|
| `--method` | `ipw` | Adjustment method: `ipw`, `cbps`, or `rake` |
| `--max_de` | `1.5` | Maximum design effect. Set to `None` to use `lambda_1se` instead |
| `--lambda_min` | `1e-05` | Lower bound for L1 penalty (IPW only) |
| `--lambda_max` | `10` | Upper bound for L1 penalty (IPW only) |
| `--num_lambdas` | `250` | Number of lambda values to search (IPW only) |
| `--weight_trimming_mean_ratio` | `20.0` | Trim weights above `mean(weights) * ratio` |
| `--transformations` | `default` | Covariate transformations. Use `None` to disable |
| `--formula` | `None` | Custom model formula (e.g., `"age + gender"`) |
| `--one_hot_encoding` | `True` | One-hot encode categorical features |
| `--batch_columns` | `None` | Columns to group by for batch processing |
| `--keep_columns` | `None` | Subset of columns to include in output |
| `--outcome_columns` | `None` | Columns treated as outcomes (not covariates) |
| `--ipw_logistic_regression_kwargs` | `None` | JSON string of kwargs for sklearn LogisticRegression |
| `--succeed_on_weighting_failure` | `False` | Return null weights instead of failing on errors |


### Example: Tuning IPW parameters

Below we run the CLI with custom regularization settings and a custom logistic regression solver:


In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input.csv")
    output_path = os.path.join(tmpdir, "weights_tuned.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_tuned.csv")

    load_data_input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file", input_path,
        "--output_file", output_path,
        "--diagnostics_output_file", diagnostics_path,
        "--covariate_columns", "gender,age_group,income",
        "--method", "ipw",
        # Tuning parameters
        "--max_de", "2.0",
        "--lambda_min", "1e-06",
        "--lambda_max", "100",
        "--num_lambdas", "500",
        "--weight_trimming_mean_ratio", "10.0",
        # Custom logistic regression settings
        "--ipw_logistic_regression_kwargs", '{"solver": "liblinear", "max_iter": 500}',
    ]

    print("CLI command:")
    print(" ".join(cmd))
    subprocess.check_call(cmd)

    tuned_adjusted_df = pd.read_csv(output_path)

tuned_adjusted_df.head()


### Example: Using a Custom Formula

The `--formula` argument allows you to specify a custom model formula, including interaction
terms. When using `--formula`, you should typically also set `--transformations=None` to
prevent automatic transformations from interfering with your custom formula.

The formula uses patsy/R-style syntax:
- `age + gender`: additive terms (no interaction)
- `age * gender`: equivalent to `age + gender + age:gender` (main effects + interaction)
- `age:gender`: only the interaction term

In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input.csv")
    output_path = os.path.join(tmpdir, "weights_formula.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_formula.csv")

    # Use the synthetic data with numeric covariates for formula example
    input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file", input_path,
        "--output_file", output_path,
        "--diagnostics_output_file", diagnostics_path,
        "--covariate_columns", "age,gender",
        "--method", "ipw",
        # Disable transformations to use raw covariates in formula
        "--transformations", "None",
        # Use a formula with interaction term
        "--formula", "age*gender",
    ]

    print("CLI command with custom formula:")
    print(" ".join(cmd))
    subprocess.check_call(cmd)

    formula_diagnostics_df = pd.read_csv(diagnostics_path)

# Check model coefficients to verify formula was applied
print("\nModel coefficients (showing interaction term):")
print(formula_diagnostics_df.query("metric == 'model_coef'")[["var", "val"]])

## Batch Processing Example

The `--batch_columns` argument allows you to run separate adjustments for each unique
combination of values in the specified columns. This is useful when you want to compute
weights independently for different subgroups (e.g., by gender or region).

In [None]:
# Create a dataset with a batch column for gender
batch_input_df = load_data_input_df.copy()

# The 'gender' column has values like 'Female', 'Male', and possibly NA
# Filter to only rows with non-null gender for this example
batch_input_df = batch_input_df[batch_input_df["gender"].notna()].copy()
print(f"Rows after filtering: {len(batch_input_df)}")
print(f"Gender distribution:\n{batch_input_df['gender'].value_counts()}")


In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input_batch.csv")
    output_path = os.path.join(tmpdir, "weights_batch.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_batch.csv")

    batch_input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file", input_path,
        "--output_file", output_path,
        "--diagnostics_output_file", diagnostics_path,
        "--covariate_columns", "age_group,income",  # Note: gender is now used as batch column
        "--outcome_columns", "happiness",
        "--batch_columns", "gender",  # Process each gender separately
        "--method", "ipw",
    ]

    print("CLI command with batch processing:")
    print(" ".join(cmd))
    subprocess.check_call(cmd)

    batch_adjusted_df = pd.read_csv(output_path)
    batch_diagnostics_df = pd.read_csv(diagnostics_path)

print(f"\nOutput rows: {len(batch_adjusted_df)}")
batch_adjusted_df.head()


In [None]:
# Inspect weights by gender - each group was adjusted independently
print("Weight statistics by gender (sample only):")
sample_only = batch_adjusted_df[batch_adjusted_df["is_respondent"] == 1]
print(sample_only.groupby("gender")["weight"].describe().round(3))


## Alternative Weighting Methods

The CLI supports three adjustment methods:
- **IPW (Inverse Probability Weighting)**: The default method, uses logistic regression to estimate propensity scores
- **CBPS (Covariate Balancing Propensity Score)**: Balances covariates while estimating propensity scores
- **Rake (Raking/Iterative Proportional Fitting)**: Adjusts weights iteratively to match marginal distributions

### Example: CBPS Method

CBPS simultaneously optimizes covariate balance and propensity score estimation:

In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input.csv")
    output_path = os.path.join(tmpdir, "weights_cbps.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_cbps.csv")

    input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file", input_path,
        "--output_file", output_path,
        "--diagnostics_output_file", diagnostics_path,
        "--covariate_columns", "age,gender",
        "--method", "cbps",
    ]

    print("CLI command with CBPS method:")
    print(" ".join(cmd))
    subprocess.check_call(cmd)

    cbps_diagnostics_df = pd.read_csv(diagnostics_path)

# Verify the method used
print("\nAdjustment method used:")
print(cbps_diagnostics_df.query("metric == 'adjustment_method'")[["var", "val"]])

### Example: Rake Method

Raking iteratively adjusts weights to match target marginal distributions:

In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input.csv")
    output_path = os.path.join(tmpdir, "weights_rake.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_rake.csv")

    input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file", input_path,
        "--output_file", output_path,
        "--diagnostics_output_file", diagnostics_path,
        "--covariate_columns", "age,gender",
        "--method", "rake",
    ]

    print("CLI command with rake method:")
    print(" ".join(cmd))
    subprocess.check_call(cmd)

    rake_diagnostics_df = pd.read_csv(diagnostics_path)

# Verify the method used
print("\nAdjustment method used:")
print(rake_diagnostics_df.query("metric == 'adjustment_method'")[["var", "val"]])

## Next steps

- Try `--method cbps` or `--method rake` for alternative weighting approaches.
- Use `--outcome_columns` to control which columns are treated as outcomes.
- Supply `--ipw_logistic_regression_kwargs` to tune the IPW model.
- Use `--succeed_on_weighting_failure` for pipelines where you want null weights instead of errors.
- Explore `--covariate_columns_for_diagnostics` and `--rows_to_keep_for_diagnostics` to customize diagnostic output.


## Session info

For reproducibility, here is the session information:


In [None]:
import session_info
session_info.show(html=False, dependencies=True)
