# Balance CLI tutorial

This tutorial walks through using the `balance` command-line interface (CLI) to adjust a sample dataset to a target. We will build a small synthetic dataset, run the CLI, and inspect the outputs.


## Prerequisites

Make sure `balance` is installed and the `balance` CLI is on your PATH. You can also run the CLI via `python -m balance.cli` from a checkout of the repository.


In [None]:
import os
import subprocess
import tempfile

import numpy as np
import pandas as pd

from balance import load_data
from IPython.display import HTML, display


## Create a sample + target dataset

We'll create a CSV with two groups: respondents (sample) and non-respondents (target). The CLI expects a binary sample indicator column (`is_respondent` by default), an `id`, a `weight`, and covariates.


In [None]:
rng = np.random.default_rng(2021)
n_sample = 1000
n_target = 2000

sample_df = pd.DataFrame(
    {
        "age": rng.uniform(18, 80, n_sample),
        "gender": rng.choice([1, 2, 3, 4], n_sample),
        "id": range(n_sample),
        "weight": 1.0,
        "is_respondent": 1,
    }
)
target_df = pd.DataFrame(
    {
        "age": rng.uniform(18, 80, n_target),
        "gender": rng.choice([1, 2, 3, 4], n_target),
        "id": range(n_sample, n_sample + n_target),
        "weight": 1.0,
        "is_respondent": 0,
    }
)

input_df = pd.concat([sample_df, target_df], ignore_index=True)
input_df.head()


## Alternative: use the bundled demo data

Balance ships with a small demo dataset via `load_data()`. You can build the CLI input by
adding a sample indicator and weight columns, then concatenate sample and target frames.


In [None]:
target_df, sample_df = load_data()

sample_df = sample_df.copy()
target_df = target_df.copy()
sample_df["is_respondent"] = 1
target_df["is_respondent"] = 0
sample_df["weight"] = 1.0
target_df["weight"] = 1.0

load_data_input_df = pd.concat([sample_df, target_df], ignore_index=True)
load_data_input_df.head()


## Run the CLI

We'll write the input dataset to disk, then call the CLI to compute weights and diagnostics.


In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input.csv")
    output_path = os.path.join(tmpdir, "weights_out.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_out.csv")

    input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file",
        input_path,
        "--output_file",
        output_path,
        "--diagnostics_output_file",
        diagnostics_path,
        "--covariate_columns",
        "age,gender",
        "--method",
        "ipw",
    ]

    print("CLI command:", " ".join(cmd))
    subprocess.check_call(cmd)

    adjusted_df = pd.read_csv(output_path)
    diagnostics_df = pd.read_csv(diagnostics_path)

adjusted_df.head()


## Run the CLI on the bundled demo data

Here is the same CLI flow using the data returned by `load_data()`.


In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    input_path = os.path.join(tmpdir, "input_load_data.csv")
    output_path = os.path.join(tmpdir, "weights_load_data.csv")
    diagnostics_path = os.path.join(tmpdir, "diagnostics_load_data.csv")

    load_data_input_df.to_csv(input_path, index=False)

    cmd = [
        "python",
        "-m",
        "balance.cli",
        "--input_file",
        input_path,
        "--output_file",
        output_path,
        "--diagnostics_output_file",
        diagnostics_path,
        "--covariate_columns",
        "gender,age_group,income",
        "--outcome_columns",
        "happiness",
        "--method",
        "ipw",
    ]

    print("CLI command:", " ".join(cmd))
    subprocess.check_call(cmd)

    load_data_adjusted_df = pd.read_csv(output_path)
    load_data_diagnostics_df = pd.read_csv(diagnostics_path)

display(load_data_adjusted_df.head())
display(load_data_diagnostics_df.head())


## Inspect diagnostics

The diagnostics output is a flat table that includes adjustment metadata and balance
metrics. The `metric` column identifies the type of diagnostic, while `var` indicates the
variable (or `NaN` for overall summaries). The cells below use the most recent
`load_data()` run (`load_data_diagnostics_df`).


In [None]:
diagnostics_df = load_data_diagnostics_df


In [None]:
sorted(diagnostics_df["metric"].unique())


In [None]:
sorted(diagnostics_df["var"].dropna().unique())


In [None]:
diagnostics_df.query("metric == 'adjustment_method'")


In [None]:
table_data = diagnostics_df.to_json(orient="records")
table_id = "diagnostics-table"
table_html = f"""
<div style="margin-bottom: 0.5rem;">
  <label for="{table_id}-search">Search diagnostics:</label>
  <input id="{table_id}-search" type="search" placeholder="Type to filter" style="margin-left: 0.5rem;" />
</div>
<div style="max-height: 400px; overflow: auto; border: 1px solid #ddd;">
  <table id="{table_id}" style="width: 100%; border-collapse: collapse;"></table>
</div>
<script>
  const data = {table_data};
  const table = document.getElementById("{table_id}");
  const searchInput = document.getElementById("{table_id}-search");

  function renderTable(rows) {{
    if (!rows.length) {{
      table.innerHTML = "<tr><td>No diagnostics rows found.</td></tr>";
      return;
    }}
    const columns = Object.keys(rows[0]);
    const header = "<tr>" + columns.map(col => "<th style=\"text-align:left; border-bottom: 1px solid #ccc; padding: 4px;\">" + col + "</th>").join("") + "</tr>";
    const body = rows.map(row => {{
      return "<tr>" + columns.map(col => "<td style=\"border-bottom: 1px solid #eee; padding: 4px;\">" + (row[col] ?? "") + "</td>").join("") + "</tr>";
    }}).join("");
    table.innerHTML = header + body;
  }}

  function filterRows(term) {{
    const lowerTerm = term.toLowerCase();
    return data.filter(row => Object.values(row).some(value => {{
      if (value === null || value === undefined) return false;
      return String(value).toLowerCase().includes(lowerTerm);
    }}));
  }}

  renderTable(data);
  searchInput.addEventListener("input", event => {{
    renderTable(filterRows(event.target.value));
  }});
</script>
"""
display(HTML(table_html))


## Next steps

- Try `--method cbps` or `--method rake` for alternative weighting approaches.
- Use `--outcome_columns` to control which columns are treated as outcomes.
- Supply `--ipw_logistic_regression_kwargs` to tune the IPW model.
