## Calibration

The `Calibration` class provides a way to adjust weights of observations in a dataset to match specified target values. This is commonly used in survey research and policy modeling for rebalancing datasets to better represent desired population characteristics. 

The calibration process uses an optimization algorithm to find weights that minimize the distance from the original weights while achieving the target constraints.

## Basic usage

### Parameters

`__init__(data, weights, targets)`

- `data` (pd.DataFrame): The dataset to be calibrated. This should contain all the variables you want to use for calibration.
- `weights` (np.ndarray): Initial weights for each observation in the dataset. Typically starts as an array of ones for equal weighting.
- `targets` (np.ndarray): Target values that the calibration process should achieve. These correspond to the desired weighted sums.

Calibration can be easily done by initializing the `Calibration` class, passing in the parameters above. Then `calibrate()` method performs the actual calibration using the reweight function. This method:
- Adjusts the weights to better match the target values
- May subsample the data for efficiency
- Updates both `self.weights` and `self.data` with the calibrated results

## Example

Below is a complete example showing how to calibrate a dataset to match income targets for specific age groups:

In [3]:
from microcalibrate.calibration import Calibration
import logging
import numpy as np
import pandas as pd

logging.basicConfig(
    level=logging.INFO,
    format='%(name)s - %(levelname)s - %(message)s'
)

# Create a sample dataset with age and income data
data = pd.DataFrame({
    "age": np.random.randint(18, 70, size=100),
    "income": np.random.normal(40000, 50000, size=100),
})

# Set initial weights (all one in this example)
weights = np.ones(len(data))

# Calculate target values: total income for age groups 20-30 and 40-50 (as an example) or employ existing targets
targets_matrix = pd.DataFrame({
    "income_aged_20_30": ((data["age"] >= 20) & (data["age"] <= 30)).astype(float) * data["income"],
    "income_aged_40_50": ((data["age"] >= 40) & (data["age"] <= 50)).astype(float) * data["income"],
})

print(targets_matrix)


targets = np.array([
    (targets_matrix["income_aged_20_30"] * weights * 1).sum(),
    (targets_matrix["income_aged_40_50"] * weights * 1).sum(), 
])

print(f"Original weights: {weights}")
print(f"Original targets: {targets}")

    income_aged_20_30  income_aged_40_50
0       -50487.312561          -0.000000
1            0.000000           0.000000
2           -0.000000          -0.000000
3            0.000000       65919.792922
4            0.000000           0.000000
..                ...                ...
95           0.000000           0.000000
96           0.000000           0.000000
97           0.000000           0.000000
98           0.000000           0.000000
99           0.000000        9531.550398

[100 rows x 2 columns]
Original weights: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1.]
Original targets: [ 630004.69588852 1001707.09307129]


In [2]:
# Initialize the Calibration object
calibrator = Calibration(
    data=data,
    weights=weights, 
    targets=targets,
    noise_level=0.05,
    epochs=64,
    learning_rate=0.01,
    dropout_rate=0,
    subsample_every=0,
)

# Perform the calibration
calibrator.calibrate()

print(f"Original dataset size: {len(data)}")
print(f"Calibrated dataset size: {len(calibrator.data)}")
print(f"Number of calibrated weights: {len(calibrator.weights)}")

microcalibrate.reweight - INFO - Starting calibration process for targets ['age' 'income']: [846184.90984739 701149.57088578]
microcalibrate.reweight - INFO - Original weights - mean: 1.0000, std: 0.0000
microcalibrate.reweight - INFO - Initial weights after noise - mean: 1.0240, std: 0.0141
Reweighting progress:   0%|          | 0/64 [00:00<?, ?epoch/s]microcalibrate.reweight - INFO - Initial weights after noise - mean: 1.0240, std: 0.0141
microcalibrate.reweight - INFO - Estimates: tensor([   4486.9644, 3978299.5000], device='mps:0',
       grad_fn=<SqueezeBackward4>)
microcalibrate.reweight - INFO - Targets: tensor([846184.9375, 701149.5625], device='mps:0')
microcalibrate.reweight - INFO - Relative error: tensor([ 0.9894, 21.8459], device='mps:0', grad_fn=<PowBackward0>)
microcalibrate.reweight - INFO - tensor([1.0123, 1.0214, 1.0323, 1.0488, 1.0007, 1.0284, 1.0179, 1.0087, 1.0489,
        1.0086, 1.0174, 1.0039, 1.0469, 1.0099, 1.0390, 1.0013, 1.0026, 1.0191,
        1.0290, 1.011

Original dataset size: 100
Calibrated dataset size: 100
Number of calibrated weights: 100


In [None]:
# Verify the calibration results
calibrated_matrix = pd.DataFrame({
    "income_aged_20_30": ((calibrator.data["age"] >= 20) & (calibrator.data["age"] <= 30)).astype(float) * calibrator.data["income"],
    "income_aged_40_50": ((calibrator.data["age"] >= 40) & (calibrator.data["age"] <= 50)).astype(float) * calibrator.data["income"],
})

# Calculate final weighted totals
final_totals = calibrated_matrix.mul(calibrator.weights, axis=0).sum().values

print(f"Target totals: {targets}")
print(f"Final calibrated totals: {final_totals}")
print(f"Difference: {final_totals - targets}")
print(f"Relative error: {(final_totals - targets) / targets * 100}")

Target totals: [673662.31574665 906225.91384017]
Final calibrated totals: [152673.82244626 367294.22662297]
Difference: [-520988.49330039 -538931.6872172 ]
Relative error: [-77.33674292 -59.46990469]
