# Sampling

How would you choose $n$ observations from a total of $N$ to effectively estimate (say) the accuracy of a classifier? For example, imagine that our budget is limited and we can only annotate $n=100$ examples from data of size $N=10^{7}$! 

In this notebook, we show how to 
* Sample with stratified simple random sampling (SSRS) with proportional allocation
* Estimate the metric of interest $\mathbb{E}[Z]$ with the Horvitz-Thompson (HT) and difference (DF) estimators

Besides estimating the value of the metric, we also computs its variance, which would allow us to create confidence intervals for the estimates.  

We focus on estimating the precision of the binary accuracy of a multi-class classifier. Other evaluation metrics can be estimated in a similar way. 




## Load the data

Consider a multi-class classification task. Let's load the packages as well as predictions $(m_1(X), \dots, m_K(X))$ because that's all we have right now. You can also plug in your own data! 

In [1]:
import numpy as np
from ssepy import ModelPerformanceEvaluator
from sklearn.cluster import KMeans

In [2]:
# Generate synthetic data
num_samples = 1000  # Total population size
num_classes = 10    # Number of prediction classes
num_features = 20   # Features for clustering

# Generate random features and predictions
np.random.seed(42)  # For reproducibility

# Random features (normal distribution)
features = np.random.randn(num_samples, num_features)
preds = np.random.rand(num_samples, num_classes)
preds = np.exp(preds) / np.sum(np.exp(preds), axis=1, keepdims=True)

budget = 100
total_sample_size = num_samples

We take performance to be the binary accuracy of the classifier and we will try to estimate its value on the dataset. 

### 1. Predict performance

We obtain an estimate of the expected performance for each observation, that is we construct a proxy $\hat{Z}$ for $Z$. This proxy can be based on _anything_, but, the more strongly associated it is with $Z$, the more precise our estimates of $\mathbb{E}[Z]$ will be. 

In this notebook we use the predictions of the model under evaluation to construct $\hat{Z}$. This means that we set $\hat{Z} = \arg \max_{k\in [K]} m_k(X)$. Ideally, we may want to at least calibrate these predictions. Let's skip this step here.  

In [3]:
# Create proxy performance measures
ground_truth = np.random.randint(0, num_classes, size=num_samples)
predicted_labels = np.argmax(preds, axis=1)
proxy_performance = preds[np.arange(num_samples), predicted_labels]
performance = (predicted_labels == ground_truth)

evaluator = ModelPerformanceEvaluator(
    Yh=proxy_performance,
    budget=budget,
)

### 2. Stratify

SSRS requires dividing the population into strata, from which we will select which samples should be annotated. We form the strata by running k-means on the predictions, following the recommendations from the paper. However, other sample designs can be applied here as well, e.g., the strata could be formed by running a Gaussian mixture model on the feature representations of the data obtained from a neural network. 

In [4]:
evaluator.stratify_data(clustering_algo = KMeans(n_clusters=5, random_state=42))

### 3. Sample

We now sample from the data with SSRS with proportional allocation. In practice, you would choose only strategy.

In [5]:
# Initialize performance evaluator with synthetic data
evaluator.allocate_budget(allocation_type="proportional")
sampled_idx = evaluator.sample()

### 4. Annotate

Pretend that in this step we annotate the selected samples. Here they are (luckily) already available. 

### 5. Estimate

Now we can estimate the performance on our data subset!

In [6]:
estimates_ht = evaluator.compute_estimate(performance[sampled_idx])
print('Mean is ', estimates_ht[0])
print('Variance is ', estimates_ht[1])

# for the difference estimator
estimates_df = evaluator.compute_estimate(performance[sampled_idx], estimator = "df")
print('Mean is ', estimates_df[0])
print('Variance is ', estimates_df[1])

Mean is  [0.12947919]
Variance is  [0.00100167]
Mean is  [0.13163876]
Variance is  [0.00100172]
