# Super Simple Autometrics Tutorial
=================================

This tutorial shows the absolute basics of using autometrics.
Just load a dataset, run the pipeline, and get your metrics!

## Setup

In [1]:
import os
import dspy
from autometrics.autometrics import Autometrics
from autometrics.dataset.datasets.simplification.simplification import SimpDA
from autometrics.aggregator.regression.ElasticNet import ElasticNet

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = open('/Users/spangher/.openai-reglab-project-key.txt').read().strip()

  from .emd import emd, emd_with_flow, emd_samples


[Autometrics] No GPU detected - using BM25 + LLMRec pipeline for CPU-optimized performance


In [11]:
dataset

Dataset: SimpDA, Target Columns: ['fluency', 'meaning', 'simplicity'], Ignore Columns: ['id', 'original', 'simple', 'system', 'ref1', 'ref2', 'ref3', 'ref4', 'ref5', 'ref6', 'ref7', 'ref8', 'ref9', 'ref10'], Metric Columns: ['LENS', 'SARI_P', 'SARI_F', 'FKGL', 'BERTScoreP_roberta-large', 'BERTScoreR_roberta-large', 'BERTScoreF_roberta-large', 'BLEU', 'METEOR', 'ROUGE-1-p', 'ROUGE-2-p', 'ROUGE-L-p', 'ROUGE-Lsum-p', 'ROUGE-1-r', 'ROUGE-2-r', 'ROUGE-L-r', 'ROUGE-Lsum-r', 'ROUGE-1-f1', 'ROUGE-2-f1', 'ROUGE-L-f1', 'ROUGE-Lsum-f1', 'distinct_1', 'distinct_2', 'distinct_3', 'distinct_4', 'Perplexity_gpt2-large', 'Autometrics_Regression_simplicity']
   id                                           original  \
0   0  a bastion on the eastern approaches was built ...   
1   0  a bastion on the eastern approaches was built ...   
2   0  a bastion on the eastern approaches was built ...   
3   2  a few animals have chromatic response, changin...   
4   2  a few animals have chromatic response, chan

In [4]:
# Load the SimpDA dataset (text simplification)
dataset = SimpDA(path='../autometrics/dataset/datasets/simplification/simpda.csv')
target_measure = "simplicity"  # The human score column we want to predict

print(f"Dataset: {dataset.get_name()}")
print(f"Size: {len(dataset.get_dataframe())} examples")
print(f"Target measure: {target_measure}")

## Cell 3: Configure LLMs
# Use GPT-4o-mini for both generation and judging
generator_llm = dspy.LM("openai/gpt-4o-mini")
judge_llm = dspy.LM("openai/gpt-4o-mini")

print("LLMs configured!")

Dataset: SimpDA
Size: 434 examples
Target measure: simplicity
LLMs configured!


## Autometrics Pipeline

In [5]:
# Super simple configuration:
# - Generate 1 metric using LLM judge
# - Retrieve 10 metrics from the bank
# - Select top 5 using ElasticNet regression
autometrics = Autometrics(
    metric_generation_configs={
        "llm_judge": {"metrics_per_trial": 1}  # Just generate 1 metric
    },
    regression_strategy=ElasticNet,  # Use ElasticNet instead of default Lasso
    seed=42,  # For reproducibility
    generated_metrics_dir="tutorial_metrics"  # Unique directory for this tutorial
)

In [6]:
print("Running autometrics pipeline...")
print("This will:")
print("1. Generate 1 LLM judge metric")
print("2. Retrieve 10 relevant metrics from the bank")
print("3. Evaluate all metrics on your dataset")
print("4. Select top 5 using ElasticNet regression")
print("5. Create a final aggregated metric")

results = autometrics.run(
    dataset=dataset,
    target_measure=target_measure,
    generator_llm=generator_llm,
    judge_llm=judge_llm,
    num_to_retrieve=10,  # Retrieve 10 metrics
    num_to_regress=5     # Select top 5
)

print("Pipeline complete! ðŸŽ‰")

Running autometrics pipeline...
This will:
1. Generate 1 LLM judge metric
2. Retrieve 10 relevant metrics from the bank
3. Evaluate all metrics on your dataset
4. Select top 5 using ElasticNet regression
5. Create a final aggregated metric
[Autometrics] Starting pipeline for SimpDA - simplicity
[Autometrics] Configuration: retrieve=10, regress=5, regenerate=False

[Autometrics] Step 1: Generating/Loading Metrics
[Autometrics] Generating 1 metrics using llm_judge...
Initializing BM25 recommender with index path: /Users/spangher/Library/Application Support/autometrics/bm25_all_metrics
Building BM25 index in /Users/spangher/Library/Application Support/autometrics/bm25_all_metrics/index for 48 metrics â€¦




2025-11-17 17:53:54,496 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - Setting log level to INFO
2025-11-17 17:53:54,497 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) - AbstractIndexer settings:
2025-11-17 17:53:54,505 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + DocumentCollection path: /Users/spangher/Library/Application Support/autometrics/bm25_all_metrics/collection
2025-11-17 17:53:54,506 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + CollectionClass: JsonCollection
2025-11-17 17:53:54,506 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:215) -  + Index path: /Users/spangher/Library/Application Support/autometrics/bm25_all_metrics/index
2025-11-17 17:53:54,506 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:216) -  + Threads: 1
2025-11-17 17:53:54,506 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:217) -  + Optimize (merge segments)? false
2025-11-17 17:53:54,522 INFO  [main] inde

Nov 17, 2025 5:53:54 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false


2025-11-17 17:53:54,829 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:292) - Indexing Complete! 48 documents indexed
2025-11-17 17:53:54,830 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:294) - indexed:               48
2025-11-17 17:53:54,830 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:295) - unindexable:            0
2025-11-17 17:53:54,830 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:296) - empty:                  0
2025-11-17 17:53:54,830 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:297) - skipped:                0
2025-11-17 17:53:54,830 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:298) - errors:                 0
2025-11-17 17:53:54,833 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:301) - Total 48 documents indexed in 00:00:00
BM25 recommender loaded successfully
[Autometrics] Saved 1 metrics for llm_judge
[Autometrics] Generated/Loaded 1 metrics

[Autometrics] Step 2: Loading Metric Bank
[Autometri

Nov 17, 2025 5:53:54 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
  """---


Building BM25 index in /Users/spangher/Library/Application Support/autometrics/bm25_SimpDA_5db6f1da_card/index for 49 metrics â€¦




2025-11-17 17:53:56,143 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - Setting log level to INFO
2025-11-17 17:53:56,144 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) - AbstractIndexer settings:
2025-11-17 17:53:56,153 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + DocumentCollection path: /Users/spangher/Library/Application Support/autometrics/bm25_SimpDA_5db6f1da_card/collection
2025-11-17 17:53:56,153 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + CollectionClass: JsonCollection
2025-11-17 17:53:56,154 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:215) -  + Index path: /Users/spangher/Library/Application Support/autometrics/bm25_SimpDA_5db6f1da_card/index
2025-11-17 17:53:56,154 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:216) -  + Threads: 1
2025-11-17 17:53:56,154 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:217) -  + Optimize (merge segments)? false
2025-11-17 17:53:56,167

Nov 17, 2025 5:53:56 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false


2025-11-17 17:53:56,440 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:292) - Indexing Complete! 49 documents indexed
2025-11-17 17:53:56,441 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:294) - indexed:               49
2025-11-17 17:53:56,441 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:295) - unindexable:            0
2025-11-17 17:53:56,441 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:296) - empty:                  0
2025-11-17 17:53:56,441 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:297) - skipped:                0
2025-11-17 17:53:56,441 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:298) - errors:                 0
2025-11-17 17:53:56,444 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:301) - Total 49 documents indexed in 00:00:00
Starting iterative recommendation: requesting 10 metrics from 49 available

--- Iteration 1 ---
Requesting 10 metrics from 49 remaining metrics
Planning recommendation strateg



config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[MetricBank] Perplexity model device: cpu
[Autometrics] Built 9 valid metrics
[Autometrics] Evaluating 9 regular metrics using parallel execution...
[Parallel] LENS - checking device before predict...
[Parallel] SARI - checking device before predict...
[Parallel] FKGL - checking device before predict...
[Parallel] BERTScore_roberta-large - checking device before predict...
[Parallel] BERTScore_roberta-large model has no device info
[Parallel] BLEU - checking device before predict...
[Parallel] METEOR - checking device before predict...
[Parallel] ROUGE - checking device before predict...
[Parallel] DistinctNGram - checking device before predict...
[Parallel] Perplexity_gpt2-large - checking device before predict...
[Parallel] Perplexity_gpt2-large model device: cpu


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

checkpoints/model.ckpt:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

hparams.yaml:   0%|          | 0.00/705 [00:00<?, ?B/s]






config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Groups:   0%|                                                                                | 0/1 [00:00<?, ?it/s][A[A[A

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

    âœ“ SARI computed successfully (parallel)
    âœ“ BLEU computed successfully (parallel)
    âœ“ METEOR computed successfully (parallel)
    âœ“ DistinctNGram computed successfully (parallel)
    âœ“ FKGL computed successfully (parallel)
    âœ“ ROUGE computed successfully (parallel)


merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using LENS with topk=3


/Users/spangher/miniconda3/lib/python3.12/site-packages/pytorch_lightning/core/saving.py:188: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']

2025-11-17 17:56:58,647 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True (mps), used: False

2025-11-17 17:56:58,719 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores

2025-11-17 17:56:58,720 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs

2025-11-17 17:56:58,721 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
/Users/spangher/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/setup.py:187: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Predicting DataLoader 0:   4%|â–ˆâ–ˆâ–Ž                                                 | 12/272 [00:23<08:19,  1.92s/it]


Groups: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ

    âœ“ Perplexity_gpt2-large computed successfully (parallel)


Predicting DataLoader 0:  10%|â–ˆâ–ˆâ–ˆâ–ˆâ–‰                                               | 26/272 [00:53<08:29,  2.07s/it]

    âœ“ BERTScore_roberta-large computed successfully (parallel)


Predicting DataLoader 0: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 272/272 [07:20<00:00,  1.62s/it]
  template = """


    âœ“ LENS computed successfully (parallel)
[Autometrics] Aggregating results back to original dataset...
    âœ“ LENS aggregated successfully
    âœ“ SARI aggregated successfully
    âœ“ FKGL aggregated successfully
    âœ“ BERTScore_roberta-large aggregated successfully
    âœ“ BLEU aggregated successfully
    âœ“ METEOR aggregated successfully
    âœ“ ROUGE aggregated successfully
    âœ“ DistinctNGram aggregated successfully
    âœ“ Perplexity_gpt2-large aggregated successfully
[Autometrics] Parallel evaluation complete. Dataset now has 9 metrics
[Autometrics] Successfully evaluated 9 metrics, 0 failed

[Autometrics] Step 5: Regression Analysis (Selecting Top 5 via ElasticNet)
[Autometrics] Running regression to select top 5 metrics from 9 candidates...
  Fitting regression on all metrics to identify importance...
  Metric importance scores:
    1. LENS: 0.3799
    2. BERTScoreP_roberta-large: 0.1846
    3. BLEU: 0.0802
    4. ROUGE-L-p: -0.0473
    5. Perplexity_gpt2-large: 0.04

In [7]:
print("\n" + "="*50)
print("RESULTS")
print("="*50)

print(f"\nGenerated metrics: {len(results['all_generated_metrics'])}")
for i, metric in enumerate(results['all_generated_metrics']):
    print(f"  {i+1}. {metric.__name__}")

print(f"\nRetrieved metrics: {len(results['retrieved_metrics'])}")
for i, metric in enumerate(results['retrieved_metrics'][:3]):  # Show first 3
    print(f"  {i+1}. {metric.__name__}")

print(f"\nTop selected metrics: {len(results['top_metrics'])}")
for i, metric in enumerate(results['top_metrics']):
    print(f"  {i+1}. {metric.get_name()}")

print(f"\nFinal regression metric: {results['regression_metric'].get_name()}")
print(f"Description: {results['regression_metric'].get_description()}")


RESULTS

Generated metrics: 1
  1. Clarity_of_Expression_gpt_4o_mini_LLMJudge

Retrieved metrics: 9
  1. LENS
  2. SARI
  3. FKGL

Top selected metrics: 5
  1. LENS
  2. BERTScore_roberta-large
  3. BLEU
  4. ROUGE
  5. Perplexity_gpt2-large

Final regression metric: Autometrics_Regression_simplicity
Description: Regression aggregator for simplicity using top 5 metrics


## Use Your Metrics

In [8]:
print("\n" + "="*50)
print("USING YOUR METRICS")
print("="*50)

# Get predictions from your final metric
final_scores = results['regression_metric'].predict(dataset)
human_scores = dataset.get_dataframe()[target_measure]

print(f"\nPredicted vs Human scores for first 5 examples:")
print("Example | Predicted | Human | Pred Rank | Human Rank")
print("-" * 55)

# Get first 5 examples
first_5_pred = final_scores[:5]
first_5_human = human_scores.iloc[:5]

for i in range(min(5, len(final_scores))):
    predicted = first_5_pred[i]
    human = first_5_human.iloc[i]
    
    # Calculate ranks within these 5 examples (higher score = higher rank)
    pred_rank = (first_5_pred > predicted).sum() + 1
    human_rank = (first_5_human > human).sum() + 1
    
    print(f"  {i+1}     | {predicted:.3f}    | {human:.3f} | {pred_rank:>9} | {human_rank:>10}")

# Check correlation with human scores
import numpy as np
from scipy.stats import pearsonr

correlation, p_value = pearsonr(human_scores, final_scores)
print(f"\nCorrelation with human scores: {correlation:.3f} (p={p_value:.3f})")


USING YOUR METRICS

Predicted vs Human scores for first 5 examples:
Example | Predicted | Human | Pred Rank | Human Rank
-------------------------------------------------------
  1     | 0.813    | 1.352 |         1 |          1
  2     | -0.111    | -0.873 |         3 |          5
  3     | -0.641    | -0.472 |         4 |          3
  4     | -0.039    | -0.281 |         2 |          2
  5     | -1.013    | -0.684 |         5 |          4

Correlation with human scores: 0.780 (p=0.000)


In [9]:
print("\n" + "="*50)
print("REPORT CARD")
print("="*50)

print(results['report_card'])

print("\n" + "="*50)
print("TUTORIAL COMPLETE!")
print("="*50)
print("You now have:")
print("âœ… A custom metric for your task")
print("âœ… Top 5 most relevant metrics")
print("âœ… A final aggregated metric")
print("âœ… Correlation with human judgments")
print("\nYou can use these metrics on new data!")


REPORT CARD

# Autometrics Report Card

## Dataset Information
- **Dataset**: SimpDA
- **Target Measure**: simplicity
- **Dataset Size**: 434 examples

## Top Metrics Selected
- **1.** LENS
- **2.** BERTScore_roberta-large (MultiMetric: BERTScoreP_roberta-large, BERTScoreR_roberta-large, BERTScoreF_roberta-large)
- **3.** BLEU
- **4.** ROUGE (MultiMetric: ROUGE-1-p, ROUGE-2-p, ROUGE-L-p, ROUGE-Lsum-p, ROUGE-1-r, ROUGE-2-r, ROUGE-L-r, ROUGE-Lsum-r, ROUGE-1-f1, ROUGE-2-f1, ROUGE-L-f1, ROUGE-Lsum-f1)
- **5.** Perplexity_gpt2-large

## Regression Aggregator
- **Type**: ElasticNet
- **Name**: Autometrics_Regression_simplicity
- **Description**: Regression aggregator for simplicity using top 5 metrics

## Summary
The Autometrics pipeline successfully identified the most relevant metrics for evaluating simplicity on the SimpDA dataset. The selected metrics can be used individually or combined through the regression aggregator for comprehensive evaluation.

## Hotelling TÂ² Selection
- Select