In [7]:
import pandas as pd

# Load full clustered dataset
df = pd.read_parquet("../data/clustered_benchmark_data.parquet")

# Load LLM interpretation
with open("../data/llm_cluster_analysis.md", "r") as f:
    llm_output = f.read()

print("LLM Heuristic Output:")
print(llm_output)

LLM Heuristic Output:
# LLM Cluster Interpretation

### Performance Benchmarking Analysis

1. **Key Patterns in Runtime Across Clusters:**
   - Cluster 0 and Cluster 3 have lower runtimes compared to Cluster 1 and Cluster 2.
   - Cluster 0 has the lowest runtime, while Cluster 1 has the highest.

2. **Impact of Input Features and Output Format on Performance:**
   - **Null Rate:** Higher null rates tend to increase runtime due to additional processing required for handling missing values.
   - **Cardinality:** Higher cardinality can lead to longer runtimes as it increases the complexity of operations.
   - **Output Format:** The choice of output format can impact performance; for example, Parquet may be faster than CSV due to its columnar storage.

3. **Transitions or Thresholds:**
   - No specific thresholds were explicitly mentioned in the data provided.

4. **Efficient and Inefficient Configurations:**
   - Cluster 0 and Cluster 3 can be considered more efficient due to their lower 

In [9]:
summary_by_regime = df.groupby("regime").agg({
    "runtime_ms": "mean",
    "null_rate": "mean",
    "cardinality": "mean",
    "output_format": lambda x: x.mode()[0],
    "engine": lambda x: x.mode()[0],
    "cluster": lambda x: x.mode()[0],  # Most common cluster for the regime
    "rows": "mean",
    "columns": "mean"
}).round(2)

summary_by_regime

Unnamed: 0_level_0,runtime_ms,null_rate,cardinality,output_format,engine,cluster,rows,columns
regime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
high_card_pandas,882.18,0.2,7486.88,csv,pandas,2,541205.16,35.07
null_heavy_json,1051.75,0.75,549.85,json,pandas,1,301504.32,30.83
small_dense_parquet,105.59,0.05,52.01,parquet,polars,0,2939.9,12.31
wide_polars,353.89,0.3,2914.03,parquet,polars,3,51583.79,90.04


# 📊 Evaluation of LLM-Generated Performance Heuristics

This section evaluates the correctness and usefulness of the LLM-generated heuristics based on actual benchmark results. The synthetic dataset was generated with four performance regimes, each assigned to a distinct cluster by HDBSCAN.

---

## ✅ Ground Truth: Regime Summary

| Regime               | Cluster | Runtime (ms) | Null Rate | Cardinality | Output Format | Engine  |
|----------------------|---------|--------------|-----------|-------------|----------------|---------|
| small_dense_parquet  | 0       | 106          | 0.05      | 52          | parquet        | polars  |
| null_heavy_json      | 1       | 1052         | 0.75      | 550         | json           | pandas  |
| high_card_pandas     | 2       | 882          | 0.20      | 7487        | csv            | pandas  |
| wide_polars          | 3       | 354          | 0.30      | 2914        | parquet        | polars  |

---

## 🤖 LLM Output Summary

**Key Statements:**
1. Cluster 0 has the lowest runtime; Cluster 1 the highest.
2. High null rates increase runtime.
3. Higher cardinality increases complexity.
4. Parquet is faster than CSV.
5. Clusters 0 and 3 are efficient; Cluster 1 is inefficient.
6. Recommended heuristics:
   - Minimise nulls
   - Optimise cardinality
   - Prefer Parquet for performance
   - Choose tools based on workload
   - Benchmark real configurations

---

## 🧠 Evaluation of Claims

| LLM Statement                                            | Match? | Assessment |
|----------------------------------------------------------|--------|------------|
| Cluster 0 has lowest, Cluster 1 has highest runtime       | ✅     | Matches runtime data |
| High null rate increases runtime                         | ✅     | Null-heavy cluster is slowest |
| High cardinality increases runtime                       | 🟡     | Partially true; not the main factor |
| Parquet is faster than CSV                               | ✅     | Parquet clusters (0, 3) are faster |
| Clusters 0 and 3 are efficient; Cluster 1 inefficient    | ✅     | Matches runtime and input features |

---

## 📝 Heuristic Evaluation

| Heuristic Rule                                            | Ground Truth Alignment |
|-----------------------------------------------------------|-------------------------|
| Minimise nulls to improve performance                     | ✅ Strong support       |
| Optimise cardinality to avoid degradation                 | 🟡 Partial              |
| Prefer efficient output formats (e.g. Parquet)            | ✅ Strong support       |
| Choose tools (e.g. Polars) based on performance needs     | ✅ Matches observed     |
| Benchmark configurations empirically                      | ✅ Good advice          |

---

## ✅ Conclusion

The LLM provided high-quality interpretation of cluster-level benchmarking results:
- It correctly identified the impact of nulls, file format, and general efficiency patterns.
- It produced realistic heuristics that map closely to the synthetic regimes.
- Its one weakness was a tendency to generalise the role of cardinality without detecting more complex interactions.

This evaluation demonstrates that large language models can successfully generate **interpretable and actionable insights** from unsupervised benchmark clustering when combined with structured prompts and real-world validation.