Skip to content

fix: validate user-provided clustering weight keys#627

Merged
FBumann merged 3 commits intomainfrom
validate-clustering-weight-keys
Mar 24, 2026
Merged

fix: validate user-provided clustering weight keys#627
FBumann merged 3 commits intomainfrom
validate-clustering-weight-keys

Conversation

@FBumann
Copy link
Member

@FBumann FBumann commented Mar 24, 2026

Summary

  • Raise ValueError when ClusterConfig.weights references variables that don't exist in the FlowSystem, catching typos early
  • Weights for constant (but valid) variables are still silently filtered — they exist in the system, they're just not useful for clustering
  • Updated tests to cover both unknown-key rejection and constant-column filtering

Test plan

  • test_unknown_weight_keys_raise — bogus keys raise ValueError
  • test_extra_weight_keys_filtered_with_constant_column — constant columns silently filtered
  • test_unknown_weight_keys_raise_multiperiod — bogus key in multi-period raises
  • test_valid_weight_keys_multiperiod — valid keys in multi-period work

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved weight validation for clustering: unknown weight keys are now detected and rejected with clear error messages that point users to the available clustering variables; validation runs early to prevent confusing downstream failures. Single-period clustering still silently ignores constant columns.
  • Tests

    • Updated tests to cover error handling for invalid weight keys and valid-weight success cases across single- and multi-period clustering.

Raise ValueError when ClusterConfig.weights references variables that
don't exist in the FlowSystem, catching typos early with a clear message.
Weights for constant (but valid) variables are still silently filtered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

Warning

Rate limit exceeded

@FBumann has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 13 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88efbe94-5c2e-44c2-a81b-82cb3676d42c

📥 Commits

Reviewing files that changed from the base of the PR and between 2af82d7 and df746c2.

📒 Files selected for processing (1)
  • tests/test_clustering/test_integration.py
📝 Walkthrough

Walkthrough

Added validation in TransformAccessor.cluster() to verify that any non-None cluster.weights keys exist in the selected clustering dataset's variables; unknown keys now cause a ValueError listing the unknown keys and pointing to transform.clustering_data(). The check runs before the dataset-empty guard.

Changes

Cohort / File(s) Summary
Weight Key Validation
flixopt/transform_accessor.py
Added a pre-check in TransformAccessor.cluster() that computes unknown keys from cluster.weights against the clustering input dataset's variables and raises ValueError listing unknown keys and referencing transform.clustering_data(). Runs before the empty-dataset check.
Clustering Tests
tests/test_clustering/test_integration.py
Replaced tests that previously allowed unknown weight keys with tests expecting ValueError for unknown keys (test_unknown_weight_keys_raise, multi-period variant). Added a passing multi-period test limited to valid weight keys and updated single-period constant-column handling assertions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped through datasets, ears alert and bright,
Counting weight keys by soft moonlight,
If a name is missing, I thump and declare—
"Unknown variables!" — I leave no spare. 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: validate user-provided clustering weight keys' clearly and concisely summarizes the main change: adding validation for clustering weight keys.
Description check ✅ Passed The PR description covers the key changes (validation logic, constant-column handling) and includes a comprehensive test plan, though it deviates from the template structure with a custom summary and test plan format.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch validate-clustering-weight-keys

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_clustering/test_integration.py (1)

399-405: ⚠️ Potential issue | 🟠 Major

This test never exercises the dropped-constant-column path.

Line 400 calls transform.clustering_data(), which already strips constant arrays, so weights is built only from varying columns. The test will still pass even if constant-column filtering in cluster() breaks. Pull the constant time-series name from fs.to_dataset(include_solution=False) (or add the known constant profile key explicitly) before invoking cluster().

Suggested change
-        # The constant column name (find it from clustering_data)
-        all_data = fs.transform.clustering_data()
-        all_columns = set(all_data.data_vars)
+        full_data = fs.to_dataset(include_solution=False)
+        varying_columns = set(fs.transform.clustering_data().data_vars)
+        constant_columns = {
+            name for name, da in full_data.data_vars.items() if 'time' in da.dims and name not in varying_columns
+        }
+        assert constant_columns, 'test setup failed: expected at least one constant time series'

         # Build weights that reference ALL columns including the constant one
         # that will be dropped — these are valid variables, just constant over time
-        weights = {col: 1.0 for col in all_columns}
+        weights = {col: 1.0 for col in varying_columns | constant_columns}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_clustering/test_integration.py` around lines 399 - 405, The test
builds weights from all columns using all_data = fs.transform.clustering_data(),
but clustering_data() already strips constant arrays so the test never verifies
the code path where cluster() must drop constant columns; instead, obtain the
constant time-series key from fs.to_dataset(include_solution=False) (or add the
known constant profile key explicitly) before constructing weights, then pass
that full-keys weights dict into cluster() so the constant-column-dropping logic
is exercised (update where weights is created and ensure cluster() is called
with that weights mapping).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flixopt/transform_accessor.py`:
- Around line 1701-1709: The weight validation currently compares
cluster.weights against the full dataset (ds.data_vars) which lets weights for
excluded variables slip through; update the check to validate cluster.weights
against the actual selected clustering input allow-list immediately after the
data_vars selection and before drop_constant_arrays() is called so that constant
columns and excluded variables are handled correctly; specifically, compute the
selected input set (the allow-list used to build available_columns / clustering
inputs) and use that set to detect unknown weights (referencing cluster.weights,
data_vars selection, available_columns, and drop_constant_arrays()) and raise
the same ValueError if any weight keys are not in that selected input set.

---

Outside diff comments:
In `@tests/test_clustering/test_integration.py`:
- Around line 399-405: The test builds weights from all columns using all_data =
fs.transform.clustering_data(), but clustering_data() already strips constant
arrays so the test never verifies the code path where cluster() must drop
constant columns; instead, obtain the constant time-series key from
fs.to_dataset(include_solution=False) (or add the known constant profile key
explicitly) before constructing weights, then pass that full-keys weights dict
into cluster() so the constant-column-dropping logic is exercised (update where
weights is created and ensure cluster() is called with that weights mapping).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 87f28107-3d5e-4881-82fd-6d7e5691c52d

📥 Commits

Reviewing files that changed from the base of the PR and between 27cce42 and 626727c.

📒 Files selected for processing (2)
  • flixopt/transform_accessor.py
  • tests/test_clustering/test_integration.py

… dataset

- Validate against ds_for_clustering (after data_vars selection) instead
  of ds (full dataset), so weights for excluded variables are caught
- Move validation before drop_constant_arrays so constant columns remain
  accepted as valid keys
- Fix test to use to_dataset() instead of clustering_data() so the
  constant-column path is actually exercised

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/test_clustering/test_integration.py (2)

338-361: Please cover the data_vars allow-list directly.

These additions cover typo keys and the happy path, but they still do not exercise the case where a weight key exists in the FlowSystem and should fail only because data_vars excluded it. One targeted regression for that path would lock in the behavior this change is primarily relying on.

Also applies to: 418-497

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_clustering/test_integration.py` around lines 338 - 361, The test
currently checks unknown weight keys but doesn't exercise the case where a
weight key exists on the FlowSystem but is excluded by the
clustering_data.data_vars allow-list; add a new assertion (or modify
test_unknown_weight_keys_raise) to build a weights dict that includes a variable
name that is present on basic_flow_system but intentionally omitted from
clustering_data.data_vars, then call basic_flow_system.transform.cluster(...)
with ClusterConfig(weights=weights) and assert it raises ValueError (match
'unknown variables' or similar); to locate code update, reference the test
function test_unknown_weight_keys_raise, the clustering_data =
basic_flow_system.transform.clustering_data() call and the ClusterConfig class
so the test explicitly validates the data_vars allow-list enforcement.

399-405: Assert the known constant column explicitly.

all_columns > clustering_columns can already be true because to_dataset() contains other non-time-varying data vars, so this can pass without proving that the constant profile is the column being filtered.

🧪 Tighten the assertion
         all_data = fs.to_dataset(include_solution=False)
         all_columns = set(all_data.data_vars)
         clustering_columns = set(fs.transform.clustering_data().data_vars)
-        assert all_columns > clustering_columns, 'Test requires at least one constant column to be meaningful'
+        constant_column = 'constant_load(constant_out)|fixed_relative_profile'
+        assert constant_column in all_columns
+        assert constant_column not in clustering_columns
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_clustering/test_integration.py` around lines 399 - 405, Instead of
the loose set check, explicitly assert the constant column is being filtered:
find the constant column(s) from all_data (e.g., by checking which data_vars
have only a single unique value across time) and then assert at least one such
constant column exists in all_columns and that each of those constant column
names is not in clustering_columns; reference all_data, all_columns,
clustering_columns and the transform.clustering_data() path when making these
assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_clustering/test_integration.py`:
- Around line 338-361: The test currently checks unknown weight keys but doesn't
exercise the case where a weight key exists on the FlowSystem but is excluded by
the clustering_data.data_vars allow-list; add a new assertion (or modify
test_unknown_weight_keys_raise) to build a weights dict that includes a variable
name that is present on basic_flow_system but intentionally omitted from
clustering_data.data_vars, then call basic_flow_system.transform.cluster(...)
with ClusterConfig(weights=weights) and assert it raises ValueError (match
'unknown variables' or similar); to locate code update, reference the test
function test_unknown_weight_keys_raise, the clustering_data =
basic_flow_system.transform.clustering_data() call and the ClusterConfig class
so the test explicitly validates the data_vars allow-list enforcement.
- Around line 399-405: Instead of the loose set check, explicitly assert the
constant column is being filtered: find the constant column(s) from all_data
(e.g., by checking which data_vars have only a single unique value across time)
and then assert at least one such constant column exists in all_columns and that
each of those constant column names is not in clustering_columns; reference
all_data, all_columns, clustering_columns and the transform.clustering_data()
path when making these assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dfae4c89-a44e-43d4-8212-84201f13b394

📥 Commits

Reviewing files that changed from the base of the PR and between 626727c and 2af82d7.

📒 Files selected for processing (2)
  • flixopt/transform_accessor.py
  • tests/test_clustering/test_integration.py

- New test_weight_keys_excluded_by_data_vars_raise: verifies that weight
  keys present on the FlowSystem but excluded by data_vars are rejected
- Replace loose set check with explicit constant-column identification
  and per-column assertions in test_extra_weight_keys_filtered_with_constant_column

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@FBumann FBumann merged commit 4cba143 into main Mar 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant