What happened?
TestOfNoEffectAnalysis (and the underlying functions in ax/utils/stats/no_effects.py) mishandles three inputs that are easy to produce through the public API:
-
KeyError: 'n' on standard Client data. check_experiment_effects_per_metric reads dfm["n"], but n is not in Data.REQUIRED_COLUMNS and is never produced by the standard Client.complete_trial(...) flow. Computing the analysis on any such experiment crashes with a bare KeyError: 'n' instead of an informative error.
-
sem == 0 silently flips the conclusion. With deterministic data (sem=0, common in BO settings), no_effect_test_welch divides by zero (ws = n / variances), yielding a NaN p-value. has_effect = bool(nan < alpha) is False, so the healthcheck warns "no effects have been detected" even when the exactly-known means clearly differ — the opposite of the correct conclusion. The only symptom is a numpy RuntimeWarning.
-
Single-arm trials (K == 1) silently produce NaN. The Welch statistic divides by K - 1 == 0, again giving NaN → has_effect=False with no error, even though a one-arm group can't be tested at all.
Also minor: the check_experiment_effects docstring describes ineffective_on_objectives with the same sentence as effective ("can be rejected"), but it means the opposite.
Please provide a minimal, reproducible example of the unexpected behavior.
from ax import Client, RangeParameterConfig
from ax.analysis.healthcheck.no_effects_analysis import TestOfNoEffectAnalysis
client = Client()
client.configure_experiment(
parameters=[RangeParameterConfig(name="x", parameter_type="float", bounds=(0, 1))]
)
client.configure_optimization(objective="y")
for x, y in [(0.1, 1.0), (0.5, 2.0), (0.9, 1.5)]:
idx = client.attach_trial(parameters={"x": x})
client.complete_trial(trial_index=idx, raw_data={"y": y})
TestOfNoEffectAnalysis().compute(experiment=client._experiment) # KeyError: 'n'
For case 2 (silent wrong result):
import pandas as pd
from ax.core.data import Data
from ax.utils.stats.no_effects import check_experiment_effects_per_metric
data = Data(df=pd.DataFrame([
{"trial_index": 0, "arm_name": "0_0", "metric_name": "m1",
"metric_signature": "m1", "mean": 1.0, "sem": 0.0, "n": 100},
{"trial_index": 0, "arm_name": "0_1", "metric_name": "m1",
"metric_signature": "m1", "mean": 2.0, "sem": 0.0, "n": 100},
]))
out = check_experiment_effects_per_metric(data=data, objective_names={"m1"})
print(out) # p_value=NaN, has_effect=False — but means 1.0 vs 2.0 with sem=0
# is a maximally significant effect
Please paste any relevant traceback/logs produced by the example provided.
Traceback (most recent call last):
...
File "ax/utils/stats/no_effects.py", line 62, in check_experiment_effects_per_metric
ns=list(dfm["n"]),
...
KeyError: 'n'
# and for sem=0:
ax/utils/stats/no_effects.py:239: RuntimeWarning: divide by zero encountered in divide
ax/utils/stats/no_effects.py:246: RuntimeWarning: invalid value encountered in scalar divide
Ax Version
main (reproduced at d8eeb97)
Python Version
3.14
Operating System
Linux
(Optional) Describe any potential fixes you've considered to the issue outlined above.
- Validate the
n column up front in check_experiment_effects_per_metric / check_experiment_effects and raise a UserInputError with an actionable message.
- Handle zero-variance arms explicitly in
no_effect_test_welch: differing exactly-known means are an exact effect (p=0); identical all-deterministic means are no effect (p=1); a mixed zero/positive-sem group raises UserInputError since Welch's test is undefined there.
- Skip single-arm (metric, trial) groups in
check_experiment_effects_per_metric, validate K >= 2 and n > 1 in no_effect_test_welch, and raise a UserInputError from TestOfNoEffectAnalysis when no trial has two or more arms.
I have a patch with these changes plus tests ready and will open a PR.
Pull Request
Yes, opening one
Code of Conduct
What happened?
TestOfNoEffectAnalysis(and the underlying functions inax/utils/stats/no_effects.py) mishandles three inputs that are easy to produce through the public API:KeyError: 'n'on standard Client data.check_experiment_effects_per_metricreadsdfm["n"], butnis not inData.REQUIRED_COLUMNSand is never produced by the standardClient.complete_trial(...)flow. Computing the analysis on any such experiment crashes with a bareKeyError: 'n'instead of an informative error.sem == 0silently flips the conclusion. With deterministic data (sem=0, common in BO settings),no_effect_test_welchdivides by zero (ws = n / variances), yielding a NaN p-value.has_effect = bool(nan < alpha)isFalse, so the healthcheck warns "no effects have been detected" even when the exactly-known means clearly differ — the opposite of the correct conclusion. The only symptom is a numpyRuntimeWarning.Single-arm trials (K == 1) silently produce NaN. The Welch statistic divides by
K - 1 == 0, again giving NaN →has_effect=Falsewith no error, even though a one-arm group can't be tested at all.Also minor: the
check_experiment_effectsdocstring describesineffective_on_objectiveswith the same sentence aseffective("can be rejected"), but it means the opposite.Please provide a minimal, reproducible example of the unexpected behavior.
For case 2 (silent wrong result):
Please paste any relevant traceback/logs produced by the example provided.
Ax Version
main (reproduced at d8eeb97)
Python Version
3.14
Operating System
Linux
(Optional) Describe any potential fixes you've considered to the issue outlined above.
ncolumn up front incheck_experiment_effects_per_metric/check_experiment_effectsand raise aUserInputErrorwith an actionable message.no_effect_test_welch: differing exactly-known means are an exact effect (p=0); identical all-deterministic means are no effect (p=1); a mixed zero/positive-sem group raisesUserInputErrorsince Welch's test is undefined there.check_experiment_effects_per_metric, validateK >= 2andn > 1inno_effect_test_welch, and raise aUserInputErrorfromTestOfNoEffectAnalysiswhen no trial has two or more arms.I have a patch with these changes plus tests ready and will open a PR.
Pull Request
Yes, opening one
Code of Conduct