Debugging RuntimeError: cholesky_cpu: For batch 2: U(77,77) is zero, singular U.

This is an issue related to https://github.com/facebook/Ax/issues/228, https://github.com/facebook/Ax/issues/99, https://github.com/facebook/Ax/issues/308 and maybe some others as welll, but I'm more interested in two things:

1. Is there a way to debug what is causing the failure? I think it is related to the bad-conditioning on the underlying GP model, but I'm not sure how to confirm this.
2. When given a set of parameters for a trial generated from the model, is there a way to sample repeatedly in the neighborhood and return a mean and variance for the trial?

Using the service API, my experiment is setup as follows:

```
ax_client.create_experiment(
        name="test_dct_slpEnrg",
        parameters=[
            {
                "name" : "w1",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0e-1, 1.0e2]
            },
            {
                "name" : "w2",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0, 1.0e2]
            },
            {
                "name" : "w3",
                "type" : "range",
                "value_type" : "float",
                "bounds" : [1.0e-3, 1.0]
            },
            {
                "name" : "w4",
                "type" : "range",
                "value_type" : "int",
                "bounds" : [10, 20]
            },
            {
                "name" : "w5",
                "type" : "range",
                "value_type" : "int",
                "bounds" : [2, 20]
            }
        ],
        objective_name ="Tc2_slpEnrg",
        minimize=True,
        parameter_constraints = [ "w4 >= w5", "w2 - w1 >= 0"
        ],
        outcome_constraints = ["slp_speed <= 3", "engn_trq >= 0.001"],
        choose_generation_strategy_kwargs=
            {
                "num_initialization_trials" : num_init,
                "winsorize_botorch_model": True,
                "winsorization_limits": (0.0, 0.3)
            }
    )
```
The sampled parameters are input into an evaluation function which internally runs an optimization routine which either converges and outputs a valid `Tc2_slpEnrg, slp_speed, engn_trq` value. A valid value is indicated by a `slp_speed <=3` which I have also placed as an `outcome_constraint`. I was unsure of how to deal with parameter values which were 'invalid (non-convergence)' as discussed in https://github.com/facebook/Ax/issues/372.

Currently, the approach I am taking is for the intial Sobel steps, I use `abandon_trial` for values which do not converge and after the Sobel steps, in order to discourage the model for sampling from nearby-parameters which ended up being invalid, I set the objective value to a high value of `3000` which is not too high, but very unlikely to normally occur.  

I think this is this is the main reason why the instability is occurring as nearby values can be very noisy and the objective can jump between ranges of 1000 to 3000, despite very small changes in the parameters. This is why I'd like to sample from a small neighborhood around the generated trial parameter and compute a mean to return as the value. I'm unsure if Ax supports this feature or if it's something I would need to set through Botorch.

However, I have also tried to abandon these parameter values (during the GPEI step) and I would still run into these errors, so I am unsure what the actual issue is and how to resolve it.

Here is a snippet of the trace when the error occurs, note that I am periodically outputting the best parameter values so far since it completely fails when this Runtime Error occurs:

```
[INFO 09-04 11:32:32] ax.service.ax_client: Generated new trial 630 with parameters {'w1': 19.57, 'w2': 72.48, 'w3': 0.77, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:33] ax.service.ax_client: Completed trial 630 with data: {'Tc2_slpEnrg': (1087.14, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.51, 0.0), 'engn_trq': (12.7, 0.0)}.
Completed 125 of 500 trials
[INFO 09-04 11:32:36] ax.service.ax_client: Generated new trial 631 with parameters {'w1': 49.14, 'w2': 59.68, 'w3': 0.27, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:37] ax.service.ax_client: Completed trial 631 with data: {'Tc2_slpEnrg': (1082.28, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.46, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 126 of 500 trials
[INFO 09-04 11:32:40] ax.service.ax_client: Generated new trial 632 with parameters {'w1': 37.7, 'w2': 59.59, 'w3': 0.09, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:32:42] ax.service.ax_client: Completed trial 632 with data: {'Tc2_slpEnrg': (1084.8, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.39, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 127 of 500 trials
[INFO 09-04 11:32:45] ax.service.ax_client: Generated new trial 633 with parameters {'w1': 71.5, 'w2': 82.59, 'w3': 0.27, 'w4': 20, 'w5': 15}.
Did not converge: (3.869655369315524, 0.0). Setting value to 3000
[INFO 09-04 11:32:48] ax.service.ax_client: Completed trial 633 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.87, 0.0), 'engn_trq': (12.63, 0.0)}.
Completed 128 of 500 trials
[INFO 09-04 11:32:51] ax.service.ax_client: Generated new trial 634 with parameters {'w1': 45.01, 'w2': 66.33, 'w3': 0.15, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:32:52] ax.service.ax_client: Completed trial 634 with data: {'Tc2_slpEnrg': (1072.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.86, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 129 of 500 trials
[INFO 09-04 11:32:56] ax.service.ax_client: Generated new trial 635 with parameters {'w1': 53.86, 'w2': 58.84, 'w3': 0.06, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:32:57] ax.service.ax_client: Completed trial 635 with data: {'Tc2_slpEnrg': (1087.49, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (1.98, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 130 of 500 trials
[INFO 09-04 11:33:00] ax.service.ax_client: Generated new trial 636 with parameters {'w1': 43.28, 'w2': 67.07, 'w3': 0.29, 'w4': 19, 'w5': 9}.
[INFO 09-04 11:33:01] ax.service.ax_client: Completed trial 636 with data: {'Tc2_slpEnrg': (1083.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.41, 0.0), 'engn_trq': (12.47, 0.0)}.
Completed 131 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15}  {'slp_speed': 2.3737070108366964, 'engn_trq': 12.243898660289364, 'Tc2_slpEnrg': 1030.4849595920462, 'max_abs_Jerk': 4.059934646446776}
Completed 131 of 500 trials
[INFO 09-04 11:33:04] ax.service.ax_client: Generated new trial 637 with parameters {'w1': 27.6, 'w2': 64.84, 'w3': 0.2, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:05] ax.service.ax_client: Completed trial 637 with data: {'Tc2_slpEnrg': (1075.64, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.97, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 132 of 500 trials
[INFO 09-04 11:33:09] ax.service.ax_client: Generated new trial 638 with parameters {'w1': 44.71, 'w2': 62.99, 'w3': 0.25, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:10] ax.service.ax_client: Completed trial 638 with data: {'Tc2_slpEnrg': (1089.75, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.05, 0.0), 'engn_trq': (12.64, 0.0)}.
Completed 133 of 500 trials
[INFO 09-04 11:33:13] ax.service.ax_client: Generated new trial 639 with parameters {'w1': 20.52, 'w2': 70.79, 'w3': 0.8, 'w4': 17, 'w5': 9}.
Did not converge: (3.0800594621335904, 0.0). Setting value to 3000
[INFO 09-04 11:33:14] ax.service.ax_client: Completed trial 639 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.08, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 134 of 500 trials
[INFO 09-04 11:33:18] ax.service.ax_client: Generated new trial 640 with parameters {'w1': 36.74, 'w2': 60.21, 'w3': 0.43, 'w4': 15, 'w5': 9}.
Did not converge: (86.95642479778826, 0.0). Setting value to 3000
[INFO 09-04 11:33:18] ax.service.ax_client: Completed trial 640 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (2.26, 0.0), 'slp_speed': (86.96, 0.0), 'engn_trq': (70.0, 0.0)}.
Completed 135 of 500 trials
[INFO 09-04 11:33:22] ax.service.ax_client: Generated new trial 641 with parameters {'w1': 13.41, 'w2': 66.27, 'w3': 0.18, 'w4': 17, 'w5': 9}.
[INFO 09-04 11:33:23] ax.service.ax_client: Completed trial 641 with data: {'Tc2_slpEnrg': (1073.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.88, 0.0), 'engn_trq': (12.65, 0.0)}.
Completed 136 of 500 trials
[INFO 09-04 11:33:27] ax.service.ax_client: Generated new trial 642 with parameters {'w1': 28.77, 'w2': 66.15, 'w3': 0.22, 'w4': 16, 'w5': 9}.
[INFO 09-04 11:33:28] ax.service.ax_client: Completed trial 642 with data: {'Tc2_slpEnrg': (1074.72, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.98, 0.0), 'engn_trq': (12.57, 0.0)}.
Completed 137 of 500 trials
[INFO 09-04 11:33:32] ax.service.ax_client: Generated new trial 643 with parameters {'w1': 25.92, 'w2': 67.46, 'w3': 0.79, 'w4': 17, 'w5': 9}.
Did not converge: (3.105967542260089, 0.0). Setting value to 3000
[INFO 09-04 11:33:33] ax.service.ax_client: Completed trial 643 with data: {'Tc2_slpEnrg': (3000, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (3.11, 0.0), 'engn_trq': (12.74, 0.0)}.
Completed 138 of 500 trials
[INFO 09-04 11:33:37] ax.service.ax_client: Generated new trial 644 with parameters {'w1': 37.94, 'w2': 66.57, 'w3': 0.32, 'w4': 18, 'w5': 8}.
[INFO 09-04 11:33:38] ax.service.ax_client: Completed trial 644 with data: {'Tc2_slpEnrg': (1086.47, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.48, 0.0), 'engn_trq': (12.61, 0.0)}.
Completed 139 of 500 trials
[INFO 09-04 11:33:41] ax.service.ax_client: Generated new trial 645 with parameters {'w1': 38.19, 'w2': 65.75, 'w3': 0.19, 'w4': 18, 'w5': 9}.
[INFO 09-04 11:33:43] ax.service.ax_client: Completed trial 645 with data: {'Tc2_slpEnrg': (1072.85, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.8, 0.0), 'engn_trq': (12.71, 0.0)}.
Completed 140 of 500 trials
[INFO 09-04 11:33:47] ax.service.ax_client: Generated new trial 646 with parameters {'w1': 37.22, 'w2': 65.46, 'w3': 0.23, 'w4': 17, 'w5': 8}.
[INFO 09-04 11:33:47] ax.service.ax_client: Completed trial 646 with data: {'Tc2_slpEnrg': (1085.76, 0.0), 'max_abs_Jerk': (4.06, 0.0), 'slp_speed': (2.56, 0.0), 'engn_trq': (12.53, 0.0)}.
Completed 141 of 500 trials
Best params: {'w1': 71.1208035618998, 'w2': 82.38559271674603, 'w3': 0.23882054011337459, 'w4': 20, 'w5': 15}  {'slp_speed': 2.373707491338672, 'engn_trq': 12.243896324181325, 'Tc2_slpEnrg': 1030.4849537960213, 'max_abs_Jerk': 4.059934784356625}
Completed 141 of 500 trials


Traceback (most recent call last):
  File "/home/mlab/gitRepo/cvt_opt/cvt_bayes_opt/dct_service_debug.py", line 170, in <module>
    trial_params, trial_index = ax_client.get_next_trial()
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 275, in get_next_trial
    trial = self.experiment.new_trial(generator_run=self._gen_new_generator_run())
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/service/ax_client.py", line 865, in _gen_new_generator_run
    experiment=self.experiment
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/generation_strategy.py", line 376, in gen
    keywords=get_function_argument_names(model.gen),
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/base.py", line 626, in gen
    model_gen_options=model_gen_options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/array.py", line 238, in _gen
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/modelbridge/torch.py", line 260, in _model_best_point
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 458, in best_point
    target_fidelities=target_fidelities,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 353, in recommend_best_observed_point
    options=model_gen_options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 296, in best_observed_point
    options=options,
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/model_utils.py", line 399, in best_in_sample_point
    f, cov = as_array(model.predict(X_obs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 314, in predict
    return self.model_predictor(model=self.model, X=X)  # pyre-ignore [28]
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/ax/models/torch/utils.py", line 454, in predict_from_model
    posterior = model.posterior(X)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 301, in posterior
    mvn = self(X)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_gp.py", line 328, in __call__
    predictive_mean, predictive_covar = self.prediction_strategy.exact_prediction(full_mean, full_covar)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 302, in exact_prediction
    self.exact_predictive_mean(test_mean, test_train_covar),
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 320, in exact_predictive_mean
    res = (test_train_covar @ self.mean_cache.unsqueeze(-1)).squeeze(-1)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/models/exact_prediction_strategies.py", line 269, in mean_cache
    mean_cache = train_train_covar.inv_matmul(train_labels_offset).squeeze(-1)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 934, in inv_matmul
    return func.apply(self.representation_tree(), False, right_tensor, *self.representation())
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 47, in forward
    solves = _solve(lazy_tsr, right_tensor)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/functions/_inv_matmul.py", line 11, in _solve
    return lazy_tsr._cholesky()._cholesky_solve(rhs)
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/memoize.py", line 34, in g
    add_to_cache(self, cache_name, method(self, *args, **kwargs))
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 414, in _cholesky
    cholesky = psd_safe_cholesky(evaluated_mat).contiguous()
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 48, in psd_safe_cholesky
    raise e
  File "/home/mlab/.local/share/virtualenvs/cvt_opt-pgLzqWRW/lib/python3.7/site-packages/gpytorch/utils/cholesky.py", line 25, in psd_safe_cholesky
    L = torch.cholesky(A, upper=upper, out=out)
RuntimeError: cholesky_cpu: For batch 2: U(99,99) is zero, singular U.
```

Please let me know what your thoughts are about my problem and how I should proceed. Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging RuntimeError: cholesky_cpu: For batch 2: U(77,77) is zero, singular U. #381

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Debugging RuntimeError: cholesky_cpu: For batch 2: U(77,77) is zero, singular U. #381

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions