Error `ShardCollectContext for {0,1,2} already added` in low-memory situations #15518

amotl · 2024-02-05T17:15:56Z

CrateDB version

latest/nightly

NB: The issue exists for a longer time already, i.e. it has not been introduced recently.

CrateDB setup information

In this context, a "low-memory situation" is created by loading enough volume of data when running on 512 MB heap size, per default configuration.

docker run --rm -it --publish=4200:4200 crate/crate:nightly

Problem description

Description
We discovered an edge case, where CrateDB may not be able to detect a low-memory situation through corresponding circuit breaker mechanics. Thus, it responds with an error message which does not make it clear where the problem is originating from.

Example

mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)

Reference

Catching a CrateDB fluke: ShardCollectContext for {0,2} already added mlflow-cratedb#53

Observations
The problem happens when loading a reasonable amount of data into the table metrics, quickly succeeded by a DELETE FROM metrics operation.

Steps to Reproduce

We tried to isolate the problem on behalf of a corresponding self-contained Python program, shared per cratedb_heap_exchaust_weird_error.py, but failed. ¹.

What works well to reproduce the error is indeed by just running two test cases of the MLflow adapter for CrateDB. Hereby, we are sharing a quick walkthrough:

Run CrateDB with low heap size

docker run --rm -it --name=cratedb --publish=4200:4200 \
  --env=CRATE_HEAP_SIZE=256m \
  crate/crate:nightly  -Cdiscovery.type=single-node

Setup development sandbox

git clone https://github.com/crate-workbench/mlflow-cratedb.git
cd mlflow-cratedb
python -m venv .venv
source .venv/bin/activate
pip install --use-pep517 --prefer-binary --editable=.[examples,develop,docs,test]

Invoke test cases

time pytest -vvv -k "test_search_runs_returns_expected_results_with_large_experiment or test_search_runs_run_id"

Actual Result

The CrateDB Python driver raises an exception like:

(crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]

Expected Result

The CrateDB Python driver responds with an error message better indicating the problem, like OutOfMemoryError[Java heap space], or other exceptions like the CircuitBreaker-types, which also more easily lead the user to the right root cause, that the solution is just about adding memory.

By using that program, which intends to emulate MLflow test case behaviour, we only have been able to trip sound error responses like OutOfMemoryError[Java heap space] by CrateDB, some of them even crashing the process, and some of them tripped by the circuit breaker operating correctly, which we observed on the CrateDB log output. ↩

The text was updated successfully, but these errors were encountered:

amotl · 2024-02-05T17:22:17Z

On the repository where we originally observed and reported about the problem...

Catching a CrateDB fluke: ShardCollectContext for {0,2} already added mlflow-cratedb#53

... we now just increased the heap size assigned to CrateDB on the CI runner, and we believe the error will go away without further ado.

CI: Increase heap memory size for CrateDB to 4 GB mlflow-cratedb#100

jeeminso · 2024-02-05T20:44:22Z

Thank you for providing steps reproduce @amotl . I can reproduce this locally and confirm that the cause is a duplicate readerId = 0.

amotl · 2024-02-05T20:49:48Z

Thank you. So, do you think the corresponding patch you are preparing may resolve this problem already?

Assert collect task is not completed before fall-back to resume #15495

jeeminso · 2024-02-05T21:13:32Z

I worked out that PR based on simply reading the code(without reproducible scenario). Still would like to clarify my finding but that would be after producing a fix for this.

jeeminso · 2024-02-14T14:55:48Z

Thank you for reporting, @amotl the fix will be available with next hotfix release.

amotl added the triage An issue that needs to be triaged by a maintainer label Feb 5, 2024

jeeminso added bug Clear identification of incorrect behaviour and removed triage An issue that needs to be triaged by a maintainer labels Feb 5, 2024

jeeminso self-assigned this Feb 5, 2024

amotl mentioned this issue Feb 5, 2024

Catching a CrateDB fluke: ShardCollectContext for {0,2} already added crate/mlflow-cratedb#53

Closed

jeeminso mentioned this issue Feb 6, 2024

Prevent duplicate SharedShardContext.readerId #15520

Merged

5 tasks

mergify bot closed this as completed in #15520 Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error `ShardCollectContext for {0,1,2} already added` in low-memory situations #15518

Error `ShardCollectContext for {0,1,2} already added` in low-memory situations #15518

amotl commented Feb 5, 2024 •

edited

amotl commented Feb 5, 2024 •

edited

jeeminso commented Feb 5, 2024

amotl commented Feb 5, 2024 •

edited

jeeminso commented Feb 5, 2024

jeeminso commented Feb 14, 2024

Error ShardCollectContext for {0,1,2} already added in low-memory situations #15518

Error ShardCollectContext for {0,1,2} already added in low-memory situations #15518

Comments

amotl commented Feb 5, 2024 • edited

CrateDB version

CrateDB setup information

Problem description

Steps to Reproduce

Actual Result

Expected Result

Footnotes

amotl commented Feb 5, 2024 • edited

jeeminso commented Feb 5, 2024

amotl commented Feb 5, 2024 • edited

jeeminso commented Feb 5, 2024

jeeminso commented Feb 14, 2024

Error `ShardCollectContext for {0,1,2} already added` in low-memory situations #15518

Error `ShardCollectContext for {0,1,2} already added` in low-memory situations #15518

amotl commented Feb 5, 2024 •

edited

amotl commented Feb 5, 2024 •

edited

amotl commented Feb 5, 2024 •

edited