Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error ShardCollectContext for {0,1,2} already added in low-memory situations #15518

Closed
amotl opened this issue Feb 5, 2024 · 5 comments · Fixed by #15520
Closed

Error ShardCollectContext for {0,1,2} already added in low-memory situations #15518

amotl opened this issue Feb 5, 2024 · 5 comments · Fixed by #15520
Assignees
Labels
bug Clear identification of incorrect behaviour

Comments

@amotl
Copy link
Member

amotl commented Feb 5, 2024

CrateDB version

latest/nightly

NB: The issue exists for a longer time already, i.e. it has not been introduced recently.

CrateDB setup information

In this context, a "low-memory situation" is created by loading enough volume of data when running on 512 MB heap size, per default configuration.

docker run --rm -it --publish=4200:4200 crate/crate:nightly

Problem description

Description
We discovered an edge case, where CrateDB may not be able to detect a low-memory situation through corresponding circuit breaker mechanics. Thus, it responds with an error message which does not make it clear where the problem is originating from.

Example

mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)

Reference

Observations
The problem happens when loading a reasonable amount of data into the table metrics, quickly succeeded by a DELETE FROM metrics operation.

Steps to Reproduce

We tried to isolate the problem on behalf of a corresponding self-contained Python program, shared per cratedb_heap_exchaust_weird_error.py, but failed. 1.

What works well to reproduce the error is indeed by just running two test cases of the MLflow adapter for CrateDB. Hereby, we are sharing a quick walkthrough:

Run CrateDB with low heap size

docker run --rm -it --name=cratedb --publish=4200:4200 \
  --env=CRATE_HEAP_SIZE=256m \
  crate/crate:nightly  -Cdiscovery.type=single-node

Setup development sandbox

git clone https://github.com/crate-workbench/mlflow-cratedb.git
cd mlflow-cratedb
python -m venv .venv
source .venv/bin/activate
pip install --use-pep517 --prefer-binary --editable=.[examples,develop,docs,test]

Invoke test cases

time pytest -vvv -k "test_search_runs_returns_expected_results_with_large_experiment or test_search_runs_run_id"

Actual Result

The CrateDB Python driver raises an exception like:

(crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]

Expected Result

The CrateDB Python driver responds with an error message better indicating the problem, like OutOfMemoryError[Java heap space], or other exceptions like the CircuitBreaker-types, which also more easily lead the user to the right root cause, that the solution is just about adding memory.

Footnotes

  1. By using that program, which intends to emulate MLflow test case behaviour, we only have been able to trip sound error responses like OutOfMemoryError[Java heap space] by CrateDB, some of them even crashing the process, and some of them tripped by the circuit breaker operating correctly, which we observed on the CrateDB log output.

@amotl amotl added the triage An issue that needs to be triaged by a maintainer label Feb 5, 2024
@amotl
Copy link
Member Author

amotl commented Feb 5, 2024

On the repository where we originally observed and reported about the problem...

... we now just increased the heap size assigned to CrateDB on the CI runner, and we believe the error will go away without further ado.

@jeeminso jeeminso added bug Clear identification of incorrect behaviour and removed triage An issue that needs to be triaged by a maintainer labels Feb 5, 2024
@jeeminso jeeminso self-assigned this Feb 5, 2024
@jeeminso
Copy link
Contributor

jeeminso commented Feb 5, 2024

Thank you for providing steps reproduce @amotl . I can reproduce this locally and confirm that the cause is a duplicate readerId = 0.

스크린샷, 2024-02-05 15-39-29

@amotl
Copy link
Member Author

amotl commented Feb 5, 2024

Thank you. So, do you think the corresponding patch you are preparing may resolve this problem already?

@jeeminso
Copy link
Contributor

jeeminso commented Feb 5, 2024

I worked out that PR based on simply reading the code(without reproducible scenario). Still would like to clarify my finding but that would be after producing a fix for this.

@jeeminso
Copy link
Contributor

Thank you for reporting, @amotl the fix will be available with next hotfix release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Clear identification of incorrect behaviour
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants