Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catching a CrateDB fluke: ShardCollectContext for {0,2} already added #53

Closed
amotl opened this issue Nov 1, 2023 · 19 comments
Closed
Labels
bug Something isn't working

Comments

@amotl
Copy link
Member

amotl commented Nov 1, 2023

Report

While trying to bring in GH-52, we caught an unusual error from CrateDB we haven't seen before.

ShardCollectContext for 0 already added

It is happening on a DELETE FROM SQL statement.

DELETE FROM metrics;

Details

The DELETE FROM metrics is happening within a regular integration test scenario on the canonical setUp() method calling out to self.pruneTables(), in order to supply the test cases with a blank canvas. This is nothing special, we have been doing it like this for a while already, on behalf of different test suites we are maintaining. Also note it was really only a fluke: Re-running the test cases made them succeed on the first attempt already.

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6721484528/job/18267317481?pr=52#step:6:491

Thoughts

As we are using CrateDB nightly for all of our downstream integration tests, it may be a regression introduced just recently.

/cc @matriv, @seut

@amotl amotl changed the title Catching a CrateDB fluke Catching a CrateDB fluke: ShardCollectContext for 0 already added Nov 1, 2023
@amotl amotl changed the title Catching a CrateDB fluke: ShardCollectContext for 0 already added Catching a CrateDB fluke: ShardCollectContext for 0 already added Nov 1, 2023
@BaurzhanSakhariev
Copy link

BaurzhanSakhariev commented Nov 6, 2023

Hi, I don't have permissions to assign myself. I will look into this issue.

@amotl
Copy link
Member Author

amotl commented Nov 6, 2023

Thank you, I've just assigned you. Please note that it was only a single event, and it hasn't been observed ever after. Maybe a real SEU? So, you will have to evaluate if this is important / feels serious, or not.

If you decide that it may actually have been an SEU, or if you can't find anything suspicious, let's just close the issue again.

@BaurzhanSakhariev
Copy link

Maybe relates to crate/crate#11677

@amotl
Copy link
Member Author

amotl commented Nov 13, 2023

@matriv
Copy link

matriv commented Nov 13, 2023

could it be that we need to have an ensureGreen() alternative for this one?

@amotl
Copy link
Member Author

amotl commented Nov 14, 2023

Thank you for watching this conversation, Marios.

Could it be that we need to have an ensureGreen() alternative for this one?

I am not sure what you are particularly referring to with ensureGreen(). Up until now, I only have been able to spot this flaw twice, starting with its first occurrance two weeks ago.

@amotl
Copy link
Member Author

amotl commented Nov 14, 2023

Just to add more context here, to improve the original report: The flaw is apparently not happening on CrateDB startup, but at regular runtime.

tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_pagination PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_returns_expected_results_with_large_experiment PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id FAILED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_name PASSED [ 87%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_start_time_alias PASSED [ 88%]

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6721484528/job/18267317481?pr=52#step:6:154

tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_pagination PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_returns_expected_results_with_large_experiment PASSED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id FAILED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_name PASSED [ 87%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_start_time_alias PASSED [ 88%]

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6846915144/job/18614271079?pr=64#step:6:488

@matriv
Copy link

matriv commented Nov 14, 2023

I am not sure what you are particularly referring to with ensureGreen(). Up until now, I only have been able to spot this flaw twice, starting with its first occurrance two weeks ago.

I'm probably wrong, I was thinking to check that all shards of a table are allocated before proceeding with statements, but we use only one cratedb node, right?

@amotl
Copy link
Member Author

amotl commented Nov 14, 2023

The output I've shared above clearly indicates that it is happenening at the very same test case function test_search_runs_run_id. What might be particularly more interesting in this case, is the predecessor test case function, test_search_runs_returns_expected_results_with_large_experiment 1.

This observation indicates that CrateDB might not always be ready to accept a DELETE FROM ... statement on a database table after inserting a large amount of data 2 into it. Do you think we are missing an additional REFRESH TABLE ... statement before conducting the DELETE FROM ...?

We can try, and it may qualify as a workaround. However, you may also want to address this on behalf of CrateDB itself, when applicable.

Footnotes

  1. https://github.com/crate-workbench/mlflow-cratedb/blob/7d700523443848917848deac6706ceb677743513/tests/test_tracking.py#L2819-L2831

  2. ... at least, of the shape/characteristics as performed by the test_search_runs_returns_expected_results_with_large_experiment test case function, which I haven't investigated yet.

@amotl
Copy link
Member Author

amotl commented Nov 14, 2023

[You are] using only one CrateDB node, right?

This is correct, the issue is about two flukes reported from pedantically observing CI runs where CrateDB Nightly is used purposely.

@amotl
Copy link
Member Author

amotl commented Nov 14, 2023

To add more information about chronology: It has first happened here at GH-52, on Wed, 01 Nov 2023, with this release 1:

version[5.6.0-SNAPSHOT], pid[1], build[827ccae/NA], OS[Linux/6.2.0-1015-azure/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.1+12-LTS]

The second spot was:

version[5.6.0-SNAPSHOT], pid[1], build[ca0a7b6/NA], OS[Linux/6.2.0-1015-azure/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.1+12-LTS]

Footnotes

  1. Unfortunately, we didn't schedule nightly checks on this repository, so if there indeed is a flaw, it has probably been introduced way earlier.

@amotl
Copy link
Member Author

amotl commented Nov 22, 2023

@amotl amotl changed the title Catching a CrateDB fluke: ShardCollectContext for 0 already added Catching a CrateDB fluke: ShardCollectContext for {0,2} already added Nov 24, 2023
@amotl
Copy link
Member Author

amotl commented Nov 24, 2023

Now, it is about ShardCollectContext for 2 instead of ShardCollectContext for 0.

FAILED tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id - mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 2 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6985231736/job/19009218790?pr=69#step:6:663

@amotl
Copy link
Member Author

amotl commented Jan 8, 2024

Once more.

SQLParseException[ShardCollectContext for 0 already added] with DELETE FROM metrics.

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/7375903029/job/20068020355?pr=81#step:6:907

@amotl
Copy link
Member Author

amotl commented Feb 5, 2024

That patch will increase CrateDB's heap size on CI, in order to explore whether the problem originates from low-memory situations.

@amotl
Copy link
Member Author

amotl commented Feb 5, 2024

[...] whether the problem originates from low-memory situations.

Indeed. Decreasing heap size triggers the issue right away. In this spirit, creating a reproducer will be much easier.

docker run --rm -it --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 \
  --env=CRATE_HEAP_SIZE=256m \
  crate/crate:nightly -Cdiscovery.type=single-node
time pytest -vvv -k "test_search_runs_returns_expected_results_with_large_experiment or test_search_runs_run_id"
mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)

/cc @BaurzhanSakhariev, @jeeminso

@amotl
Copy link
Member Author

amotl commented Feb 5, 2024

The issue has been reported to the upstream crate/crate repository.

@amotl
Copy link
Member Author

amotl commented Apr 8, 2024

Coming from crate/crate#15518 (comment), and looking at recent nightly scheduled job executions of https://github.com/crate-workbench/mlflow-cratedb/actions, it looks like this problem has been mitigated. Therefore, I am closing this.

Thanks, @jeeminso.

@amotl amotl closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants