Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testDataFrameTransformCrud failed with "Failed to retrieve checkpointing info" #44011

Closed
DaveCTurner opened this issue Jul 5, 2019 · 3 comments · Fixed by #44058
Closed

testDataFrameTransformCrud failed with "Failed to retrieve checkpointing info" #44011

DaveCTurner opened this issue Jul 5, 2019 · 3 comments · Fixed by #44058
Assignees
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

I see (very occasional) failures of integTestRunnerorg.elasticsearch.xpack.dataframe.integration.DataFrameTransformIT.testDataFrameTransformCrud:

The exception looks like this:

org.elasticsearch.ElasticsearchStatusException: Unable to parse response bodyClose stacktrace
at __randomizedtesting.SeedInfo.seed([D2EC9756C9B4990F:6F9BE6D8A6D3A141]:0)
at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1701)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1461)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1433)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1400)
at org.elasticsearch.client.DataFrameClient.getDataFrameTransformStats(DataFrameClient.java:105)
at org.elasticsearch.xpack.dataframe.integration.DataFrameIntegTestCase.getDataFrameTransformStats(DataFrameIntegTestCase.java:121)
at org.elasticsearch.xpack.dataframe.integration.DataFrameIntegTestCase.lambda$waitUntilCheckpoint$0(DataFrameIntegTestCase.java:135)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:888)
at org.elasticsearch.xpack.dataframe.integration.DataFrameIntegTestCase.waitUntilCheckpoint(DataFrameIntegTestCase.java:134)
at org.elasticsearch.xpack.dataframe.integration.DataFrameIntegTestCase.waitUntilCheckpoint(DataFrameIntegTestCase.java:130)
at org.elasticsearch.xpack.dataframe.integration.DataFrameTransformIT.testDataFrameTransformCrud(DataFrameTransformIT.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.client.ResponseException: 
method [GET], host [http://[::1]:34681], URI [/_data_frame/transforms/data-frame-transform-crud/_stats], status line [HTTP/1.1 500 Internal Server Error]
{"node_failures":[{"type":"failed_node_exception","reason":"Failed to retrieve checkpointing info","node_id":"ClCfyxTvT-udgCmvKlEPFg","caused_by":{"type":"checkpoint_exception","reason":"checkpoint_exception: Failure during source checkpoint info retrieval","caused_by":{"type":"null_pointer_exception","reason":null}}}],"count":1,"transforms":[{"id":"data-frame-transform-crud","state":{"task_state":"started","indexer_state":"started","checkpoint":0,"node":{"id":"ClCfyxTvT-udgCmvKlEPFg","name":"integTest-1","ephemeral_id":"xrmt63p0S8KyN28YbIpPrg","transport_address":"127.0.0.1:35199","attributes":{}}},"stats":{"pages_processed":0,"documents_processed":0,"documents_indexed":0,"trigger_count":1,"index_time_in_ms":0,"index_total":0,"index_failures":0,"search_time_in_ms":0,"search_total":0,"search_failures":0},"checkpointing":{"operations_behind":0}}]}

Apparently you can ...

REPRODUCE WITH: ./gradlew :x-pack:plugin:data-frame:qa:multi-node-tests:integTestRunner --tests "org.elasticsearch.xpack.dataframe.integration.DataFrameTransformIT.testDataFrameTransformCrud" -Dtests.seed=ED02F21FF219DD2A -Dtests.security.manager=true -Dtests.locale=he -Dtests.timezone=America/Indiana/Knox -Dcompiler.java=12 -Druntime.java=8

... but I had no success doing so.

@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :ml/Transform Transform labels Jul 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

"Failed to retrieve checkpointing info" is probably related to #43992 (comment). But then it appears that there's an additional problem with the HLRC when that happens, presumably because the stats object returned doesn't have the expected structure in this case.

@davidkyle davidkyle self-assigned this Jul 5, 2019
@davidkyle
Copy link
Member

The 'Failed to retrieve checkpointing info' error is an outstanding issue the problem here is the failure to parse the response.

The response object extends BaseTasksResponse and the data frame client knows how to parse that the problem is in this logic I added which says if there are task or node failures (even if some of the tasks responded correctly) then return a 500 status code. The rest client expects some kind of parseable exception in this case but it has something that extends BaseTasksResponse and doesn't know how to read it.

The options are to always return a code in the 2xx range and let the data frame client read any errors or add GetDataFrameTransformsStatsAction.Response as a named xcontent readable by the client. No doubt returning a 500 code made sense at the time but looking elsewhere in the code that is not the pattern used. I'll change the response to be 200 which conforms with other usages.

Best also modify the test to check for node and task failures.

davidkyle added a commit that referenced this issue Jul 8, 2019
Data frame task responses had logic to return a HTTP 500 status code if there was 
any node or task failures even if other tasks in the same request reported correctly. 
This is different to how other task responses are handled where a 200 is always 
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200

Closes #44011
davidkyle added a commit that referenced this issue Jul 8, 2019
Data frame task responses had logic to return a HTTP 500 status code if there was 
any node or task failures even if other tasks in the same request reported correctly. 
This is different to how other task responses are handled where a 200 is always 
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200

Closes #44011
davidkyle added a commit that referenced this issue Jul 8, 2019
Data frame task responses had logic to return a HTTP 500 status code if there was 
any node or task failures even if other tasks in the same request reported correctly. 
This is different to how other task responses are handled where a 200 is always 
returned leaving the client should check for failures. Returning a 500 also breaks
the high level rest client so always return a 200

Closes #44011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants