Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FeatureUpgradeIT testGetFeatureUpgradeStatus failing with IllegalArgumentException: mapping with hash [...] not found #101331

Closed
droberts195 opened this issue Oct 25, 2023 · 16 comments · Fixed by #101778
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/ibqgql7fylvvw/tests/:qa:rolling-upgrade:v7.17.15%23bwcTest/org.elasticsearch.upgrades.FeatureUpgradeIT/testGetFeatureUpgradeStatus%20%7BupgradedNodes=3%7D
Reproduction line:

./gradlew ':qa:rolling-upgrade:v7.17.15#bwcTest' -Dtests.class="org.elasticsearch.upgrades.FeatureUpgradeIT" -Dtests.method="testGetFeatureUpgradeStatus {upgradedNodes=3}" -Dtests.seed=85641CEF59F01F1F -Dtests.bwc=true -Dtests.locale=ja-JP-u-ca-japanese-x-lvariant-JP -Dtests.timezone=Europe/Bratislava -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.upgrades.FeatureUpgradeIT&tests.test=testGetFeatureUpgradeStatus%20%7BupgradedNodes%3D3%7D

Failure excerpt:

java.lang.RuntimeException: An error occurred while checking cluster 'test-cluster' status.

  at __randomizedtesting.SeedInfo.seed([85641CEF59F01F1F:4A3B8B4A3EE04BD8]:0)
  at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.waitUntilReady(DefaultLocalClusterHandle.java:188)
  at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.upgradeNodeToVersion(DefaultLocalClusterHandle.java:151)
  at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster.upgradeNodeToVersion(DefaultLocalElasticsearchCluster.java:134)
  at org.elasticsearch.upgrades.ParameterizedRollingUpgradeTestCase.upgradeNode(ParameterizedRollingUpgradeTestCase.java:126)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:39)
  at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: java.io.IOException: 408 Request Timeout

    at org.elasticsearch.test.cluster.local.WaitForHttpResource.checkResource(WaitForHttpResource.java:129)
    at org.elasticsearch.test.cluster.local.WaitForHttpResource.waitFor(WaitForHttpResource.java:107)
    at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.waitUntilReady(DefaultLocalClusterHandle.java:186)
    at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.upgradeNodeToVersion(DefaultLocalClusterHandle.java:151)
    at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster.upgradeNodeToVersion(DefaultLocalElasticsearchCluster.java:134)
    at org.elasticsearch.upgrades.ParameterizedRollingUpgradeTestCase.upgradeNode(ParameterizedRollingUpgradeTestCase.java:126)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke(Method.java:580)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:39)
    at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
    at java.lang.Thread.run(Thread.java:1583)

@droberts195 droberts195 added :Search/Mapping Index mappings, including merging and defining field types >test-failure Triaged test failures from CI labels Oct 25, 2023
@elasticsearchmachine elasticsearchmachine added blocker Team:Search Meta label for search team labels Oct 25, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@droberts195
Copy link
Contributor Author

The important stack trace is not the one HOMER put in the issue description, but this one:

[2023-10-25T17:08:36,592][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-2] fatal error in thread [elasticsearch[test-cluster-2][cluster_coordination][T#1]], exiting java.lang.AssertionError: org.elasticsearch.gateway.CorruptStateException: org.elasticsearch.gateway.CorruptStateException: java.lang.IllegalArgumentException: mapping with hash [nta1u3NgXPKcAhx4AGeV0OW7hlyKINbUL0KI7DNvFwk=] not found
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService$Writer.assertOnCommit(PersistedClusterStateService.java:1245)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService$Writer.commit(PersistedClusterStateService.java:1235)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeIncrementalStateAndCommit(PersistedClusterStateService.java:977)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.writeClusterStateToDisk(GatewayMetaState.java:600)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:583)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:392)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:476)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:214)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:201)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:113)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.transport.InboundHandler.doHandleRequest(InboundHandler.java:288)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:301)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.elasticsearch.gateway.CorruptStateException: org.elasticsearch.gateway.CorruptStateException: java.lang.IllegalArgumentException: mapping with hash [nta1u3NgXPKcAhx4AGeV0OW7hlyKINbUL0KI7DNvFwk=] not found
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.readXContent(PersistedClusterStateService.java:675)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.lambda$loadOnDiskState$12(PersistedClusterStateService.java:623)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.consumeFromType(PersistedClusterStateService.java:717)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.loadOnDiskState(PersistedClusterStateService.java:622)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService$Writer.assertOnCommit(PersistedClusterStateService.java:1243)
	... 17 more
Caused by: org.elasticsearch.gateway.CorruptStateException: java.lang.IllegalArgumentException: mapping with hash [nta1u3NgXPKcAhx4AGeV0OW7hlyKINbUL0KI7DNvFwk=] not found
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.lambda$loadOnDiskState$11(PersistedClusterStateService.java:627)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.readXContent(PersistedClusterStateService.java:673)
	... 21 more
Caused by: java.lang.IllegalArgumentException: mapping with hash [nta1u3NgXPKcAhx4AGeV0OW7hlyKINbUL0KI7DNvFwk=] not found
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.metadata.IndexMetadata$Builder.fromXContent(IndexMetadata.java:2509)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.cluster.metadata.IndexMetadata.fromXContent(IndexMetadata.java:1442)
	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.gateway.PersistedClusterStateService.lambda$loadOnDiskState$11(PersistedClusterStateService.java:625)
	... 22 more

@benwtrent
Copy link
Member

This failure is on a cluster state update, switching to distributed

@benwtrent benwtrent added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Search/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team labels Oct 27, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Oct 27, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

ywangd added a commit to ywangd/elasticsearch that referenced this issue Oct 30, 2023
@ywangd
Copy link
Member

ywangd commented Oct 30, 2023

It does not reproduce locally with about 30 runs. The underlying failure is

java.lang.IllegalArgumentException: mapping with hash [nta1u3NgXPKcAhx4AGeV0OW7hlyKINbUL0KI7DNvFwk=] not found

triggered by PersistedClusterStateService#assertOnCommit which is the same as #99778. They are also both upgrade tests.
@williamrandolph thinks it is due to the change in #99668 and he is currently investigating it with #101016. Since the other issue is flagged as medium-risk, I am flagging this the same as well.

Also raised #101499 to log the index name on unfound mapping hash.

@idegtiarenko
Copy link
Contributor

Should we consider this a higher priority as it seems to prevent node from starting after the upgrade?

@idegtiarenko idegtiarenko added blocker and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Nov 1, 2023
mark-vieira pushed a commit to mark-vieira/elasticsearch that referenced this issue Nov 2, 2023
@williamrandolph
Copy link
Contributor

williamrandolph commented Nov 2, 2023

I've got a timeline from a local reproduction:

Mapping failure investigation
Nodes 0, 1, 2

15:10:12 - tests begin
[2023-11-02T15:10:17,474][INFO ][o.e.n.Node               ] [v7.17.13-0] started
[2023-11-02T15:10:17,692][INFO ][o.e.n.Node               ] [v7.17.13-1] started
[2023-11-02T15:10:18,421][INFO ][o.e.n.Node               ] [v7.17.13-2] started

15:10:19 - node 1 is master
[2023-11-02T15:10:19,668][DEBUG][o.e.c.s.MasterService    ] [v7.17.13-1] took [6ms] to compute cluster state update for [elected-as-master ([2] nodes joined)

15:11:55 - tasks index is created
[2023-11-02T15:11:55,881][TRACE][o.e.c.s.MasterService    ] [v7.17.13-1] will process [auto create [.tasks]]

15:11:58 - Node 0 is stopped and restarted
[2023-11-02T19:11:58.918622Z] [BUILD] Stopping node

15:12:40 - Node 1 is stopped and restarted
[2023-11-02T19:12:40.987057Z] [BUILD] Stopping node

15:12:41 - Node 0 elected master, .tasks index update triggered
[2023-11-02T15:12:41,148][DEBUG][o.e.c.s.MasterService    ] [v7.17.13-0] took [8ms] to compute cluster state update for [elected-as-master ([2] nodes joined in term 7)
[2023-11-02T15:12:41,254][INFO ][o.e.i.SystemIndexMappingUpdateService] [v7.17.13-0] Index [.tasks] (alias [null]) mappings are not up-to-date and will be updated

15:12:42 - Node 0 updates system index mappings
[2023-11-02T15:12:42,200][DEBUG][o.e.c.s.MasterService    ] [v7.17.13-0] executing cluster state update for [put-mapping [.tasks/5sma6F-MR0mKEnpoRN8Sag][PutMappingClusterStateUpdateTask[request=org.elasticsearch.action.admin.indices.mapping.put.PutMappingClusterStateUpdateRequest@22736966, listener=org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener/org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/org.elasticsearch.tasks.TaskManager$1{SafelyWrappedActionListener[listener=org.elasticsearch.action.support.ContextPreservingActionListener/org.elasticsearch.indices.SystemIndexMappingUpdateService$1@2659a320]}{Task{id=1346, type='transport', action='indices:admin/mapping/put', description='', parentTask=unset, startTime=1698952361254, startTimeNanos=139856818146125}}/org.elasticsearch.action.support.master.TransportMasterNodeAction$$Lambda$6496/0x00000003021058a8@7402b8ea/org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$$Lambda$8336/0x00000003023b3380@6cde9b59/org.elasticsearch.action.admin.indices.mapping.put.TransportPutMappingAction$$Lambda$8341/0x00000003023b3be8@55dfe7fd]]]
[2023-11-02T15:12:42,205][INFO ][o.e.c.m.MetadataMappingService] [v7.17.13-0] [.tasks/5sma6F-MR0mKEnpoRN8Sag] update_mapping [task]
cluster uuid: Se98XR3wQkWurCBCQOrMEA [committed: true]
version: 246
state uuid: dbF4YgZpQfmJMZqEQfX0Kg

15:12:50 - Node 2 gets cluster state version 261
[2023-11-02T15:12:50,380][INFO ][o.e.c.s.ClusterApplierService] [v7.17.13-2] added {{v7.17.13-1}{LxBO0YqDQSSfFXxAaoVQBQ}{dyxRhb0NQq-2PgMYFlydpw}{127.0.0.1}{127.0.0.1:59892}{cdfhilmrstw}}, term: 7, version: 261, reason: ApplyCommitRequest{term=7, version=261, sourceNode={v7.17.13-0}{V-MIT5iuTOyvNLS0Z5GD4g}{j9ScnXQSQn2CHNPjyshTUg}{127.0.0.1}{127.0.0.1:59743}{cdfhilmrstw}{ml.allocated_processors_double=12.0, upgraded=true, ml.machine_memory=34359738368, xpack.installed=true, transform.config_version=10.0.0, testattr=test, ml.config_version=11.0.0, ml.max_jvm_size=536870912, ml.allocated_processors=12}}

15:12:51 - Node 1 adds updated mapping for .tasks
2023-11-02T15:12:51,085][DEBUG][o.e.i.m.MapperService    ] [v7.17.13-1] [.tasks] [[.tasks/5sma6F-MR0mKEnpoRN8Sag]] added mapping, source [{"task":{"dynamic":"strict","_meta":{"version":"8.12.0","managed_index_mappings_version":0}

There is no similar log message for Node 2

15:13:21 - Node 2 stops and restarts
[2023-11-02T19:13:21.925557Z] [BUILD] Stopping node

15:13:33 - During startup, node 2 hits the error
[2023-11-02T15:13:33,624][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [v7.17.13-2] fatal error in thread [elasticsearch[v7.17.13-2][cluster_coordination][T#1]], exiting java.lang.AssertionError: org.elasticsearch.gateway.CorruptStateException: org.elasticsearch.gateway.CorruptStateException: java.lang.IllegalArgumentException: mapping of index [.tasks] with hash [07t3iDH1U18mgWco/xRfWlHPzxMIaLwTRh/T7NQ1ONk=] not found

Logs for all three nodes are here:
logs-101331.tar.gz

I was running ./gradlew ':qa:rolling-upgrade-legacy:v7.17.13#upgradedClusterTest' -Dtests.seed=DD0C8780632E495C -Dtests.bwc=true -Dtests.locale=sq -Dtests.timezone=Europe/Kiev -Druntime.java=20 against my branch here: #101016

I don't know why node 2 doesn't receive the mapping update from node 0. But it makes sense that #99668 would have caused this. Previously, the system index metadata upgrade service waited until all nodes in the cluster had the same version before running an update, so that mapping update never would have run before all three nodes were upgraded.

Come to think of it, the system index mapping update shouldn't be running this early either. So I think there's a bug in the system index mapping update code, and I'll look for it now.

williamrandolph added a commit to williamrandolph/elasticsearch that referenced this issue Nov 3, 2023
elastic#99668 seems to have introduced a bug where
SystemIndexMappingUpdateService updates system index mappings
even in mixed clusters. This PR restores the old version-based
check in order to be sure that there's no update until the
cluster is fully upgraded.

Fixes elastic#99778, elastic#101331
@rjernst rjernst added :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team labels Nov 3, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Nov 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

williamrandolph added a commit that referenced this issue Nov 3, 2023
* Don't update system index mappings in mixed clusters

#99668 seems to have introduced a bug where
SystemIndexMappingUpdateService updates system index mappings
even in mixed clusters. This PR restores the old version-based
check in order to be sure that there's no update until the
cluster is fully upgraded.

The timing of the mapping update seems to be causing worse
problems, corrupting persisted cluster state.

Fixes #99778, #101331

* Remove broken assertion

The compatibility versions objects are not showing up
correctly, so we shouldn't assert on them.
williamrandolph added a commit to williamrandolph/elasticsearch that referenced this issue Nov 3, 2023
* Don't update system index mappings in mixed clusters

elastic#99668 seems to have introduced a bug where
SystemIndexMappingUpdateService updates system index mappings
even in mixed clusters. This PR restores the old version-based
check in order to be sure that there's no update until the
cluster is fully upgraded.

The timing of the mapping update seems to be causing worse
problems, corrupting persisted cluster state.

Fixes elastic#99778, elastic#101331

* Remove broken assertion

The compatibility versions objects are not showing up
correctly, so we shouldn't assert on them.
elasticsearchmachine pushed a commit that referenced this issue Nov 3, 2023
* Don't update system index mappings in mixed clusters

#99668 seems to have introduced a bug where
SystemIndexMappingUpdateService updates system index mappings
even in mixed clusters. This PR restores the old version-based
check in order to be sure that there's no update until the
cluster is fully upgraded.

The timing of the mapping update seems to be causing worse
problems, corrupting persisted cluster state.

Fixes #99778, #101331

* Remove broken assertion

The compatibility versions objects are not showing up
correctly, so we shouldn't assert on them.
@volodk85
Copy link
Contributor

@volodk85 volodk85 reopened this Dec 12, 2023
@rjernst
Copy link
Member

rjernst commented Dec 13, 2023

@williamrandolph Can you please take another look?

@williamrandolph
Copy link
Contributor

I'm not seeing the java.lang.IllegalArgumentException: mapping with hash [...] not found message in this most recent failure, but it's not yet clear to me what the real problem is. I'll keep looking.

@benwtrent
Copy link
Member

@volodk85 unless the cause is confirmed to be the same, we shouldn't reopen test failure issues after they have been closed for a while.

@williamrandolph
Copy link
Contributor

On the second node, a CorruptStateException:

[2023-12-12T17:17:00,062][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [test-cluster-2] fatal error in thread [elasticsearch[test-cluster-2][cluster_coordination][T#1]], exiting
java.lang.AssertionError: org.elasticsearch.gateway.CorruptStateException: org.elasticsearch.xcontent.XContentParseException: [-1:38420] [index_template] failed to parse field [index_template]
        at org.elasticsearch.gateway.PersistedClusterStateService$Writer.assertOnCommit(PersistedClusterStateService.java:1244) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.gateway.PersistedClusterStateService$Writer.commit(PersistedClusterStateService.java:1234) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.gateway.PersistedClusterStateService$Writer.writeIncrementalStateAndCommit(PersistedClusterStateService.java:976) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.writeClusterStateToDisk(GatewayMetaState.java:581) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.gateway.GatewayMetaState$LucenePersistedState.setLastAcceptedState(GatewayMetaState.java:564) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:392) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:463) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler.acceptState(PublicationTransportHandler.java:210) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:197) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:109) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:74) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:315) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.9.2.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.9.2.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1623) ~[?:?]

Looks like the same thing as here: #103285

Just from what Gradle reported, I think it made sense to re-open this issue. It's just that digging deeper, it turned out to be something else.

@ldematte
Copy link
Contributor

This is also already tracked in #103358 (same issue)

@williamrandolph williamrandolph added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Dec 14, 2023
@williamrandolph
Copy link
Contributor

Since the underlying issue is tracked in a few places already, I'm going to close this issue again. We should save this one for failures with the message

java.lang.IllegalArgumentException: mapping with hash [...] not found

I'll update the title to reflect that.

@williamrandolph williamrandolph changed the title [CI] FeatureUpgradeIT testGetFeatureUpgradeStatus {upgradedNodes=3} failing [CI] FeatureUpgradeIT testGetFeatureUpgradeStatus failing with IllegalArgumentException: mapping with hash [...] not found Dec 14, 2023
@jbaiera
Copy link
Member

jbaiera commented Dec 14, 2023

This might need to have the Failure Store feature flag enabled for the test clusters in the rolling upgrade. I think that #103358 is the same symptom but the solution lives in a different place than this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants