Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Connection issues in BasicDistributedJobsIT #32196

Closed
cbuescher opened this issue Jul 19, 2018 · 6 comments

Comments

@cbuescher
Copy link
Member

commented Jul 19, 2018

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1767/consoleFull

Does not reproduce with:

REPRODUCE WITH: gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=98AB1BDCBBCE6890 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT \
  -Dtests.security.manager=true \
  -Dtests.locale=en-US \
  -Dtests.timezone=UTC

REPRODUCE WITH: gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=98AB1BDCBBCE6890 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT \
  -Dtests.security.manager=true \
  -Dtests.locale=en-US \
  -Dtests.timezone=UTC

I found #29897 but this looks different in that I don't see an assertion failure in the logs. Rather something in the test cluster formation seems to be getting off track. Cluster starts up, then one node is shut down ([2018-07-19T10:11:05,813][INFO ][o.e.t.InternalTestCluster] Closing random node [node_t0], looks like this is done on purpose). After that I see socket failures and refused connections. Not sure if they are related to the test running of some general test cluster problem:

15:12:15   1> java.net.SocketException: Socket is closed
15:12:15   1> 	at java.net.Socket.getOutputStream(Socket.java:943) ~[?:1.8.0_172]
15:12:15   1> 	at org.elasticsearch.transport.MockTcpTransport$MockChannel.sendMessage(MockTcpTransport.java:412) [framework-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:1163) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1252) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1220) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:66) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:60) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.discovery.zen.MembershipAction$LeaveRequestRequestHandler.messageReceived(MembershipAction.java:287) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.discovery.zen.MembershipAction$LeaveRequestRequestHandler.messageReceived(MembershipAction.java:282) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1679) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
15:12:15   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
15:12:15   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
15:12:15   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
15:12:15   1> [2018-07-19T10:11:05,835][WARN ][o.e.t.MockTcpTransport   ] [node_t3] send message failed [channel: MockChannel{profile='default', isOpen=false, localAddress=/127.0.0.1:56400, isServerSocket=false}]
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Jul 19, 2018

@davidkyle

This comment has been minimized.

Copy link
Member

commented Jul 19, 2018

15:12:15 1> java.net.SocketException: Socket is closed

The test is an evil test that randomly takes down nodes in the cluster to assert the ml jobs relocate correctly so these log messages are expected.

The test failed because of zombie threads leaking

2> SEVERE: There are still zombie threads that couldn't be terminated:
2> 1) Thread[id=262, name=elasticsearch[node_t0][ml_autodetect][T#1], state=RUNNABLE, group=TGRP-BasicDistributedJobsIT]

2> INFO: Starting to interrupt leaked threads:
...
at org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectCommunicator.waitFlushToCompletion(AutodetectCommunicator.java:282)

I see why we get the zombie threads because the interrupt is caught here returning null which causes this loop to continue.

Essentially the interrupt is ignored and we can fix that but the underlying question is why was the thread still running when the job should have been closed at the end of the test.

benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Aug 1, 2018
benwtrent added a commit that referenced this issue Aug 1, 2018
@benwtrent

This comment has been minimized.

Copy link
Member

commented Aug 1, 2018

#32541 Addressed the zombie thread issue, but the root cause of the job being closed not causing it to completely die is still not known.

More data is required, and hopefully the aforementioned PR will provide some of that data if we see this test failure again.

dnhatn added a commit that referenced this issue Aug 2, 2018
Merge branch '6.x' into ccr-6.x
* 6.x:
  Fix scriptdocvalues tests with dates
  Correct minor typo in explain.asciidoc for HLRC
  Fix painless whitelist and warnings from backporting #31441
  Build: Add elastic maven to repos used by BuildPlugin (#32549)
  Scripting: Conditionally use java time api in scripting (#31441)
  [ML] Improve error when no available field exists for rule scope (#32550)
  [ML] Improve error for functions with limited rule condition support (#32548)
  [ML] Remove multiple_bucket_spans
  [ML] Fix thread leak when waiting for job flush (#32196) (#32541)
  Painless: Clean Up PainlessField (#32525)
  Add @AwaitsFix for #32554
  Remove broken @link in Javadoc
  Add AwaitsFix to failing test - see #32546
  SQL: Added support for string manipulating functions with more than one parameter (#32356)
  [DOCS] Reloadable Secure Settings (#31713)
  Fix compilation error introduced by #32339
  [Rollup] Remove builders from TermsGroupConfig (#32507)
  Use hostname instead of IP with SPNEGO test (#32514)
  Switch x-pack rolling restart to new style Requests (#32339)
  [DOCS] Small fixes in rule configuration page (#32516)
  Painless: Clean up PainlessMethod (#32476)
  SQL: Add test for handling of partial results (#32474)
  Docs: Add missing migration doc for logging change
  Build: Remove shadowing from benchmarks (#32475)
  Docs: Add all JDKs to CONTRIBUTING.md
  Logging: Make node name consistent in logger (#31588)
  High-level client: fix clusterAlias parsing in SearchHit (#32465)
  REST high-level client: parse back _ignored meta field (#32362)
  backport fix of reduceRandom fix (#32508)
  Add licensing enforcement for FIPS mode (#32437)
  INGEST: Clean up Java8 Stream Usage (#32059) (#32485)
  Improve the error message when an index is incompatible with field aliases. (#32482)
  Mute testFilterCacheStats
  Scripting: Fix painless compiler loader to know about context classes (#32385)
  [ML][DOCS] Fix typo applied_to => applies_to
  Mute SSLTrustRestrictionsTests on JDK 11
  Changed ReindexRequest to use Writeable.Reader (#32401)
  Increase max chunk size to 256Mb for repo-azure (#32101)
  Mute KerberosAuthenticationIT
  fix no=>not typo (#32463)
  HLRC: Add delete watch action (#32337)
  Fix calculation of orientation of polygons (#27967)
  [Kerberos] Add missing javadocs (#32469)
  Fix missing JavaDoc for @throws in several places in KerberosTicketValidator.
  Make get all app privs requires "*" permission (#32460)
  Ensure KeyStoreWrapper decryption exceptions are handled (#32472)
  update rollover to leverage write-alias semantics (#32216)
  [Kerberos] Remove Kerberos bootstrap checks (#32451)
  Switch security to new style Requests (#32290)
  Switch security spi example to new style Requests (#32341)
  Painless: Add PainlessConstructor (#32447)
  Update Fuzzy Query docs to clarify default behavior re max_expansions (#30819)
  Remove > from Javadoc (fatal with Java 11)
  Tests: Fix convert error tests to use fixed value (#32415)
  IndicesClusterStateService should replace an init. replica with an init. primary with the same aId (#32374)
  auto-interval date histogram - 6.x backport (#32107)
  [CI] Mute DocumentSubsetReaderTests testSearch
  [TEST] Mute failing InternalEngineTests#testSeqNoAndCheckpoints
  TEST: testDocStats should always use forceMerge (#32450)
  TEST: Avoid deletion in FlushIT
  AwaitsFix IndexShardTests#testDocStats
  Painless: Add method type to method. (#32441)
  Remove reference to non-existent store type (#32418)
  [TEST] Mute failing FlushIT test
  Fix ordering of bootstrap checks in docs (#32417)
  Wrong discovery.type for azure in breaking changes (#32432)
  Mute ConvertProcessorTests failing tests
  TESTS: Move netty leak detection to paranoid level (#32354) (#32425)
  Upgrade to Lucene-7.5.0-snapshot-608f0277b0 (#32390)
  [Kerberos] Avoid vagrant update on precommit (#32416)
  TEST: Avoid triggering merges in FlushIT
  [DOCS] Fixes formatting of scope object in job resource
  Switch x-pack/plugin to new style Requests (#32327)
  Release requests in cors handle (#32410)
  Remove BouncyCastle dependency from runtime (#32402)
  Copy missing segment attributes in getSegmentInfo (#32396)
  Rest HL client: Add put license action (#32214)
  Docs: Correcting a typo in tophits (#32359)
  Build: Stop double generating buildSrc pom (#32408)
  Switch x-pack full restart to new style Requests (#32294)
  Painless: Clean Up PainlessClass Variables (#32380)
  [ML] Consistent pattern for strict/lenient parser names (#32399)
  Add Restore Snapshot High Level REST API
  Update update-settings.asciidoc (#31378)
  Introduce index store plugins (#32375)
  Rank-Eval: Reduce scope of an unchecked supression
  Make sure _forcemerge respects `max_num_segments`. (#32291)
dnhatn added a commit that referenced this issue Aug 3, 2018
Merge branch 'master' into ccr
* master:
  HLRC: Move commercial clients from XPackClient (#32596)
  Add cluster UUID to Cluster Stats API response (#32206)
  Security: move User to protocol project (#32367)
  [TEST] Test for shard failures, add debug to testProfileMatchesRegular
  Minor fix for javadoc (applicable for java 11). (#32573)
  Painless: Move Some Lookup Logic to PainlessLookup (#32565)
  TEST: Avoid merges in testSeqNoAndCheckpoints
  [Rollup] Remove builders from HistoGroupConfig (#32533)
  Mutes failing SQL string function tests due to #32589
  fixed elements in array of produced terms (#32519)
  INGEST: Enable default pipelines (#32286)
  Remove cluster state initial customs (#32501)
  Mutes LicensingDocumentationIT due to #32580
  [ML] Remove multiple_bucket_spans (#32496)
  [ML] Rename JobProvider to JobResultsProvider (#32551)
  Correct minor typo in explain.asciidoc for HLRC
  Build: Add elastic maven to repos used by BuildPlugin (#32549)
  Clarify the error message when a pipeline agg is used in the 'order' parameter. (#32522)
  Revert "[test] turn on host io cache for opensuse (#32053)"
  Enable packaging tests on suse boxes
  [ML] Improve error when no available field exists for rule scope (#32550)
  [ML] Improve error for functions with limited rule condition support (#32548)
  Painless: Clean Up PainlessField (#32525)
  Add @AwaitsFix for #32554
  Remove broken @link in Javadoc
  Scripting: Conditionally use java time api in scripting (#31441)
  [ML] Fix thread leak when waiting for job flush (#32196) (#32541)
  Add AwaitsFix to failing test - see #32546
  Core: Minor size reduction for AbstractComponent (#32509)
  SQL: Added support for string manipulating functions with more than one parameter (#32356)
  [DOCS] Reloadable Secure Settings (#31713)
  Watcher: Reenable HttpSecretsIntegrationTests#testWebhookAction test (#32456)
  [Rollup] Remove builders from TermsGroupConfig (#32507)
  Use hostname instead of IP with SPNEGO test (#32514)
  Switch x-pack rolling restart to new style Requests (#32339)
  NETWORKING: Fix Netty Leaks by upgrading to 4.1.28 (#32511)
  [DOCS] Small fixes in rule configuration page (#32516)
  Painless: Clean up PainlessMethod (#32476)
  Build: Remove shadowing from benchmarks (#32475)
  Docs: Add all JDKs to CONTRIBUTING.md
  Add licensing enforcement for FIPS mode (#32437)
  SQL: Add test for handling of partial results (#32474)
  Mute testFilterCacheStats
  [ML][DOCS] Fix typo applied_to => applies_to
  Scripting: Fix painless compiler loader to know about context classes (#32385)
@DaveCTurner

This comment has been minimized.

Copy link
Contributor

commented Oct 25, 2018

@benwtrent I've assigned this to you (we're trying to cut down on unassigned test failure issues) but think this can probably be closed until it fails again.

@benwtrent

This comment has been minimized.

Copy link
Member

commented Oct 25, 2018

@DaveCTurner Cool! I will Just close it then :D

@benwtrent benwtrent closed this Oct 25, 2018

@davidkyle

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

I suspect the close job request in the teardown section timed out due to a slow machine/GC pause then there was a race between the job closing and the test framework interrupting the active threads. It hasn't happened since and if it does occur again we can consider increasing the timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.