Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scroll SearchContextMissingException and NPE in ES logs #25820

Closed
qwerty4030 opened this issue Jul 20, 2017 · 12 comments
Closed

Scroll SearchContextMissingException and NPE in ES logs #25820

qwerty4030 opened this issue Jul 20, 2017 · 12 comments
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories

Comments

@qwerty4030
Copy link
Contributor

Elasticsearch version: 5.3.3

Plugins installed: [discovery-ec2, repository-s3, x-pack]

JVM version : 1.8.0_131 Java HotSpot(TM) 64-Bit Server VM

OS version : 3.13.0-121-generic AWS Ubuntu 14.04 LTS

Description of the problem including expected versus actual behavior:

A scroll request in a daily scheduled job sometimes fails due to cause [SearchContextMissingException[No search context found for id [162106278]]. I'm pretty confident that our code is correct: we use the scroll id from the previous request and the scroll timeout is set to 10 minutes. This error only occurs sometimes within a few minutes of starting the scroll (size is 1000) and about 20k docs have been processed at that point. Around the time of the error I noticed an NPE, IAE, and GC warnings in the ES logs. The index has 10 shards and 1 replica with 10 data nodes. Also around this time we experience a 60% increase in search latency (looking at kibana monitoring).

Unable to reproduce, but this error has occurred 3 times in the last 7 days (run once per day).

Provide logs (if relevant):

Exception from our code:

[ERROR] [19-Jul-2017 13:07:12.930] java.lang.IllegalArgumentException: response shards failed: [shard [_na], reason [RemoteTransportException[[data7][10.206.38.239:9300][indices:data/read/search[phase/query/scroll]]]; nested: SearchContextMissingException[No search context found for id [178138733]]; ], cause [SearchContextMissingException[No search context found for id [178138733]]
	at org.elasticsearch.search.SearchService.findContext(SearchService.java:438)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:309)
	at org.elasticsearch.action.search.SearchTransportService$8.messageReceived(SearchTransportService.java:351)
	at org.elasticsearch.action.search.SearchTransportService$8.messageReceived(SearchTransportService.java:348)
...

Exceptions in ES logs (about 15s before, same logs for many/most of the other nodes):

[2017-07-19T13:06:57,579][DEBUG][o.e.a.s.TransportSearchScrollAction] [data7] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [data7][10.206.38.239:9300][indices:data/read/search[phase/fetch/id/scroll]]
Caused by: java.lang.NullPointerException
	at org.elasticsearch.search.DefaultSearchContext.isCancelled(DefaultSearchContext.java:826) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:140) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:417) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:384) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:381) ~[elasticsearch-5.3.3.jar:5.3.3]
...
[2017-07-19T13:06:57,582][DEBUG][o.e.a.s.TransportSearchScrollAction] [data7] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [data3][10.206.38.212:9300][indices:data/read/search[phase/fetch/id/scroll]]
Caused by: java.lang.IllegalArgumentException: unable to fetch fields from _source field: _source is disabled in the mappings for index [data-product-5]
	at org.elasticsearch.search.fetch.subphase.FetchSourceSubPhase.hitExecute(FetchSourceSubPhase.java:48) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:163) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:417) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:384) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:381) ~[elasticsearch-5.3.3.jar:5.3.3]
...
[2017-07-19T13:06:57,629][DEBUG][o.e.a.s.TransportSearchScrollAction] [data4] [178138733] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [data7][10.206.38.239:9300][indices:data/read/search[phase/query/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [178138733]
	at org.elasticsearch.search.SearchService.findContext(SearchService.java:438) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:309) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$8.messageReceived(SearchTransportService.java:351) ~[elasticsearch-5.3.3.jar:5.3.3]
...
[2017-07-19T13:07:20,773][WARN ][o.e.m.j.JvmGcMonitorService] [data9] [gc][2070933] overhead, spent [6.9s] collecting in the last [7.4s]
@markharwood
Copy link
Contributor

_source is disabled in the mappings for index [data-product-5]

This error looks significant. It' saying there was an attempt to retrieve the JSON of a matching doc but the index being hit has specifically disabled the storage of the _source field which contains JSON. This is normally only ever disabled when space-saving is paramount and you don't want a whole host of elasticsearch features that rely on this content being stored. Was this a deliberate configuration on your part?

@qwerty4030
Copy link
Contributor Author

No we definitely do not have _source disabled, as I mentioned the code works fine sometimes. I'm not sure if this helps, but this index was created in 2.x and restored via snapshot in the 5.3 cluster:

"version": {
          "created": "2040099",
          "upgraded": "5030399"
        }

@markharwood
Copy link
Contributor

not sure if this helps, but this index was created in 2.x and restored via snapshot in the 5.3 cluster:

Having a hard time recreating this.
I've just created a single-field, single-shard index with 100k docs in 2.4.0, snapshotted it and restored into 5.3.3 where I've been able to scroll through all docs without an issue.
We'll need to dig deeper on your config either by supplying the index or at least mappings, scroll queries and example doc.

@imotov imotov added the :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. label Sep 7, 2017
@imotov
Copy link
Contributor

imotov commented Sep 7, 2017

There might be a related issue in https://discuss.elastic.co/t/nullpointerexception-in-scroll/98717

@qwerty4030
Copy link
Contributor Author

This has occurred again a couple of times in the last few days. I noticed a different NPE in the ES logs:

[2017-09-07T13:13:09,214][DEBUG][o.e.a.s.TransportSearchScrollAction] [data3] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [data6][10.206.38.201:9300][indices:data/read/search[phase/fetch/id/scroll]]
Caused by: java.lang.NullPointerException
	at org.elasticsearch.search.fetch.subphase.MatchedQueriesFetchSubPhase.lambda$hitsExecute$0(MatchedQueriesFetchSubPhase.java:52) ~[elasticsearch-5.3.3.jar:5.3.3]
	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:360) ~[?:1.8.0_131]
	at java.util.TimSort.sort(TimSort.java:234) ~[?:1.8.0_131]
	at java.util.Arrays.sort(Arrays.java:1438) ~[?:1.8.0_131]
	at org.elasticsearch.search.fetch.subphase.MatchedQueriesFetchSubPhase.hitsExecute(MatchedQueriesFetchSubPhase.java:52) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:168) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:417) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:384) ~[elasticsearch-5.3.3.jar:5.3.3]
	at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:381) ~[elasticsearch-5.3.3.jar:5.3.3]
	...

@imotov
Copy link
Contributor

imotov commented Sep 7, 2017

This seems to be a different issue. In the first set of stacktraces, it looked like search context was disappearing. In this stack trace we are dealing with nulls in SearchHit array. I can come up with some reasonable explanation how search context could disappear (some sort of race condition between context cleanup and use). However, SearchHits array here is a local variable and I don't really see any code paths that could lead to having nulls in it.

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Scroll labels Feb 14, 2018
@cbuescher
Copy link
Member

@qwerty4030 do you still experience this issue? It looks like we are still not able to reproduce.
/cc @elastic/es-search-aggs

@qwerty4030
Copy link
Contributor Author

@cbuescher Unfortunately I don't have access to this cluster anymore. I was hoping the stacktraces would be enough to figure it out. Hopefully its fixed in a later version!

@bleskes bleskes removed the :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. label Mar 20, 2018
@talanb
Copy link

talanb commented Mar 22, 2018

@cbuescher I am the lead dev on this cluster now and we are still seeing this error. Over the last 30 days we have 81 occurrences of this exception. /cc @qwerty4030

@cbuescher
Copy link
Member

The two main reasons for this exception seem to be using the original scroll id for all scroll requests, and/or timing out of the search context.
Since from your initial description you seem confident to rule those out, I'd suggest going back to the ask by @markharwood for you config, index mappings, scroll queries and example documents, if possible.

@talanb
Copy link

talanb commented Mar 22, 2018

@cbuescher I'll start putting that information together. One thing that I noticed while researching this error this morning is that at the time that the error occurred we have a gap in the monitoring graphs. I've attached two screenshots that show this. I've only checked 2 of the occurrences so far but the two that I checked both exhibit the same gap in the graphs

The gap appears in the primary master node as well as in at least one of the data nodes. These images are for the primary master node.

25820_monitoring_master1
25820_monitoring_overview

@dliappis
Copy link
Contributor

dliappis commented Jul 4, 2018

No further feedback received. @talanb / @qwerty4030 if you have the requested
information please add it in a comment and we can look at re-opening
this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback_needed :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

10 participants