Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in _termvectors with artificial document? #21906

Closed
shaie opened this Issue Dec 1, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@shaie
Copy link
Contributor

shaie commented Dec 1, 2016

I am using the _termvectors API with an artificial document, and I hit an inconsistent behavior, depending on which shard the request hits. Below I attach reproducing steps, but let me first explain what I do: I create an index with two shards and index one document with one field and the value "one one". When I request the term vectors of that document with term_statistics, I get ttf=2 as expected. At this point I know that one shard contains the document, and the other shard has no documents at all.

When I submit a _termvectors request with an artificial document with the field text:one, then if the request hits the shard that holds the indexed document, I get back ttf=2 which is expected. But if it hits the shard without any documents, it returns ttf=1 which is confusing.

More so, if I send an artificial document with the field text:two (NOTE: two does not exist in any of the shards!), then the response is similar -- if the query hits the shard without any documents, it returns ttf=1, while if it hits the shard with the one indexed document, it does not return ttf at all (which is expected).

Here are the steps to reproduce:

Create the index

{
  "mappings": {
    "doc": {
      "properties": {
        "text" : {
          "type": "text"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 2,
      "number_of_replicas" : 0
    }
  }
}'

Index one document

curl -XPUT 'http://localhost:9200/tv_bug/doc/1?pretty=true' -d '{
  "text" : "one one"
}'

Verify the document exists in one shard

curl -XGET 'http://localhost:9200/tv_bug/doc/_search?preference=_shards:0&pretty=true'
curl -XGET 'http://localhost:9200/tv_bug/doc/_search?preference=_shards:1&pretty=true'

One of those requests will return 0 hits, the other 1 (in my case shard 0 did not return the hit)

Get the Term Vectors of that one document

curl -XGET 'http://localhost:9200/tv_bug/doc/1/_termvectors?pretty=true' -d '{
  "fields": ["text"],
  "term_statistics" : true
}'

Get the TV of 'one' using an artificial document

From shard 0

curl -XGET 'http://localhost:9200/tv_bug/doc/_termvectors?preference=_shards:0&pretty=true' -d '{
  "term_statistics" : true,
  "doc" : {
    "text" : "one"
  }
}'

From shard 1

curl -XGET 'http://localhost:9200/tv_bug/doc/_termvectors?preference=_shards:1&pretty=true' -d '{
  "term_statistics" : true,
  "doc" : {
    "text" : "one"
  }
}'

In my case, shard 0 returns ttf=1 and shard 1 returns ttf=2 (expected). Also, if you send the word "two" instead, you will see that shard 1 does not return ttf at all (expected), but shard 0 returns ttf=1.

BTW, and this may be a different bug report, when I send the TV requests with the preference parameter, I sometimes receive this error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[C-N1Zbn][127.0.0.1:9300][indices:data/read/tv[s]]"
      }
    ],
    "type" : "null_pointer_exception",
    "reason" : null
  },
  "status" : 500
}

Submitting the request multiple times eventually succeeds.

@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 1, 2016

Oh, forgot to mention that I use 5.0.1.

@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 1, 2016

Here's the NPE stacktrace from the console BTW:

[2016-12-01T11:59:13,208][WARN ][r.suppressed             ] path: /tv_bug/doc/_termvectors, params: {pretty=true, preference=_shards:1, index=tv_bug, type=doc}
org.elasticsearch.transport.RemoteTransportException: [C-N1Zbn][127.0.0.1:9300][indices:data/read/tv[s]]
Caused by: java.lang.NullPointerException
        at org.elasticsearch.action.termvectors.TransportTermVectorsAction.shardOperation(TransportTermVectorsAction.java:79) ~[elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.action.termvectors.TransportTermVectorsAction.shardOperation(TransportTermVectorsAction.java:42) ~[elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:293) ~[elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:286) ~[elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:548) [elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:520) [elasticsearch-5.0.1.jar:5.0.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.0.1.jar:5.0.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 1, 2016

After reviewing ES code, I believe I found the issue, in TermVectorsWriter.setFields() (line 72)

            // if no terms found, take the retrieved term vector fields for stats
            if (topLevelTerms == null) {
                topLevelTerms = fieldTermVector;
            }

Since the shard has no documents indexed, it finds no terms for the artificial document's field, and therefore uses the doc's TV as the terms iterator.

So if I send:

$ curl -XGET 'http://localhost:9200/tv_bug/doc/_termvectors?preference=_shards:0&pretty=true' -d '{
  "term_statistics" : true,
  "doc" : {
    "text" : "one one one"
  }
}'

I get ttf=3, which supports the observation above.

I still think it's a bug, in that it's OK to receive the artificial doc's TV, but I don't expect term_statistics to use the doc's stats as what's in the index?

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Dec 1, 2016

Hi @shaie

Thanks for the report and for digging into the code to find the issue. Would you be up for sending a PR to fix this?

Also, if you don't mind, I'd be interested to hear how what you're using the term vectors API for?

@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 1, 2016

Thanks for the quick response @clintongormley. As for what I use it for, see this discussion that I started https://discuss.elastic.co/t/terms-stats-api/67508 and this feature request #21886. Basically I want to get terms statistics (currently for re-ranking capabilities, and also at the moment outside of 'scripting') and the lack of API got me to try TV and artificial documents, where I send the list of terms I wish to get stats for as an artificial document, to all the shards.

I would love to send a PR, but I'll need to do some work to setup a dev environment. I.e. I don't have an ES fork, not even Gradle installed 😄 , and so I cannot run ES tests, build the code etc. I will start doing that though, cause I wish to be able to contribute to the code (e.g. maybe this terms_stats API), so it seems worth it. In the meanwhile, I think perhaps if you/we implement something like an EmptyTermsEnum which assumes there are no terms to iterate on, and set topLevelTerms to such instance, the rest of the code would just work. But, if I want to test it, I need to setup the dev environment...

Also, what about that NPE? Do you prefer I report that in a separate bug report?

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Dec 1, 2016

but I'll need to do some work to setup a dev environment. I.e. I don't have an ES fork, not even Gradle installed

just a few easy clicks away :)

Also, what about that NPE? Do you prefer I report that in a separate bug report?

No, a drive-by-fix would be fine :)

@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 1, 2016

just a few easy clicks away :)

Indeed. I've got my fork and gradle set up. Now figuring out what do I need to run to test it actually works :).

No, a drive-by-fix would be fine :)

Oh well, I don't know the cause for that yet ;)

shaie added a commit to shaie/elasticsearch that referenced this issue Dec 1, 2016

Return correct term statistics whem a field is not found in a shard
If you ask for the term vectors of an artificial document with
term_statistics=true, but a shard does not have any terms of the doc's
field(s), it returns the doc's term vectors values as the shard-level
term statistics. This commit fixes that to return 0 for ttf and also
field-level aggregated statistics.

This closes elastic#21906
@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 1, 2016

@clintongormley I pushed a PR #21922 which fixes the term_statistics bug. I haven't yet addressed the NPE bug as I don't yet know what causes it. I think this should go under a separate commit though. Would appreciate your review and feedback on the changes.

@s1monw s1monw closed this in #21922 Dec 2, 2016

s1monw added a commit that referenced this issue Dec 2, 2016

Return correct term statistics when a field is not found in a shard (#…
…21922)

If you ask for the term vectors of an artificial document with
term_statistics=true, but a shard does not have any terms of the doc's
field(s), it returns the doc's term vectors values as the shard-level
term statistics. This commit fixes that to return 0 for `ttf` and also
field-level aggregated statistics.

Closes #21906

s1monw added a commit that referenced this issue Dec 2, 2016

Return correct term statistics when a field is not found in a shard (#…
…21922)

If you ask for the term vectors of an artificial document with
term_statistics=true, but a shard does not have any terms of the doc's
field(s), it returns the doc's term vectors values as the shard-level
term statistics. This commit fixes that to return 0 for `ttf` and also
field-level aggregated statistics.

Closes #21906

s1monw added a commit that referenced this issue Dec 2, 2016

Return correct term statistics when a field is not found in a shard (#…
…21922)

If you ask for the term vectors of an artificial document with
term_statistics=true, but a shard does not have any terms of the doc's
field(s), it returns the doc's term vectors values as the shard-level
term statistics. This commit fixes that to return 0 for `ttf` and also
field-level aggregated statistics.

Closes #21906

s1monw added a commit that referenced this issue Dec 2, 2016

Return correct term statistics when a field is not found in a shard (#…
…21922)

If you ask for the term vectors of an artificial document with
term_statistics=true, but a shard does not have any terms of the doc's
field(s), it returns the doc's term vectors values as the shard-level
term statistics. This commit fixes that to return 0 for `ttf` and also
field-level aggregated statistics.

Closes #21906

@s1monw s1monw removed the help wanted label Dec 2, 2016

@shaie

This comment has been minimized.

Copy link
Contributor Author

shaie commented Dec 2, 2016

I created #21928 to continue tracking the NPE issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.