Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on MoreLikeThis API with Non Stored Numeric Fields #3252

Closed
lmenezes opened this Issue Jun 27, 2013 · 10 comments

Comments

Projects
None yet
3 participants
@lmenezes
Copy link
Contributor

lmenezes commented Jun 27, 2013

According to the documentation:

Note: In order to use the mlt feature a mlt_field needs to be either be stored, store term_vector or source needs to be enabled.

But,running this:

curl -XPOST http://localhost:9200/foo
curl -XPUT http://localhost:9200/foo/bar/_mapping -d '{ "bar": { "dynamic": "strict", "properties": { "id": { "type": "integer", "index": "not_analyzed" }, "content": { "type": "string", "analyzer": "standard" }}}}'

curl -XPUT http://localhost:9200/foo/bar/1 -d '{"id":1, "content":"foo bar foo2 bar2 foo3 bar3"}'
curl -XPUT http://localhost:9200/foo/bar/2 -d '{"id":2, "content":"foo3 bar3 foo4 bar4"}'


curl -XGET 'http://localhost:9200/foo/bar/1/_mlt?mlt_fields=content&min_term_freq=1&min_doc_freq=1'
curl -XGET 'http://localhost:9200/foo/bar/1/_mlt?min_term_freq=1&min_doc_freq=1'

fails(second query) with:
{"error":"MapperParsingException[failed to parse [id]]; nested: ElasticSearchIllegalStateException[Field should have either a string, numeric or binary value]; ","status":400}

This is basically because here(for example):
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/mapper/core/IntegerFieldMapper.java#L356-L360

The numeric value is not actually used unless the field is stored.

Then here:
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/mlt/TransportMoreLikeThisAction.java#L293-L303

if you can't read it, it will just thrown an exception.

@lmenezes

This comment has been minimized.

Copy link
Contributor Author

lmenezes commented Jul 2, 2013

any comments on that? I could try going around and store the fields(even though its not the best scenario...), but would be nice also having that without the need to reindex...

@ghost ghost assigned jpountz Jul 2, 2013

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Jul 2, 2013

@lmenezes @jpountz says that he will have a look at it

@lmenezes

This comment has been minimized.

Copy link
Contributor Author

lmenezes commented Jul 2, 2013

@clintongormley cool :)

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Jul 3, 2013

@lmenezes You are right about why you got this error but unfortunately setting the value of the field instance even when the field is not stored won't work. The reason is that Lucene's MoreLikeThis can only work on top on character token streams and numeric fields are encoded as binary token streams.

This issue is very similar to #3211, where we decided to ignore numeric fields when performing highlighting in order to match Elasticsearch 0.20 behavior. Maybe we should do the same here? @clintongormley what do you think?

@clintongormley

This comment has been minimized.

Copy link
Member

clintongormley commented Jul 3, 2013

My feeling is that the more_like_this functionality is about finding string terms in common, rather than numeric similarity, so I would agree with you on ignoring non-string fields. Numeric similarity implies a different type of comparison, which would be usually be better handled by a specific clause outside the mlt query.

If you want to treat numbers as "full text" then you can always use a multi_field to index them both as numbers and as strings.

So ++ for ignoring non-strings, I'd say.

@lmenezes

This comment has been minimized.

Copy link
Contributor Author

lmenezes commented Jul 3, 2013

@jpountz @clintongormley I don't really agree, since if the numbers are ids for some kind of relation, they represent similarity as well or even better than matching tokens. But, if it's a lucene limitation, ignoring is definitely better than failing. Still, would be nice having that working on numeric fields(I guess that affects everything that internally is stored as a number, like ips?).
But yeah, ignoring is ok.

@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Jul 3, 2013

@lmenezes This is correct, the limitation is in Lucene and this affects everything which is stored as a number, so byte, short, integer, long, float and double but also ips and dates. There might be options to support numbers in the future but right now I think the best fix to apply is to ignore numeric data from the mlt fields.

@lmenezes

This comment has been minimized.

Copy link
Contributor Author

lmenezes commented Jul 3, 2013

@jpountz cool, waiting for the fix then :)

jpountz added a commit to jpountz/elasticsearch that referenced this issue Jul 15, 2013

Add the ability to ignore or fail on numeric fields when executing mo…
…re-like-this or fuzzy-like-this queries.

More-like-this and fuzzy-like-this queries expect analyzers which are able to
generate character terms (CharTermAttribute), so unfortunately this doesn't
work with analyzers which generate binary-only terms (BinaryTermAttribute,
the default CharTermAttribute impl being a special BinaryTermAttribute) such as
our analyzers for numeric fields (byte, short, integer, long, float, double but
also date and ip).

To work around this issue, this commits adds a fail_on_unsupported_field
parameter to the more-like-this and fuzzy-like-this parsers. When this parameter
is false, numeric fields will just be ignored and when it is true, an error will
be returned, saying that these queries don't support numeric fields. By default,
this setting is true but the mlt API sets it to true in order not to fail on
documents which contain numeric fields.

Close elastic#3252
@jpountz

This comment has been minimized.

Copy link
Contributor

jpountz commented Jul 15, 2013

The mlt API uses the mlt query, so I updated to pull request:

  • the mlt API doesn't fail even if one of the fields of the document is numeric,
  • mlt and flt queries fail if any of the fields is numeric,
  • the new fail_on_unsupported_field parameter (defaults to true) allows for ignoring numeric fields instead of raising an error when set to false.
@lmenezes

This comment has been minimized.

Copy link
Contributor Author

lmenezes commented Jul 16, 2013

sounds good 👍

@jpountz jpountz closed this in ffcc710 Jul 16, 2013

jpountz added a commit that referenced this issue Jul 16, 2013

Add the ability to ignore or fail on numeric fields when executing mo…
…re-like-this or fuzzy-like-this queries.

More-like-this and fuzzy-like-this queries expect analyzers which are able to
generate character terms (CharTermAttribute), so unfortunately this doesn't
work with analyzers which generate binary-only terms (BinaryTermAttribute,
the default CharTermAttribute impl being a special BinaryTermAttribute) such as
our analyzers for numeric fields (byte, short, integer, long, float, double but
also date and ip).

To work around this issue, this commits adds a fail_on_unsupported_field
parameter to the more-like-this and fuzzy-like-this parsers. When this parameter
is false, numeric fields will just be ignored and when it is true, an error will
be returned, saying that these queries don't support numeric fields. By default,
this setting is true but the mlt API sets it to true in order not to fail on
documents which contain numeric fields.

Close #3252

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Add the ability to ignore or fail on numeric fields when executing mo…
…re-like-this or fuzzy-like-this queries.

More-like-this and fuzzy-like-this queries expect analyzers which are able to
generate character terms (CharTermAttribute), so unfortunately this doesn't
work with analyzers which generate binary-only terms (BinaryTermAttribute,
the default CharTermAttribute impl being a special BinaryTermAttribute) such as
our analyzers for numeric fields (byte, short, integer, long, float, double but
also date and ip).

To work around this issue, this commits adds a fail_on_unsupported_field
parameter to the more-like-this and fuzzy-like-this parsers. When this parameter
is false, numeric fields will just be ignored and when it is true, an error will
be returned, saying that these queries don't support numeric fields. By default,
this setting is true but the mlt API sets it to true in order not to fail on
documents which contain numeric fields.

Close elastic#3252
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.