Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlighting sometimes returns too much text (orders of magnitude larger than fragment_size) #9442

Closed
scharron opened this issue Jan 27, 2015 · 17 comments
Labels
>bug :Search/Highlighting How a query matched a document

Comments

@scharron
Copy link

Index mapping :

{
    "mappings" : {
        "test": {
            "properties": {
                "content" : {
                    "type" : "string",
                    "analyzer" : "french"
                }
            }
        }
    }
}

Document inserted :
https://gist.github.com/scharron/684a4fbab85135c203ee

Query :

{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "\"Vlaams Brabant\"",
          "fields": [
            "content"
          ]
        }
      }
    }
  },
  "highlight": {
    "fragment_size": 100,
    "fields": {
      "content": {}
    }
  }
}

I would expect to get about 200 characters in 2 fragments, but instead I get 12KB of data in only one fragment.
I also tried the postings and the fast vector highlighters with the same result.
The behaviour is the same for both an ES 1.4.2 and ES 1.3.4.

It may be related to other curious behaviours : http://stackoverflow.com/questions/28167990/curious-behaviour-of-fragment-size-in-elasticsearch-highlighting?noredirect=1

@clintongormley clintongormley added the :Search/Highlighting How a query matched a document label Jan 27, 2015
@strouptl
Copy link

I second this issue, as I am randomly getting up to 1100 characters per fragment, as well (fragment_size set to 300). Does anyone have any ideas?

@youxu71
Copy link

youxu71 commented Jul 15, 2015

I got same issue, sometimes got 1000+ chars, although I set fragment_size to 200.
Any ideas?
I am using ES 1.5.2

@monsieur-d
Copy link

I second this too. This is really a problem since this occurs unpredictably except that it seems to occur only with search terms consisting of two or more words.
This makes the search results of a website really ugly!
When is this accepted as a bug (using 1.5.2 too)?

@strouptl
Copy link

strouptl commented Aug 5, 2015

Good point, @monsieur-d - @nik9000 can we get this marked as a bug like issue #12648 while you are working on highlighting issues? I would be happy to provide a curl sample!

@nik9000
Copy link
Member

nik9000 commented Aug 5, 2015

I would be happy to provide a curl sample!

I think I got it:
https://gist.github.com/nik9000/d420c02f625a956e0faf

@nik9000 nik9000 added the >bug label Aug 5, 2015
@nik9000
Copy link
Member

nik9000 commented Aug 5, 2015

And this shows that its actually just a bug in the plain highlighter:
https://gist.github.com/nik9000/e02b854b67de680759de

Our old, cranky friend the fvh does the right thing here.

@monsieur-d
Copy link

Very good. I'm looking forward to a fix :-)

@giles-t
Copy link

giles-t commented Oct 16, 2015

Out of curiosity, are there any further updates on this issue?

@klimchuk
Copy link

We're also interested in resolving this issue

@sfcgeorge
Copy link

sfcgeorge commented Jul 5, 2016

I had the same issue and was not able to work around.

  • Plain highlighter has this bug, sometimes returns too much.
  • Postings highlighter doesn't support fragment_size.
  • FVH doesn't work on nested fields ¯\⁠_(ツ)_/⁠¯

So basically fragment_size is unusable. I had to set number_of_fragments: 0 to get all the text, then implement my own highlight-aware truncating in code.

@nik9000
Copy link
Member

nik9000 commented Oct 30, 2016

FVH doesn't work on nested fields ¯⁠_(ツ)_/⁠¯

You have to escape the \ or it won't show....

The issue you linked looks like it was fixed for 2.4.0 and 5.0.0. Not that this isn't still a problem, but it isn't one I've looked at for a long, long time.

@clintongormley
Copy link

This appears to be fixed in 5.0

@olegskl
Copy link

olegskl commented Jul 19, 2017

This doesn't seem to be fixed (using 5.5.0). I had to specify fragmenter: 'simple' to avoid the issue.

@sipingliu
Copy link

Same issue in 5.4.1. Fixed it by specifying the type explicitly:
"myDataField": {
"fragment_size": 100,
"number_of_fragments": 1,
"type":"plain"
}

@Garrett-R
Copy link

Garrett-R commented Sep 30, 2018

Specifying 'type': 'plain' didn't work for me in 6.2.3, but following @olegskl's suggestion of putting 'fragmenter': 'simple' did solve the issue.

@choco-love
Copy link

I had the same issue and was not able to work around.

  • Plain highlighter has this bug, sometimes returns too much.
  • Postings highlighter doesn't support fragment_size.
  • FVH doesn't work on nested fields ¯\⁠_(ツ)_/⁠¯

So basically fragment_size is unusable. I had to set number_of_fragments: 0 to get all the text, then implement my own highlight-aware truncating in code.

how did you fix your problem???

@sfcgeorge
Copy link

"I had to set number_of_fragments: 0 to get all the text, then implement my own highlight-aware truncating in code."

You need to write application code to do the truncation. It's a terrible workaround because you'll be getting far more text over the network than you need. The code will depend on the language you're using but the algorithm is effectively:

  • Scan for <em>s, if there are none then just truncate as usual.
  • If there are <em>s:
    • Scan up to the first one.
    • Go back a few chars to provide context around the highlight (if there is any text before the highlight).
    • Count truncate but you have to subtract <em> and </em>s that would be included since they're invisible, tricky.
    • After truncation count the number of opening <em> and closing </em>s and if there aren't enough closing because you've chopped one off add it at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Highlighting How a query matched a document
Projects
None yet
Development

No branches or pull requests