Highlight suggestions #3442

nik9000 · 2013-08-05T14:43:06Z

I think it'd be cool if the suggestion API returned the suggestions in a tokenized form so consumers could easily highlight the changed tokens. This particular tokenization would have to ignore any shingle filters to in the analyzer because we'd end up highlighting whole changed phrases and that really isn't how users think.

I'm thinking it should work like this:

curl -XDELETE "http://localhost:9200/test?pretty"
curl -XPOST "http://localhost:9200/test?pretty" -d '{
  "settings": {
    "index": {
      "number_of_replicas": 0
      "analysis":{
        "analyzer":{
          "suggest":{
            "type": "custom",
            "tokenizer": "standard",
            "filter": [ "standard", "lowercase", "suggest_shingle" ]
          }
        },
        "filter":{
          "suggest_shingle":{
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 5,
            "output_unigrams": true
          }
        }
      }
    }
  }
}'
curl -XPOST "http://localhost:9200/test/test/_mapping?pretty" -d '{
  "test":{
    "foo":{
      "analyzer":"suggest"
    }
  }
}'

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true&wait_for_status=yellow'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "I love shingles for suggestions"
}'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "but not for finding differences"
}'

sleep 1

curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
  "query": {
    "match_all": {}
  },
  "suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5
      }
    }
  }
}'

The final request would return something like this:

{
  "took" : 39,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 2,
    "failed" : 3,
    "failures" : [ {
      "status" : 400,
      "reason" : "ElasticSearchIllegalArgumentException[generator field [foo] doesn't exist]"
    }, {
      "status" : 400,
      "reason" : "ElasticSearchIllegalArgumentException[generator field [foo] doesn't exist]"
    }, {
      "status" : 400,
      "reason" : "ElasticSearchIllegalArgumentException[generator field [foo] doesn't exist]"
    } ]
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "test",
      "_id" : "JBXVvQv6QUCHsUAc6XGvRg",
      "_score" : 1.0, "_source" : {
  "foo": "I love shingles for suggestions"
}
    }, {
      "_index" : "test",
      "_type" : "test",
      "_id" : "VnE5ZKNNRp2V_mRKqASnYQ",
      "_score" : 1.0, "_source" : {
  "foo": "but not for finding differences"
}
    } ]
  },
  "suggest" : {
    "phrase" : [ {
      "text" : "differrences",
      "offset" : 0,
      "length" : 12,
      "options" : [ {
        "text" : "finding differences",
        "score" : 0.49144784,
        "changes" : [ {
          "text" : "findning",
          "is" : "finding",
          "from" : 0,
          "to" : 8
        }, {
          "text" : "differrences",
          "is" : "differences",
          "from" : 9,
          "to" : 21
        } ]
      }, {
        "text" : "findning differences",
        "score" : 0.3803136,
        "changes" : [ {
          "text" : "differrences",
          "is" : "differences",
          "from" : 9,
          "to" : 21
        } ]
      }, {
        "text" : "finding differrences",
        "score" : 0.37071815,
        "changes" : [ {
          "text" : "findning",
          "is" : "finding",
          "from" : 0,
          "to" : 8
        } ]
      } ]
    } ]
  }
}

s1monw · 2013-08-05T16:22:19Z

Hey @nik9000 thanks for opening this issue. I'm not sure if this is the right approach to go here since pure tokenization not always what you want. I think we should rather provide an API that provides a highlighted version of the corrected string or it should return plain offsets. Yet, why I think we should not plain tokenize is that if you have synonyms etc. in pre or post filters tokens will not correspond to actual corrected substrings.

The way I think we should do that is to provide pre and post tokens similar to the highlight API and return a "highlighted" version of the string. I think the big advantage of this is that you can display it directly to the user and I don't wanna overcomplicate the response format. We can also return actual offsets which will require some processing in the client which I would want to prevent. Yet, if you wanna take a look at implementing this, we are maintaining the splitted tokens until we are done and join them together. we can add a flag to Correction and highlight if it's a replacement?

nik9000 · 2013-08-05T17:50:50Z

I agree with your point about the API. Elasticsearch should do the highlighting.
I agree with your point about synonyms but won't using the Corrections cause trouble with shingles? If the whole search is corrected using a single shingle (which I think is likely for me) then we'd mark the whole thing as changed which isn't really useful.

s1monw · 2013-08-05T18:09:31Z

I agree with your point about synonyms but won't using the Corrections cause trouble with shingles? If the whole search is corrected using a single shingle (which I think is likely for me) then we'd mark the whole thing as changed which isn't really useful.

internally we use shingles or ngrams to lookup the frequencies but those are only in the index and the bigrams / trigrams that are build from the query are build at runtime and not from the shingle filter so that each Candidate corresponds to a unigram. I will look closer into this tomorrow and provide some examples

s1monw · 2013-08-06T12:08:35Z

Hey @nik9000 I think we can add this in a backwards compatible way by simply using a custom join function on Correction what we need is support for pre_tag & post_tag on the request side something like

"suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5
        "highlight" : {
          "pre_tag" : "<b>",
          "post_tag" : "</b>"
        }
      }
    }
  }

so if highlight is set we will return the result highlighted like this:

"options" : [ {
        "text" : "<b>finding differences</b>",
        "score" : 0.49144784,

does this make sense? I think what we need here is a notion of a boolean userInput; in the Candidate class such that we can differentiate if this is actually a term that got suggested or not. We can easily make this distinction in NoisyChannelSpellChecker line 96. In Line 82 we actually check if it is a shingle or not and if so we ignore it.

nik9000 · 2013-08-06T15:03:07Z

@s1monw I think I'd prefer to return the unlighted version next to the highlighted version. That way I don't have to re-strip the highlight components when building the suggestion query.

Here is an updated curl testing script:

curl -XDELETE "http://localhost:9200/test?pretty"
curl -XPOST "http://localhost:9200/test?pretty" -d '{
  "settings": {
    "index": {
      "number_of_replicas": 0
      "analysis":{
        "analyzer":{
          "suggest":{
            "type": "custom",
            "tokenizer": "standard",
            "filter": [ "standard", "lowercase", "suggest_shingle" ]
          }
        },
        "filter":{
          "suggest_shingle":{
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 5,
            "output_unigrams": true
          }
        }
      }
    }
  }
}'
curl -XPOST "http://localhost:9200/test/test/_mapping?pretty" -d '{
  "test":{
    "foo":{
      "analyzer":"suggest"
    }
  }
}'

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true&wait_for_status=yellow'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "I love shingles for suggestions"
}'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "but not for finding differences"
}'

sleep 1

curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
  "query": {
    "match_all": {}
  },
  "suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5
      }
    }
  }
}'



curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
  "query": {
    "match_all": {}
  },
  "suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5,
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}'

The first doesn't have highlighting enabled so the suggest section should look exactly like a search without this feature:

  "suggest" : {
    "phrase" : [ {
      "text" : "findning differrences",
      "offset" : 0,
      "length" : 21,
      "options" : [ {
        "text" : "finding differences",
        "score" : 0.49144784
      }, {
        "text" : "findning differences",
        "score" : 0.3803136
      }, {
        "text" : "finding differrences",
        "score" : 0.37071815
      } ]
    } ]

and the second search should have highlighting like this:

  "suggest" : {
    "phrase" : [ {
      "text" : "findning differrences",
      "offset" : 0,
      "length" : 21,
      "options" : [ {
        "text" : "finding differences",
        "highlighted" : "<em>finding</em> <em>differences</em>",
        "score" : 0.49144784
      }, {
        "text" : "findning differences",
        "highlighted" : "findning <em>differences</em>",
        "score" : 0.3803136
      }, {
        "text" : "finding differrences",
        "highlighted" : "<em>finding</em> differrences",
        "score" : 0.37071815
      } ]
    } ]
  }

s1monw · 2013-08-06T15:08:41Z

oh that is an very good reason for having it side by side to the original suggestion....! agreed!

s1monw · 2013-08-06T19:03:57Z

@nik9000 I changed your commit message a bit and running tests now. I will push this very shortly! Unfortunately we can't add this to 0.90 at this point since it breaks the wire format. It's a pity though. But anyway thanks so much for adding this!

This commit adds general highlighting support to the suggest feature. The only implementation that implements this functionality at this point is the phrase suggester. The API supports a 'pre_tag' and a 'post_tag' that are used to wrap suggested parts of the given user input changed by the suggester. Closes #3442

s1monw · 2013-08-07T08:29:53Z

Backported to 0.90 with version compatibility on the transport protocol. I kept the version checks for master (1.0) for now to have the same code in master and 0.90 but that doesn't mean a major version upgrade from 0.90 to 1.0 doesn't require a full cluster restart

This commit adds general highlighting support to the suggest feature. The only implementation that implements this functionality at this point is the phrase suggester. The API supports a 'pre_tag' and a 'post_tag' that are used to wrap suggested parts of the given user input changed by the suggester. Closes elastic#3442

nik9000 mentioned this issue Aug 5, 2013

Analyze suggestions #3443

Closed

ghost assigned s1monw Aug 5, 2013

nik9000 mentioned this issue Aug 6, 2013

Add highlighting support for suggester. #3451

Closed

s1monw closed this as completed in 72d6d82 Aug 6, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlight suggestions #3442

Highlight suggestions #3442

nik9000 commented Aug 5, 2013

s1monw commented Aug 5, 2013

nik9000 commented Aug 5, 2013

s1monw commented Aug 5, 2013

s1monw commented Aug 6, 2013

nik9000 commented Aug 6, 2013

s1monw commented Aug 6, 2013

s1monw commented Aug 6, 2013

s1monw commented Aug 7, 2013

Highlight suggestions #3442

Highlight suggestions #3442

Comments

nik9000 commented Aug 5, 2013

s1monw commented Aug 5, 2013

nik9000 commented Aug 5, 2013

s1monw commented Aug 5, 2013

s1monw commented Aug 6, 2013

nik9000 commented Aug 6, 2013

s1monw commented Aug 6, 2013

s1monw commented Aug 6, 2013

s1monw commented Aug 7, 2013