Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight suggestions #3442

Closed
nik9000 opened this issue Aug 5, 2013 · 8 comments
Closed

Highlight suggestions #3442

nik9000 opened this issue Aug 5, 2013 · 8 comments

Comments

@nik9000
Copy link
Member

nik9000 commented Aug 5, 2013

I think it'd be cool if the suggestion API returned the suggestions in a tokenized form so consumers could easily highlight the changed tokens. This particular tokenization would have to ignore any shingle filters to in the analyzer because we'd end up highlighting whole changed phrases and that really isn't how users think.

I'm thinking it should work like this:

curl -XDELETE "http://localhost:9200/test?pretty"
curl -XPOST "http://localhost:9200/test?pretty" -d '{
  "settings": {
    "index": {
      "number_of_replicas": 0
      "analysis":{
        "analyzer":{
          "suggest":{
            "type": "custom",
            "tokenizer": "standard",
            "filter": [ "standard", "lowercase", "suggest_shingle" ]
          }
        },
        "filter":{
          "suggest_shingle":{
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 5,
            "output_unigrams": true
          }
        }
      }
    }
  }
}'
curl -XPOST "http://localhost:9200/test/test/_mapping?pretty" -d '{
  "test":{
    "foo":{
      "analyzer":"suggest"
    }
  }
}'

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true&wait_for_status=yellow'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "I love shingles for suggestions"
}'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "but not for finding differences"
}'

sleep 1

curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
  "query": {
    "match_all": {}
  },
  "suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5
      }
    }
  }
}'

The final request would return something like this:

{
  "took" : 39,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 2,
    "failed" : 3,
    "failures" : [ {
      "status" : 400,
      "reason" : "ElasticSearchIllegalArgumentException[generator field [foo] doesn't exist]"
    }, {
      "status" : 400,
      "reason" : "ElasticSearchIllegalArgumentException[generator field [foo] doesn't exist]"
    }, {
      "status" : 400,
      "reason" : "ElasticSearchIllegalArgumentException[generator field [foo] doesn't exist]"
    } ]
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "test",
      "_id" : "JBXVvQv6QUCHsUAc6XGvRg",
      "_score" : 1.0, "_source" : {
  "foo": "I love shingles for suggestions"
}
    }, {
      "_index" : "test",
      "_type" : "test",
      "_id" : "VnE5ZKNNRp2V_mRKqASnYQ",
      "_score" : 1.0, "_source" : {
  "foo": "but not for finding differences"
}
    } ]
  },
  "suggest" : {
    "phrase" : [ {
      "text" : "differrences",
      "offset" : 0,
      "length" : 12,
      "options" : [ {
        "text" : "finding differences",
        "score" : 0.49144784,
        "changes" : [ {
          "text" : "findning",
          "is" : "finding",
          "from" : 0,
          "to" : 8
        }, {
          "text" : "differrences",
          "is" : "differences",
          "from" : 9,
          "to" : 21
        } ]
      }, {
        "text" : "findning differences",
        "score" : 0.3803136,
        "changes" : [ {
          "text" : "differrences",
          "is" : "differences",
          "from" : 9,
          "to" : 21
        } ]
      }, {
        "text" : "finding differrences",
        "score" : 0.37071815,
        "changes" : [ {
          "text" : "findning",
          "is" : "finding",
          "from" : 0,
          "to" : 8
        } ]
      } ]
    } ]
  }
}
@ghost ghost assigned s1monw Aug 5, 2013
@s1monw
Copy link
Contributor

s1monw commented Aug 5, 2013

Hey @nik9000 thanks for opening this issue. I'm not sure if this is the right approach to go here since pure tokenization not always what you want. I think we should rather provide an API that provides a highlighted version of the corrected string or it should return plain offsets. Yet, why I think we should not plain tokenize is that if you have synonyms etc. in pre or post filters tokens will not correspond to actual corrected substrings.

The way I think we should do that is to provide pre and post tokens similar to the highlight API and return a "highlighted" version of the string. I think the big advantage of this is that you can display it directly to the user and I don't wanna overcomplicate the response format. We can also return actual offsets which will require some processing in the client which I would want to prevent. Yet, if you wanna take a look at implementing this, we are maintaining the splitted tokens until we are done and join them together. we can add a flag to Correction and highlight if it's a replacement?

@nik9000
Copy link
Member Author

nik9000 commented Aug 5, 2013

I agree with your point about the API. Elasticsearch should do the highlighting.
I agree with your point about synonyms but won't using the Corrections cause trouble with shingles? If the whole search is corrected using a single shingle (which I think is likely for me) then we'd mark the whole thing as changed which isn't really useful.

@s1monw
Copy link
Contributor

s1monw commented Aug 5, 2013

I agree with your point about synonyms but won't using the Corrections cause trouble with shingles? If the whole search is corrected using a single shingle (which I think is likely for me) then we'd mark the whole thing as changed which isn't really useful.

internally we use shingles or ngrams to lookup the frequencies but those are only in the index and the bigrams / trigrams that are build from the query are build at runtime and not from the shingle filter so that each Candidate corresponds to a unigram. I will look closer into this tomorrow and provide some examples

@s1monw
Copy link
Contributor

s1monw commented Aug 6, 2013

Hey @nik9000 I think we can add this in a backwards compatible way by simply using a custom join function on Correction what we need is support for pre_tag & post_tag on the request side something like

"suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5
        "highlight" : {
          "pre_tag" : "<b>",
          "post_tag" : "</b>"
        }
      }
    }
  }

so if highlight is set we will return the result highlighted like this:

"options" : [ {
        "text" : "<b>finding differences</b>",
        "score" : 0.49144784,

does this make sense? I think what we need here is a notion of a boolean userInput; in the Candidate class such that we can differentiate if this is actually a term that got suggested or not. We can easily make this distinction in NoisyChannelSpellChecker line 96. In Line 82 we actually check if it is a shingle or not and if so we ignore it.

@nik9000
Copy link
Member Author

nik9000 commented Aug 6, 2013

@s1monw I think I'd prefer to return the unlighted version next to the highlighted version. That way I don't have to re-strip the highlight components when building the suggestion query.

Here is an updated curl testing script:

curl -XDELETE "http://localhost:9200/test?pretty"
curl -XPOST "http://localhost:9200/test?pretty" -d '{
  "settings": {
    "index": {
      "number_of_replicas": 0
      "analysis":{
        "analyzer":{
          "suggest":{
            "type": "custom",
            "tokenizer": "standard",
            "filter": [ "standard", "lowercase", "suggest_shingle" ]
          }
        },
        "filter":{
          "suggest_shingle":{
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 5,
            "output_unigrams": true
          }
        }
      }
    }
  }
}'
curl -XPOST "http://localhost:9200/test/test/_mapping?pretty" -d '{
  "test":{
    "foo":{
      "analyzer":"suggest"
    }
  }
}'

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true&wait_for_status=yellow'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "I love shingles for suggestions"
}'

curl -XPOST "http://localhost:9200/test/test?pretty" -d '{
  "foo": "but not for finding differences"
}'

sleep 1

curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
  "query": {
    "match_all": {}
  },
  "suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5
      }
    }
  }
}'



curl -XGET "http://localhost:9200/test/test/_search?pretty" -d '{
  "query": {
    "match_all": {}
  },
  "suggest": {
    "text": "findning differrences",
    "phrase":{
      "phrase":{
        "field": "foo",
        "max_errors": 5,
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}'

The first doesn't have highlighting enabled so the suggest section should look exactly like a search without this feature:

  "suggest" : {
    "phrase" : [ {
      "text" : "findning differrences",
      "offset" : 0,
      "length" : 21,
      "options" : [ {
        "text" : "finding differences",
        "score" : 0.49144784
      }, {
        "text" : "findning differences",
        "score" : 0.3803136
      }, {
        "text" : "finding differrences",
        "score" : 0.37071815
      } ]
    } ]

and the second search should have highlighting like this:

  "suggest" : {
    "phrase" : [ {
      "text" : "findning differrences",
      "offset" : 0,
      "length" : 21,
      "options" : [ {
        "text" : "finding differences",
        "highlighted" : "<em>finding</em> <em>differences</em>",
        "score" : 0.49144784
      }, {
        "text" : "findning differences",
        "highlighted" : "findning <em>differences</em>",
        "score" : 0.3803136
      }, {
        "text" : "finding differrences",
        "highlighted" : "<em>finding</em> differrences",
        "score" : 0.37071815
      } ]
    } ]
  }

@s1monw
Copy link
Contributor

s1monw commented Aug 6, 2013

oh that is an very good reason for having it side by side to the original suggestion....! agreed!

@s1monw
Copy link
Contributor

s1monw commented Aug 6, 2013

@nik9000 I changed your commit message a bit and running tests now. I will push this very shortly! Unfortunately we can't add this to 0.90 at this point since it breaks the wire format. It's a pity though. But anyway thanks so much for adding this!

@s1monw s1monw closed this as completed in 72d6d82 Aug 6, 2013
s1monw pushed a commit that referenced this issue Aug 7, 2013
This commit adds general highlighting support to the suggest feature.
The only implementation that implements this functionality at this
point is the phrase suggester.
The API supports a 'pre_tag' and a 'post_tag' that are used
to wrap suggested parts of the given user input changed by the
suggester.

Closes #3442
@s1monw
Copy link
Contributor

s1monw commented Aug 7, 2013

Backported to 0.90 with version compatibility on the transport protocol. I kept the version checks for master (1.0) for now to have the same code in master and 0.90 but that doesn't mean a major version upgrade from 0.90 to 1.0 doesn't require a full cluster restart

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
This commit adds general highlighting support to the suggest feature.
The only implementation that implements this functionality at this
point is the phrase suggester.
The API supports a 'pre_tag' and a 'post_tag' that are used
to wrap suggested parts of the given user input changed by the
suggester.

Closes elastic#3442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants