Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the postings highlighter and make unified the default highlighter choice #25028

Merged
merged 3 commits into from Jun 9, 2017

Conversation

Projects
None yet
5 participants
@jimczi
Copy link
Member

commented Jun 2, 2017

This change removes the postings highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the unified highlighter when index_options is set to offsets:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the unified highlighter the default choice for highlighting a field (if type is not provided).
The strategy used internally by this highlighter remain the same as before, it checks term_vectors first, then postings and ultimately it re-analyzes the text.
This change also rewrites the docs so that the options that the unified highlighter cannot handle are clearly marked as such.
There are few features that the unified highlighter is not able to handle which is why the other highlighters (plain and fvh) are still available.
I'll open separate issues for these features and we'll deprecate the fvh and plain highlighters when full support for these features have been added to the unified.

@@ -98,3 +98,14 @@ but the only reason why it has not been deprecated too is because it is used
for the `random_score` function. If you really need access to the id of
documents for sorting, aggregations or search scripts, the recommandation is
to duplicate the id as a field in the document.

==== Highlighers

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jun 2, 2017

Member

missing t

@clintongormley
Copy link
Member

left a comment

Added a few docs comments

==== Highlighers

The `unified` highlighter is the new default choice for highlighter.
The offset strategy for each field is picked internally by this highlighter depending on the

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jun 2, 2017

Member

What do you mean by the offset strategy?

docs/reference/search/request/highlighting.asciidoc Outdated
Allows to highlight search results on one or more fields. The
implementation uses either the lucene `plain` highlighter, the
fast vector highlighter (`fvh`) or `postings` highlighter.
Allows to highlight search results on one or more fields.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jun 2, 2017

Member

"Highlighters allow you to produce highlighted snippets from one or more fields in your search results."

docs/reference/search/request/highlighting.asciidoc Outdated

The default choice of highlighter is of type `plain` and uses the Lucene highlighter.
It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.
The default choice of highlighter is of type `unified` and uses the Lucene Unified highlighter.

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jun 2, 2017

Member

The unified highlighter (which is used by default if no highlighter type is specified) uses the Lucene Unified Highlighter.

docs/reference/search/request/highlighting.asciidoc Outdated
It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting.

[float]
===== Offsets Strategy

This comment has been minimized.

Copy link
@clintongormley

clintongormley Jun 2, 2017

Member

Offsets Strategy requires some explanation, perhaps:

In order to create meaningful search snippets from the terms being queried, a highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:

  • The postings list (fields mapped as "index_options": "offsets").
  • Term vectors (fields mapped as "term_vectors": "with_positions_offsets").
  • The original field, by reanalysing the text on-the-fly.
@nik9000

nik9000 approved these changes Jun 2, 2017

Copy link
Contributor

left a comment

I left a few minors but LGTM. I guess next is to deprecate the postings highlighter in 5.x, right?

core/src/test/java/org/elasticsearch/search/fetch/subphase/highlight/HighlighterSearchIT.java Outdated
.endObject().endObject().endObject()
.endObject().endObject().endObject()));
.addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties")
// we don't store title and don't use term vector, now lets see if it works...

This comment has been minimized.

Copy link
@nik9000

nik9000 Jun 2, 2017

Contributor

Can you keep the indentation?

This comment has been minimized.

Copy link
@jimczi

jimczi Jun 2, 2017

Author Member

Sorry about that :(

core/src/test/java/org/elasticsearch/search/fetch/subphase/highlight/HighlighterSearchIT.java Outdated
.endObject().endObject().endObject()));
.addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties")
// we don't store title, now lets see if it works...
.startObject("title")

This comment has been minimized.

Copy link
@nik9000

nik9000 Jun 2, 2017

Contributor

Here too.

docs/reference/search/request/highlighting.asciidoc Outdated
will use this information to highlight documents without re-analyzing the text.
It re-runs the original query directly on the postings and extracts the matching offsets
directly from the index limiting the collection to the highlighted documents.
This mode is faster since it doesn't require to reanalyze the text to be highlighted

This comment has been minimized.

Copy link
@nik9000

nik9000 Jun 2, 2017

Contributor

I expect it isn't actually faster for very short strings. At least, that was my experience with the experimental highlighter.

docs/reference/search/request/highlighting.asciidoc Outdated
@@ -814,6 +833,8 @@ to

[[phrase-limit]]
==== Phrase Limit

WARNING:

This comment has been minimized.

Copy link
@nik9000

nik9000 Jun 2, 2017

Contributor

Missing "this is only supported by the fast vector highlighter", I think.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2017

Thanks @clintongormley and @nik9000
I pushed a commit to address your comments.

I guess next is to deprecate the postings highlighter in 5.x

We can make this change transparent by allowing postings to be set in 6.x as an alias for unified (and just add the verification that postings is set in the index options) ?

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2017

We can make this change transparent by allowing postings to be set in 6.x as an alias for unified (and just add the verification that postings is set in the index options) ?

I'm afraid of something sneaky like that. I think we're better off deprecating in 5.x so users of the postings highlighter know that they should test with the unified highlighter. Even if it is the same code I'd prefer they know about the change rather than get some kind of surprise on upgrade.

@javanna

This comment has been minimized.

Copy link
Member

commented Jun 2, 2017

Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type), and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors).

I tend to agree with @nik9000 on not making unified a synonym of postings under the hood.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jun 6, 2017

I tend to agree with @nik9000 on not making unified a synonym of postings under the hood.

Ok so I'll start with the deprecation in 5.x

Not sure, but I am wondering if we should rather have two separate PRs, one for removing postings which gets replaced by unified (probably already potentially breaking although if it breaks users it's because of bugs? but certainly breaking for people specifying postings as highlighter type)

I think it affects 5.x only where we have two options:

  • Deprecate postings but only when the type is forced and redirect automatically to unified (instead of postings) when the field options for offsets are set.
  • Deprecate postings but continue to redirect automatically to this highlighter when the field options are set.

I am leaning toward option 2 since the unified highlighter also handles phrase/span queries (and the postings just ignores the positions).

and another one for changing the default highlighter type (more breaking as it affects also people relying on plain or fvh just because they don't have offsets or they have term vectors)

If the desired behavior for 6.x is to use the unified highlighter by default I don't think we need to split this in two PRs. Doing so would require to change the tests and documentations twice, with a single PR the change is simpler, we don't need to have a replacement for the automatic postings mode.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jun 6, 2017

I opened #25073 for the deprecation in 5.x

jimczi added some commits Jun 2, 2017

Remove the postings highlighter and make unified the default highligh…
…ter choice

This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.

@jimczi jimczi force-pushed the jimczi:unified_highlighting_default branch to 4feaaca Jun 8, 2017

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jun 9, 2017

Deprecate postings highlighter
This change adds a deprecation warning for removal in 6.0.

Relates elastic#25028

jimczi added a commit that referenced this pull request Jun 9, 2017

Postings highlighter deprecation (#25073)
This change adds a deprecation warning for removal in 6.0.
Only one deprecation is logged per request
Relates #25028

@jimczi jimczi merged commit 8250aa4 into elastic:master Jun 9, 2017

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details

@jimczi jimczi deleted the jimczi:unified_highlighting_default branch Jun 9, 2017

@jimczi jimczi removed the review label Jun 9, 2017

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 10, 2017

Merge branch 'master' into primary-context
* master: (53 commits)
  Log checkout so SHA is known
  Add link to community Rust Client (elastic#22897)
  "shard started" should show index and shard ID (elastic#25157)
  await fix testWithRandomException
  Change BWC versions on create index response
  Return the index name on a create index response
  Remove incorrect bwc branch logic from master
  Correctly format arrays in output
  [Test] Extending parsing checks for SearchResponse (elastic#25148)
  Scripting: Change keys for inline/stored scripts to source/id (elastic#25127)
  [Test] Add test for custom requests in High Level Rest Client (elastic#25106)
  nested: In case of a single type the _id field should be added to the nested document instead of _uid field.
  `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586)
  fix highlighting docs
  Fix NPE in token_count datatype with null value (elastic#25046)
  Remove the postings highlighter and make unified the default highlighter choice (elastic#25028)
  [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144)
  Fix typo in shards.asciidoc (elastic#25143)
  List Hibernate Search (elastic#25145)
  [DOCS] update maxRetryTimeout in java REST client usage page
  ...

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 10, 2017

Merge branch 'master' into fix-get-mapping-head
* master: (80 commits)
  Test: remove faling test that relies on merge order
  Log checkout so SHA is known
  Add link to community Rust Client (elastic#22897)
  "shard started" should show index and shard ID (elastic#25157)
  await fix testWithRandomException
  Change BWC versions on create index response
  Return the index name on a create index response
  Remove incorrect bwc branch logic from master
  Correctly format arrays in output
  [Test] Extending parsing checks for SearchResponse (elastic#25148)
  Scripting: Change keys for inline/stored scripts to source/id (elastic#25127)
  [Test] Add test for custom requests in High Level Rest Client (elastic#25106)
  nested: In case of a single type the _id field should be added to the nested document instead of _uid field.
  `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586)
  fix highlighting docs
  Fix NPE in token_count datatype with null value (elastic#25046)
  Remove the postings highlighter and make unified the default highlighter choice (elastic#25028)
  [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144)
  Fix typo in shards.asciidoc (elastic#25143)
  List Hibernate Search (elastic#25145)
  ...

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 10, 2017

Merge branch 'master' into simplify-prefix-logger
* master: (1889 commits)
  Test: remove faling test that relies on merge order
  Log checkout so SHA is known
  Add link to community Rust Client (elastic#22897)
  "shard started" should show index and shard ID (elastic#25157)
  await fix testWithRandomException
  Change BWC versions on create index response
  Return the index name on a create index response
  Remove incorrect bwc branch logic from master
  Correctly format arrays in output
  [Test] Extending parsing checks for SearchResponse (elastic#25148)
  Scripting: Change keys for inline/stored scripts to source/id (elastic#25127)
  [Test] Add test for custom requests in High Level Rest Client (elastic#25106)
  nested: In case of a single type the _id field should be added to the nested document instead of _uid field.
  `type` and `id` are lost upon serialization of `Translog.Delete`. (elastic#24586)
  fix highlighting docs
  Fix NPE in token_count datatype with null value (elastic#25046)
  Remove the postings highlighter and make unified the default highlighter choice (elastic#25028)
  [Test] Adding test for parsing SearchShardFailure leniently (elastic#25144)
  Fix typo in shards.asciidoc (elastic#25143)
  List Hibernate Search (elastic#25145)
  ...

@colings86 colings86 added v6.0.0-beta1 and removed v6.0.0 labels Aug 4, 2017

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 11, 2017

@Mpdreamz Mpdreamz referenced this pull request Oct 11, 2017

Merged

WIP Feature/elasticsearch 6 #2870

25 of 27 tasks complete

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 21, 2017

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Oct 21, 2017

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Nov 16, 2017

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Nov 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.