Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for highlighting extracted entities. #29467

Closed
markharwood opened this issue Apr 11, 2018 · 28 comments

Comments

@markharwood
Copy link
Contributor

commented Apr 11, 2018

Background

Highlighting and entity extraction are cornerstones of search systems and yet they currently do not play well with each other in elasticsearch/Lucene.

  • Highlighting provides the evidence of where information was found in large texts.
  • Entity extraction derives structured information such as names of people or organisations from unstructured free-text input.

The two techniques are often used together in systems with large amounts of free-text such as news reports. Consider this example search which combines free-text and a structured field derived from the free-text:

entitysearch

In this particular example highlighting works for the entity Natalia Veselnitskaya but would not for the entity Donald Trump Jr.

Issue - brittle highlighting

Sadly, the structured keyword terms like "person" produced by entity extraction tools rarely exist as tokens in the free text fields where they were originally discovered. The traceability of this discovery is lost. In the example above the natalia veselnitskaya entity only highlights because I carefully constructed the scenario:

  1. I lowercase-normalized the person keyword field's contents
  2. I applied lowercase and 2 word shingles to the unstructured text field

This approach was a good suggestion from @mbarretta but one which many casual users would overlook and still is far from a complete solution. Donald Trump Jr. would require a 3 word shingle analyzer on my text field and one which knew to preserve the full-stop in Jr. - but I don't want to apply 3 word shingles to all text or retain all full-stops. This is clearly a brittle strategy.

The irony is that entity extractors such as OpenNLP, Rosette or even custom regex have the information required to support highlighting (extracted entity term and offset into original text) but no place to keep this data. Entity extraction is really concerned with 2 or more fields - an unstructured source of data and one or more structured fields (person/organisation/location?) to deposit findings. Due to the way Analysis is focused on single fields we are left with no means for entity extractors to store the offsets that provide the traceability of their discoveries which standard highlighters can use.

Possible Solutions

1) "Internal analysis" - entity extraction performed as Analyzers

(I offer this option only to show how bad this route is...)
If the smarts in entity extractors were performed as part of the Lucene text analysis phase they could potentially emit token streams for both structured and unstructured output fields.

  • Advantages - input JSON source is free of low-level term+offset information
  • Disadvantages are many. Lucene analysis would need to support multiple field outputs, entity extraction logic would have to be Java and processed in-line with added compute expense.

2) "External analysis" - entity extraction performed prior to indexing

In this approach any entity extraction logic is performed outside of core elasticsearch e.g. using python's nltk or perhaps reusing human-annotated content like Wikipedia. The details of discoveries are passed in the JSON passed to elasticsearch. We would need to allow detailed text offset information of the type produced by analysis to be passed in from outside - akin to Solr's pre-analyzed field. This information could act as an "overlay" to the tokens normally produced by the analysis of a text field. Maybe text strings in JSON could, like geo fields be presented in more complex object forms to pass the additional metadata e.g. instead of:

"article_text": "Donald Trump Jr. met with russian attorney"

we could also support this more detailed form:

"article_text": {
    "text": "Donald Trump Jr. met with russian attorney",
    "inject_tokens" : [
          {
                "token": "Donald Trump Jr",
                "offset": 0,
                "length": 16
          }
    ]

}

A custom Analyzer could fuse the token streams produced by standard analysis of the text and those provided in the inject_tokens array.

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Apr 11, 2018

Pinging @elastic/es-search-aggs

@spinscale

This comment has been minimized.

Copy link
Member

commented Apr 11, 2018

+1 on the second approach, if we decide to do this - this allows us either to use ingest processors or have a completely external NER. Using lucene analyzers also would mean, that we have to keep the models in all Elasticsearch data nodes.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 12, 2018

I've experimented with the example of object-based representations of text strings in JSON (with text and inject_tokens properties) and can see that there's some potential issues.
I had to modify TextFieldMapper to accept object-based strings for indexing and for highlighting to work I had to hack the SourceLookup.extractRawValues method to look for body.text in the source map if the user asked for the body field to be highlighted . It's not clear to me how users should refer to the body field when making requests for highlighting or source filtering etc - should they refer to body string fields or body.text? Also, the majority of the time in things like hits or top_hits aggs the users are unlikely to be interested in seeing all the inject_tokens echoed back as part of the source - these are more of an indexing detail for the search engine rather than useful information for the end user.

@mayya-sharipova

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2018

@markharwood very interesting issue and usecase.

Also +1 for the second approach.

It would be cool if we can add even the type of token:

"inject_tokens" : [
          {
                "token": "Donald Trump Jr",
                "offset": 0,
                "length": 16,
                "type":  "person"
          }
    ]

Have you thought about designing a special type of query for this, something like this:

"query": {
    "entity" : { "person" : "Donald Trump Jr" } 
  }

About the implementation details: it would be cool if we can add an additional tokeStream to a field besides the one traditionally analyzed. The custom similarity modules can combine both tokenStreams in a custom way.
I have heard similar requests about possible ways to index vector embeddings with a field content. Some people are using payload for this, but it doesn't scale well.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2018

It would be cool if we can add even the type of token:

Yep, anything a regular Token produced by internal analysis can contain should be on the table.
I don't think we do anything with Token type info at query time but that's probably for another issue.

it would be cool if we can add an additional tokeStream to a field besides the one traditionally analyzed.

My assumption is that we'd have to emit one fused TokenStream at index time containing a combo of the internally-generated tokens and the externally-provided ones.

Also, anything that re-analyzes _source text strings at query time will need to repeat this stream-fusing logic. Candidates may include:

  • UnifiedHighlighter
  • SignificantText agg
  • MoreLikeThis Query
  • org.elasticsearch.xpack.ml.job.categorization.CategorizationAnalyzer

I'm not sure how best to handle this - either rely on persisted tokenisation e.g. TermVectors for these classes to work properly or try abstract the way TokenStreams are obtained from a fusion of on-the-fly Lucene text analysis and the externally-provided tokens.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2018

I discussed this with @jpountz @jimczi @romseygeek and others and the suggestion was that we should ideally store the externally-provided entity tokens in a separate "annotations" field rather than splicing them directly into the tokens of the text field. These annotations should be thought of as similar to the fields sub-properties in a field mapping - alternative ways of indexing the original JSON text. Let's call both of these concepts "indexed variants" or IV fields for the moment.

There are a number of existing challenges with support for IV fields which we might choose to address in spin-off issues:

  1. Highlighters need to be able to highlight the original text using one or more selected IV fields that contain the tokens used by the user in the query
  2. Positional queries (Spans or the new interval queries) need to support finding "X near Y" where tokens X and Y may be stored in different IV fields.
  3. Other forms of query-time analysis such as the significant_text aggregation, the MoreLikeThis query or ML's CategorizationAnalyzer may need to support the selection of one or more IV fields.

In 1) and 3) above, the new "annotations" IV field would break all their existing assumptions about how to get hold of a token stream for original JSON strings. It is not enough for them to pass the original text to the field's Analyzer for it to re-tokenize. The "annotations" IV field would require callers to pass a different context to retrieve the externally-provided list of tokens. This context might be a Map of the source data, some Lucene object like Document - it's unclear to me how we would abstract this more cleanly - especially when we consider the "TermVectors" alternatives for providing any pre-stored tokens.
It is perhaps also worth formalising the connection between the annotations field and the source text to which it relates. Maybe with a special definition in the mapping. This would help us validate that when highlighting the foo_text JSON field, the "foo_annotations" field is an appropriate choice of IV field whose positions and offsets still relate to the foo_text content.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2018

Damn. If we adopt the "separate field" approach to storing entity annotations (as suggested in my previous comment) then we can't use positional queries like span/interval. These queries measure token proximity using position increments, not character offsets. Using a separate field to store entity annotations would mean that it would be impossible to record position inc values that tied up with those tokens record in the text field.

For the record - the highlighting and positional query capabilities I would hope to enable are demonstrated here: https://youtu.be/kbK3D_pULd4

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2018

If we adopt the "separate field" approach to storing entity annotations (as suggested in my previous comment) then we can't use positional queries like span/interval.

I'm not sure this is a blocker. For instance I could imagine that we could merge-sort two token streams in order to reconstitute one that has both the raw tokens and extracted entities. I vaguely remember @romseygeek talking about something like that but could be wrong.

@markharwood markharwood removed the discuss label Apr 20, 2018

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 23, 2018

Inline vs external tagging styles

NLP tools such as Apache UIMA and GATE can export a rich-text format of annotated text where any annotations are in-lined around selected text by introducing special markup (traditionally XML) to identify items of interest. This is similar to how HTML uses <a href="foo.com"> tags to introduce hyperlinks around selected text. The advantage is that no position offset information is required which can be brittle when different character encodings are used between systems. The disadvantage is that XML-like structures may be hard to express in JSON. Perhaps the {{...}} style of escaping annotations text popularised by HTML templating engines may be another approach.
Whichever escaping format is used, this approach would rely on elasticsearch supporting a new rich text format which keeps both text and annotations together.

The alternative format is offset-and-length based annotations such as those provided by OpenCalais where any annotations are listed separately from the text using tags, positions and offsets to reference areas of the original text where entities were discovered. This approach would rely on elasticsearch supporting a new "annotations" field type and defining the text field to which it relates in the mapping.

@mbarretta

This comment has been minimized.

Copy link

commented Apr 23, 2018

Another approach I've seen in the past was a graph-style annotation schema that took the raw text as a single field and attached various low-level analytics (tokens, mainly) as nodes pointing to offsets of that text and higher-level analytics (SBs, POS, ultimately entities) pointing to the lower-level nodes or to "beginning" and "ending" nodes.

The spec for it is here:
https://github.com/pagi-org/spec/blob/master/pagi.md

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 25, 2018

Not heard of pagi before, thanks Mike.
My gut feel is that we should aim for simplicity here and not try support
a) overlapping annotations in the text
b) annotation type hierarchies (eg person ->organisation)
c) annotation relationships (e.g. text declares annotation 1 has "employed_by" relationship with nearby annotation 2)
d) multiple annotation properties e.g. person annotations possibly having "name", "age", "gender" attributes etc

The simplest option is to support an annotation as a single Token (string + pos + offset + len) whose string value serves as both a (hopefully) unique ID and human-readable label. The dandelion NER plugin for example uses Wikipedia URLs as a string that is both a unique ID and a human-readable string which contains the entity name.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 25, 2018

@jpountz @jimczi We had a meeting on this and came up with the following decisions:

1) Annotated text should be presented inline using tags

A new field type is required ("annotated_text"?) which accepts strings that are interspersed with markup e.g.

"text" : "They met with <a type=`person` value=`Donald Trump Jr.`>Don junior</a> and ..."

The exact tag-escaping mechanism was not decided (thoughts, @clintongormley ?) but it should allow users to express a type and a value which would be indexed as a single token at the same offset and position as the text it surrounds. The value would be indexed as-is so not lower-cased or otherwise formatted. The type information may appear as a payload or possibly a prefix on the value (as yet undecided).
The advantage of the in-line tagging format is that external clients would not have to pass any offsets and lengths for annotations which can be problematic when translating between client and server character encodings. In-line tags also mean we will not support overlapping tokens.

2) Annotations are additional tokens indexed into the same "text" index field

Entity annotations can be thought of as synonyms that expand on the text tokens recorded during analysis. Advantages of putting them in the same text field rather than a separate "text.annotations" indexed field are:
a) existing highlighters work with a single token-stream and not a fusion of multiple indexed fields
b) Positional interval queries also only work with a single indexed field.
The disadvantage is that traditional text-based queries may unexpectedly highlight/match annotation-introduced tokens. We felt this could be mitigated if the annotations tokens adopted a convention like type-prefixing that would ensure there weren't unintentional matches.

3) May only work with one choice of Highlighter impl

The challenge of working on the text of a field that contains both text and annotation markup may mean that we have to "special case" the new annotated_text fields so that they only work with one highlighter impl - possibly a variant of the unified highlighter

4) Punted for later releases:

a) We won't be allowing in-text annotations to declare "copy_to" commands for any structured keyword fields. This is a convenience for clients that would be hard for us to implement and they do have a work-around. Clients could instead pass JSON source that included identical person names in the text annotations and the structured person keyword field.

b) We won't be offering a means to ask for hits in results that return the text field without the annotation markup. I can see people might want it but we'd need to work out a way to make this clean.

c) MoreLikeThis and Significant Text agg both try to identify statistically significant items in text, doing so using re-analysis. Whether this should include any annotations is perhaps open to interpretation so we did not decide on a policy or any additional user-facing controls that might be needed here.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Apr 30, 2018

I've opted for encoding annotations in text using a markdown style syntax.
So the original text appears in between [ ] followed by a url-like syntax in between( ) used to describe the entity value and type e.g.

 "text" : "They met with [Don junior](type=person&value=Donald%20Trump%20Jr) and ..."

Note that the type and value of the entity are url-encoded parameters.
Note also markdown is a permissive syntax meaning that regular uses of [ in the text don't have to be escaped and it is only when the use of these characters matches the pattern [...](...) that it is interpreted as being a URL, or in our case, an entity reference.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2018

I'd like to add the option of injecting multiple annotation tokens for a given piece of text.

image

e.g. in the highlighted text above I want a token to identify both the person and the role.

The question is how to encode multiple tokens in the annotation's markdown-like syntax. I can think of 3 approaches:

  1. Simple key/value pairs
    In this syntax the token type and value are collapsed into simple key/value pairs:

    he paid [John](person=John+Smith&role=payee)

  2. Multiple numbered token properties
    In this syntax the type and value (and potentially other attributes) for each token are associated using numbers

    he paid [John](type1=person&value1=John+Smith&type2=role&value2=payee)

  3. More complex encoding
    We could introduce extra escaping into the url-like syntax to have comma-delimited list of annotation attributes or perhaps use JSON curly braces instead of the (url) syntax e.g. paid [John]{...}

I like the simplicity of 1) but it does preclude having any token attributes other than type and value - we couldn't for instance introduce anything that added payload information in future.
Currently we only use value in the search index - the type part of a token only has potential use in clients rendering this text in type-specific ways.

Proposal

We should reserve [](...) syntax for the simple key/value syntax and use []{...} for any advanced JSON-like syntax we may come up with in future.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2018

"Copy_to" from annotations to structured keyword fields looks like it may be tricky.

Copy_to impl works by passing the same JSON property text to multiple fields (see DocumentParser.parseCopy(field, parseContext) ). Each target field currently reparses the original JSON string (text-plus-markup). Ideally we'd pass only parsed annotation token values to keyword fields based on the token type e.g.

{
	"my_annotated_field": {
		"type": "annotated_text",
		"copy_annotation_types_to": {
			"person": "my_entities_keyword_field",
			"role": "my_roles_keyword_field"
		}
	}
}

Adding this would require a change to DocumentParser to allow annotated fields to pass back "virtual" document properties that are just the annotation token values presented to target fields as if they were there in the original JSON.

Proposal

Copying annotations to structured fields looks too messy to attempt inside elasticsearch. Maybe an ingest pipeline processor that understands the annotation markup syntax is a better way to copy these fields values around. Certainly tools like our open NLP already do this kind of copy-to type logic when extracting entities from the raw text. In future that tool and others like it will likely automate the process of both marking up the annotated text and copying discoveries to structured fields in the JSON. Any human-authored docs may still get it wrong (e.g. forgetting to add an annotation value to a related structured field) but I expect the majority of value-copying will be done automatically by upstream tools in practice.

@jimczi

This comment has been minimized.

Copy link
Member

commented May 18, 2018

We could also create extra fields directly in my_annotated_field, it could be per annotation type my_annotated_field.person, my_annotated_field.role or a single field my_annotated_field.annotations and add the type as a prefix in annotation person#madonna ?
I think this field should be able to handle the indexation of the text and the annotations in a doc_values field automatically otherwise you'll need to handle the format in a lot of places.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2018

and add the type as a prefix in annotation

It certainly would be nice to exploit the type information in the annotation tokens.
This is something I think is generally missing in our existing mapping definitions - client tools such as Kibana don't appreciate that the values found in, say, the from keyword field can also be used in the to keyword field because the tokens both represent the same entity type (email address).

Nothing in the mappings declares these fields' tokens are interchangeable. I'd like to see this entity-type information in the mapping alongside the existing choice of storage-type (eg keyword).
Generic clients like Kibana could then understand how discoveries in one field were exploitable in other fields for search or highlighting purposes.
Annotated tokens are perhaps the first place in elasticsearch where an idea of entity type (person, organisation etc) is introduced independently of the type associated with the field that contains it. It would be nice to carry this "entity type" info further into our mappings.

@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2018

I think when highlighting an annotated_text field it will be useful to return hit information using the annotation syntax.

Benefits

  1. Rather than plain <em></em> tags we can pass extra "hit" information in the url-parameter like syntax such as the actual search term that matched and possibly scoring weights. An example hit for a search for "tesla" might be marked up as follows: brand new [Tesla](_hit.term=tesla&_hit.score=3.32) launched.
  2. It would also be useful to return any other non-matching annotations from the original text e.g. in this Wikipedia example below the only thing highlighted in yellow is the searched text but the original JSON contains multiple people annotations which would be useful to have marked up in the client too:
    image

A sophisticated client (Kibana?) could make good use of all this extra metadata embedded in the text, rendering results with hyperlinks, different colours, font-weights etc.

Downsides

  • Would potentially be confusing in that the usual "pre" and "post" tag settings sent in search highlight requests would not apply - maybe we should throw an error if passed for use on an annotated text field?
  • Would clients still want an option to return plain-text with the <em> tags for the convenience of displaying as HTML? Annotated text markup needs some client-side parsing before it can be rendered nicely (however, the same is true of viewing the raw content of a doc anyway).

Approach

The implementation would be a special PassageFormatter for the existing UnifiedHighlighter.
When mixing search terms and pre-existing annotations in the final markup the rules would have to be as follows:

  • Search terms that exactly overlap an existing annotation will mix _hit.* attributes into the url-like syntax (the assumption being existing annotation attributes won't name-clash with _hit.* attributes)
  • Search terms that partially overlap an existing annotation will take precedence and replace any existing annotation
  • Search terms that are distinct from any annotations will just inject the appropriate _hit.* markup
@markharwood

This comment has been minimized.

Copy link
Contributor Author

commented Aug 31, 2018

Had a chat with @jpountz and we agreed that a type system for annotations should come later (hopefully sharing the notion of entity types from metadata also used in keyword fields).

In the interim we said we should reject as malformed any documents that have annotations using the [foo](key=value) syntax rather than the [foo](value) syntax.

OK with this, @colings86 ?

@colings86

This comment has been minimized.

Copy link
Member

commented Aug 31, 2018

Sounds good to me 👍

markharwood added a commit to markharwood/elasticsearch that referenced this issue Sep 17, 2018
New plugin for annotated_text field type.
Largely a copy of `text` field type but adds ability to include markdown-like syntax in the text.
The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens.
The annotation token values are injected unchanged alongside the regular text tokens to provide a
form of additional indexed overlay useful in positional searches and highlighting.
Annotated_text fields do not support fielddata as we want to phase this out.
Also includes a new "annotated" highlighter type that retains annotations and merges in search
hits as additional annotation markup.

Closes elastic#29467
markharwood added a commit that referenced this issue Sep 18, 2018
New plugin - Annotated_text field type (#30364)
New plugin for annotated_text field type.
Largely a copy of `text` field type but adds ability to include markdown-like syntax in the text.
The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens.
The annotation token values are injected unchanged alongside the regular text tokens to provide a
form of additional indexed overlay useful in positional searches and highlighting.
Annotated_text fields do not support fielddata as we want to phase this out.
Also includes a new "annotated" highlighter type that retains annotations and merges in search
hits as additional annotation markup.

Closes #29467

@markharwood markharwood added the v7.0.0 label Sep 18, 2018

markharwood added a commit to markharwood/elasticsearch that referenced this issue Sep 23, 2018
markharwood added a commit that referenced this issue Sep 23, 2018
New plugin - Annotated_text field type (#33851)
Backport of annotated_text plugin, issue #29467

@markharwood markharwood added the v6.5.0 label Sep 24, 2018

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.