Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DelimitedTermFrequencyTokenFilter #27552

Closed
gkinsman opened this issue Nov 28, 2017 · 21 comments
Closed

Support DelimitedTermFrequencyTokenFilter #27552

gkinsman opened this issue Nov 28, 2017 · 21 comments
Labels
discuss >feature :Search/Analysis How text is split into tokens

Comments

@gkinsman
Copy link

Feature Request

I'd like to use the DelimitedTermFrequencyTokenFilter that was added to Lucene 7, but it doesn't appear to be available via Elastic's API yet.

AFAIK it just needs to be added to CommonAnalysisPlugin.java, as a factory and token filter already exist within Lucene. I'd like to not have to use a custom plugin for this.

Thanks!

@jimczi jimczi added :Search/Analysis How text is split into tokens discuss labels Nov 28, 2017
@jimczi
Copy link
Contributor

jimczi commented Dec 1, 2017

We discussed this internally during FixIt Friday. This is an expert token filter that requires to be used under certain conditions (a tokenizer that does not split on punctuations and a similarity that takes the frequency into account, ...). For this reason we'd like to understand the use cases that this new filter could help with to see if we can have a better integration than just a direct mapping.
Can you share the details of the usages that you plan to make of this new filter ?

@rpedela
Copy link

rpedela commented Dec 2, 2017

I have a use case. Given a document, I want to detect keywords and keyphrases using ML and then use them to improve relevance. I could store them as an array in _source, but then I lose term frequency information. I could index them normally, but that isn't space efficient. I believe, though could be wrong, that DelimitedTermFrequencyTokenFilter would give the best of both worlds.

@gkinsman
Copy link
Author

gkinsman commented Dec 3, 2017

The use case I'm attempting is a slightly unorthodox use of elastic, but I believe it's a valid one.

I have tags on a document that are tokenized and indexed, but also have vote counts. I'm providing a search facility for these tags that utilise the vote count number to retrieve documents that have higher vote counts than other documents tagged with the same tag.

I achieved this using Lucene locally, by writing my own duplicating token filter and feeding each term through n number of times, where n is the vote count. I'm able to do a similar thing with elastic where I'd pre-filter the token stream by duplicating the tokens in the request I send to elastic. This obviously won't work well for high values of n, so it's a temporary solution.

In fact, I just used the whitespace tokenizer with Lucene which I fed into the aforementioned token filter - there's no real text analysis required for my usage as I'm storing and indexing the tags verbatim, I just want the tf-idf ranking.

Thanks for listening!

@nadre
Copy link

nadre commented Mar 14, 2018

Another usecase would be to add related terms to a document if the document is short, also known as document expansion.

I created a plugin for now:
https://github.com/nadre/elasticsearch-delimited-tf-token-filter

@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@softwaredoug
Copy link
Contributor

One problem this solves is allowing users to store distributed representations of text in a field, and use it in ranking. By overwriting term freq, you can create and use arbitrary vector representations in ranking and have them used efficiently.

For example, if you have a 5 dimensional embedding out of word2vec/doc2vec, you might store that in a field:

{
   "text_doc2vec": "0|75 1|1 2|14 3|52 4|12"
}

Here the vector output of doc2vec has been scaled to 0-100.

Now the similarity between this doc and query terms passed into the same word2vec model becomes rather efficient, as it's using the search engine's built in TF*IDF scoring.

I think Lucene & ES need to do a better job supporting arbitrary vector math like this. I'm finding in my work, clients regularly need this kind of functionality.

Granted, I don't know if this is the absolute best way to implement such functionality. Perhaps a new field type would be better, that could store multidimensional data using syntactic sugar around this approach. But this is one way I see this used.

@gkinsman
Copy link
Author

@softwaredoug that's a great use case, and another example of where this feature could be used to reduce the work done at index/query time while increasing the flexibility of the API.

Imagine if you could write code in your own app to index a field to produce a term frequency map, and then pass that to ES. Would be a game changer!

I'm currently still hacking around the fact that I can't customise term freq by doing term stuffing.

@peterdm
Copy link

peterdm commented May 29, 2018

I have @gkinsman ‘s use case as well (vote counts or impressions).

@jpountz
Copy link
Contributor

jpountz commented May 29, 2018

For the record, we already merged #30618 which is one way to leverage custom frequencies and is abstracted in its own field type.

@gkinsman What kind of field do you perform term stuffing on? Something like tags?

@gkinsman
Copy link
Author

gkinsman commented May 29, 2018

Ah interesting @jpountz, I haven't seen those before - might have to take a look.

I have two fields per doc for tags - one is an original, stored text field which is the tag and count, e.g: "tag1|3 tag2|4 tag3|1".
I then have a stuffed field which is a non-stored text field with IndexOptions.Freqs enabled, and would look like "tag1 tag1 tag1 tag2 tag2 tag2 tag2 tag3". When indexed, the stats generated represent the term score.

I should also mention that both fields use just a whitespace analyzer, as the terms are verbatim.

@jpountz
Copy link
Contributor

jpountz commented May 29, 2018

Out of curiosity, what is the number of unique tags that you have in your index. Is it low-cardinality and conceptually closer to a field (eg. rating: 5 or views: 12345) or high-cardinality and closer to a value (eg. tags: { politics: 12, economics: 20, germany: 50, renewable_energies: 35 })?

@jpountz
Copy link
Contributor

jpountz commented May 29, 2018

Sorry, one more question: do you disable norms on this term-stuffing field or not?

@gkinsman
Copy link
Author

The latter - I have high cardinality tags :).

I don't disable norms on the field currently, but it looks like I should, which would save disk.

@softwaredoug
Copy link
Contributor

@jpountz thanks, that looks promising! I'm wondering though why I couldn't take the value of what I index directly into scoring? It seems I have to take saturation, log, or sigmoid. I don't think I could implement a dot product using that feature very easily.

To do a dot product, I would need to take the term freq directly, multiply it at query time by a query time boost.

@jpountz
Copy link
Contributor

jpountz commented May 30, 2018

It was designed this way because such functionality is easier to evolve over time when it has few constrained parameters and its use-case is well understood. I do agree that it doesn't address all use-cases. @gkinsman needs a way to handle a feature vector with arbitrary (and potentially many) dimensions (the tags) and I have heard about similar use-cases in the past so I think we should do something as well. For instance @jimczi mentioned to me that in order to boost transactions in e-commerce, it is possible to attach search keywords that lead to a transaction to documents and store the likeliness of transaction in a custom term frequency. Then searching this field in addition to the usual designation/description fields is expected to help relevance and conversion rate. I am hopeful that adding something like the new feature field, but that would accept a hash, eg. features: { dim0: 20, dim1: 50, dim2: 12 } rather than a single value would address both use-cases.

Once you would have your dot product, what would you do with it? Just sum it up with the bm25 score?

@softwaredoug
Copy link
Contributor

@jpountz

Some of this is a tangent, probably good for another issue related to working with embeddings or other vector representations in the search engine:

On what I'd do with it, certainly a feature for learning to rank. Embeddings are often fed as input to models. I'm sort of inspired by what Vespa does by giving a very math-looking interface to take arbitrary tensors or other values as input and implement a direct scoring function. In part they do this because they can't anticipate whatever exotic model based on vectors you'll want to use. I'm not sure I'd advocate Elasticsearch going that radically in that direction.

For more manual cases, certainly a summation would be useful. Honestly, it could be hard to anticipate completely. In a way it's a different kind of text similarity comparable to a TF*IDF similarity. In spirit, it's a bit like analyzing the text/query, with a 'vector' representation coming out of the output, and a 'similarity' that does a dot product. Lucene has added OpenNLP analyzers that work with models to do POS tagging, etc in the analysis chain. Perhaps a token filter that uploaded a word2vec model and output a vector representation of the terms/documents somehow?

The problem is word2vec models are basically a giant matrix of every word/document and their corresponding vector representation. Very memory intense to keep in the search engine. Perhaps there's alternate embedding methods that are more memory efficient?

Alternatively, or in addition, I wonder if an explicit mapping type that took a vector would be useful? In your example, you use object notation, but could you also do just a list

"features": [20, 50, 12]

Or if you truly are looking to support vectors ranging from [-1.0, 1.0] then hide the details in Elasticsearch (whether its term freq or a multivalued numerical field):

"features": [-0.2, 0.8, -0.6]

Then a "feature" query (or whatever other 'dot product' query) could do the rest

Finally it's important to recognize these vectors could mean more than text, they could be:

  • A component in a user-item matrix, to help drive personalization/recommendation systems
  • Some kind of 'embedding' decomposing an image

So a general solution that let's you store and compute vector similarities could be useful for a lot of interesting 'search' applications

@gibrown
Copy link
Contributor

gibrown commented May 31, 2018

There are two end use cases that immediately come to mind.

  1. Popularity ranking of content

We often are trying to boost content based on some metric of popularity. Often it is a list of users who have liked or commented on something. By adding the user ids as a list of items we can give some weighting and somewhat correct for users who like tons of content due to the IDF. Controlling the TF though would allow us to give extra weight to particular users who tend to do a good job of recommending content within a specific category or on certain sites. We use this sort of ranking for search, related posts, and content recommendations

We've also looked at classifying sites by topic. Something like the Chinese Restaurant Process to group together similar sites. Then each site could be weighted by how much it belongs in a particular topic.

  1. Applying additional NLP processing to search content. In particular searching for domain names

I recently convinced our team to move from developing on Neo4J to ES, but it was not a simple mapping because of the NLP they were applying. In particular they were doing:

  • Meronymy (part-of, is-a, instance-of, sub-part, etc) with a weighting to create an ontology of words to search in
  • Part of Speech tagging where the part of speech attached to a word was weighted based on how common it is
  • Just a generic score that links words together. eg. car and jaguar would have a link of some weight, but jaguar would also have some weight linking to cat

We had to make a number of compromises to get it to work in ES. Got it working, but it was not easy to line it up with how data scientists really think about such problems.

Also, if you want my very open source, opinionated reason for wanting this for the long term:

While we can and should talk about filter bubbles and the impact that these algorithms have on the world, a world where only monopolistic tech giants can deploy these algorithms is not one where publishing is democratic.

-- Me

Ultimately, Lucene is tuned to process term frequencies very fast and efficiently. Controlling those term frequencies provides a lot of flexibility and speed vs the other work arounds.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Jun 5, 2018
This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates elastic#27552
@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2018

FYI i submitted a proposal of a new field type called feature_vector at #31102 which should address @gkinsman's and one of @gibrown's use-cases: the ability to weigh by tag/topic.

@rpedela
Copy link

rpedela commented Jun 5, 2018

@jpountz Is there an advantage or difference between using feature_vector instead of function_score?

@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2018

Yes, feature_vector is expected to perform better at computing the top-matches of a query than the same thing implemented via function_score.

Scoring functions are black boxes to function_score, so it has no choice but to look at all matches and compute their scores in order to compute the top-k matches. On the other hand, these new feature and feature_vector fields are designed to integrate will with Lucene 8.0's implementation of block-max WAND, which is an algorithm that helps skip matches that may not produce competitive scores. The basic idea is that if you search for a OR b, then the score is the sum of the score contributions of a and b. So if the maximum score contribution of a is, say, 2.5 and the k-th best hit that you have collected so far has a score that is greater than or equal to 2.5, then you won't miss any competitive hits if you dynamically mutate this query into a AND b, which is much more efficient since you can skip documents that don't match both terms. What these new feature/feature_vector fields do is that they index features into the term frequency of some terms so that boosting by a feature can be implemented by adding a term query for this feature as a SHOULD clause under the hood, just using a different similarity (the thing that computes scores) and block-max WAND can leverage it as it would with any other term query.

jpountz added a commit that referenced this issue Jun 7, 2018
This field is similar to the `feature` field but is better suited to index
sparse feature vectors. A use-case for this field could be to record topics
associated with every documents alongside a metric that quantifies how well
the topic is connected to this document, and then boost queries based on the
topics that the logged user is interested in.

Relates #27552
@jpountz
Copy link
Contributor

jpountz commented Aug 16, 2018

Closing in favor of the upcoming feature_vector (#31102) and feature(#30618) fields.

@jpountz jpountz closed this as completed Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss >feature :Search/Analysis How text is split into tokens
Projects
None yet
Development

No branches or pull requests

10 participants