Feature Request: pre-select terms in TermVector request #3924

mhoffman · 2013-10-16T21:19:58Z

Dear All

I have a feature request regarding the TermVector API. I was really happy to see this commit, which I had fledgingly written as a plugin before. Thanks @brwe ! Though, would it be possible to submit a list of terms and only have the TermVector returned for those? My hope is that it's considerably faster than the full request. I have a use case, where I know the terms for which I need the information before making the request.

Some pointers of how to do it myself are appreciated, too, though I am afraid my solution won't be as efficient.

Best and many, many thanks for the great work.
Max.

brwe · 2013-10-17T11:30:01Z

Thanks! I am glad to hear that the term vector api is useful. I want to add that @bleskes deserves at least half the credit!

There is another issue related to gathering statistics on terms(#3920). Is this similar to what you need?
Could you describe your use case?

mhoffman · 2013-10-17T15:42:37Z

Thanks for the quick response. So big shoutout to @bleskes, too: Thank you! No in my case I am actually after the precise offset, but typically only of a few terms (say 2-5). To describe this in general terms: I am doing a distance based scoring for things in the text. Can I send you a prototype privately?

brwe · 2013-10-19T13:26:43Z

If you by privately mean that you want to push a branch to your repo without pull request and then discuss it here - sure! Just link the commit or branch here.

mhoffman · 2013-10-22T05:04:02Z

Thanks, for taking the time. I have tried to do that in a low-brow way and the result is at
User/Project@SHA: mhoffman/elasticsearch@5ed795c .

Now, I have two problems so far:

There is some speed-up compared to a full termvector, but it is less than I thought
If the selected term(s) is not in the document, elasticsearch returns a 500, EOFException. Which is probably not desirable.

Any suggestions?

brwe · 2013-10-22T17:02:31Z

Thanks for the commit! I assumed you wanted to discuss e7c1e9e, since 5ed795c does some percolator things, right? Let me know if I am mistaken.

About the speed-up: I could imagine the speedup is not as great as expected because all term info is loaded from disc no matter how many of terms you actually return. Disc IO probably influences the performance most. I still think the change is useful for large documents with many terms when only few are requested.
What kind of documents did you use for measuring the speedup?

About the EOF: We should decide if a requested term should be returned with frequency 0 if it is not in the doc, or not return missing terms at all. I prepared a commit fixing the EOF for the former and added a comment for the latter, just so that you know where the changes have to be made here: 27bc8fd

What do you think: Return the term with frequency 0 or not return it at all if it is not the document?

mhoffman · 2013-10-22T17:45:21Z

Thanks for the corrections! Sorry, about the confusion with the commit link. I must have gotten something wrong with linking into user-specific commits. I am also a bit confused how this 5ed795c you link to comes turn out to be there? I definitely didn't intend to touch any percolator stuff. Is it coincidence that the first 7 characters match?

Anyhow, you seem to have gotten the right commit and in my use-case both solutions work equally well. Though one could argue that given that Disk IO will dominate the runtime of this request anyways, returning a 0 would simply make the result a bit more self-explaining.

Thanks again.

brwe · 2013-10-23T12:47:47Z

Would you like to make a pull request for your changes?

mhoffman · 2013-10-25T07:16:55Z

Sorry for the delay. Why don't you go ahead since you have the last version
in your repo already.

2013/10/23 Britta Weber notifications@github.com

Would you like to make a pull request for your changes?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-26901527
.

Max J. Hoffmann
Tel: +4989 289 13807 (office)
Room CH62115
TU München
Lichtenbergstr. 4
D-85747 Garching

brwe · 2013-10-25T13:56:39Z

ok, I'll do that.

…req 0 If specific terms are requested, they should be returned with freq 0 if they do not appear in the document. To do this efficiently, we compute a sorted list of terms and then intersect this list with the terms in the documents when writing the terms. This commit changes an inconsistency of uri parameter handling: Previously, when selected fields were given in the uri parameters, these fields were added to the list of selected terms. This does not make sense since it is not the other way round. Now, uri parameters are always overwritten by the parameters given in the body. closes elastic#3924 for single term vector request

…ector api uri parameters were not all parsed for the multi term vector request. This commit makes sure that all parameters are parsed and used when creating the requests for the multi term vector request. In order to simplify both code and json request, the request structure now allows two ways to use multi term vectors: 1. Give all parameters for each document requested in the docs array like this: ``` { "docs": [ { "_index": "testidx", "_type": "test", "_id": "2", "terms": [ "fox" ], "term_statistics": true }, { "_index": "testidx", "_type": "test", "_id": "1", "terms": [ "quick", "brown" ], "term_statistics": false } ] } ``` 2. Define a list of ids and give parameters in a separate parameters object like this: ``` { "ids": [ "1", "2" ], "parameters": { "_index": "testidx", "_type": "test", "terms": [ "brown" ] } } ``` uri parameters are global parameters that are set for both cases. They are overwritten by parameter definitions in the body. closes elastic#3924 for multi term vector api

brwe · 2013-11-13T17:48:36Z

About the speedup: With the current implementation, we load the whole term vector for a document. This makes sense if you need all the terms or do not know in advance which term is requested but also makes it slow.
For pre-selected terms the current implementation (see commits above) we also load the term vector and only keep the terms that we are interested in. I am now wondering if this makes sense at all for pre-selected terms. We do not need the full term vector and could also get all the information from the DocsEnum of the terms - this might be quicker than what we do right now.
Summoning @s1monw here because I am unsure if it really is quicker in this case.

Also, please take a look at pr #4161. It implements access to term vectors in a script. I implemented it wrong there as well (load term vectors and get the information from there instead from the standard DocsEnum) but I can fix that.
If the script term vector access gives you all you need, than maybe you do not even need the pre-selected term vectors in the _termvector api anymore?

Also, sorry for the delay.

brwe · 2014-01-02T10:32:42Z

I just pushed the script api for term statistics (pr #4161) Could you check if this allows you to do all you need?

mhoffman · 2014-02-09T21:37:02Z

Hi

Sorry for the slow reply. The function_score query seems to work as the
docs promise (at least using 'mvel' and 'native'). Though my actual
problem, that made me look for term vectors etc., is still too complex for
this query type: I would need to access the term vector of the parent of
the object that I am scoring with the script. I guess this notion was
expressed before in #1071 and #761 but apparently easier said than done.

I think this could still be a nice feature and since parent/child object
follow the same routing shouldn't all too complicated. Any comments?

2014-01-02 11:33 GMT+01:00 Britta Weber notifications@github.com:

I just pushed the script api for term statistics (pr #4161 #4161)
Could you check if this allows you to do all you need?

Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-31445583
.

Max J. Hoffmann
Tel: +4989 289 13807 (office)
Room CH62115
TU München
Lichtenbergstr. 4
D-85747 Garching

brwe · 2014-02-12T10:00:45Z

That would call for a different issue.

Just to be sure: Do you need both parent and child statistics in the same script? Can you elaborate a little on what exactly you need maybe with a small example?

mhoffman · 2014-02-15T17:40:11Z

Just played with the script_score feature and thinks this allows a really
nice developing workflow. Sweet!

My parent/child use case is the following (I hope this makes sense). In my
application I store my data with parents and children and each parent has
many children (~100). The children are very small in size (4 fields, short
strings each) but the parent has one large field (~1e6 characters). So from
an index size point of view it makes sense to keep the big field with the
parent once instead of with each of the many children

Then the 'functions' part of the 'function_score' request could look like
below, and it is run on the children doc_type but the '_parent' refers to
the respective parent. I think this could make sense since (if I understand
correctly) parent and child are always routed to the same node. I would
like to have a look into this, but it might take me a couple of days.

'functions' : [{
'script_score': {
"params" : {
'pos': 520,
'terms': ['this','system'],
},
'script': """
score = 0;
for (term: terms) {
offsets = _parent['body'].get(term,
_OFFSETS | _CACHE);
for (offset: offsets){
score = score + exp(-(pos -
offset.startOffset)**2 * (-.0001));
}
}
score

                        """,
                                    }
                }],

2014-02-12 11:01 GMT+01:00 Britta Weber notifications@github.com:

That would call for a different issue.

Just to be sure: Do you need both parent and child statistics in the same
script? Can you elaborate a little on what exactly you need maybe with a
small example?

Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-34854215
.

Max J. Hoffmann
Tel: +4989 289 13807 (office)
Room CH62115
TU München
Lichtenbergstr. 4
D-85747 Garching

brwe · 2014-05-06T06:13:01Z

Hi,

sorry for the late reply. Can you check if the has_parent query together with the function score would solve the problem?

Using several documents within a script for computing the score is not supported yet and would need a new issue. Let me know if I can close this one. Thanks!

mhoffman · 2014-05-27T14:52:09Z

Looks good to me.
Am 06.05.2014 08:13 schrieb "Britta Weber" notifications@github.com:

Hi,

sorry for the late reply. Can you check if the has_parent query together
with the function score would solve the problem?

Using several documents within a script for computing the score is not
supported yet and would need a new issue. Let me know if I can close this
one. Thanks!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-42270734
.

brwe mentioned this issue Oct 17, 2013

Term Count on Search Results #3920

Closed

ghost assigned brwe Oct 29, 2013

kevinkluge added v1.1.0 and removed v1.0.1 labels Feb 11, 2014

s1monw added v1.2.0 and removed v1.1.0 labels Mar 12, 2014

brwe removed the v1.2.0 label May 6, 2014

clintongormley closed this as completed Aug 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: pre-select terms in TermVector request #3924

Feature Request: pre-select terms in TermVector request #3924

mhoffman commented Oct 16, 2013

brwe commented Oct 17, 2013

mhoffman commented Oct 17, 2013

brwe commented Oct 19, 2013

mhoffman commented Oct 22, 2013

brwe commented Oct 22, 2013

mhoffman commented Oct 22, 2013

brwe commented Oct 23, 2013

mhoffman commented Oct 25, 2013

brwe commented Oct 25, 2013

brwe commented Nov 13, 2013

brwe commented Jan 2, 2014

mhoffman commented Feb 9, 2014

brwe commented Feb 12, 2014

mhoffman commented Feb 15, 2014

brwe commented May 6, 2014

mhoffman commented May 27, 2014

Feature Request: pre-select terms in TermVector request #3924

Feature Request: pre-select terms in TermVector request #3924

Comments

mhoffman commented Oct 16, 2013

brwe commented Oct 17, 2013

mhoffman commented Oct 17, 2013

brwe commented Oct 19, 2013

mhoffman commented Oct 22, 2013

brwe commented Oct 22, 2013

mhoffman commented Oct 22, 2013

brwe commented Oct 23, 2013

mhoffman commented Oct 25, 2013

brwe commented Oct 25, 2013

brwe commented Nov 13, 2013

brwe commented Jan 2, 2014

mhoffman commented Feb 9, 2014

brwe commented Feb 12, 2014

mhoffman commented Feb 15, 2014

brwe commented May 6, 2014

mhoffman commented May 27, 2014