Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: pre-select terms in TermVector request #3924

Closed
mhoffman opened this issue Oct 16, 2013 · 16 comments
Closed

Feature Request: pre-select terms in TermVector request #3924

mhoffman opened this issue Oct 16, 2013 · 16 comments
Assignees
Labels

Comments

@mhoffman
Copy link

Dear All

I have a feature request regarding the TermVector API. I was really happy to see this commit, which I had fledgingly written as a plugin before. Thanks @brwe ! Though, would it be possible to submit a list of terms and only have the TermVector returned for those? My hope is that it's considerably faster than the full request. I have a use case, where I know the terms for which I need the information before making the request.

Some pointers of how to do it myself are appreciated, too, though I am afraid my solution won't be as efficient.

Best and many, many thanks for the great work.
Max.

@brwe
Copy link
Contributor

brwe commented Oct 17, 2013

Thanks! I am glad to hear that the term vector api is useful. I want to add that @bleskes deserves at least half the credit!

There is another issue related to gathering statistics on terms(#3920). Is this similar to what you need?
Could you describe your use case?

@mhoffman
Copy link
Author

Thanks for the quick response. So big shoutout to @bleskes, too: Thank you! No in my case I am actually after the precise offset, but typically only of a few terms (say 2-5). To describe this in general terms: I am doing a distance based scoring for things in the text. Can I send you a prototype privately?

@brwe
Copy link
Contributor

brwe commented Oct 19, 2013

If you by privately mean that you want to push a branch to your repo without pull request and then discuss it here - sure! Just link the commit or branch here.

@mhoffman
Copy link
Author

Thanks, for taking the time. I have tried to do that in a low-brow way and the result is at
User/Project@SHA: mhoffman/elasticsearch@5ed795c .

Now, I have two problems so far:

  • There is some speed-up compared to a full termvector, but it is less than I thought
  • If the selected term(s) is not in the document, elasticsearch returns a 500, EOFException. Which is probably not desirable.

Any suggestions?

@brwe
Copy link
Contributor

brwe commented Oct 22, 2013

Thanks for the commit! I assumed you wanted to discuss e7c1e9e, since 5ed795c does some percolator things, right? Let me know if I am mistaken.

About the speed-up: I could imagine the speedup is not as great as expected because all term info is loaded from disc no matter how many of terms you actually return. Disc IO probably influences the performance most. I still think the change is useful for large documents with many terms when only few are requested.
What kind of documents did you use for measuring the speedup?

About the EOF: We should decide if a requested term should be returned with frequency 0 if it is not in the doc, or not return missing terms at all. I prepared a commit fixing the EOF for the former and added a comment for the latter, just so that you know where the changes have to be made here: 27bc8fd

What do you think: Return the term with frequency 0 or not return it at all if it is not the document?

@mhoffman
Copy link
Author

Thanks for the corrections! Sorry, about the confusion with the commit link. I must have gotten something wrong with linking into user-specific commits. I am also a bit confused how this 5ed795c you link to comes turn out to be there? I definitely didn't intend to touch any percolator stuff. Is it coincidence that the first 7 characters match?

Anyhow, you seem to have gotten the right commit and in my use-case both solutions work equally well. Though one could argue that given that Disk IO will dominate the runtime of this request anyways, returning a 0 would simply make the result a bit more self-explaining.

Thanks again.

@brwe
Copy link
Contributor

brwe commented Oct 23, 2013

Would you like to make a pull request for your changes?

@mhoffman
Copy link
Author

Sorry for the delay. Why don't you go ahead since you have the last version
in your repo already.

2013/10/23 Britta Weber notifications@github.com

Would you like to make a pull request for your changes?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-26901527
.

Max J. Hoffmann
Tel: +4989 289 13807 (office)
Room CH62115
TU München
Lichtenbergstr. 4
D-85747 Garching

@brwe
Copy link
Contributor

brwe commented Oct 25, 2013

ok, I'll do that.

brwe added a commit to brwe/elasticsearch that referenced this issue Oct 29, 2013
…req 0

If specific terms are requested, they should be returned with freq 0 if they do not
appear in the document. To do this efficiently, we compute a sorted list of
terms and then intersect this list with the terms in the documents when writing
the terms.

This commit changes an inconsistency of uri parameter handling:
Previously, when selected fields were given in the uri parameters,
these fields were added to the list of selected terms. This does not make
sense since it is not the other way round. Now, uri parameters are always
overwritten by the parameters given in the body.

closes elastic#3924 for single term vector request
brwe added a commit to brwe/elasticsearch that referenced this issue Oct 29, 2013
…ector api

uri parameters were not all parsed for the multi term vector request. This commit
makes sure that all parameters are parsed and used when creating the requests for the
multi term vector request.

In order to simplify both code and json request, the request structure now allows
two ways to use multi term vectors:

1. Give all parameters for each document requested in the docs array like this:

```
{
   "docs": [
      {
         "_index": "testidx",
         "_type": "test",
         "_id": "2",
         "terms": [
            "fox"
         ],
         "term_statistics": true
      },
      {
         "_index": "testidx",
         "_type": "test",
         "_id": "1",
         "terms": [
            "quick",
            "brown"
         ],
         "term_statistics": false
      }
   ]
}
```

2. Define a list of ids and give parameters in a separate parameters object like this:

```
{
   "ids": [
      "1",
      "2"
   ],
   "parameters": {
      "_index": "testidx",
      "_type": "test",
      "terms": [
         "brown"
      ]
   }
}
```

uri parameters are global parameters that are set for both cases. They are overwritten
by parameter definitions in the body.

closes elastic#3924 for multi term vector api
@ghost ghost assigned brwe Oct 29, 2013
@brwe
Copy link
Contributor

brwe commented Nov 13, 2013

About the speedup: With the current implementation, we load the whole term vector for a document. This makes sense if you need all the terms or do not know in advance which term is requested but also makes it slow.
For pre-selected terms the current implementation (see commits above) we also load the term vector and only keep the terms that we are interested in. I am now wondering if this makes sense at all for pre-selected terms. We do not need the full term vector and could also get all the information from the DocsEnum of the terms - this might be quicker than what we do right now.
Summoning @s1monw here because I am unsure if it really is quicker in this case.

Also, please take a look at pr #4161. It implements access to term vectors in a script. I implemented it wrong there as well (load term vectors and get the information from there instead from the standard DocsEnum) but I can fix that.
If the script term vector access gives you all you need, than maybe you do not even need the pre-selected term vectors in the _termvector api anymore?

Also, sorry for the delay.

@brwe
Copy link
Contributor

brwe commented Jan 2, 2014

I just pushed the script api for term statistics (pr #4161) Could you check if this allows you to do all you need?

@mhoffman
Copy link
Author

mhoffman commented Feb 9, 2014

Hi

Sorry for the slow reply. The function_score query seems to work as the
docs promise (at least using 'mvel' and 'native'). Though my actual
problem, that made me look for term vectors etc., is still too complex for
this query type: I would need to access the term vector of the parent of
the object that I am scoring with the script. I guess this notion was
expressed before in #1071 and #761 but apparently easier said than done.

I think this could still be a nice feature and since parent/child object
follow the same routing shouldn't all too complicated. Any comments?

2014-01-02 11:33 GMT+01:00 Britta Weber notifications@github.com:

I just pushed the script api for term statistics (pr #4161#4161)
Could you check if this allows you to do all you need?

Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-31445583
.

Max J. Hoffmann
Tel: +4989 289 13807 (office)
Room CH62115
TU München
Lichtenbergstr. 4
D-85747 Garching

@kevinkluge kevinkluge added v1.1.0 and removed v1.0.1 labels Feb 11, 2014
@brwe
Copy link
Contributor

brwe commented Feb 12, 2014

That would call for a different issue.

Just to be sure: Do you need both parent and child statistics in the same script? Can you elaborate a little on what exactly you need maybe with a small example?

@mhoffman
Copy link
Author

Just played with the script_score feature and thinks this allows a really
nice developing workflow. Sweet!

My parent/child use case is the following (I hope this makes sense). In my
application I store my data with parents and children and each parent has
many children (~100). The children are very small in size (4 fields, short
strings each) but the parent has one large field (~1e6 characters). So from
an index size point of view it makes sense to keep the big field with the
parent once instead of with each of the many children

Then the 'functions' part of the 'function_score' request could look like
below, and it is run on the children doc_type but the '_parent' refers to
the respective parent. I think this could make sense since (if I understand
correctly) parent and child are always routed to the same node. I would
like to have a look into this, but it might take me a couple of days.

'functions' : [{
'script_score': {
"params" : {
'pos': 520,
'terms': ['this','system'],
},
'script': """
score = 0;
for (term: terms) {
offsets = _parent['body'].get(term,
_OFFSETS | _CACHE);
for (offset: offsets){
score = score + exp(-(pos -
offset.startOffset)**2 * (-.0001));
}
}
score

                        """,
                                    }
                }],

2014-02-12 11:01 GMT+01:00 Britta Weber notifications@github.com:

That would call for a different issue.

Just to be sure: Do you need both parent and child statistics in the same
script? Can you elaborate a little on what exactly you need maybe with a
small example?

Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-34854215
.

Max J. Hoffmann
Tel: +4989 289 13807 (office)
Room CH62115
TU München
Lichtenbergstr. 4
D-85747 Garching

@s1monw s1monw added v1.2.0 and removed v1.1.0 labels Mar 12, 2014
@brwe brwe removed the v1.2.0 label May 6, 2014
@brwe
Copy link
Contributor

brwe commented May 6, 2014

Hi,

sorry for the late reply. Can you check if the has_parent query together with the function score would solve the problem?

Using several documents within a script for computing the score is not supported yet and would need a new issue. Let me know if I can close this one. Thanks!

@mhoffman
Copy link
Author

Looks good to me.
Am 06.05.2014 08:13 schrieb "Britta Weber" notifications@github.com:

Hi,

sorry for the late reply. Can you check if the has_parent query together
with the function score would solve the problem?

Using several documents within a script for computing the score is not
supported yet and would need a new issue. Let me know if I can close this
one. Thanks!


Reply to this email directly or view it on GitHubhttps://github.com//issues/3924#issuecomment-42270734
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants