Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing fields of nested doc in custom score script may cause documents missing in query result #3056

Closed
junjun-zhang opened this issue May 17, 2013 · 3 comments

Comments

@junjun-zhang
Copy link

I recently ran into a problem of missing documents in query result when custom score script is used. After some testing, I found that the problem seems occur when the script tries to access a field in a nested doc where a particular root document does not contain any such nested doc.

To reproduce the problem, test data and queries can be found here: http://goo.gl/iHOc5. The example may not make much sense in real world, but the idea is to sort products by average rate from users' review. One particular requirement is to always treat anonymous user's rate as 3 and assign rate as 3 for products with no reviews.

We can determine whether a user is anonymous or not by checking review.user.member_id field is empty or not: doc['review.user.member_id'].empty, this seems work fine except that products with no reviews are dropped out in the result as the first query example shows. Is this a bug? As there is no query/filter that excludes documents, shouldn't all documents be returned?

Also, there seems no way to determine whether a review exists or not. The second query example shows doc['review'].empty does not work, this makes sense because indeed, there is not such field as 'review' under the 'product' index, 'review' is a nested document. However, the question remains: is there a way to determine the existence of a nested doc?

@clintongormley
Copy link

Hiya

This is a really interesting question (and thanks for the runnable gist!)

The issue is that the nested match_all is only matching docs which have
nested docs. Really, you just care about whether a doc has any review by a
registered member, because all other products get the score of 3. In order
to do that, we need to expose member.id in the parent doc as well, which we
can do by adding include_in_parent: true to the nested mapping.

Then we write the query to match docs with member reviews, and calculate
the score using your script, and docs without member reviews, which get a
score of 3. For this we use a bool query with two clauses. Also, set
disable_coord to true, so that the "pure" score from each clause comes
through. Otherwise the bool query would divide the score by the number of
clauses (ie 2).

  "bool" : {
     "disable_coord" : 1,
     "should" : [
        { clause to match docs without member reviews },
        { clause to match docs with member reviews }
     ]
  }

So the clause WITH member reviews looks like the following:

{
"filtered" : {
"filter" : {
"exists" : {
"field" : "review.user.member_id"
}
},
"query" : {
"nested" : {
"query" : {
"custom_score" : {
"script" : "doc[\u0027review.user.member_id\u0027].empty
? 3 : doc[\u0027review.rate\u0027].value",
"query" : {
"match_all" : {}
}
}
},
"score_mode" : "avg",
"path" : "review"
}
}
}
}

Then, the clause to match docs WITHOUT reviews. Initially, I wrote this:

{
"constant_score" : {
"boost" : 3,
"filter" : {
"missing" : {
"field" : "review.user.member_id"
}
}
}
}

But the score of 3 was being combined with the query norm, and so
returning values like 0.9xxxx. Weirdly, the custom_score doesn't get
combined with the query norm. I'm not sure if that is intentional or not,
but that's the way it is. So the way to get a pure score of 3 from the
above is to wrap it in a custom_score query, and have the script just
return 3:

{
"custom_score" : {
"script" : "3",
"query" : {
"constant_score" : {
"filter" : {
"missing" : {
"field" : "review.user.member_id"
}
}
}
}
}
},

The full query is here: https://gist.github.com/clintongormley/5604037

IMPORTANT: you're paying the cost of this calculation at query time, but
all the information is known at index time. A much better approach would
be to just calculate the avg rating when you index a document and store it
as a field: avg_rating. Then you can sort by that field.

clint

On 17 May 2013 21:12, Junjun Zhang notifications@github.com wrote:

I recently ran into a problem of missing documents in query result when
custom score script is used. After some testing, I found that the problem
seems occur when the script tries to access a field in a nested doc where a
particular root document does not contain any such nested doc.

To reproduce the problem, test data and queries can be found here:
http://goo.gl/iHOc5. The example may not make much sense in real world,
but the idea is to sort products by average rate from users' review. One
particular requirement is to always treat anonymous user's rate as 3 and
assign rate as 3 for products with no reviews.

We can determine whether a user is anonymous or not by checking
review.user.member_id field is empty or not:
doc['review.user.member_id'].empty, this seems work fine except that
products with no reviews are dropped out in the result as the first query
example shows. Is this a bug? As there is no query/filter that excludes
documents, shouldn't all documents be returned?

Also, there seems no way to determine whether a review exists or not. The
second query example shows doc['review'].empty does not work, this makes
sense because indeed, there is not such field as 'review' under the
'product' index, 'review' is a nested document. However, the question
remains: is there a way to determine the existence of a nested doc?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3056
.

@btiernay
Copy link

For the future reader, #3058 was created to address:

Weirdly, the custom_score doesn't get combined with the query norm. I'm not sure if that is intentional or not, but that's the way it is.

@clintongormley
Copy link

Closed in favour of #3495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants