Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function Score: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] #7446

Closed
wants to merge 5 commits into from

Conversation

Projects
None yet
5 participants
@rjernst
Copy link
Member

commented Aug 25, 2014

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907

FunctionScore: Refactor RandomScoreFunction to be consistent, and ret…
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907
@@ -160,7 +160,7 @@

# Path to log files:
#
#path.logs: /path/to/logs
path.logs: logs

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 25, 2014

Contributor

leftover?

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 25, 2014

Author Member

Doh! Yes, will remove.

@@ -123,7 +126,12 @@ public FieldDataType defaultFieldDataType() {
}

@Override
protected String defaultPostingFormat() {
protected String defaultPostingFormat() {

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 25, 2014

Author Member

Doh! More leftover stuff, I will remove shortly.

rjernst added some commits Aug 25, 2014

@rjernst

This comment has been minimized.

Copy link
Member Author

commented Aug 25, 2014

Sorry about that @jpountz I have removed those extraneous changes.

@@ -28,7 +28,7 @@
*/
public class RandomScoreFunctionBuilder implements ScoreFunctionBuilder {

private Long seed = null;
private Integer seed = null;

This comment has been minimized.

Copy link
@spinscale

spinscale Aug 26, 2014

Member

is there a particular reason (which I fail to see obviously :-) why this is an object and not a regular int?

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 26, 2014

Contributor

I think it is to be able to do if (seed != null) in the toXContent method?

This comment has been minimized.

Copy link
@spinscale

spinscale Aug 26, 2014

Member

makes sense, I rest my case.. not sure if the noargs constructor makes sense, if that one was gone, the seed would never be empty

This comment has been minimized.

Copy link
@rjernst

rjernst Aug 26, 2014

Author Member

I just went with what was there. I think having the option to not supply the seed (ie you don'g care about reproducing, you just want some randomness) is a good option to keep.

This comment has been minimized.

Copy link
@harmsk

harmsk Nov 17, 2014

Is there a reason this was changed from a long to an int? In 1.4 I can no longer use this function because the seed I was using matches data that is only 64 bits.

This comment has been minimized.

Copy link
@rjernst

rjernst Nov 18, 2014

Author Member

@harmsk This was due to using a 32 bit hash, however it was fixed so longs (as well as strings) work later in #8311

createIndex("test");
ensureGreen();
index("test", "type", "1", jsonBuilder().startObject().endObject());
flush();

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 26, 2014

Contributor

flush should not be necessary, only refresh?

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 26, 2014

Contributor

Oh I see this suite did it for every test already

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

The diff looks good to me but I'm wondering that we should enable doc values by default on _uid before merging such a change?

@jpountz jpountz removed the review label Aug 26, 2014

@rjernst

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2014

@jpountz That's the assumption that I was waiting on this fix for all this time. However, this still brings improvements that I believe are important to get into master (2 bug fixes and an simplification of return values). The caveat of course is the cost of pulling _uid into field data, but I think that is just something to be improved upon in the future, rather than continuing with the current broken behavior.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2014

Then let's add some documentation to make clear that this feature relies on fielddata of the _uid field?

@rjernst

This comment has been minimized.

Copy link
Member Author

commented Aug 27, 2014

@jpountz I modified the docs for random_score a little to mention uid. Let me know if that is what you had in mind, or if you wanted something more.

@rjernst rjernst added the review label Aug 27, 2014

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2014

This is what I had in mind. Maybe also put a warning that it will load field data for the _uid field (which can be very memory-intensive given that all values are unique). Otherwise LGTM (feel free to push without asking for further review from my end).

@jpountz jpountz removed the review label Aug 27, 2014

rjernst added a commit that referenced this pull request Aug 27, 2014

FunctionScore: Refactor RandomScoreFunction to be consistent, and ret…
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446

rjernst added a commit that referenced this pull request Aug 27, 2014

FunctionScore: Refactor RandomScoreFunction to be consistent, and ret…
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446

@rjernst rjernst closed this Aug 27, 2014

rjernst added a commit that referenced this pull request Sep 8, 2014

FunctionScore: Refactor RandomScoreFunction to be consistent, and ret…
…urn values in rang [0.0, 1.0]

RandomScoreFunction previously relied on the order the documents were
iterated in from Lucene. This caused changes in ordering, with the same
seed, if documents moved to different segments. With this change, a
murmur32 hash of the _uid for each document is used as the "random"
value. Also, the hash is adjusted so as to only return values between
0.0 and 1.0 to enable easier manipulation to fit into users' scoring
models.

closes #6907, #7446

@clintongormley clintongormley changed the title FunctionScore: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] Function Score: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] Sep 11, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.