New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose MMapDirectory.preLoad(). #18880

Merged
merged 1 commit into from Jun 20, 2016

Conversation

Projects
None yet
4 participants
@jpountz
Contributor

jpountz commented Jun 15, 2016

The MMapDirectory has a switch that allows the content of files to be loaded
into the filesystem cache upon opening. This commit exposes it with the new
index.store.pre_load setting.

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 15, 2016

Contributor

Notes/questions to reviewers:

  • this removes the deprecated default fs to keep things simple
  • should the setting be index.store.pre_load or index.store.fs.pre_load?
  • currently the setting has a default value of false, should we consider enabling it by default to give better search performance on (re)starts and avoid cold starts?
Contributor

jpountz commented Jun 15, 2016

Notes/questions to reviewers:

  • this removes the deprecated default fs to keep things simple
  • should the setting be index.store.pre_load or index.store.fs.pre_load?
  • currently the setting has a default value of false, should we consider enabling it by default to give better search performance on (re)starts and avoid cold starts?

@jpountz jpountz added the review label Jun 15, 2016

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Jun 15, 2016

Member

I'd vote for index.store.preload (without the _)

Member

clintongormley commented Jun 15, 2016

I'd vote for index.store.preload (without the _)

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 15, 2016

Contributor

Something else that I forgot to mention: only the terms dictionary, norms, doc values and points will be preloaded. I think this is important to not waste some file system cache on stored fields or term vectors.

Contributor

jpountz commented Jun 15, 2016

Something else that I forgot to mention: only the terms dictionary, norms, doc values and points will be preloaded. I think this is important to not waste some file system cache on stored fields or term vectors.

@rmuir

This comment has been minimized.

Show comment
Hide comment
@rmuir

rmuir Jun 15, 2016

Contributor

I had in mind two cases for the preload option in Lucene:

  1. better option than new RAMDirectory(FSDirectory) for folks that want to load the entire index into RAM. This is a common thing people want, they think they are smarter than the OS, but then use RAMDir? that's not smarter :) At least this is going to be better than RAMDir if someone is hell-bent on doing that.
  2. more efficient warming than e.g. running a bunch of queries or other things. This could be a concern for those random access datastructures, I agree.

It will touch the first byte of every page. This is still useful as it knows platform-specific crap like page size to do this. Additionally, on linux, before it does that touching, it will call madvise(WILL_NEED).

I'm not sure if we should do this by default: its a fairly serious hammer? Basically you are saying "I think i know better than the operating system" by using this. I worry about doing it by default in cases where these index portions are significantly larger than available RAM, or where the index is considered "cold" by the user anyway. In those cases we could simply cause a ton of unnecessary I/O for no good reason.

But at the same time, it could be useful for some of these random access data structures (like norms/docvalues) where traditionally they were loaded completely into the heap anyway. Like a middle ground?

Just thoughts, i don't really have strong opinions either way. Maybe we could do some kind of test...

Contributor

rmuir commented Jun 15, 2016

I had in mind two cases for the preload option in Lucene:

  1. better option than new RAMDirectory(FSDirectory) for folks that want to load the entire index into RAM. This is a common thing people want, they think they are smarter than the OS, but then use RAMDir? that's not smarter :) At least this is going to be better than RAMDir if someone is hell-bent on doing that.
  2. more efficient warming than e.g. running a bunch of queries or other things. This could be a concern for those random access datastructures, I agree.

It will touch the first byte of every page. This is still useful as it knows platform-specific crap like page size to do this. Additionally, on linux, before it does that touching, it will call madvise(WILL_NEED).

I'm not sure if we should do this by default: its a fairly serious hammer? Basically you are saying "I think i know better than the operating system" by using this. I worry about doing it by default in cases where these index portions are significantly larger than available RAM, or where the index is considered "cold" by the user anyway. In those cases we could simply cause a ton of unnecessary I/O for no good reason.

But at the same time, it could be useful for some of these random access data structures (like norms/docvalues) where traditionally they were loaded completely into the heap anyway. Like a middle ground?

Just thoughts, i don't really have strong opinions either way. Maybe we could do some kind of test...

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 15, 2016

Contributor

Thanks @rmuir for the feedback. I changed the setting to accept a list of extensions to preload so that users who want to load whole indices into memory can do it without requiring us to have a dedicated setting for it. Also the default value is now to preload only norms and doc values, and I improved the documentation to describe how to disable this behaviour for cold indices or preload more files for hot indices.

Contributor

jpountz commented Jun 15, 2016

Thanks @rmuir for the feedback. I changed the setting to accept a list of extensions to preload so that users who want to load whole indices into memory can do it without requiring us to have a dedicated setting for it. Also the default value is now to preload only norms and doc values, and I improved the documentation to describe how to disable this behaviour for cold indices or preload more files for hot indices.

@uschindler

This comment has been minimized.

Show comment
Hide comment
@uschindler

uschindler Jun 17, 2016

Contributor

-1 to enable by default. That's horrible!

Contributor

uschindler commented Jun 17, 2016

-1 to enable by default. That's horrible!

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 17, 2016

Contributor

Fair enough, I changed the default back so that it does not pre-load anything.

Contributor

jpountz commented Jun 17, 2016

Fair enough, I changed the default back so that it does not pre-load anything.

@uschindler

This comment has been minimized.

Show comment
Hide comment
@uschindler

uschindler Jun 17, 2016

Contributor

Thanks Adrien. Maybe others have a different opinion, but those settings have a large effect on opening indexes. It also depends what type of system you have: For some logstash environment its different than an online shop regarding delays while opening new segments.

+1 to make it an array setting. Much cleaner!

Contributor

uschindler commented Jun 17, 2016

Thanks Adrien. Maybe others have a different opinion, but those settings have a large effect on opening indexes. It also depends what type of system you have: For some logstash environment its different than an online shop regarding delays while opening new segments.

+1 to make it an array setting. Much cleaner!

@uschindler

This comment has been minimized.

Show comment
Hide comment
@uschindler

uschindler Jun 17, 2016

Contributor

For the example Robert brought in: People who want to load everything - do we have a shortcut for that? (RAMDirectory replacment)

Contributor

uschindler commented Jun 17, 2016

For the example Robert brought in: People who want to load everything - do we have a shortcut for that? (RAMDirectory replacment)

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 17, 2016

Contributor

Indeed it is probably safer to have it empty by default. Then we can work on documentation, or maybe even a "how to tune for search speed" post like @mikemccand did for indexing.

There is no shortcut for loading everything into memory, do you think there should be one? I'm wondering it might be better to not make it too easy?

Contributor

jpountz commented Jun 17, 2016

Indeed it is probably safer to have it empty by default. Then we can work on documentation, or maybe even a "how to tune for search speed" post like @mikemccand did for indexing.

There is no shortcut for loading everything into memory, do you think there should be one? I'm wondering it might be better to not make it too easy?

@uschindler

This comment has been minimized.

Show comment
Hide comment
@uschindler

uschindler Jun 17, 2016

Contributor

There is no shortcut for loading everything into memory, do you think there should be one? I'm wondering it might be better to not make it too easy?

I agree it would make it too easy. But on the other hand for those people that really want everything in RAM, there is no clean way to do it - because you would need to list every (maybe not yet known) file extension. My first idea would have been a glob pattern like for indexes - "*" as setting matches every file.

Contributor

uschindler commented Jun 17, 2016

There is no shortcut for loading everything into memory, do you think there should be one? I'm wondering it might be better to not make it too easy?

I agree it would make it too easy. But on the other hand for those people that really want everything in RAM, there is no clean way to do it - because you would need to list every (maybe not yet known) file extension. My first idea would have been a glob pattern like for indexes - "*" as setting matches every file.

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Jun 17, 2016

Contributor

@uschindler I pushed a new commit that adds support for "*" in order to load everything into memory.

Contributor

jpountz commented Jun 17, 2016

@uschindler I pushed a new commit that adds support for "*" in order to load everything into memory.

@uschindler

This comment has been minimized.

Show comment
Hide comment
@uschindler

uschindler Jun 17, 2016

Contributor

Looks fine. Let's see what @rmuir says!

Contributor

uschindler commented Jun 17, 2016

Looks fine. Let's see what @rmuir says!

@rmuir

This comment has been minimized.

Show comment
Hide comment
@rmuir

rmuir Jun 17, 2016

Contributor

I don't think we should differ from lucene's defaults here. If there are some interesting numbers, then we can think about what to do.

But lets provide a less horrible alternative to RAMDirectory for stubborn people at least.

Contributor

rmuir commented Jun 17, 2016

I don't think we should differ from lucene's defaults here. If there are some interesting numbers, then we can think about what to do.

But lets provide a less horrible alternative to RAMDirectory for stubborn people at least.

Expose MMapDirectory.preLoad(). #18880
The MMapDirectory has a switch that allows the content of files to be loaded
into the filesystem cache upon opening. This commit exposes it with the new
`index.store.pre_load` setting.

@jpountz jpountz merged commit 93415d4 into elastic:master Jun 20, 2016

1 check passed

CLA Commit author is a member of Elasticsearch
Details

@jpountz jpountz deleted the jpountz:feature/expose_mmapdirectory_preload branch Jun 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment