Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checksum option for index.shard.check_on_startup #9183

Closed
wants to merge 3 commits into from

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Jan 7, 2015

The current "checkindex" on startup is very very expensive. This is
like running one of the old school hard drive diagnostic checkers and
usually not a good idea.

But we can do a CRC32 verification of files. We don't even need to
open an indexreader to do this, its much more lightweight.

This patch also sets it as the default. I think its a really good idea,
but we don't have to do that here.

No tests yet.

The current "checkindex" on startup is very very expensive. This is
like running one of the old school hard drive diagnostic checkers and
usually not a good idea.

But we can do a CRC32 verification of files. We don't even need to
open an indexreader to do this, its much more lightweight.

This patch also sets it as the default. I think its a really good idea,
but we don't have to do that here.

No tests yet.
@rmuir
Copy link
Contributor Author

rmuir commented Jan 7, 2015

Note: the other idea i had was to only do it by default when old segment files are present. This is more complicated but maybe a better default.

@s1monw
Copy link
Contributor

s1monw commented Jan 7, 2015

I like the idea a lot we might even be able to keep it on by default?

@rjernst
Copy link
Member

rjernst commented Jan 7, 2015

LGTM. +1 to having it as the default

@rmuir
Copy link
Contributor Author

rmuir commented Jan 7, 2015

Mainly what i want to catch here is the situation where there is problems on upgrade. Otherwise problems may not be found until later when its harder to deal with.

Note I discovered some issues for 1.x branch here (assuming the code is the same). We need to revisit logic in #9142 and either also handle it in these Store verification methods, which are already used in other places in the code, or just refactor it to handle it in a different place in the Store code (e.g. when we know the old checksum is useless, null it out in metadata so nothing goes wrong).

@rmuir
Copy link
Contributor Author

rmuir commented Jan 7, 2015

I think i found another existing bug? if you use this option today, it will leak a lock because it does not close CheckIndex.

@s1monw
Copy link
Contributor

s1monw commented Jan 7, 2015

I think i found another existing bug? if you use this option today, it will leak a lock because it does not close CheckIndex.

good catch I fixed this in other places to :)

@kimchy
Copy link
Member

kimchy commented Jan 7, 2015

regarding setting it by default, I would love to run a quick test on, lets say, a 200gb index (on AWS for example) and see how long it takes to do the checksum checks? Historically, ES used to do it (though it was more expensive computationally wise), and it slowed down full cluster restart by hours :(. I didn't know back then if the time was spent for computation, or just reading a lot of data, the cluster in question was quite big, almost a petabyte of data. So it would be good just to have an idea about the cost of it.

@rmuir
Copy link
Contributor Author

rmuir commented Jan 11, 2015

I don't want to argue about defaults, lets just turn it off. I at least need this option to improve testing.

@s1monw
Copy link
Contributor

s1monw commented Jan 11, 2015

@rmuir do me a favor and go ahead with false in this PR but please open an issue to rethink this default. I really would recommend this to be set to true to anybody running ES in production. If somebody needs to disable it because they have a slow IO system or a suuper huge index that is OK. I think the right logging here is important such that we you are seeing the progress of what is happeneing so if it's really really slow folks can go back an disable it? I don't like to set defaults based on some extreme usecase that could be slow. but lets discuss this on a different issue.

@rmuir
Copy link
Contributor Author

rmuir commented Jan 11, 2015

Yeah, i just want to turn it on in tests. especially static backwards index ones. honestly we don't even need to publicize/document the option yet. we can defer all of that.

@rmuir
Copy link
Contributor Author

rmuir commented Jan 12, 2015

I turned this off by default, but enabled it in tests. I think its important we make this step because again, the existing code leaks write.lock etc.

@rjernst
Copy link
Member

rjernst commented Jan 16, 2015

Latest changes LGTM.

@rmuir rmuir closed this in 0277300 Feb 3, 2015
@clintongormley clintongormley changed the title add 'checksum' option for index.shard.check_on_startup Add checksum option for index.shard.check_on_startup Jun 8, 2015
@clintongormley clintongormley added >feature :Core/Infra/Settings Settings infrastructure and APIs labels Jun 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Settings Settings infrastructure and APIs >feature v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants