Add `checksum` option for `index.shard.check_on_startup` #9183

Closed
wants to merge 3 commits into
from

Projects

None yet

5 participants

@rmuir
Contributor
rmuir commented Jan 7, 2015

The current "checkindex" on startup is very very expensive. This is
like running one of the old school hard drive diagnostic checkers and
usually not a good idea.

But we can do a CRC32 verification of files. We don't even need to
open an indexreader to do this, its much more lightweight.

This patch also sets it as the default. I think its a really good idea,
but we don't have to do that here.

No tests yet.

@rmuir rmuir add 'checksum' option for index.shard.check_on_startup
The current "checkindex" on startup is very very expensive. This is
like running one of the old school hard drive diagnostic checkers and
usually not a good idea.

But we can do a CRC32 verification of files. We don't even need to
open an indexreader to do this, its much more lightweight.

This patch also sets it as the default. I think its a really good idea,
but we don't have to do that here.

No tests yet.
db617c1
@rmuir
Contributor
rmuir commented Jan 7, 2015

Note: the other idea i had was to only do it by default when old segment files are present. This is more complicated but maybe a better default.

@s1monw
Contributor
s1monw commented Jan 7, 2015

I like the idea a lot we might even be able to keep it on by default?

@rjernst
Member
rjernst commented Jan 7, 2015

LGTM. +1 to having it as the default

@rmuir
Contributor
rmuir commented Jan 7, 2015

Mainly what i want to catch here is the situation where there is problems on upgrade. Otherwise problems may not be found until later when its harder to deal with.

Note I discovered some issues for 1.x branch here (assuming the code is the same). We need to revisit logic in #9142 and either also handle it in these Store verification methods, which are already used in other places in the code, or just refactor it to handle it in a different place in the Store code (e.g. when we know the old checksum is useless, null it out in metadata so nothing goes wrong).

@rmuir
Contributor
rmuir commented Jan 7, 2015

I think i found another existing bug? if you use this option today, it will leak a lock because it does not close CheckIndex.

@s1monw
Contributor
s1monw commented Jan 7, 2015

I think i found another existing bug? if you use this option today, it will leak a lock because it does not close CheckIndex.

good catch I fixed this in other places to :)

@kimchy
Member
kimchy commented Jan 7, 2015

regarding setting it by default, I would love to run a quick test on, lets say, a 200gb index (on AWS for example) and see how long it takes to do the checksum checks? Historically, ES used to do it (though it was more expensive computationally wise), and it slowed down full cluster restart by hours :(. I didn't know back then if the time was spent for computation, or just reading a lot of data, the cluster in question was quite big, almost a petabyte of data. So it would be good just to have an idea about the cost of it.

@rmuir
Contributor
rmuir commented Jan 11, 2015

I don't want to argue about defaults, lets just turn it off. I at least need this option to improve testing.

@s1monw
Contributor
s1monw commented Jan 11, 2015

@rmuir do me a favor and go ahead with false in this PR but please open an issue to rethink this default. I really would recommend this to be set to true to anybody running ES in production. If somebody needs to disable it because they have a slow IO system or a suuper huge index that is OK. I think the right logging here is important such that we you are seeing the progress of what is happeneing so if it's really really slow folks can go back an disable it? I don't like to set defaults based on some extreme usecase that could be slow. but lets discuss this on a different issue.

@rmuir
Contributor
rmuir commented Jan 11, 2015

Yeah, i just want to turn it on in tests. especially static backwards index ones. honestly we don't even need to publicize/document the option yet. we can defer all of that.

@rmuir
Contributor
rmuir commented Jan 12, 2015

I turned this off by default, but enabled it in tests. I think its important we make this step because again, the existing code leaks write.lock etc.

@rjernst
Member
rjernst commented Jan 16, 2015

Latest changes LGTM.

@rmuir rmuir added a commit that closed this pull request Feb 3, 2015
@rmuir rmuir core: add 'checksum' option for index.shard.check_on_startup
The current "checkindex" on startup is very very expensive. This is
like running one of the old school hard drive diagnostic checkers and
usually not a good idea.

But we can do a CRC32 verification of files. We don't even need to
open an indexreader to do this, its much more lightweight.

This option (as well as the existing true/false) are randomized in
tests to find problems.

Also fix bug where use of the current option would always leak
an indexwriter lock.

Closes #9183
0277300
@rmuir rmuir closed this in 0277300 Feb 3, 2015
@clintongormley clintongormley removed the review label Mar 19, 2015
@clintongormley clintongormley changed the title from add 'checksum' option for index.shard.check_on_startup to Add `checksum` option for `index.shard.check_on_startup` Jun 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment