Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove IndexCloseListener & Store.OnCloseListener #9009

Closed
wants to merge 1 commit into from

Conversation

@bleskes
Copy link
Member

commented Dec 19, 2014

#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( #8608 ).

Example test failure can be found at: http://build-us-00.elasticsearch.org/job/es_core_master_metal/6120/

Internal: remove IndexCloseListener & Store.OnCloseListener
#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favor of a simple timeout based solution to be use until a better listener based solution is ready ( #8608 ).
@@ -159,8 +164,12 @@ public GatewayMetaState(Settings settings, ThreadPool threadPool, NodeEnvironmen

this.autoImportDangled = AutoImportDangledState.fromString(settings.get(GATEWAY_AUTO_IMPORT_DANGLED, settings.get(GATEWAY_LOCAL_AUTO_IMPORT_DANGLED, AutoImportDangledState.YES.toString())));
this.danglingTimeout = settings.getAsTime(GATEWAY_DANGLING_TIMEOUT, settings.getAsTime(GATEWAY_LOCAL_DANGLING_TIMEOUT, TimeValue.timeValueHours(2)));
this.deleteTimeout = settings.getAsTime(GATEWAY_DELETE_TIMEOUT, TimeValue.timeValueSeconds(30));

This comment has been minimized.

Copy link
@dakrone

dakrone Dec 19, 2014

Member

How did this arrive at 30 seconds?

This comment has been minimized.

Copy link
@bleskes

bleskes Dec 19, 2014

Author Member

That's a good question. The delete index has an ack timeout of 30s (default) and ideally you would like to wait longer than this for the delete not to be acked (we can have a different solution to make sure it's not acked, this is change is meant to small and temporary) but on the other hand not to delay cluster state processing beyond the 30s the master waits to avoid having multiple cluster states in flight. I agree this is not ideal at all but I felt it was good enough as a temporary measure until a proper solution is in place.

@dakrone

This comment has been minimized.

Copy link
Member

commented Dec 19, 2014

LGTM

@bleskes bleskes closed this in 4d699bd Dec 19, 2014

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Dec 19, 2014
Internal: remove IndexCloseListener & Store.OnCloseListener
elastic#8436 has introduced shard level locks in order to prevent directory reuse during fast deletion & creation of indices. As part for the change, close listeners were introduced to delete the folders once all out standing references were released. The new change has created race conditions causing shard folders not to be deleted (causing test failures due to left over corruption markers). This commit removes the listeners in favour of a simple timeout based solution to be use until a better listener based solution is ready ( elastic#8608 ).

Closes elastic#9009

@bleskes bleskes deleted the bleskes:delete_shard_listeners branch Dec 19, 2014

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Dec 29, 2014

This solution is not fixing the problem. Actually it makes it even worse IMO. Previously a test was failing since the shard was still in-use. which is the right thing btw. Now we wait for a timeout and then do nothing if we can't lock the shard. I really don't like this solution at all since in re-introduces a buggy mechanism that we fixed with ref-counting and deleting files under locks. Now we have the a race condition that shard data will never be deleted if we run into timeouts. I don't understand why we went this way to be honest or rather removed all the infrastructure to this the race that #8608 tried to fix.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2015

FWIW, there was a misunderstanding on my end why this was removed an what is was waiting for. The last comment might have been too rough tone wise which was mainly due to miscommunication. This was basically a temporary workaround until #8608 was fixed which I didn't realize while it's actually cleary written in the desc. Sorry about that :)

@clintongormley clintongormley changed the title Internal: remove IndexCloseListener & Store.OnCloseListener Remove IndexCloseListener & Store.OnCloseListener Jun 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.