Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded)#480

Closed
navis wants to merge 1 commit intoapache:masterfrom
navis:TAJO-1469
Closed

TAJO-1469 allocateQueryMaster can leak resources if it times-out (3sec, hardcoded)#480
navis wants to merge 1 commit intoapache:masterfrom
navis:TAJO-1469

Conversation

@navis
Copy link
Contributor

@navis navis commented Mar 30, 2015

Tested with some hack to reproduce timeout.

WorkerResourceAllocationResponse response = null;
try {
  response = callFuture.get(3, queryInProgress.getQueryId().getSeq() < 3 ? TimeUnit.MICROSECONDS : TimeUnit.SECONDS);
} catch (Throwable t) {
  ....
}

exceptions for first two but successes for all following queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is not necessary if you change as the following in the immediately preceding line.

LOG.warn("Got exception while allocating QueryMaster: " + t, t);  

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (response != null in catch block) means response is acquired just after timeout. I don't think it should be logged even after we got expected result in time.

@babokim
Copy link
Contributor

babokim commented Apr 6, 2015

I investigated the lock of CallFuture. CallFuture should be synchronized with run() and get(). Current code looks like this would be implemented but not. If the following situation is occur, some resources or tasks will be lost forever.

  1. Worker: TaskRunner sends GetTask request.
  2. QM: QueryMaster selects proper task and calls RpcCallback.
  3. Worker: AsyncRpcClient receives the response and calls CallFuture.run(response).
    3-1. Worker: If TimeoutException occurs after 1) between 2) ~ 3), TaskRunner can't receive the response and doesn't run the allocated task, but QM doesn't know about that.

If my thought is wrong, please let me know.
If my thought is right, this patch is temporary solution and we need to create another issue for this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that SimpleExchange name is not proper this class. How about "CancelableRpcCallback"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@babokim
Copy link
Contributor

babokim commented Apr 6, 2015

If this patch is considered as a temporary solution, looks good to me.
@navis, thank you for your contribution, I left some comments.

@navis
Copy link
Contributor Author

navis commented Apr 6, 2015

Yes, if resource allocation is done over network, we need more serious work to fix that. If the problem is recognized fully by this simple, in-process fix, I have a intent to fix that, too.

@jihoonson
Copy link
Contributor

Thanks @navis! This is a critical problem that should be fixed.
You seem to handle a somple case in this issue. If so, would you book other remaining issues on Jira? It will be helpful to remind us.
Thanks.

@navis
Copy link
Contributor Author

navis commented Apr 7, 2015

Addressed comments & added "tajo.qm.resource.allocation.timeout" to set timeout.

@navis
Copy link
Contributor Author

navis commented Apr 7, 2015

@jihoonson Would you mind to review TAJO-1385 first? it seemed it would overlap in some part.

@jihoonson
Copy link
Contributor

Ok. I'll review today.

@babokim
Copy link
Contributor

babokim commented Apr 8, 2015

+1
All tests passed. I'll commit soon.

@asfgit asfgit closed this in 6989542 Apr 8, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants