Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-6477: Drillbit crashes with OOME (Heap) for a large WebUI query #1309

Closed
wants to merge 2 commits into from

Conversation

@kkhatua
Copy link
Contributor

commented Jun 7, 2018

For queries submitted through the WebUI and retrieving a large result-set, the Drillbit often hangs or crashes due to the (foreman) Drillbit running out of Heap memory.

This is because the Web client translates the result set into a massive object in the heap-space and tries to send that back to the browser. This results in the VM thread actively trying to perform GC if the memory is not sufficient.

The workaround is to have the active webConnection of the query periodically timeout to allow for checking the consumed heap-space. A level of 0.85 (i.e. 85%) is set as default threshold, crossing which, a query submitted through the REST API is marked and failed.
In addition, a user exception is thrown, indicting the cause of the query failing, along with alternative suggestions for re-executing the query.

This is the example of a query that tried to scan an entire 60M row table through the browser. The query failed with the following error message:
image

For queries submitted through the WebUI and retrieving a large result-set, the Drillbit often hangs or crashes due to the (foreman) Drillbit running out of Heap memory.

This is because the Web client translates the result set into a massive object in the heap-space and tries to send that back to the browser. This results in the VM thread actively trying to perform GC if the memory is not sufficient.

The workaround is to have the active webConnection of the query periodically timeout to allow for checking the consumed heap-space. A level of 0.85 (i.e. 85%) is set as default threshold, crossing which, a query submitted through the REST API is marked and failed. 
In addition, a user exception is thrown, indicting the cause of the query failing, along with alternative suggestions for re-executing the query.
@kkhatua

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2018

@parthchandra could you please review this?

Copy link
Contributor

left a comment

Looks good. Some minor comments

.addToEventQueue(QueryState.FAILED,
UserException.resourceError(
new Throwable(
"Query submitted through the Web interface was failed due to diminishing free heap memory ("+ Math.floor(((1-usagePercent)*100)) +"% free). "

This comment has been minimized.

Copy link
@parthchandra

parthchandra Jun 11, 2018

Contributor

We could make this friendlier :)
"There is not enough heap memory to run this query using the web interface. Please try a query with fewer columns or with a filter or limit condition to limit the data returned. You can also try an ODBC/JDBC client"

This comment has been minimized.

Copy link
@kkhatua

kkhatua Jun 11, 2018

Author Contributor

I thought it would be useful to show the level at which the free memory was prior to cancellation. I suspect that people will see a GC'ed Drillbit after the cancellation and wonder why Drill is complaining of no sufficient heap. Hence, the wording.

This comment has been minimized.

Copy link
@parthchandra

parthchandra Jun 11, 2018

Contributor

Oh yes. That would be useful information to add to the message. But since the error is triggered at 0.85, the number would always be 15% :)

@@ -204,6 +204,8 @@ private ExecConstants() {
public static final String SERVICE_KEYTAB_LOCATION = SERVICE_LOGIN_PREFIX + ".keytab";
public static final String KERBEROS_NAME_MAPPING = SERVICE_LOGIN_PREFIX + ".auth_to_local";

/* Provide resiliency on web server for queries submitted via HTTP */
public static final String HTTP_QUERY_FAIL_LOW_HEAP_THRESHOLD = "drill.exec.http.query.fail.low_heap.threshold";

This comment has been minimized.

Copy link
@parthchandra

parthchandra Jun 11, 2018

Contributor

I would have just put this as a constant in the QueryWrapper class. I don't expect the user to ever modify this and when the QueryWrapper is updated to address large result sets, then we don't have to worry about removing this.

This comment has been minimized.

Copy link
@kkhatua

kkhatua Jun 11, 2018

Author Contributor

I am inclined towards having this as an drill-override.conf property, just so that we have a tuning mechanism. But 85% is a reasonable threshold, so I'll hard-code it as a constant within QueryWrapper.

@kkhatua

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2018

@parthchandra updated the PR based on your review.

Copy link
Contributor

left a comment

+1.

ilooner added a commit to ilooner/drill that referenced this pull request Jun 13, 2018
For queries submitted through the WebUI and retrieving a large result-set, the Drillbit often hangs or crashes due to the (foreman) Drillbit running out of Heap memory.

This is because the Web client translates the result set into a massive object in the heap-space and tries to send that back to the browser. This results in the VM thread actively trying to perform GC if the memory is not sufficient.

The workaround is to have the active webConnection of the query periodically timeout to allow for checking the consumed heap-space. A level of 0.85 (i.e. 85%) is set as default threshold, crossing which, a query submitted through the REST API is marked and failed.
In addition, a user exception is thrown, indicting the cause of the query failing, along with alternative suggestions for re-executing the query.

closes apache#1309
@ilooner ilooner closed this in 7be1e01 Jun 13, 2018
kkhatua added a commit to kkhatua/drill that referenced this pull request Jul 13, 2018
When query fails on Web UI result page no error is shown, only "No result found." 
This was because DRILL-6477 (PR apache#1309) switched to `WebUserConnection.await(long timeoutInMillis)` . Unlike the original `WebUserConnection.await()`, this method did not throw any UserException generated by a query failure.
The fix was to add a new WebUser-only method - `WebUserConnection.timedWait(long timeoutInMillis)` 
This ensures that other callers to the `WebUserConnection.await(timeoutInMillis)` is unaffected.
kkhatua added a commit to kkhatua/drill that referenced this pull request Jul 13, 2018
When query fails on Web UI result page no error is shown, only "No result found." 
This was because DRILL-6477 (PR apache#1309) switched to `WebUserConnection.await(long timeoutInMillis)` . Unlike the original `WebUserConnection.await()`, this method did not throw any UserException generated by a query failure.
The fix was to add a new WebUser-only method - `WebUserConnection.timedWait(long timeoutInMillis)` 
This ensures that other callers to the `WebUserConnection.await(timeoutInMillis)` is unaffected.
sohami added a commit that referenced this pull request Jul 13, 2018
* DRILL-6591: Show Exception for failed queries submitted in WebUI

When query fails on Web UI result page no error is shown, only "No result found." 
This was because DRILL-6477 (PR #1309) switched to `WebUserConnection.await(long timeoutInMillis)` . Unlike the original `WebUserConnection.await()`, this method did not throw any UserException generated by a query failure. The fix was to use WebUserConnection.getError() method to know about failure of the query and throw UserRemoteException with that.

closes #1379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.