Multiple Hive sessions for Tez #436

Merged
merged 6 commits into from Oct 13, 2016

Projects

None yet

2 participants

@antbell
Contributor
antbell commented Oct 4, 2016

Allow multiple concurrent HiveServer2 sessions per user.
When using Tez, this prevents a new query to kill a currently running operation.

Note: I'm already in the contributor list.

+
+ else:
+
+ # Get 2 + n_sessions sessions and filter out the busy ones
@romainr
romainr Oct 5, 2016 Member

Curious: is the algo like this in order to avoid adding a Foreign Key from a session to a query?
(that way we could get the last 10 sessions and their queries and quickly check the status of the query/session)

Isn't there a way to know if a session is busy directly with a certain Thrift call? (would even be simpler)

@antbell
Contributor
antbell commented Oct 5, 2016

Hi, Romain. Thanks for the feedback. We are definitely willing to spend more effort to put the pull-request in a better shape for merging.

Now, to your questions:

I have looked a bit for thrift calls giving the status of a session, but haven't been able to find anything usable. If you have any pointer there, it would certainly simplify the whole logic as you correctly point out.

As far as I understand, Hive queries started from the notebook are only stored in Document2 objects as json, so that looked like the natural place where to attach the additional session information.
It could also live in a foreign key referencing the session, making the Document2 model a bit more complex and the logic simpler. If you think that's the proper way to go, we can work on it and submit an amended pull-request.

Cheers

@romainr
Member
romainr commented Oct 10, 2016

Will put some ideas soon on an efficient/robust design

@antbell
Contributor
antbell commented Oct 11, 2016

Thanks. Looking forward to it.
We'd really like to have this feature upstream and are willing to contribute with some development effort.

@romainr
Member
romainr commented Oct 12, 2016

Some designs:

Get a new session for each query.
Then need to close Session similarly to close Query https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini#L894 or have a short TTL for Hive session at the HiveServer2 level gethue.com/hadoop-tutorial-hive-and-impala-queries-life-cycle/

Not sure of the actual cost of creating a new session for each query, reloading UDFs etc..

Look at the last 10 sessions ordered by creation date, then start from the bottom and just try to submit with it. If it errors because the session does not exist, the original code will recreate a session and re-execute it automatically. If the session is busy, it retries with session 9 etc.

What happens when the session is executing a query already and you do an ExecuteStatement()? Does it error out?
If there is no easy to way to detect that a session is busy, we would need to store the mapping Session --> QueryHistory and FetchStatus of the operation.

Interesting in any case is that the API calls do not require the session ID, just the operation handles so the good news is that a mapping Session <--> Operation would be needed only if we want to efficiently poll the current operations of a user.

Overall, I think the PR is a good start and we can get it in. I would recommend to find a way to speed up the loop that check status of queries, as this is a lot of calls

@antbell
Contributor
antbell commented Oct 13, 2016

Hi. Thanks for the feedback and ideas.

Some more information from our side:

Opening a new session for each query works, but it increases latency considerably compared to re-using an existing session. Closing sessions explicitly or with a short TTL would keep the number of active sessions under control, but would also quickly make results of related operations unavailable.

When a query is running in a session, executing a new query within the same session when using Tez will kill the current one.

As to the performance of the current implementation, some local checks show the time to get the session (including parsing the json from 100 Document2 instances) are between 0.1 and 0.2 seconds. If the documents are larger than in my case, times might go up.

As you mention, the code to check for available sessions is only executed when starting a new query from the notebook, so there is no penalty for other calls which only need the operation handle.

The choice of 100 documents was arbitrary and is probably overkill. Reducing it to 50 would still work in most cases and reduce the time to get a session by half.

@romainr

Last question, then let's go with it for the first version!

@@ -227,7 +227,7 @@ def execute(self, notebook, snippet):
try:
if statement.get('statement_id') == 0:
db.use(query.database)
- handle = db.client.query(query)
+ handle = db.client.query(query, withMultipleSession=True)
@romainr
romainr Oct 13, 2016 Member

Is withMultipleSession really needed as we have 'n_sessions = conf.MAX_NUMBER_OF_SESSIONS.get() above?

@antbell
antbell Oct 13, 2016 Contributor

Some hive queries (for instance the ones dealing with the metastore) are run from method execute_and_wait in apps/beeswax/src/beeswax/server/dbms
Those don't need the extra session logic because they don't interfere with Tez jobs. The extra flag is only passed in user queries started from the notebook.

@romainr
Member
romainr commented Oct 13, 2016

Thanks for the investigation.

+1 for moving from 100 to 50 or even bit less

@romainr romainr changed the title from Multiple hive sessions to Multiple Hive sessions for Tez Oct 13, 2016
@antbell
Contributor
antbell commented Oct 13, 2016

40 :)

@romainr romainr merged commit 874d728 into cloudera:master Oct 13, 2016
@romainr
Member
romainr commented Oct 13, 2016

Thanks!

@romainr romainr added a commit that referenced this pull request Dec 13, 2016
@antbell @romainr antbell + romainr PR436 [hive] Multiple Hive sessions for Tez
#436

* Add session_guid information to operations

* Mark and save hiveserver2 session as invalid on error

* Better error message on missing handle fields for snippet

* Allow multiple concurrent Hive operations by using multiple sessions

* Fixed help string for max_number_of_sessions configuration parameter

* Use 40 user documents instead of 100 when checking busy HS2 sessions

(cherry picked from commit 874d728)
909f943
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment