Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81

Merged
merged 1 commit into from Oct 30, 2019

Conversation

fapifta
Copy link
Contributor

@fapifta fapifta commented Oct 24, 2019

What changes were proposed in this pull request?

We found this issue during Hive TPCDS tests, the basis of the problem is that Hive starts up an arbitrary number of threads to work on the same file, and reads the file from multiple threads.
In this case, the same XCeiverClientGrpc is called, and there are certain scenarios, where the current client is not synchronized properly. This PR is to add necessary synchronization around the closed internal boolean state, and around the channels and asyncstubs structures.
A fundamental change in behaviour is that the XCeiverClientGrpc instances are served after connecting to the first DN in a synchronized fashion in the XCeiverClientManager, then reconnect if needed is done after checking wether the DN is connected properly, and if not then reconnect in a synchronized block.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2347

How was this patch tested?

As this issue comes out intermittently, and reproduction depends on how the JVM schedules the code of different threads, I was not able to write any reliable tests so far.
Manually the patch was tested on a 42 node cluster, with the 100 tpcds queries on a scale 2 and scale 3 large data set generated by the tools here: https://github.com/fapifta/hive-testbench
These tools are coming from https://github.com/hortonworks/hive-testbench with some modification to be able to use Ozone and HDFS as filesystems in parallel.

After applying the patch on the cluster with current trunk, I have not seen the NPE in 3 runs of the 99 TPCDS queries, before the patch I was able to see 2-5 queries failing with the given NPE per run.

@fapifta
Copy link
Contributor Author

fapifta commented Oct 24, 2019

/label ozone

@elek elek added the ozone label Oct 24, 2019
@bshashikant
Copy link
Contributor

The changes look good to me. I am +1 on this.

@hanishakoneru
Copy link
Contributor

Thank you for working on this @fapifta. Integration test failures do no look related to this PR.
LGTM. +1.

@adoroszlai
Copy link
Contributor

/retest

@elek elek changed the title HDDS-2347 XCeiverClientGrpc's parallel use leads to NPE HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE Oct 28, 2019
@lokeshj1703
Copy link
Contributor

The changes look good to me. +1.

@lokeshj1703 lokeshj1703 merged commit d3021fb into apache:master Oct 30, 2019
@lokeshj1703
Copy link
Contributor

@fapifta Thanks for the contribution! @bshashikant @hanishakoneru Thanks for the reviews! I have merged the PR to master branch.

@fapifta fapifta deleted the HDDS-2347 branch October 30, 2019 13:28
@adoroszlai
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants