HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658

fapifta · 2021-11-14T15:37:06Z

Description of PR

In the JIRA there is a long description about the problem, the TLDR is that if there are two DFSClient instance at the same time on the client side, the second one can not properly fall back to SIMPLE auth due to how the SaslRpcClient handles the fallback via an AtomicBoolean created by the DfsClient, and set by the hadoop.ipc.Client.

The initial idea/solution was posted as #3579 where we had a long discussion with @symious about this solution and that solution, and the tradeoffs. After a while, I think I was convinced that this is a proper solution, and fits into the current system, so I am adding this PR to finally fix the issue.

The solution is to add the AtomicBoolean fallbackToSimpleAuth which controls how/if the SaslRpcClient falls back to simple auth or not, to the ConnectionId class, and with that distinguish the connections of the two DfsClient based on the fallbackToSimpleAuth AtomicBoolean instance. If that is different, from any other connections' AtomicBoolean instance, then we will distinguish the connection, and with that we will initialize the value of fallbackToSimpleAuth properly.

How was this patch tested?

JUnit test added.
The problem manifested itself as a failing HBase ExportSnapshot job, between two small clusters, one using Kerberos auth one using Simple auth. With this fix the ExportSnapshot job was able to run as well.

…stributedFileSystem instance

hadoop-yetus · 2021-11-14T18:51:57Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 50s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 5 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	32m 21s		trunk passed
+1 💚	compile	21m 45s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	19m 0s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	checkstyle	1m 11s		trunk passed
+1 💚	mvnsite	1m 37s		trunk passed
+1 💚	javadoc	1m 9s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 37s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	2m 25s		trunk passed
+1 💚	shadedclient	21m 40s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 56s		the patch passed
+1 💚	compile	21m 8s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javac	21m 8s		the patch passed
+1 💚	compile	18m 58s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	javac	18m 58s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 10s	/results-checkstyle-hadoop-common-project_hadoop-common.txt	hadoop-common-project/hadoop-common: The patch generated 5 new + 242 unchanged - 0 fixed = 247 total (was 242)
+1 💚	mvnsite	1m 35s		the patch passed
+1 💚	javadoc	1m 7s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 39s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	2m 32s		the patch passed
+1 💚	shadedclient	21m 45s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	17m 23s		hadoop-common in the patch passed.
+1 💚	asflicense	1m 0s		The patch does not generate ASF License warnings.
		193m 31s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3658/1/artifact/out/Dockerfile
GITHUB PR	#3658
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux 3612d89ef707 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `8942fea`
Default Java	Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3658/1/testReport/
Max. process+thread count	2348 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3658/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

symious · 2021-11-15T02:22:54Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java

                 UserGroupInformation ticket, int rpcTimeout,
-                 RetryPolicy connectionRetryPolicy, Configuration conf) {
+                 RetryPolicy connectionRetryPolicy, Configuration conf,
+                 AtomicBoolean fallbackToSimpleAuth) {


I'm not sure if we should change it to boolean instead of AtomicBoolean here.
The "equals" of AtomicBoolean seem doesn't fit the situation.

I have tried to simply use a boolean here, but it did not work, and I realized that we need to distinguish between connections based on the reference of the AtomicBoolean what Object.equals will do here.
Here is why:
If we pass in the boolean value of AtomicBoolean, then the event sequence is this:

dfs client created with atomic boolean set to false

we get the connection based on connection id, and the boolean value false

we call setupIOStreams at first connect, which sets the AtomicBoolean to true

at next call when we get a connection, we get an other one, based on the boolean value true

the next client at initialization will create an other AtomicBoolean, and then gets the connection associated with the boolean value false, and never initializes its own AtomicBoolean via setupIOStreams again

The situation is similar also, if we use the value of the AtomicBooleans in the hashcode and equals.

That is why I chose to compare based on Object.equals (which will use AtomicBoolean's equals, which falls back to Object.equals), and also the hashcode method of AtomicBoolean (which will also give back something based on the Object.hashcode).
With that every distinct client with its own AtomicBoolean will get a distinct connection, and will initialize the fallback properly via setupIOStreams.

Objects.equals(new AtomicBoolean(false), new AtomicBoolean(false) is false.

So if clientA and clientB from the same server connecting to NameNode, they should be using the same connection, but now they are using different connections.

Yes, they are not equals, but let me re-iterate... they should not be equals.
Three things to note here:

every DfsClient creates a new AtomicBoolean initialized to false

these atomic booleans are set to true by ipc Client's Connection's setupIOStreams method which can be called only after we have a connection based on the ConnectionID which we would than possibly change in setupIOStreams.

in the ipc Client layer, a connection to a NameNode is not different from a connection to a DataNode, as it is a connection to a host port combination with given network related setup.

Comparing the boolean values instead of the atomic boolean reference will bring us back to the initial problem. We will have a false value when we first try to get the connection based on the ConnectionID, then after we get the connection we change the value to true. Which is even worse, as in this case if fallback is on, we would have for sure two connections even for a single DfsClient, as when we get the connection the second time, the value would be true, so we would create an other connection for the new ConnectionID, and we possibly leak the previous one, as we would never create a ConnectionID with false again from that client. While on the other hand, from a second DfsClient, we would get the connection with the ID contains the false value, and we would have the problem we are trying to fix (unset AtomicBoolean prevents fallback of the second DfsClient).

About the number of connections we discussed in the previous PR, though at that point NN was out of picture, but yes, this change affects the NN connections as well if there are multiple DfsClients that are active.

Your suggestion to add the client's ID to the ConnectionID would help, but there is a tradeoff there, as in this case client B has to set up a new socket connection to DataNode D.
That would cause two things:

is an overhead of creating a new connection instead of reusing what we already have.

By default, connections from the same server with same configurations will reuse the connection tagged with the same ConnectionId. Say Client1, to Client10, these 10 clients will reuse the same connection.

If we add a field of AtomicBoolean in the comparison of ConnectionId, Client1 to Client10 will each initial a new AtomicBoolean, since the result of Object.equals are always false, the Client1 to Client10 will have 10 connections. That's connection leak.
If we change AtomicBoolean to boolean, there will at most be two different connections, since "false" is the default value, normally we will only have have one connection, only when we have a "true" value connection will the second connection be created.

I assume the problem we initially trying to solve is that clients with "fallback" set to true will be omitted. When we check with "boolean" instead of "AtomicBoolean", normal clients will go to the connection with "boolean" set to "false", special clients will go to the connection with "boolean" set to "true".
IMO, two connections are not leak, since idle connection will be collected and closed.

Here is the problem again with other words, as I am starting to feel that we don't understand each other :)

When a client code initiates a connection to HDFS, the following decisions are made:

if the DfsClient is cached, then we will have the same DfsClient with the same AtomicBoolean to all connections (this is the default)

if the DfsClient is not cached, or if we explicitly create a new instance without checking the cache (we have API for this) then there will be as many distinct DfsClient with as many distinct AtomicBoolean as many are created.

With the distinct AtomicBooleans, there are distinct ipc connection ids, which means there are distinct connections.
Ipc Connections are closed when idle for 10 seconds (by default).

So if we have the AtomicBoolean reference in the ConnectionId, we might create connections if distinct DfsClients are created (DfsClients are not cached, or via explicit call to create a distinct instance), which are then either used by the different clients, or closed after the max idle time.

Note that the AtomicBoolean value is not there to determine if we are allowed to fall back, that comes from configuration via a variable fallbackAllowed into Connection#setupIOStreams. This atomicBoolean determines during the real communication which type of authentication will be used, and we do only know if we really have to fall back to simple auth after we connected to the server address and discovered its expected authentication method.

As I stated earlier, we can not use the boolean value of the AtomicBoolean in the key. Here is what happens when we do:
Let's say we have DfsClient1 and DfsClient2. DfsClient1 does 2 RPC calls to the same address then it is not closed, but DfsClient2 is created in the meantime which does 1 RPC call to the same address.
DfsClient1 creates ConnectionID1 with boolean value false.
DfsClient1 realizes it should fall back to simple auth, so it sets AtomicBoolean1 to true
DfsClient1 sends the first RPC via the connection that has ConnectionID1
DfsClient1 tries to send the second RPC, and in order to do so it creates a connection with ConnectionID2 where the boolean value is true
DfsClient1 sends the second RPC via the connection that has ConnectionID2
DfsClient1 goes idle
DfsClient2 is created with AtomicBoolean2 set to false
DfsCllient2 initiates the RPC call, and gets the connection with ConnectionID1 which has the boolean value as false
DfsClient2 does yield from setupIOStreams as the connection is already established, with that AtomicBoolean2 remains false
DfsClient2 fails the RPC call with the same error as documented in HADOOP-17975 which we are trying to solve.

While if the ConnectionID has the reference to the AtomicBoolean instead of the boolean value of it, then DfsClient1 does not create a second connection on the second RPC call with ConnectionID2, but DfsClient2 will create a second connection with its own AtomicBoolean, and then setupIOStreams is able to properly initialize AtomicBoolean2.

normally we will only have have one connection, only when we have a "true" value connection will the second connection be created.

This might sound appealing, but it is not true, as initially we would always have a ConnectionID with the value false, and this false value might change during setting up the connection when it is already cached by the false boolean value. As we do not know the final value and can not know the final value until we haven't set up the sockets and initiated some communication to see what auth method the server side has we will never initiate a connection with a ConnectionID that contains true.

Yes you are right in that, changing the ConnectionID to have a distinct connection per DfsClient for just this specific case, and create multiple connections for all the other cases is bad, but if we insists on having this property in the ConnectionID, then I do not see any way to avoid this.

Alternatively as this seems to be problematic, we can switch back to my previous proposal, which does not increases the number of connections, but just sets the AtomicBoolean value in setupIOStreams always when it is called if the auth protocol is SASL, and server side auth method is SIMPLE, and the client runs with a secured configuration, and if fallback is allowed where these conditions are checked in this order lazily.

DfsClient2 is created with AtomicBoolean2 set to false

If the AtomicBoolean is set to false, DFSClient2 is supposed to fail. Since it won't fallback to simpleAuth.

The case of HADOOP-17975 is DFSClient2 is using "fallback = true", but it's reusing the connection with "fallback = false", so it fails.

Let me quote myself here:

Note that the AtomicBoolean value is not there to determine if we are allowed to fall back, that comes from configuration via a variable fallbackAllowed into Connection#setupIOStreams. This atomicBoolean determines during the real communication which type of authentication will be used, and we do only know if we really have to fall back to simple auth after we connected to the server address and discovered its expected authentication method.

Please see this code, the falbackAllowed boolean is what you are talking about. It is initialized based on config here.

The AtomicBoolean is initialized in the Connection's setupIOStreams method which is called from the getConnection method, which also caches the connection based on ConnectionID here.

The fallbackToSimpleAuth AtomicBoolean is created in the DfsClient here, and is shared with the SaslDataTransferClient here, while it is provided to the ipc Client via the protocol proxy between this two points.

Later on the SaslDataTransferClient uses the AtomicBoolean to decide wether to fallback or not in this method.

So again:
the config we can not use, as even if the fallback is allowed we do not want to fallback against a secured server, and we do not know if the server is secure or unsecure until we are connected to it.
The AtomicBoolean is initialized as false, and set to be true only after the Connection has been set up and cached based on the ConnectionID, so we can not rely on its boolean value.

So the question still stands, do you still think we should go down this route, and tinker with the ConnectionID, instead of setting the AtomicBoolean's value in setupIOStreams if it should be true?

@fapifta I see.
Seems we need to fallback to the solution of #3579.

Alright, thank you for confirming, I am reopening that PR, and closing this one.

Thank you for the review, and the continued discussion about selecting the final solution.

HADOOP-17975 Fallback to simple auth does not work for a secondary Di…

8942fea

…stributedFileSystem instance

fapifta mentioned this pull request Nov 14, 2021

HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance. #3579

Merged

fapifta requested a review from sodonnel November 14, 2021 15:40

symious reviewed Nov 15, 2021

View reviewed changes

fapifta closed this Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658

HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658

Uh oh!

fapifta commented Nov 14, 2021

Uh oh!

hadoop-yetus commented Nov 14, 2021

Uh oh!

symious Nov 15, 2021

Uh oh!

fapifta Nov 15, 2021

Uh oh!

symious Nov 16, 2021

Uh oh!

fapifta Nov 16, 2021

Uh oh!

symious Nov 17, 2021

Uh oh!

fapifta Nov 17, 2021 •

edited

Loading

Uh oh!

symious Nov 17, 2021

Uh oh!

fapifta Nov 18, 2021

Uh oh!

symious Nov 19, 2021

Uh oh!

fapifta Nov 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658

HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658

Uh oh!

Conversation

fapifta commented Nov 14, 2021

Description of PR

How was this patch tested?

Uh oh!

hadoop-yetus commented Nov 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fapifta Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fapifta Nov 17, 2021 •

edited

Loading