Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44976] Preserve full principal user name on executor side #42690

Closed
wants to merge 1 commit into from

Conversation

eubnara
Copy link
Contributor

@eubnara eubnara commented Aug 26, 2023

What changes were proposed in this pull request?

Use full principal name as spark user name to respect hadoop.security.auth_to_local when accessing non-kerberized hdfs from kerberized hadoop cluster.

Why are the changes needed?

Since https://issues.apache.org/jira/browse/SPARK-6558, spark uses short user name, it causes not to respect hadoop.security.auto_to_local on the NameNode in non-kerberized hadoop cluster.
Also, if an user provides --principal and --keytab options when submitting spark job on kerberized cluster and creating output on non-kerberized HDFS, file/directory ownerships are not coherent.



$ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
Found 52 items
-rw-rw-rw-   3 _ex_eub hdfs          0 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/_SUCCESS
-rw-r--r--   3 eub      hdfs  134418857 2023-05-11 00:15 hdfs:///user/eub/some/path/20230510/23/part-00000-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub      hdfs  153410049 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/part-00001-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub      hdfs  157260989 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/part-00002-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
-rw-r--r--   3 eub      hdfs  156222760 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/part-00003-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz

Additional description is on https://issues.apache.org/jira/browse/SPARK-44976.

Does this PR introduce any user-facing change?

The ownerships of output file/directory will be coherent even in non-kerberized hdfs cluster from spark job in kerberized cluster.

How was this patch tested?

Manually tested.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Aug 26, 2023
@eubnara
Copy link
Contributor Author

eubnara commented Aug 26, 2023

I found that this doesn't work because SPARK_USER environment variable passed to exeuctor and ignore to call getCurrentUser().
I'll try to figure out to fix it.

@eubnara eubnara changed the title [SPARK-44976] Utils.getCurrentUserName should return the full principal name [SPARK-44976] Preserve full principal user name on executor side Aug 28, 2023
@eubnara
Copy link
Contributor Author

eubnara commented Aug 28, 2023

I change it to set full principal user name on executor side.

@@ -560,7 +560,7 @@ class SparkContext(config: SparkConf) extends Logging {
// TODO: Set this only in the Mesos scheduler.
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser
executorEnvs("SPARK_USER") = Utils.getCurrentFullUserName()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass SPARK_USER env. from driver to executor.

@@ -890,7 +890,7 @@ private[spark] class Client(
val env = new HashMap[String, String]()
populateClasspath(args, hadoopConf, sparkConf, env, sparkConf.get(DRIVER_CLASS_PATH))
env("SPARK_YARN_STAGING_DIR") = stagingDirPath.toString
env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName()
env("SPARK_USER") = UserGroupInformation.getCurrentUser().getUserName()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass full principal user name to driver

@@ -64,7 +64,7 @@ private[spark] class SparkHadoopUtil extends Logging {
}

def createSparkUser(): UserGroupInformation = {
val user = Utils.getCurrentUserName()
val user = Utils.getCurrentFullUserName()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set full principal user name to ugi on executor side

Copy link

github-actions bot commented Dec 7, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 7, 2023
@eubnara
Copy link
Contributor Author

eubnara commented Dec 7, 2023

It should be considered when using kerberized cluster.

@github-actions github-actions bot closed this Dec 8, 2023
@eubnara
Copy link
Contributor Author

eubnara commented Dec 8, 2023

How can I open it again?

@eubnara
Copy link
Contributor Author

eubnara commented Dec 8, 2023

Continued on #44244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant