-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK- 32775]Add parent dir of files to classpath using yarn.provided.lib.dirs #23164
Conversation
@architgyl Could you please clarify the description a bit? It is not quite clear why it doesn't for |
Updated the description. |
In this case, this is an issue because HiveConf tries to load |
@becketqin @wangyang0918 can you please help review this PR. |
Could you please add the before and after classpath output? Please add it to description if it is small if not a gist probably would be better. |
flink-yarn/src/main/java/org/apache/flink/yarn/YarnApplicationFileUploader.java
Outdated
Show resolved
Hide resolved
flink-yarn/src/test/java/org/apache/flink/yarn/YarnApplicationFileUploaderTest.java
Outdated
Show resolved
Hide resolved
flink-yarn/src/test/java/org/apache/flink/yarn/YarnApplicationFileUploaderTest.java
Show resolved
Hide resolved
flink-yarn/src/test/java/org/apache/flink/yarn/YarnApplicationFileUploaderTest.java
Show resolved
Hide resolved
flink-yarn/src/main/java/org/apache/flink/yarn/YarnApplicationFileUploader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@architgyl Thanks for the patch. I had a comment regarding how to make the resource files more explicit.
Also ping @wangyang0918 for review.
URI parentDirectoryUri = new Path(fileName).getParent().toUri(); | ||
String relativeParentDirectory = | ||
new Path(filePath.getName()) | ||
.toUri() | ||
.relativize(parentDirectoryUri) | ||
.toString(); | ||
|
||
if (!resourcesDir.contains(relativeParentDirectory)) { | ||
resourcesDir.add(relativeParentDirectory); | ||
} | ||
resources.add(fileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if naively adding all the files that are not Flink dist-jar or plugin as a resource file is over killing. One potential solution more accurate / explicit is to do something similar to the plugins. i.e. introduce a reserved keyword of "resources" as a dir name. All the files in this dir will be treated as resources.
BTW, given that this patch introduces a user sensible behavior change, we need a FLIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing config yarn.provided.lib.dirs
supports adding all the resources (jars or configs) under each of those dirs separated by comma in to the classpath. Why adding the parent dir also to the classpath is an overkill?
Take yarn.ship-files
as an example takes arbitrary list of dirs and adds both the files and the dir under each of the individual dirs to the classpath. Isn't this the case with yarn.provided.lib.dirs
as well? Except that in the yarn.provided.lib.dirs
is more of a platform specific libs which won't change between different flink apps.
Also introducing a new resources
dir and only supporting that feels limiting. Ideally, users or platforms would like to logically group resources in to their own dirs. For eg: hadoop, hive libs etc in to their own dirs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main concern is the length of class path. I think there is a limit of 32K characters. For one of my simple local job, the class path is already over 10K chars. If we are including all the parent dir into the class path. This might easily go beyond the limit. Also, in general, adding the resource dir to class path in a more explicit and accurate way should probably be preferred.
Also introducing a new resources dir and only supporting that feels limiting. Ideally, users or platforms would like to logically group resources in to their own dirs. For eg: hadoop, hive libs etc in to their own dirs.
Maybe we can have a yarn.provided.resources.dirs
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to add here is, we are already relativizing
the path so there are no redundant top level directory names added to the classpath. For eg: /users/test/lib/hadoop/;/users/test/lib/hive
- .:hadoop:hive:
will be added to the classpath.
Note: the same applies to existing yarn.ship-files
config as well which is working fine without issues.
Also, in general, adding the resource dir to class path in a more explicit and accurate way should probably be preferred.
Do you mean resources as anything other than jar
like .xml
, .yaml
etc should be added through yarn.provided.resources.dirs
? Can you please explain what do you mean when you say more explicit and accurate way
?
Given this separation, won't the same issue still happen for *.jar
?
@GJL can you please help review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response due to some internal things and vacation.
I left a comment that we might need some discussion.
flink-yarn/src/main/java/org/apache/flink/yarn/YarnApplicationFileUploader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, except for some minor nits in test.
flink-yarn/src/test/java/org/apache/flink/yarn/YarnApplicationFileUploaderTest.java
Outdated
Show resolved
Hide resolved
flink-yarn/src/test/java/org/apache/flink/yarn/YarnApplicationFileUploaderTest.java
Outdated
Show resolved
Hide resolved
flink-yarn/src/test/java/org/apache/flink/yarn/YarnApplicationFileUploaderTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments. LGTM.
+1 for merging
@architgyl Could you please verify this PR in a real YARN cluster whether it solves your original requirement about hive config? After then I will merge this PR. |
I have verified the changes and it works as expected. |
What is the purpose of the change
This change aims to enhance the handling of classpath configuration when using the
yarn.provided.lib.dirs
property in a YARN environment. Currently, theyarn.ship-files
property is used to add specific files to the classpath by including their parent directories. This is particularly useful for cases where resources likehive-site.xml
need to be accessible to the application. In the specific context of this situation, an issue arises due to the fact that HiveConf attempts to loadhive-site.xml
usingThread.currentThread().getContextClassLoader().getResource("hive-site.xml")
, which isn't feasible because the parent directory of hive-site.xml is not included within Flink's classpath.The proposed change is to extend this approach to include the parent directory of files specified in
yarn.provided.lib.dirs
to the classpath and not the resources itself other than jar files. This will ensure that resources in these directories are available for the application's classpath enabling access to required configuration files.Gist reference to show classpath before and after code change : Link
Brief change log
Adding parent directory of all the resources given in the
yarn.provided.lib.dirs
to the classpath and not adding resources itself except jar files.Verifying this change
Verified the change by adding hive-site.xml file full path to the classpath and was able to submit Flink job successfully and interact with Hive.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation