Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK- 32775]Add parent dir of files to classpath using yarn.provided.lib.dirs #23164

Merged
merged 9 commits into from
Sep 14, 2023

Conversation

architgyl
Copy link
Contributor

@architgyl architgyl commented Aug 8, 2023

What is the purpose of the change

This change aims to enhance the handling of classpath configuration when using the yarn.provided.lib.dirs property in a YARN environment. Currently, the yarn.ship-files property is used to add specific files to the classpath by including their parent directories. This is particularly useful for cases where resources like hive-site.xml need to be accessible to the application. In the specific context of this situation, an issue arises due to the fact that HiveConf attempts to load hive-site.xml using Thread.currentThread().getContextClassLoader().getResource("hive-site.xml"), which isn't feasible because the parent directory of hive-site.xml is not included within Flink's classpath.

The proposed change is to extend this approach to include the parent directory of files specified in yarn.provided.lib.dirs to the classpath and not the resources itself other than jar files. This will ensure that resources in these directories are available for the application's classpath enabling access to required configuration files.

Gist reference to show classpath before and after code change : Link

Brief change log

Adding parent directory of all the resources given in the yarn.provided.lib.dirsto the classpath and not adding resources itself except jar files.

Verifying this change

Verified the change by adding hive-site.xml file full path to the classpath and was able to submit Flink job successfully and interact with Hive.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 8, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@architgyl architgyl changed the title Add parent dir classpath [FLINK- 32775]Add parent dir of files to classpath using yarn.provided.lib.dirs Aug 8, 2023
@venkata91
Copy link
Contributor

@architgyl Could you please clarify the description a bit? It is not quite clear why it doesn't for hive-site.xml alone.

@architgyl
Copy link
Contributor Author

@architgyl Could you please clarify the description a bit? It is not quite clear why it doesn't for hive-site.xml alone.

Updated the description.

@venkata91
Copy link
Contributor

venkata91 commented Aug 9, 2023

@architgyl Could you please clarify the description a bit? It is not quite clear why it doesn't for hive-site.xml alone.

Updated the description.

In this case, this is an issue because HiveConf tries to load hive-site.xml using Thread.currentThread().getContextClassLoader().getResource("hive-site.xml") right which isn't visible because the parent dir of the hive-site.xml is not in Flink's classpath

@architgyl
Copy link
Contributor Author

@becketqin @wangyang0918 can you please help review this PR.

@venkata91
Copy link
Contributor

Could you please add the before and after classpath output? Please add it to description if it is small if not a gist probably would be better.

Copy link
Contributor

@becketqin becketqin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@architgyl Thanks for the patch. I had a comment regarding how to make the resource files more explicit.

Also ping @wangyang0918 for review.

Comment on lines 365 to 375
URI parentDirectoryUri = new Path(fileName).getParent().toUri();
String relativeParentDirectory =
new Path(filePath.getName())
.toUri()
.relativize(parentDirectoryUri)
.toString();

if (!resourcesDir.contains(relativeParentDirectory)) {
resourcesDir.add(relativeParentDirectory);
}
resources.add(fileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if naively adding all the files that are not Flink dist-jar or plugin as a resource file is over killing. One potential solution more accurate / explicit is to do something similar to the plugins. i.e. introduce a reserved keyword of "resources" as a dir name. All the files in this dir will be treated as resources.

BTW, given that this patch introduces a user sensible behavior change, we need a FLIP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing config yarn.provided.lib.dirs supports adding all the resources (jars or configs) under each of those dirs separated by comma in to the classpath. Why adding the parent dir also to the classpath is an overkill?

Take yarn.ship-files as an example takes arbitrary list of dirs and adds both the files and the dir under each of the individual dirs to the classpath. Isn't this the case with yarn.provided.lib.dirs as well? Except that in the yarn.provided.lib.dirs is more of a platform specific libs which won't change between different flink apps.

Also introducing a new resources dir and only supporting that feels limiting. Ideally, users or platforms would like to logically group resources in to their own dirs. For eg: hadoop, hive libs etc in to their own dirs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main concern is the length of class path. I think there is a limit of 32K characters. For one of my simple local job, the class path is already over 10K chars. If we are including all the parent dir into the class path. This might easily go beyond the limit. Also, in general, adding the resource dir to class path in a more explicit and accurate way should probably be preferred.

Also introducing a new resources dir and only supporting that feels limiting. Ideally, users or platforms would like to logically group resources in to their own dirs. For eg: hadoop, hive libs etc in to their own dirs.

Maybe we can have a yarn.provided.resources.dirs in this case.

Copy link
Contributor

@venkata91 venkata91 Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to add here is, we are already relativizing the path so there are no redundant top level directory names added to the classpath. For eg: /users/test/lib/hadoop/;/users/test/lib/hive - .:hadoop:hive: will be added to the classpath.

Note: the same applies to existing yarn.ship-files config as well which is working fine without issues.

Also, in general, adding the resource dir to class path in a more explicit and accurate way should probably be preferred.

Do you mean resources as anything other than jar like .xml, .yaml etc should be added through yarn.provided.resources.dirs? Can you please explain what do you mean when you say more explicit and accurate way?

Given this separation, won't the same issue still happen for *.jar?

@architgyl
Copy link
Contributor Author

@GJL can you please help review this PR.

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response due to some internal things and vacation.

I left a comment that we might need some discussion.

Copy link
Contributor

@venkata91 venkata91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, except for some minor nits in test.

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. LGTM.

+1 for merging

@wangyang0918
Copy link
Contributor

wangyang0918 commented Sep 13, 2023

@architgyl Could you please verify this PR in a real YARN cluster whether it solves your original requirement about hive config? After then I will merge this PR.

@architgyl
Copy link
Contributor Author

architgyl commented Sep 13, 2023

@architgyl Could you please verify this PR in a real YARN cluster whether it solves your original requirement about hive config? After then I will merge this PR.

I have verified the changes and it works as expected.

@wangyang0918 wangyang0918 merged commit 1f96218 into apache:master Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants