load hive-site.xml for flink catalog#1558
load hive-site.xml for flink catalog#1558zhangjun0x01 wants to merge 3 commits intoapache:masterfrom
Conversation
| String scheme = path.toUri().getScheme(); | ||
| if (scheme == null) { | ||
| // for case : /tmp/hive-site.xml | ||
| return HIVE_SITE_SCHEME_FILE; |
There was a problem hiding this comment.
In most projects I've worked on, it's possible to set the default scheme. I know this is true for Flink as well, which has fs.default-scheme, which is used for paths that don't have a scheme specified.
Given that this is specifically for Flink, would it make more sense to fall back to that configuration value if it's available? Perhaps one of the more regular Flink contributors could chime in here. I'd be more likely to default to file as the scheme over HDFS, but I run all of my Flink deployments in containers so I'm likely biased.
| String catalogType = options.getOrDefault(ICEBERG_CATALOG_TYPE, ICEBERG_CATALOG_TYPE_HIVE); | ||
| switch (catalogType) { | ||
| case "hive": | ||
| case ICEBERG_CATALOG_TYPE_HIVE: |
There was a problem hiding this comment.
These changes seem unrelated to loading hive-site.xml. Could you remove them?
There was a problem hiding this comment.
This is what I think kbendick's suggestion is right. We should extract some constants, but it is unrelated to load hive-site.xml. I will remove them. If necessary ,we can open another PR.
| properties.add(HADOOP_WAREHOUSE_LOCATION); | ||
| properties.add(DEFAULT_DATABASE); | ||
| properties.add(BASE_NAMESPACE); | ||
| properties.add(HIVE_SITE_PATH); |
There was a problem hiding this comment.
If I understand, this is trying to add the hive site path to the catalog's configuration properties? Why not load it from the environment? That's the typical way to load these configuration files.
| } | ||
| } | ||
|
|
||
| private void downloadFromHdfs(Configuration configuration, Path hdfsHiveSitePath) { |
There was a problem hiding this comment.
I've never seen code like this needed. Usually, hive-site.xml is loaded from the classpath. Here's some code that does it from our Spark build:
val configFile = someClassLoader.getResource("hive-site.xml")
if (configFile != null) {
hadoopConfiguration.addResource(configFile)
}| config.put(FlinkCatalogFactory.ICEBERG_CATALOG_TYPE, isHadoopCatalog ? "hadoop" : "hive"); | ||
| config.put(FlinkCatalogFactory.HADOOP_WAREHOUSE_LOCATION, "file:" + warehouse); | ||
| String path = this.getClass().getClassLoader().getResource("hive-site.xml").getPath(); | ||
| config.put(FlinkCatalogFactory.HIVE_SITE_PATH, path); |
There was a problem hiding this comment.
I think this only works because the file is in the local FS for tests, and the FS is assumed to be "file" when there is no scheme. I don't think this is right.
There was a problem hiding this comment.
I refer to the practice in flink, if no scheme is specified, such as /tmp/abc, it will be converted to a local file path according to different operating system
|
@zhangjun0x01, @JingsongLi, how does Flink normally handle hive-site.xml files? This seems strange to me, so I'd like to understand if this is something that Flink would normally do. |
hi,@rdblue: I think this should also be a issue with flink. What do you think?@JingsongLi |
I created https://issues.apache.org/jira/browse/FLINK-19541 for track it. Personally, I think we can provide two way in Flink:
|
9a77514 to
dfa85e3
Compare
|
hi: |
|
I think this patch should be included in the next release 0.10.0, so I marked it in the milestone. I will take a look . |
dfa85e3 to
88db208
Compare
Can you help me understand Flink's behavior a bit more?
|
|
hi,@rdblue: In order to use Hadoop features (such as deploying flink to yarn cluster), flink needs to load the classpath of hadoop, but hive is not necessary for flink to deploy, so there is no need to load hive when loading the Hadoop Configuration. In addition, flink provides a unified catalog api to access external systems. Hivecatlog, jdbccatalog, etc. are all sub class used to access different external systems, so hive is not a plugin or extension of flink. Therefore, we cannot load hive into the startup script of flink when flink is started. I don't know if my understanding is correct. @JingsongLi |
@rdblue , you may want to read this flink document. There's a it will bootstrap a separate flink cluster on yarn for the current submission, and the started flink jobmanager will execute the Providing a flink config key to indicate the hive-site URI sounds good to me. then we could submit the flink job in application mode like following: Above we are talking about a Flink Datastream job. For Flink SQL job, we could do the similar thing like CREATE CATALOG hive_catalog WITH (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://localhost:9083',
'clients'='5',
'property-version'='1',
'hive-conf-dir'='hdfs://config/hive-site.xml '
);For most cases, we could just set the CREATE CATALOG hive_catalog WITH (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://localhost:9083',
'clients'='5',
'property-version'='1',
'warehouse'='hdfs://data-dir '
); |
88db208 to
5a9f5d1
Compare
Why not automatically load the Why is that not just add that directory to the classpath so that |
load hive-site.xml from path,then from classpath fix checkstyle fix checkstyle
update comment fix test code
| public static final String HADOOP_WAREHOUSE_LOCATION = "warehouse"; | ||
|
|
||
| public static final String HIVE_SITE_PATH = "hive-site-path"; | ||
| public static final String HIVE_SITE_SCHEME_FILE = "file"; |
There was a problem hiding this comment.
If we plan to support hadoop path , would we still need the file or hdfs schema ? Just load those configurations files by hadoop filesystem ?
There was a problem hiding this comment.
I have tested this, if we specify a remote path, as the following code .
Configuration conf = new Configuration();
Path path = new Path("hdfs://localhost/data/flink/conf/hive-site.xml");
conf.addResource(path);
String warehouseLocation = conf.get("hive.metastore.warehouse.dir");
System.out.println(warehouseLocation);
It will not be able to get the attributes in hive-site.xml, the result is null.
If you specify a local path,
Path path = new Path("file:///tmp/hive-site.xml");
we can get it, so I use this scheme to make some judgments. If it is a non-local path, download it to the local first, and then load it.
| public static final String HIVE_CLIENT_POOL_SIZE = "clients"; | ||
| public static final String HADOOP_WAREHOUSE_LOCATION = "warehouse"; | ||
|
|
||
| public static final String HIVE_SITE_PATH = "hive-site-path"; |
There was a problem hiding this comment.
Could we align with the HiveCatalog in flink by using hive-conf-dir ?
There was a problem hiding this comment.
ok,I will update it
37a3c59 to
88195d0
Compare
|
Okay, I understand why Flink won't load What I don't understand is why Iceberg should do anything other than ensure that Most configuration for the If Hive only tuning defaults are controlled by In summary, unless we need catalog-specific Hive configs, I don't see why we wouldn't use the "normal" ways to distribute and add a |
|
Let me talk about my thoughts. in flink, the use of hive is only the behavior of the application, so hive-site.xml and hive jars are not loaded at flink startup, and are loaded by the flink application. |
For flink SQL, we usually create only one For hadoop configurations, we provide @zhangjun0x01 , for uploading the configurations files to hdfs and downloading & loading it for flink stream job in I provided a more reasonable patch (#1586) to handle hive-site.xml (As the flink 0.10.0 release is coming, and we hope to resolve this thing as soon as possible, so I pull requested the patch for the same issue, your discussion is valuable @zhangjun0x01 , hope you don't mind, Thanks). |
|
@openinx it don't matter |
In order to be compatible with various flink submission modes, FlinkCatalogFactory provides a variable hive-site-path to specify the path of hive-site.xml, which can be a local path or other no local storage
this is for issue #1437
I submitted a pr (#1527) before, but it contained some conflicts and included other PRs. I did not handle it well, so I close it and open a new PR.
@rdblue could you help review it and give me some suggestions ? thank you
@kbendick Thanks for your suggestions,I have update the code according to your suggestion