Skip to content

Conversation

@jamescao
Copy link

[FLINK1919]
https://issues.apache.org/jira/browse/FLINK-1919
Add HCatOutputFormat for Tuple data types for java and scala api also fix a bug for the scala api's HCatInputFormat for hive complex types.
Java api includes check for whether the schema of the HCatalog table and the Flink tuples match if the user provides a TypeInformation in the constructor. For data types other than tuples, the OutputFormat requires a preceding Map function that converts to HCatRecords
scala api includes check if the schema of the HCatalog table and the Scala tuples match. For data types other than scala Tuple, the OutputFormat requires a preceding Map function that converts to HCatRecords scala api requires user to import org.apache.flink.api.scala._ to allow the type be captured by the scala macro.
The Hcatalog jar in maven central is compiled using hadoop1, which is not compatible with hive jars for testing, so a cloudera hcatalog jar is pulled into the pom for testing purpose. It can be removed if not required.
java List and Map can not be cast to scala List and Map, JavaConverters is used to fix a bug in HcatInputFormat scala api

@chiwanpark @rmetzger
I have changed the hcatalog jar to the apache version. That requires that I move the hcatalog module to hadoop1 profile.
@chiwanpark
I had made changes to most of your comment. Except for your comment regarding the verification of Exception in the tests. I feel that it's better to verify the exception at the point it's expected to be thrown. If we use method-wide annotation, we are not sure where the exception is thrown from the test method, this is not safe especially for common exception types such as IOException. I did remove the test dependency on exception error message.

java api and scala api
fix scala HCatInputFormat bug for complex type
moved hcatalog module to hadoop1 profile.
Modify the surefile configuration for hcatalog tests.
Addressed review comments from the first PR.
remove unused import
@chiwanpark
Copy link
Member

From this JIRA, It seems that HCatalog are deployed with Hadoop 2. Is it sure that the vanila version of HCatalog is only with Hadoop 1?

@jamescao
Copy link
Author

@chiwanpark
Yes, it's sure that the apache version of HCatalog is compiled against hadoop1. Using this jar in an hadoop2 environment will lead to binary incompatibility error. I think Hive community is aware of this problem.
https://issues.apache.org/jira/browse/HIVE-7349
On the other hand. The cloudera Hcatalog jar is compiled against hadoop2, thus works with hadoop2 environment. But using cloudera jar in a hadoop1 environment will lead to binary incompatibility error. I guess one way to solve this is to add a vendor-specific hcatalog-module for users who want to use hcatalog in hadoop2 or we need to wait until hive community publish 2 artifacts in maven for both hadoop1 and hadoop2.

@jamescao
Copy link
Author

jamescao commented Sep 8, 2015

Any updates from the review process?

@chiwanpark
Copy link
Member

Oh, sorry for late. I'm reviewing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HCatInputOutputFormatITCase would be better for name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary new line. (For only this line, other lines are okay)

@chiwanpark
Copy link
Member

Hi @jamescao, I just reviewed your PR. There are some issues to merge.

  • HcatInputOutputFormatITest.java must use tab characters for indentation.
  • There are some formatting issues which I commented.
  • Some Scala codes doesn't seems Scala style.

As @twalthr said, I think also that we should split the current test cases to test cases for HCatInputFormat and HCatOutputFormat. This issues and PR should cover HCatalog output format. We can post another issue to JIRA and send PR for that.

@jamescao
Copy link
Author

@chiwanpark
Thanks for all you comments! I will fix them in subsequent commits. There is also a bug fix for HcatInputformat in this pr, do you think that part should aslo be removed from this pr?
@fhueske
seems that there is some fix for hadoop concurrency #1111
I will incorporate the content of that pr into this one.

@chiwanpark
Copy link
Member

@jamescao Yes, I think the part (including tests) related with HCatInputFormat should be removed.

@fhueske
Copy link
Contributor

fhueske commented Sep 10, 2015

@jamescao good observation! The fix of #1111 should be added to the HCatOutputFormatBase as well.
There was another fix for HadoopIOFormats (FLINK-2555, PR #1038) that is also relevant for this PR.

@jamescao
Copy link
Author

@chiwanpark
pr is now updated, I pull out code related to HCatInputFormat and incorporated Flink-2555 and Flink-2617. I also change the test environment to from CollectionsEnviroment to LocalEnviroment to cover the test of the de-serialization for the HCatOutputFormat.

@fpompermaier
Copy link
Contributor

Any update on this?

@twalthr
Copy link
Contributor

twalthr commented Jan 30, 2019

This PR has not found ownership by committers for quite some time. I'm not sure if it is still mergeable. Most of the functionality will also be implemented as part of FLINK-10556. I will close the PR for now. Feel free to reopen it if you think it still makes sense to keep the contribution.

@twalthr twalthr closed this Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants