Skip to content

Conversation

@jamescao
Copy link

[FLINK1919]
Add HCatOutputFormat for Tuple data types for java and scala api also fix a bug for the scala api's HCatInputFormat for hive complex types.
Java api includes check for whether the schema of the HCatalog table and the Flink tuples match if the user provides a TypeInformation in the constructor. For data types other than tuples, the OutputFormat requires a preceding Map function that converts to HCatRecords
scala api includes check if the schema of the HCatalog table and the Scala tuples match. For data types other than scala Tuple, the OutputFormat requires a preceding Map function that converts to HCatRecords scala api require suser to import org.apache.flink.api.scala._ to allow the type be captured by the scala macro.
The Hcatalog jar in maven central is compiled using hadoop1, which is not compatible with hive jars for testing, so a cloudera hcatalog jar is pulled into the pom for testing purpose. It can be removed if not required.
java List and Map can not be cast to scala List and Map,JavaConverters is used to fix a bug in HcatInputFormat scala api

java api and scala api
fix scala HCatInputFormat bug for complex type
pull in cloudera Hcatalog jar for end to end test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a HCatalog expert. But I'm not sure that this 3rd-party repository is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not depend on vendor specific repositories / versions in the normal builds.
In the parent pom, there is a profile to enable vendor repositories.

@chiwanpark
Copy link
Member

Hi @jamescao, Thanks for your pull request!
I reviewed roughly and will review more detail in few days.

About the version of hcatalog release, it would be better to use vanila release.

@jamescao
Copy link
Author

@chiwanpark @rmetzger
Thanks for all your comment, I will work to improve it. The reason I have to use a cloudera pom is that the hcatalog jar in maven central is compiled against hadoop1. Which make it incompatible with hive testing utilities. It seems that the hive test enviroment has some issues in travis CI which is not shown in my Mac, I will have a look at it and try to have it resolved soon.

@jamescao
Copy link
Author

I need to work offline to debug the travis builds So close the pr for now. Thanks for all your time and comments! I will reopen once all the tests are fixed.

@jamescao jamescao closed this Aug 29, 2015
@twalthr
Copy link
Contributor

twalthr commented Sep 2, 2015

@jamescao: It seems that you also wrote tests for the HCatInputFormat, right? is it possible to split the PR into a OutputFormat part and open a separate PR for the HCatInputFormat tests. I'm still working on FLINK-2167 and require a HCatalog testing infrastructure. Otherwise I have to write it my own. Anyway, I wonder why all HCat I/O format classes have no tests so far...

@jamescao
Copy link
Author

@twalthr : sorry I missed your message, this pr is reopened in
#1079
I didn't check this closed page. I should have continued working on this one instead of close and reopen.
I will split the testing code and a bug fix in HCatInputFormat into a standalone pr. Meanwhile, you can refer to the code in the existing pr #1079. The code has passed the test in Travis CI and the hive test environment is quite simple to setup. One catch is that the hive junit tests is not thread safe, so you may need to tweak the surefire configuration to control the test concurrency. The other catch is that the hcatalog jar from maven central only works in hadoop1 profile, please my discussion with @chiwanpark in that page #1079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants