-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158
Conversation
Thanks! Can you clarify what is the purpose of adding this integration test? What kind of failures is it intended to catch? I believe we already have tests for verifying that HadoopInputFormatIO works properly. If the intention is to serve as an example of how to use Hive with HadoopInputFormat, that's valid, but it should probably be under |
Thanks for reviewing ! Yes, this is an example of reading from Hive with HadoopInputFormat. The reason I chose the folder - beam/sdks/java/io/hadoop/jdk1.8-tests/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/ is because I found similar tests here for Cassandra & ElasticSearch (they look like simple read use cases similar to what I have for Hive, if I am not missing something here). Please let me know if otherwise. |
Hmm, I'm surprised that the Cassandra and ES tests in hadoop/inputformat do not include scripts for generating the data, that seems problematic. @ssisk , do you know what's up with that? However, the original intention of those Cassandra and ES tests was to validate HadoopInputFormatIO itself, which is not the case of your PR (the intention of your PR is simply to demonstrate how to use it with HiveIO, correct?) The substantial part of your PR, relevant to people who want to use Beam with Hive, is the E.g. you could post a question "How do I read from Hive using Apache Beam?" and self-answer it with the |
I think, since now all it takes to read from Hive is creating a Configuration object, which can be done with the 7 lines below - that JIRA can be simply closed with a link to the SO post - @iemejia , WDYT? Code to read from Hive:
I'm hesitant to add a couple hundred lines of code for a complete example or integration test, given that its salient part is only 7 lines - it seems not worth the overhead. (alternatively, if that JIRA is talking about having native support in Beam for reading from Hive - i.e. without going through HadoopInputFormat, assuming there's a compelling reason to have that - then that should be addressed separately) |
re: "I'm surprised that the Cassandra and ES tests in hadoop/inputformat do not include scripts for generating the data, that seems problematic. @ssisk ," these do include scripts for generating data - see https://github.com/apache/beam/tree/master/.test-infra/kubernetes - look for data-load.sh files under that directory |
I skimmed this quickly, will provide better comments tomorrow, but I agree with the general sentiment that we should not add tests for everything that HIFIO can be used with. I do think it'd be great to add comments to HIFIO's readme, but apparently that got removed in 1f631eb so we'll have to figure out a place for that to live. cc @aaltay |
HIFIO's readme file moved to web site (https://beam.apache.org/documentation/io/built-in/hadoop/). Is this a good place to add these information? |
+1 on adding to the hadoop input-format docs. |
@aaltay, please let me know on how to add content here - https://beam.apache.org/documentation/io/built-in/hadoop/ |
@seshadri-cr website is hosted at https://github.com/apache/beam-site. You can open a PR to change the content there. Also contribution guide has some information on doing that (https://beam.apache.org/contribute/contribution-guide/#website) |
@seshadri-cr Considering that this code allows to read data from HCatalog (a 'subsystem' of Hive). Can you please rename the code to only mention HCatalog and not Hive. I will update the JIRA to make this coherent. |
@aaltay , thank you ! I have raised the pull request for site doc - apache/beam-site#249 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks almost good to me, we'll see if this gets merged depending on the general opinion and that the Jenkins test passes.
pipeline.run().waitUntilFinish(); | ||
} | ||
|
||
static SimpleFunction<DefaultHCatRecord, String> myValueTranslate = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to raise the type to HCatRecord here (and in all uses, and in the website PR)?
* <p>This test requires a running instance of Hive, and the test dataset must exist in | ||
* the database. | ||
* | ||
* <p>You can run this test by doing the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this detail of how to run the test, since it will be run by with the full suite.
//explicitly specifying database, table & filter | ||
HCatInputFormat.setInput(hcatConf, HIVE_DATABASE, HIVE_TABLE, HIVE_FILTER); | ||
|
||
//specifying table only, assumes 'default' database in this case and no filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to remove this since this is not tested.
|
||
/** | ||
* Returns Hadoop configuration for reading data from Hive using HCatalog. | ||
* To read data from Hive using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline this to make comment shorter (of course respecting the 100-char limit)
@ssisk I agree with the general sentiment that we should not add HIFIO tests for every existing datastore that supports HIF, however I think we should prioritize those who don't exist yet in Beam, for example I will remove without hesitation the HIFIElastic classes and put this because we don't have a native way to read Hive HCatalog records yet. We shall probably do the same for others (e.g. HIFICassandra) once the native implementations are already merged. WDYT ? |
Other issue to discuss later on is if this examples probably should exist as independent modules considering that we can really fast get into dependency problems. |
@iemejia these ITs don't have:
So Beam can't run the tests in this PR as-is. That's what's useful about the current ES/cassandra HIFIO ITs. I think it could be useful to add tests like these to verify that HIFIO works for things that we don't have direct support yet, but we need data stores to test against. It's also not clear how expensive these types of tests are in the long run - for example, these HIFIO compat tests could potentially just be a small unit test since we aren't really testing beam code with this test and don't need to worry about scale/etc.. |
@ssisk , @iemejia if the general opinion is that it is an overkill to add a complete code example for Hive / HCatalog with HIFIO, I think we should go with the PR apache/beam-site#249 which adds a snippet to beam-site for Hive with HIFIO. WDYT ? |
@seshadri-cr I think effectively that we cant expect to have any existing implementation of HIFIO to be included in the examples, however I would like to have a clear criteria to decide what implementations should be included as examples, for that reason I opened the discussion in the mailing list. IMO we should only have as examples the implementations of HIF that don't exist as native IOs, so once the native implementation is there they shouldn't be there (e.g. I would expect that both of the current examples get removed once we have their native implemationts (e.g. Elasticsearch 5 and Cassandra). |
@iemejia , agree with your comments. Closing this PR |
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull request
mvn clean verify
.<Jira issue #>
in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.