Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158

Closed
wants to merge 6 commits into from

Conversation

seshadri-cr
Copy link
Contributor

@seshadri-cr seshadri-cr commented May 16, 2017

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify.
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.003%) to 70.219% when pulling fb127d7 on seshadri-cr:hiveintegrationtest into 69522fe on apache:master.

@aaltay
Copy link
Member

aaltay commented May 16, 2017

R: @jkff @jasonkuster

@jkff
Copy link
Contributor

jkff commented May 16, 2017

Thanks! Can you clarify what is the purpose of adding this integration test? What kind of failures is it intended to catch? I believe we already have tests for verifying that HadoopInputFormatIO works properly.

If the intention is to serve as an example of how to use Hive with HadoopInputFormat, that's valid, but it should probably be under examples, and should not be a test.

@seshadri-cr
Copy link
Contributor Author

Thanks for reviewing ! Yes, this is an example of reading from Hive with HadoopInputFormat. The reason I chose the folder - beam/sdks/java/io/hadoop/jdk1.8-tests/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/ is because I found similar tests here for Cassandra & ElasticSearch (they look like simple read use cases similar to what I have for Hive, if I am not missing something here). Please let me know if otherwise.

@jkff
Copy link
Contributor

jkff commented May 17, 2017

Hmm, I'm surprised that the Cassandra and ES tests in hadoop/inputformat do not include scripts for generating the data, that seems problematic. @ssisk , do you know what's up with that?

However, the original intention of those Cassandra and ES tests was to validate HadoopInputFormatIO itself, which is not the case of your PR (the intention of your PR is simply to demonstrate how to use it with HiveIO, correct?)

The substantial part of your PR, relevant to people who want to use Beam with Hive, is the getConfiguration method and myValueTranslate. I'm wondering if perhaps the proper place for the code in this PR is a StackOverflow question with a self-answer (perhaps linking to a gist), rather than code committed to the Beam repo?

E.g. you could post a question "How do I read from Hive using Apache Beam?" and self-answer it with the getConfiguration snippet. What do you think?

@seshadri-cr
Copy link
Contributor Author

JIRA - BEAM-1158, reported by @iemejia looks for Hive support on Apache Beam and after discussion with @iemejia we decided to try HadoopInputFormatIO to read from Hive.
I believe we should have this one in both Apache Beam repo ( as an example ) and Stackoverflow post, what do you think ?

@jkff
Copy link
Contributor

jkff commented May 18, 2017

I think, since now all it takes to read from Hive is creating a Configuration object, which can be done with the 7 lines below - that JIRA can be simply closed with a link to the SO post - @iemejia , WDYT?

Code to read from Hive:

Configuration conf = new Configuration();
conf.setClass("mapreduce.job.inputformat.class", HCatInputFormat.class, InputFormat.class);
conf.setClass("key.class", LongWritable.class, WritableComparable.class);
conf.setClass("value.class", DefaultHCatRecord.class, Writable.class);
conf.set("hive.metastore.uris", "...");
HCatInputFormat.setInput(hiveConf, "myDatabase", "myTable", "myFilter");
PCollection<KV<LongWritable, DefaultHCatRecord>> data =
    p.apply(HadoopInputFormatIO.<Long, DefaultHCatRecord>read().withConfiguration(conf));

I'm hesitant to add a couple hundred lines of code for a complete example or integration test, given that its salient part is only 7 lines - it seems not worth the overhead.

(alternatively, if that JIRA is talking about having native support in Beam for reading from Hive - i.e. without going through HadoopInputFormat, assuming there's a compelling reason to have that - then that should be addressed separately)

@ssisk
Copy link
Contributor

ssisk commented May 19, 2017

re: "I'm surprised that the Cassandra and ES tests in hadoop/inputformat do not include scripts for generating the data, that seems problematic. @ssisk ,"

these do include scripts for generating data - see https://github.com/apache/beam/tree/master/.test-infra/kubernetes - look for data-load.sh files under that directory

@ssisk
Copy link
Contributor

ssisk commented May 19, 2017

I skimmed this quickly, will provide better comments tomorrow, but I agree with the general sentiment that we should not add tests for everything that HIFIO can be used with.

I do think it'd be great to add comments to HIFIO's readme, but apparently that got removed in 1f631eb so we'll have to figure out a place for that to live. cc @aaltay

@aaltay
Copy link
Member

aaltay commented May 22, 2017

HIFIO's readme file moved to web site (https://beam.apache.org/documentation/io/built-in/hadoop/). Is this a good place to add these information?

@seshadri-cr
Copy link
Contributor Author

@aaltay, this place looks good to me. @jkff, @ssisk , @iemejia what is your opinion on adding the Hive use-case at - https://beam.apache.org/documentation/io/built-in/hadoop/ and propose to close the ticket BEAM-1158

@ssisk
Copy link
Contributor

ssisk commented May 22, 2017

+1 on adding to the hadoop input-format docs.

@seshadri-cr
Copy link
Contributor Author

@aaltay, please let me know on how to add content here - https://beam.apache.org/documentation/io/built-in/hadoop/

@aaltay
Copy link
Member

aaltay commented May 23, 2017

@seshadri-cr website is hosted at https://github.com/apache/beam-site. You can open a PR to change the content there. Also contribution guide has some information on doing that (https://beam.apache.org/contribute/contribution-guide/#website)

@iemejia
Copy link
Member

iemejia commented May 23, 2017

@seshadri-cr Considering that this code allows to read data from HCatalog (a 'subsystem' of Hive). Can you please rename the code to only mention HCatalog and not Hive. I will update the JIRA to make this coherent.

@seshadri-cr
Copy link
Contributor Author

@aaltay , thank you ! I have raised the pull request for site doc - apache/beam-site#249

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks almost good to me, we'll see if this gets merged depending on the general opinion and that the Jenkins test passes.

pipeline.run().waitUntilFinish();
}

static SimpleFunction<DefaultHCatRecord, String> myValueTranslate =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to raise the type to HCatRecord here (and in all uses, and in the website PR)?

* <p>This test requires a running instance of Hive, and the test dataset must exist in
* the database.
*
* <p>You can run this test by doing the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this detail of how to run the test, since it will be run by with the full suite.

//explicitly specifying database, table & filter
HCatInputFormat.setInput(hcatConf, HIVE_DATABASE, HIVE_TABLE, HIVE_FILTER);

//specifying table only, assumes 'default' database in this case and no filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to remove this since this is not tested.


/**
* Returns Hadoop configuration for reading data from Hive using HCatalog.
* To read data from Hive using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline this to make comment shorter (of course respecting the 100-char limit)

@iemejia
Copy link
Member

iemejia commented May 24, 2017

@ssisk I agree with the general sentiment that we should not add HIFIO tests for every existing datastore that supports HIF, however I think we should prioritize those who don't exist yet in Beam, for example I will remove without hesitation the HIFIElastic classes and put this because we don't have a native way to read Hive HCatalog records yet. We shall probably do the same for others (e.g. HIFICassandra) once the native implementations are already merged. WDYT ?

@iemejia
Copy link
Member

iemejia commented May 24, 2017

Other issue to discuss later on is if this examples probably should exist as independent modules considering that we can really fast get into dependency problems.

@seshadri-cr seshadri-cr changed the title [BEAM-1158]Hive integration test for HadoopInputFormatIO [BEAM-1158]HCatalog integration test for HadoopInputFormatIO May 24, 2017
@coveralls
Copy link

Coverage Status

Coverage increased (+0.5%) to 70.697% when pulling db1a50c on seshadri-cr:hiveintegrationtest into 69522fe on apache:master.

@ssisk
Copy link
Contributor

ssisk commented May 24, 2017

@iemejia these ITs don't have:

  1. an accompanying k8s script to set up small/large instances of HiveIO
  2. data load scripts

So Beam can't run the tests in this PR as-is. That's what's useful about the current ES/cassandra HIFIO ITs.

I think it could be useful to add tests like these to verify that HIFIO works for things that we don't have direct support yet, but we need data stores to test against. It's also not clear how expensive these types of tests are in the long run - for example, these HIFIO compat tests could potentially just be a small unit test since we aren't really testing beam code with this test and don't need to worry about scale/etc..

@seshadri-cr
Copy link
Contributor Author

@ssisk , @iemejia if the general opinion is that it is an overkill to add a complete code example for Hive / HCatalog with HIFIO, I think we should go with the PR apache/beam-site#249 which adds a snippet to beam-site for Hive with HIFIO. WDYT ?

@iemejia
Copy link
Member

iemejia commented May 30, 2017

@seshadri-cr I think effectively that we cant expect to have any existing implementation of HIFIO to be included in the examples, however I would like to have a clear criteria to decide what implementations should be included as examples, for that reason I opened the discussion in the mailing list. IMO we should only have as examples the implementations of HIF that don't exist as native IOs, so once the native implementation is there they shouldn't be there (e.g. I would expect that both of the current examples get removed once we have their native implemationts (e.g. Elasticsearch 5 and Cassandra).
The problem with having this PR only as documentation (e.g. in the website) is that APIs can change, and the code in the website can become outdated, at least we will discover this by having the implementation in the repo. But well since you have the other PR for the native implementation, maybe it makes sense to close this one in favor of the beam-site one.

@seshadri-cr
Copy link
Contributor Author

@iemejia , agree with your comments. Closing this PR

@seshadri-cr seshadri-cr closed this Jun 1, 2017
@seshadri-cr seshadri-cr deleted the hiveintegrationtest branch July 24, 2017 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants