[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158

seshadri-cr · 2017-05-16T00:19:06Z

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

Make sure the PR title is formatted like:
[BEAM-<Jira issue #>] Description of pull request
Make sure tests pass via mvn clean verify.
Replace <Jira issue #> in the title with the actual Jira issue
number, if there is one.
If this contribution is large, please file an Apache
Individual Contributor License Agreement.

coveralls · 2017-05-16T03:57:00Z

Coverage decreased (-0.003%) to 70.219% when pulling fb127d7 on seshadri-cr:hiveintegrationtest into 69522fe on apache:master.

aaltay · 2017-05-16T16:49:44Z

R: @jkff @jasonkuster

jkff · 2017-05-16T21:51:50Z

Thanks! Can you clarify what is the purpose of adding this integration test? What kind of failures is it intended to catch? I believe we already have tests for verifying that HadoopInputFormatIO works properly.

If the intention is to serve as an example of how to use Hive with HadoopInputFormat, that's valid, but it should probably be under examples, and should not be a test.

seshadri-cr · 2017-05-16T23:42:25Z

Thanks for reviewing ! Yes, this is an example of reading from Hive with HadoopInputFormat. The reason I chose the folder - beam/sdks/java/io/hadoop/jdk1.8-tests/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/ is because I found similar tests here for Cassandra & ElasticSearch (they look like simple read use cases similar to what I have for Hive, if I am not missing something here). Please let me know if otherwise.

jkff · 2017-05-17T21:47:42Z

Hmm, I'm surprised that the Cassandra and ES tests in hadoop/inputformat do not include scripts for generating the data, that seems problematic. @ssisk , do you know what's up with that?

However, the original intention of those Cassandra and ES tests was to validate HadoopInputFormatIO itself, which is not the case of your PR (the intention of your PR is simply to demonstrate how to use it with HiveIO, correct?)

The substantial part of your PR, relevant to people who want to use Beam with Hive, is the getConfiguration method and myValueTranslate. I'm wondering if perhaps the proper place for the code in this PR is a StackOverflow question with a self-answer (perhaps linking to a gist), rather than code committed to the Beam repo?

E.g. you could post a question "How do I read from Hive using Apache Beam?" and self-answer it with the getConfiguration snippet. What do you think?

seshadri-cr · 2017-05-18T20:08:52Z

JIRA - BEAM-1158, reported by @iemejia looks for Hive support on Apache Beam and after discussion with @iemejia we decided to try HadoopInputFormatIO to read from Hive.
I believe we should have this one in both Apache Beam repo ( as an example ) and Stackoverflow post, what do you think ?

jkff · 2017-05-18T20:48:55Z

I think, since now all it takes to read from Hive is creating a Configuration object, which can be done with the 7 lines below - that JIRA can be simply closed with a link to the SO post - @iemejia , WDYT?

Code to read from Hive:

Configuration conf = new Configuration();
conf.setClass("mapreduce.job.inputformat.class", HCatInputFormat.class, InputFormat.class);
conf.setClass("key.class", LongWritable.class, WritableComparable.class);
conf.setClass("value.class", DefaultHCatRecord.class, Writable.class);
conf.set("hive.metastore.uris", "...");
HCatInputFormat.setInput(hiveConf, "myDatabase", "myTable", "myFilter");
PCollection<KV<LongWritable, DefaultHCatRecord>> data =
    p.apply(HadoopInputFormatIO.<Long, DefaultHCatRecord>read().withConfiguration(conf));

I'm hesitant to add a couple hundred lines of code for a complete example or integration test, given that its salient part is only 7 lines - it seems not worth the overhead.

(alternatively, if that JIRA is talking about having native support in Beam for reading from Hive - i.e. without going through HadoopInputFormat, assuming there's a compelling reason to have that - then that should be addressed separately)

ssisk · 2017-05-19T01:40:35Z

re: "I'm surprised that the Cassandra and ES tests in hadoop/inputformat do not include scripts for generating the data, that seems problematic. @ssisk ,"

these do include scripts for generating data - see https://github.com/apache/beam/tree/master/.test-infra/kubernetes - look for data-load.sh files under that directory

ssisk · 2017-05-19T01:45:06Z

I skimmed this quickly, will provide better comments tomorrow, but I agree with the general sentiment that we should not add tests for everything that HIFIO can be used with.

I do think it'd be great to add comments to HIFIO's readme, but apparently that got removed in 1f631eb so we'll have to figure out a place for that to live. cc @aaltay

aaltay · 2017-05-22T17:35:10Z

HIFIO's readme file moved to web site (https://beam.apache.org/documentation/io/built-in/hadoop/). Is this a good place to add these information?

seshadri-cr · 2017-05-22T17:50:15Z

@aaltay, this place looks good to me. @jkff, @ssisk , @iemejia what is your opinion on adding the Hive use-case at - https://beam.apache.org/documentation/io/built-in/hadoop/ and propose to close the ticket BEAM-1158

ssisk · 2017-05-22T18:47:20Z

+1 on adding to the hadoop input-format docs.

seshadri-cr · 2017-05-23T16:17:09Z

@aaltay, please let me know on how to add content here - https://beam.apache.org/documentation/io/built-in/hadoop/

aaltay · 2017-05-23T16:28:15Z

@seshadri-cr website is hosted at https://github.com/apache/beam-site. You can open a PR to change the content there. Also contribution guide has some information on doing that (https://beam.apache.org/contribute/contribution-guide/#website)

iemejia · 2017-05-23T17:26:45Z

@seshadri-cr Considering that this code allows to read data from HCatalog (a 'subsystem' of Hive). Can you please rename the code to only mention HCatalog and not Hive. I will update the JIRA to make this coherent.

…beam into hiveintegrationtest

seshadri-cr · 2017-05-23T23:27:49Z

@aaltay , thank you ! I have raised the pull request for site doc - apache/beam-site#249

iemejia

It looks almost good to me, we'll see if this gets merged depending on the general opinion and that the Jenkins test passes.

iemejia · 2017-05-24T09:31:56Z

...c/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/HIFIOHCatalogIT.java

+    pipeline.run().waitUntilFinish();
+  }
+
+  static SimpleFunction<DefaultHCatRecord, String> myValueTranslate =


Is it possible to raise the type to HCatRecord here (and in all uses, and in the website PR)?

iemejia · 2017-05-24T09:34:39Z

...c/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/HIFIOHCatalogIT.java

+ * <p>This test requires a running instance of Hive, and the test dataset must exist in
+ * the database.
+ *
+ * <p>You can run this test by doing the following:


You can remove this detail of how to run the test, since it will be run by with the full suite.

iemejia · 2017-05-24T09:37:38Z

...c/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/HIFIOHCatalogIT.java

+    //explicitly specifying database, table & filter
+    HCatInputFormat.setInput(hcatConf, HIVE_DATABASE, HIVE_TABLE, HIVE_FILTER);
+
+    //specifying table only, assumes 'default' database in this case and no filter


Probably best to remove this since this is not tested.

iemejia · 2017-05-24T09:38:22Z

...c/test/java/org/apache/beam/sdk/io/hadoop/inputformat/integration/tests/HIFIOHCatalogIT.java

+
+  /**
+   * Returns Hadoop configuration for reading data from Hive using HCatalog.
+   * To read data from Hive using


inline this to make comment shorter (of course respecting the 100-char limit)

iemejia · 2017-05-24T09:44:08Z

@ssisk I agree with the general sentiment that we should not add HIFIO tests for every existing datastore that supports HIF, however I think we should prioritize those who don't exist yet in Beam, for example I will remove without hesitation the HIFIElastic classes and put this because we don't have a native way to read Hive HCatalog records yet. We shall probably do the same for others (e.g. HIFICassandra) once the native implementations are already merged. WDYT ?

iemejia · 2017-05-24T09:45:09Z

Other issue to discuss later on is if this examples probably should exist as independent modules considering that we can really fast get into dependency problems.

coveralls · 2017-05-24T17:55:41Z

Coverage increased (+0.5%) to 70.697% when pulling db1a50c on seshadri-cr:hiveintegrationtest into 69522fe on apache:master.

ssisk · 2017-05-24T23:31:18Z

@iemejia these ITs don't have:

an accompanying k8s script to set up small/large instances of HiveIO
data load scripts

So Beam can't run the tests in this PR as-is. That's what's useful about the current ES/cassandra HIFIO ITs.

I think it could be useful to add tests like these to verify that HIFIO works for things that we don't have direct support yet, but we need data stores to test against. It's also not clear how expensive these types of tests are in the long run - for example, these HIFIO compat tests could potentially just be a small unit test since we aren't really testing beam code with this test and don't need to worry about scale/etc..

seshadri-cr · 2017-05-30T17:04:39Z

@ssisk , @iemejia if the general opinion is that it is an overkill to add a complete code example for Hive / HCatalog with HIFIO, I think we should go with the PR apache/beam-site#249 which adds a snippet to beam-site for Hive with HIFIO. WDYT ?

iemejia · 2017-05-30T19:04:23Z

@seshadri-cr I think effectively that we cant expect to have any existing implementation of HIFIO to be included in the examples, however I would like to have a clear criteria to decide what implementations should be included as examples, for that reason I opened the discussion in the mailing list. IMO we should only have as examples the implementations of HIF that don't exist as native IOs, so once the native implementation is there they shouldn't be there (e.g. I would expect that both of the current examples get removed once we have their native implemationts (e.g. Elasticsearch 5 and Cassandra).
The problem with having this PR only as documentation (e.g. in the website) is that APIs can change, and the code in the website can become outdated, at least we will discover this by having the implementation in the repo. But well since you have the other PR for the native implementation, maybe it makes sense to close this one in favor of the beam-site one.

seshadri-cr · 2017-06-01T18:42:05Z

@iemejia , agree with your comments. Closing this PR

seshadri-cr added 2 commits May 15, 2017 17:15

Hive integration test for HadoopInputFormatIO

3276fd8

Fixing checkstyle validations

fb127d7

Rename to HIFIOHCatalogIT, it makes better sense

69106bf

seshadri-cr added 2 commits May 23, 2017 10:39

Rename to HIFIOHCatalogIT, it makes better sense

10ae326

Merge branch 'hiveintegrationtest' of https://github.com/seshadri-cr/…

10df580

…beam into hiveintegrationtest

iemejia reviewed May 24, 2017

View reviewed changes

seshadri-cr changed the title ~~[BEAM-1158]Hive integration test for HadoopInputFormatIO~~ [BEAM-1158]HCatalog integration test for HadoopInputFormatIO May 24, 2017

Code review fixes

db1a50c

seshadri-cr closed this Jun 1, 2017

seshadri-cr deleted the hiveintegrationtest branch July 24, 2017 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158

[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158

seshadri-cr commented May 16, 2017 •

edited

Loading

coveralls commented May 16, 2017

aaltay commented May 16, 2017

jkff commented May 16, 2017

seshadri-cr commented May 16, 2017

jkff commented May 17, 2017

seshadri-cr commented May 18, 2017

jkff commented May 18, 2017

ssisk commented May 19, 2017

ssisk commented May 19, 2017

aaltay commented May 22, 2017

seshadri-cr commented May 22, 2017

ssisk commented May 22, 2017

seshadri-cr commented May 23, 2017

aaltay commented May 23, 2017

iemejia commented May 23, 2017

seshadri-cr commented May 23, 2017

iemejia left a comment •

edited

Loading

iemejia May 24, 2017

iemejia May 24, 2017

iemejia May 24, 2017

iemejia May 24, 2017

iemejia commented May 24, 2017

iemejia commented May 24, 2017

coveralls commented May 24, 2017

ssisk commented May 24, 2017

seshadri-cr commented May 30, 2017

iemejia commented May 30, 2017

seshadri-cr commented Jun 1, 2017

[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158

[BEAM-1158]HCatalog integration test for HadoopInputFormatIO #3158

Conversation

seshadri-cr commented May 16, 2017 • edited Loading

coveralls commented May 16, 2017

aaltay commented May 16, 2017

jkff commented May 16, 2017

seshadri-cr commented May 16, 2017

jkff commented May 17, 2017

seshadri-cr commented May 18, 2017

jkff commented May 18, 2017

ssisk commented May 19, 2017

ssisk commented May 19, 2017

aaltay commented May 22, 2017

seshadri-cr commented May 22, 2017

ssisk commented May 22, 2017

seshadri-cr commented May 23, 2017

aaltay commented May 23, 2017

iemejia commented May 23, 2017

seshadri-cr commented May 23, 2017

iemejia left a comment • edited Loading

Choose a reason for hiding this comment

iemejia May 24, 2017

Choose a reason for hiding this comment

iemejia May 24, 2017

Choose a reason for hiding this comment

iemejia May 24, 2017

Choose a reason for hiding this comment

iemejia May 24, 2017

Choose a reason for hiding this comment

iemejia commented May 24, 2017

iemejia commented May 24, 2017

coveralls commented May 24, 2017

ssisk commented May 24, 2017

seshadri-cr commented May 30, 2017

iemejia commented May 30, 2017

seshadri-cr commented Jun 1, 2017

seshadri-cr commented May 16, 2017 •

edited

Loading

iemejia left a comment •

edited

Loading