Moving all dependencies off cdh and to apache #420

vinothchandar · 2018-07-16T05:35:56Z

Tests redone in the process
Main changes are to RealtimeRecordReader and how it treats maps/arrays

vinothchandar · 2018-07-16T05:37:05Z

@n3nash @bvaradar Need your inputs before we can head further down this path..

This is an attempt to reset our dependencies to just apache and not cdh. Then we can create profiles to generate cdh builds as needed..
Will mention you in tricky spots.

vinothchandar · 2018-07-16T05:38:29Z

hoodie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java

@@ -219,62 +201,6 @@ public Configuration getConf() {
    return super.getRecordReader(split, job, reporter);
  }

-  /**


Killed this code for now, since its not adding value but just creating more dependency issues.. With these gone, it should be (I think) easier to support Hive 2, since all that matters is that the class subclass MapRedParquetInputFormat

vinothchandar · 2018-07-16T05:41:45Z

...ie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/realtime/AbstractRealtimeRecordReader.java

@@ -226,7 +238,8 @@ public static Writable avroToArrayWritable(Object value, Schema schema) {
          mapValues[1] = avroToArrayWritable(mapEntry.getValue(), schema.getValueType());
          values3[index3++] = new ArrayWritable(Writable.class, mapValues);
        }
-        return new ArrayWritable(Writable.class, values3);
+        wrapperWritable = new Writable[]{new ArrayWritable(Writable.class, values3)};


Hive 1.2.1 and cdh 1.1.0 hive seem to return ArrayWritables for maps/array differently. Esp, apache version has a wrapper to hold the elements. This was causing mismatch with our code, which was expecting the map items to be just top level for e.g

Meta point here is, I think we should fork the Hive Parqeut Record Reader for the Realtime Record Reader usage ourselves and maintain it going forward.. This is consistent with our goal that RO view will continue to offer the native query perf of the engine, while RT view perf is something owned by hoodie.

What do you both think? If we are in agreement, I ll go ahead and get that rolling as well. Otherwise, once we land this the cdh build would fail .

Just like we discussed, may be we should decide if we even want to support RealTime in Hive (Hive on Spark).

vinothchandar · 2018-07-16T05:45:13Z

...ie-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/realtime/AbstractRealtimeRecordReader.java

  /**
   * Goes through the log files and populates a map with latest version of each key logged, since
   * the base split was written.
   */
  private void init() throws IOException {
    writerSchema = new AvroSchemaConverter().convert(baseFileSchema);
+    if (split.getDeltaFilePaths().size() > 0) {


this is temporary. I plan to reimplement this nicely.

Which change in avro/parquet deps, record reader test started to fail since the "name" property was not properly retained in the base parquet' file's schema. This caused avro deser to fail on the log data. But, this was anyway a stop gap since we should be reading schema from the log file and not base if we want latest columns to show up. So, just did that and it in turn ended up reading schema from teh data block itself, which fixed the original test failure . basically, fixed another bug and the test issue went away

Sounds good

vinothchandar · 2018-07-23T15:40:31Z

@n3nash do you know anything about this test failure?

testRollbackWithDeltaAndCompactionCommit(com.uber.hoodie.table.TestMergeOnReadTable)  Time elapsed: 7.575 sec  <<< ERROR!
java.lang.NullPointerException
        at com.uber.hoodie.table.TestMergeOnReadTable.testRollbackWithDeltaAndCompactionCommit(TestMergeOnReadTable.java:504)

Succeeds locally, but keeps failing in CI

n3nash · 2018-07-24T20:42:53Z

@vinothchandar Looked at the failing test case. As you mentioned, it passes locally and do not see an apparent reason for it to fail. If you get a chance, please try adding some logs/try-catch to HoodieMergeOnReadTestUtils.getRecordsUsingInputFormat seems like its failing there, else will have to dig deeper.

n3nash · 2018-09-05T21:45:41Z

@vinothchandar Can you rebase this diff and update it please ? I can then take a look at the failing test case, at the moment this diff has compilation errors against master and difficult to fix the test.

n3nash · 2018-09-07T23:20:47Z

@vinothchandar I see that the you rebased it, I tried compiling using mvn clean install -Dmaven.test.skip=true -Dhive11 but it doesn't work, let me know how to compile it and will look into the other failing test case.

bvaradar

We have tested this change as part of Hive compatibility tests. Good to go.

n3nash · 2018-09-10T02:29:04Z

@vinothchandar I have put out a PR for the issue you described in the chat but the current build seems to be failing for some other reason. I'm going to rebase and try to run this PR with my fix, could you do the same when you get a chance ?

n3nash · 2018-09-10T08:02:09Z

@vinothchandar compiles for me locally.

- Tests redone in the process - Main changes are to RealtimeRecordReader and how it treats maps/arrays - Make hive sync work with Hive 1/2 and CDH environments - Fixes to make corner cases for Hive queries - Spark Hive integration - Working version across Apache and CDH versions - Known Issue - apache#439

vinothchandar requested review from n3nash and bvaradar July 16, 2018 05:35

vinothchandar commented Jul 16, 2018

View reviewed changes

vinothchandar force-pushed the fix-deps branch from 8743d24 to 3aecbc0 Compare July 18, 2018 01:03

vinothchandar mentioned this pull request Sep 7, 2018

Reworking the deltastreamer tool #449

Merged

vinothchandar force-pushed the fix-deps branch from e026cef to 897d675 Compare September 7, 2018 19:53

vinothchandar mentioned this pull request Sep 8, 2018

(WIP) add new modules to support Hive 2.1.0 #220

Closed

bvaradar approved these changes Sep 9, 2018

View reviewed changes

Vinoth Chandar and others added 3 commits September 11, 2018 09:36

Rebasing and fixing conflicts against master

e632519

Bump up versions in packaging modules and remove commons-lang3 dep

e6348b5

vinothchandar force-pushed the fix-deps branch from 564319c to e6348b5 Compare September 11, 2018 04:58

vinothchandar merged commit 18a3971 into apache:master Sep 11, 2018

vinothchandar mentioned this pull request Sep 11, 2018

Hive 2.X support for Hoodie #154

Closed

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

[ONHS-11577] Adding sanitization for rowSource (apache#420)

1ca3337

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving all dependencies off cdh and to apache #420

Moving all dependencies off cdh and to apache #420

vinothchandar commented Jul 16, 2018

vinothchandar commented Jul 16, 2018

vinothchandar Jul 16, 2018

vinothchandar Jul 16, 2018

n3nash Jul 17, 2018

vinothchandar Jul 16, 2018

n3nash Jul 17, 2018

vinothchandar commented Jul 23, 2018

n3nash commented Jul 24, 2018

n3nash commented Sep 5, 2018

n3nash commented Sep 7, 2018

bvaradar left a comment

n3nash commented Sep 10, 2018

n3nash commented Sep 10, 2018

Moving all dependencies off cdh and to apache #420

Moving all dependencies off cdh and to apache #420

Conversation

vinothchandar commented Jul 16, 2018

vinothchandar commented Jul 16, 2018

vinothchandar Jul 16, 2018

Choose a reason for hiding this comment

vinothchandar Jul 16, 2018

Choose a reason for hiding this comment

n3nash Jul 17, 2018

Choose a reason for hiding this comment

vinothchandar Jul 16, 2018

Choose a reason for hiding this comment

n3nash Jul 17, 2018

Choose a reason for hiding this comment

vinothchandar commented Jul 23, 2018

n3nash commented Jul 24, 2018

n3nash commented Sep 5, 2018

n3nash commented Sep 7, 2018

bvaradar left a comment

Choose a reason for hiding this comment

n3nash commented Sep 10, 2018

n3nash commented Sep 10, 2018