Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving all dependencies off cdh and to apache #420

Merged
merged 3 commits into from Sep 11, 2018

Conversation

vinothchandar
Copy link
Member

  • Tests redone in the process
  • Main changes are to RealtimeRecordReader and how it treats maps/arrays

@vinothchandar
Copy link
Member Author

@n3nash @bvaradar Need your inputs before we can head further down this path..

This is an attempt to reset our dependencies to just apache and not cdh. Then we can create profiles to generate cdh builds as needed..
Will mention you in tricky spots.

@@ -219,62 +201,6 @@ public Configuration getConf() {
return super.getRecordReader(split, job, reporter);
}

/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killed this code for now, since its not adding value but just creating more dependency issues.. With these gone, it should be (I think) easier to support Hive 2, since all that matters is that the class subclass MapRedParquetInputFormat

@@ -226,7 +238,8 @@ public static Writable avroToArrayWritable(Object value, Schema schema) {
mapValues[1] = avroToArrayWritable(mapEntry.getValue(), schema.getValueType());
values3[index3++] = new ArrayWritable(Writable.class, mapValues);
}
return new ArrayWritable(Writable.class, values3);
wrapperWritable = new Writable[]{new ArrayWritable(Writable.class, values3)};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive 1.2.1 and cdh 1.1.0 hive seem to return ArrayWritables for maps/array differently. Esp, apache version has a wrapper to hold the elements. This was causing mismatch with our code, which was expecting the map items to be just top level for e.g

Meta point here is, I think we should fork the Hive Parqeut Record Reader for the Realtime Record Reader usage ourselves and maintain it going forward.. This is consistent with our goal that RO view will continue to offer the native query perf of the engine, while RT view perf is something owned by hoodie.

What do you both think? If we are in agreement, I ll go ahead and get that rolling as well. Otherwise, once we land this the cdh build would fail .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like we discussed, may be we should decide if we even want to support RealTime in Hive (Hive on Spark).

/**
* Goes through the log files and populates a map with latest version of each key logged, since
* the base split was written.
*/
private void init() throws IOException {
writerSchema = new AvroSchemaConverter().convert(baseFileSchema);
if (split.getDeltaFilePaths().size() > 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is temporary. I plan to reimplement this nicely.

Which change in avro/parquet deps, record reader test started to fail since the "name" property was not properly retained in the base parquet' file's schema. This caused avro deser to fail on the log data. But, this was anyway a stop gap since we should be reading schema from the log file and not base if we want latest columns to show up. So, just did that and it in turn ended up reading schema from teh data block itself, which fixed the original test failure . basically, fixed another bug and the test issue went away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@vinothchandar
Copy link
Member Author

@n3nash do you know anything about this test failure?

testRollbackWithDeltaAndCompactionCommit(com.uber.hoodie.table.TestMergeOnReadTable)  Time elapsed: 7.575 sec  <<< ERROR!
java.lang.NullPointerException
        at com.uber.hoodie.table.TestMergeOnReadTable.testRollbackWithDeltaAndCompactionCommit(TestMergeOnReadTable.java:504)

Succeeds locally, but keeps failing in CI

@n3nash
Copy link
Contributor

n3nash commented Jul 24, 2018

@vinothchandar Looked at the failing test case. As you mentioned, it passes locally and do not see an apparent reason for it to fail. If you get a chance, please try adding some logs/try-catch to HoodieMergeOnReadTestUtils.getRecordsUsingInputFormat seems like its failing there, else will have to dig deeper.

@n3nash
Copy link
Contributor

n3nash commented Sep 5, 2018

@vinothchandar Can you rebase this diff and update it please ? I can then take a look at the failing test case, at the moment this diff has compilation errors against master and difficult to fix the test.

@n3nash
Copy link
Contributor

n3nash commented Sep 7, 2018

@vinothchandar I see that the you rebased it, I tried compiling using mvn clean install -Dmaven.test.skip=true -Dhive11 but it doesn't work, let me know how to compile it and will look into the other failing test case.

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tested this change as part of Hive compatibility tests. Good to go.

@n3nash
Copy link
Contributor

n3nash commented Sep 10, 2018

@vinothchandar I have put out a PR for the issue you described in the chat but the current build seems to be failing for some other reason. I'm going to rebase and try to run this PR with my fix, could you do the same when you get a chance ?

@n3nash
Copy link
Contributor

n3nash commented Sep 10, 2018

@vinothchandar compiles for me locally.

Vinoth Chandar and others added 3 commits September 11, 2018 09:36
 - Tests redone in the process
 - Main changes are to RealtimeRecordReader and how it treats maps/arrays
 - Make hive sync work with Hive 1/2 and CDH environments
 - Fixes to make corner cases for Hive queries
 - Spark Hive integration - Working version across Apache and CDH versions
 - Known Issue - apache#439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants