New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving all dependencies off cdh and to apache #420
Conversation
vinothchandar
commented
Jul 16, 2018
- Tests redone in the process
- Main changes are to RealtimeRecordReader and how it treats maps/arrays
@@ -219,62 +201,6 @@ public Configuration getConf() { | |||
return super.getRecordReader(split, job, reporter); | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killed this code for now, since its not adding value but just creating more dependency issues.. With these gone, it should be (I think) easier to support Hive 2, since all that matters is that the class subclass MapRedParquetInputFormat
@@ -226,7 +238,8 @@ public static Writable avroToArrayWritable(Object value, Schema schema) { | |||
mapValues[1] = avroToArrayWritable(mapEntry.getValue(), schema.getValueType()); | |||
values3[index3++] = new ArrayWritable(Writable.class, mapValues); | |||
} | |||
return new ArrayWritable(Writable.class, values3); | |||
wrapperWritable = new Writable[]{new ArrayWritable(Writable.class, values3)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive 1.2.1 and cdh 1.1.0 hive seem to return ArrayWritables for maps/array differently. Esp, apache version has a wrapper to hold the elements. This was causing mismatch with our code, which was expecting the map items to be just top level for e.g
Meta point here is, I think we should fork the Hive Parqeut Record Reader for the Realtime Record Reader usage ourselves and maintain it going forward.. This is consistent with our goal that RO view will continue to offer the native query perf of the engine, while RT view perf is something owned by hoodie.
What do you both think? If we are in agreement, I ll go ahead and get that rolling as well. Otherwise, once we land this the cdh build would fail .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like we discussed, may be we should decide if we even want to support RealTime in Hive (Hive on Spark).
/** | ||
* Goes through the log files and populates a map with latest version of each key logged, since | ||
* the base split was written. | ||
*/ | ||
private void init() throws IOException { | ||
writerSchema = new AvroSchemaConverter().convert(baseFileSchema); | ||
if (split.getDeltaFilePaths().size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is temporary. I plan to reimplement this nicely.
Which change in avro/parquet deps, record reader test started to fail since the "name" property was not properly retained in the base parquet' file's schema. This caused avro deser to fail on the log data. But, this was anyway a stop gap since we should be reading schema from the log file and not base if we want latest columns to show up. So, just did that and it in turn ended up reading schema from teh data block itself, which fixed the original test failure . basically, fixed another bug and the test issue went away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
@n3nash do you know anything about this test failure?
Succeeds locally, but keeps failing in CI |
@vinothchandar Looked at the failing test case. As you mentioned, it passes locally and do not see an apparent reason for it to fail. If you get a chance, please try adding some logs/try-catch to |
@vinothchandar Can you rebase this diff and update it please ? I can then take a look at the failing test case, at the moment this diff has compilation errors against master and difficult to fix the test. |
e026cef
to
897d675
Compare
@vinothchandar I see that the you rebased it, I tried compiling using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tested this change as part of Hive compatibility tests. Good to go.
@vinothchandar I have put out a PR for the issue you described in the chat but the current build seems to be failing for some other reason. I'm going to rebase and try to run this PR with my fix, could you do the same when you get a chance ? |
@vinothchandar compiles for me locally. |
- Tests redone in the process - Main changes are to RealtimeRecordReader and how it treats maps/arrays - Make hive sync work with Hive 1/2 and CDH environments - Fixes to make corner cases for Hive queries - Spark Hive integration - Working version across Apache and CDH versions - Known Issue - apache#439
564319c
to
e6348b5
Compare