New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove com.databricks:spark-avro to build spark avro schema by itself #770
Conversation
@cdmikechen Have you tried running the demo steps to ensure these changes work fine ? |
@@ -272,12 +112,23 @@ object AvroConversionUtils { | |||
case ShortType => (item: Any) => | |||
if (item == null) null else item.asInstanceOf[Short].intValue | |||
case _: DecimalType => (item: Any) => if (item == null) null else item.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to be removed, otherwise Decimal
is still being converted to String
.
I pulled in this PR and ran tests with The tables in Hive end up being created with
Upon diving further into this issue, I am able to narrow it down to this line, where the Parquet footer is read to get the schema which is written as What is happening here, is that in this schema conversion from The following blob for
It end's up as following upon conversion to
Thus any context of this field being The following line which checks whether Ultimately it is converting it to Create Table command generated by Hudi:
|
should we first rebase and resolve the conflicts? |
For my testing, I had re-based this patch on top of release-0.5.0. But yes, @cdmikechen should may be rebase the PR. But the issue will still exist. |
@umehrot2 Great analysis.. Would upgrading parquet-avro help? |
Good point @vinothchandar . Upon a quick look at It is not there in I will upgrade the parquet version and test. Will update here with what I find. |
@umehrot2 It is base in 0.4.8 version, You also need to upgrade avro to 1.8.2 or higher version (support logicaltype), and parquet 1.8.2 or higher. |
I was able to read and write Is there a way we can prioritize this work and get it merged ? Is there any additional testing that I can help perform which can give us confidence that it can be merged ? @cdmikechen you mentioned there are still some issues. If you would like and can point it out here, I would be willing to help out with that as well. |
@umehrot2 #903 opened this for shading changes.. FYI.. On bumping up versions, there are few compatibility considerations.
Also feel free to open a new PR, since @cdmikechen will take few weeks to circle back, as he mentioned. |
@umehrot2 package org.apache.hadoop.hive.serde2.objectinspector.primitive;
import java.sql.Timestamp;
import org.apache.hadoop.hive.serde2.io.TimestampWritable;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
import org.apache.hadoop.io.LongWritable;
public class WritableTimestampObjectInspector extends
AbstractPrimitiveWritableObjectInspector implements
SettableTimestampObjectInspector {
public WritableTimestampObjectInspector() {
super(TypeInfoFactory.timestampTypeInfo);
}
@Override
public TimestampWritable getPrimitiveWritableObject(Object o) {
if (o instanceof LongWritable) {
return (TimestampWritable) PrimitiveObjectInspectorFactory.writableTimestampObjectInspector
.create(new Timestamp(((LongWritable) o).get()));
}
return o == null ? null : (TimestampWritable) o;
}
public Timestamp getPrimitiveJavaObject(Object o) {
if (o instanceof LongWritable) {
return new Timestamp(((LongWritable) o).get());
}
return o == null ? null : ((TimestampWritable) o).getTimestamp();
}
public Object copyObject(Object o) {
if (o instanceof LongWritable) {
return new TimestampWritable(new Timestamp(((LongWritable) o).get()));
}
return o == null ? null : new TimestampWritable((TimestampWritable) o);
}
public Object set(Object o, byte[] bytes, int offset) {
if (o instanceof LongWritable) {
o = PrimitiveObjectInspectorFactory.writableTimestampObjectInspector
.create(new Timestamp(((LongWritable) o).get()));
} else
((TimestampWritable) o).set(bytes, offset);
return o;
}
public Object set(Object o, Timestamp t) {
if (t == null) {
return null;
}
if (o instanceof LongWritable) {
o = PrimitiveObjectInspectorFactory.writableTimestampObjectInspector.create(t);
} else
((TimestampWritable) o).set(t);
return o;
}
public Object set(Object o, TimestampWritable t) {
if (t == null) {
return null;
}
if (o instanceof LongWritable) {
o = PrimitiveObjectInspectorFactory.writableTimestampObjectInspector
.create(new Timestamp(((LongWritable) o).get()));
} else
((TimestampWritable) o).set(t);
return o;
}
public Object create(byte[] bytes, int offset) {
return new TimestampWritable(bytes, offset);
}
public Object create(Timestamp t) {
return new TimestampWritable(t);
}
} I'm looking for a solution that doesn't need to modify the hive source code. See if you can come up with any good ideas. |
@vinothchandar |
@vinothchandar At the moment, I cannot think of a good way how we can upgrade avro version while still continuing to support Spark 2.3 or earlier. What @cdmikechen has mentioned about asking users for this additional step of dropping If we agree that it is fine, either me or @cdmikechen can create a new PR based off this, with following changes:
It appears like with the above 2 changes, this PR can be in a state to be merged. We can continue on the Timestamp issue in a separate Jira/PR. |
I have a slightly different strategy. We can move to spark 2.4 and match its parquet (1.10.1), avro Also @umehrot2 , is supporting 2.3 a must or can we drop Hudi support for lower than 2.4 versions? Hudi community is ok per se to just support 2.4. if so, then we can also drop com.databricks from the code and use org.apache.spark.avro (which is only in version 2.4) cc @bvaradar @bhasudha who are looking into the spark 2.4 move |
@vinothchandar At EMR we do not have a use-case to support Spark 2.3 or earlier. We would be offering Hudi starting with our latest release which has Spark 2.4.3. Anything earlier than this we would not be supporting. So, it might be a good idea to just move to 2.4 and drop support for earlier versions. |
@umehrot2 I think balaji has his hands full with the release atm. Do you have bandwidth to try moving to spark 2.4 and do these changes on top? |
Sure. I will take this up then. |
sg. assigned you https://issues.apache.org/jira/browse/HUDI-91 . lets continue there |
@vinothchandar @umehrot2
OR do some change like hive dependence in spark-bundle
In this way, we can be compatible with spark 2.2, 2.3 and 2.4. |
I will continue to discuss this issue on JIRA later. The version I'm running in the production environment now is the Hudi 0.4.8 version with this PR added. If there are new changes, I can also do some experiments in my test environment. |
Close thie PR right now. Some problems have been fixed in PR #1005. The remaining timestamp type problem will be further discussed in other JIRA's issues. |
Provide a way to let hoodie support
timestamp
anddecimal
.Change the type of timestamp from long to
int64
(logical_type=timestamp-millis
).Change the type of date from int to
int32
(logical_type=date
).Change the type of decimal from string to
fix
(logical_type=decimal
).In spark, hoodi can correctly convert the all data type of the primitive into the parquet type(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L372).
In hive, hoodie can correctly convert the
decimal
type of the primitive into the parquet type, but only readtimestamp
as long(ParquetHiveSerDe can not read logical_type).Another things to mention: We need to replace avro*-1.7.7.jar in
SPARK_HOME/jars
to avro*-1.8.2.jar, so that spark can use logical type classes.