-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25654][SQL] Support for nested JavaBean arrays, lists and maps in createDataFrame #22646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -17,8 +17,10 @@ | |||||
|
|
||||||
| package org.apache.spark.sql | ||||||
|
|
||||||
| import java.lang.reflect.{Array => JavaArray, ParameterizedType, Type} | ||||||
| import java.util.Properties | ||||||
|
|
||||||
| import scala.collection.JavaConverters._ | ||||||
| import scala.collection.immutable | ||||||
| import scala.reflect.runtime.universe.TypeTag | ||||||
|
|
||||||
|
|
@@ -30,6 +32,7 @@ import org.apache.spark.internal.config.ConfigEntry | |||||
| import org.apache.spark.rdd.RDD | ||||||
| import org.apache.spark.sql.catalyst._ | ||||||
| import org.apache.spark.sql.catalyst.expressions._ | ||||||
| import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData} | ||||||
| import org.apache.spark.sql.execution.command.ShowTablesCommand | ||||||
| import org.apache.spark.sql.internal.{SessionState, SharedState, SQLConf} | ||||||
| import org.apache.spark.sql.sources.BaseRelation | ||||||
|
|
@@ -1098,12 +1101,20 @@ object SQLContext { | |||||
| data: Iterator[_], | ||||||
| beanClass: Class[_], | ||||||
| attrs: Seq[AttributeReference]): Iterator[InternalRow] = { | ||||||
| def interfaceParameters(t: Type, interface: Class[_], dataType: DataType): Array[Type] = | ||||||
| t match { | ||||||
| case parType: ParameterizedType if parType.getRawType == interface => | ||||||
| parType.getActualTypeArguments | ||||||
| case _ => throw new UnsupportedOperationException( | ||||||
| s"Type ${t.getTypeName} is not supported for data type ${dataType.simpleString}. " + | ||||||
| s"Expected ${interface.getName}") | ||||||
| } | ||||||
| def createStructConverter(cls: Class[_], fieldTypes: Seq[DataType]): Any => InternalRow = { | ||||||
| val methodConverters = | ||||||
| JavaTypeInference.getJavaBeanReadableProperties(cls).zip(fieldTypes) | ||||||
| .map { case (property, fieldType) => | ||||||
| val method = property.getReadMethod | ||||||
| method -> createConverter(method.getReturnType, fieldType) | ||||||
| method -> createConverter(method.getGenericReturnType, fieldType) | ||||||
| } | ||||||
| value => | ||||||
| if (value == null) { | ||||||
|
|
@@ -1115,9 +1126,38 @@ object SQLContext { | |||||
| }) | ||||||
| } | ||||||
| } | ||||||
| def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { | ||||||
| case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) | ||||||
| case _ => CatalystTypeConverters.createToCatalystConverter(dataType) | ||||||
| def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { | ||||||
| case (cls: Class[_], struct: StructType) => | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait .. can we reuse There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala Lines 131 to 132 in 5264164
We should drop the support for getter or setter only. adding @cloud-fan here as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reusing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, how about we fix them together while we are here? I also checked another difference which is beans without getter and/or setter but I think this is something we should fix in 3.0. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Frankly, I was not really sure about serializing sets as arrays as the result stops behaving like a set, but I found a PR (#18416) where this seems to have been permitted, so I will go ahead and add that. |
||||||
| // bean type | ||||||
| createStructConverter(cls, struct.map(_.dataType)) | ||||||
| case (arrayType: Class[_], array: ArrayType) if arrayType.isArray => | ||||||
| // array type | ||||||
| val converter = createConverter(arrayType.getComponentType, array.elementType) | ||||||
| value => new GenericArrayData( | ||||||
| (0 until JavaArray.getLength(value)).map(i => | ||||||
| converter(JavaArray.get(value, i))).toArray) | ||||||
| case (_, array: ArrayType) => | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add few code comments explaining why having two cases both for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I should have added a check for |
||||||
| // java.util.List type | ||||||
| val cls = classOf[java.util.List[_]] | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you are right. It should be better to change it to avoid confusion. I also agree with a separate PR for that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On second thoughts, we should use |
||||||
| val params = interfaceParameters(t, cls, dataType) | ||||||
| val converter = createConverter(params(0), array.elementType) | ||||||
| value => new GenericArrayData( | ||||||
| value.asInstanceOf[java.util.List[_]].asScala.map(converter).toArray) | ||||||
| case (_, map: MapType) => | ||||||
| // java.util.Map type | ||||||
| val cls = classOf[java.util.Map[_, _]] | ||||||
| val params = interfaceParameters(t, cls, dataType) | ||||||
| val keyConverter = createConverter(params(0), map.keyType) | ||||||
| val valueConverter = createConverter(params(1), map.valueType) | ||||||
| value => { | ||||||
| val (keys, values) = value.asInstanceOf[java.util.Map[_, _]].asScala.unzip[Any, Any] | ||||||
| new ArrayBasedMapData( | ||||||
| new GenericArrayData(keys.map(keyConverter).toArray), | ||||||
| new GenericArrayData(values.map(valueConverter).toArray)) | ||||||
| } | ||||||
| case _ => | ||||||
| // other types | ||||||
| CatalystTypeConverters.createToCatalystConverter(dataType) | ||||||
| } | ||||||
| val dataConverter = createStructConverter(beanClass, attrs.map(_.dataType)) | ||||||
| data.map(dataConverter) | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, how about we put this method in
CatalystTypeConverters? Looks it is a Catalyst converter for beans. Few Java types likejava.lang.Iterable,java.math.BigDecimalandjava.math.BigIntegerare being handled there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay to move this to
CatalystTypeConverters, but note that unfortunately seems likeCatalystTypeConvertersdoesn't work properly with nested beans as we are trying to support it here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea .. was just thinking of moving this func to there .. looks ugly that this file getting long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at
CatalystTypeConvertersand I believe there would be a problem in not being able to reliably distinguish Java beans from other arbitrary classes. We might use setters or set fields directly to objects which would not be prepared for such manipulation, potentially creating hard to find errors. This method already assumes a Java bean so that problem is not present here. Isn't that so?