New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name #38200
[SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name #38200
Conversation
BTW struct type is a blocker for us to support the full version of |
|
||
def protoQualifiedAttr: proto.Expression.QualifiedAttribute = | ||
proto.Expression.QualifiedAttribute.newBuilder() | ||
.setName(identifier.mkString(".")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For LocalRelation's attributes, how to convert identifier: Seq[String]
to it?
Do we need to add an extra qualifier: Seq[String]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually a.b
should not directly map to QualifiedAttribute
. It can be GetStructField(QualifiedAttribute...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can follow the catalyst dsl
- if we call
protoAttr
, then we parse the name (split by dot) and createUnresolvedAttribute
- if we call
bool
, then we directly use the name as attribute name and createAttributeReference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah let me take a look.
message StructField { | ||
DataType type = 1; | ||
string name = 2; | ||
Nullability nullability = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't Nullability just a boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and shall we include metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Can one of the admins verify this patch? |
message StructField { | ||
DataType type = 1; | ||
string name = 2; | ||
bool nullability = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current data model assumes that the type itself expresses its nullability, so why do we need nullability here?
I am not saying that the current approach of the datatype having nullability is great. I would also be good if we only set nullability in the StructField message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a pure type system design, for composable type in which there are sub-types, both parent and children can have nullability. Use StructType as an example:
-
Struct(nullability=true, structFields=[(name="a", nullability=true)]):
val x = Struct("a"=NULL) // valid
val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // valid -
Struct(nullability=true, structFields=[(name="a", nullability=false)]):
val x = Struct("a"=NULL) // invalid
val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // valid -
Struct(nullability=false, structFields=[(name="a", nullability=false)]):
val x = Struct("a"=NULL) // invalid
val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // invalid
Same for ARRAY, MAP, etc.
This is very general. I am thinking you are thinking a data model that is a subset of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different from Catalyst. The catalyst data model is:
- most data types are just enums: string, int, float, etc.
- some data types have parameters: decimal(3, 1), varchar(10), etc.
- there are 3 collection types: struct, array, map. These 3 types have nullability: struct field nullability, array element nullability, map value nullability.
Why is proto data type different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the connect proto type system is a bit different on the container type. However I just refactor code here and this is existing code. We don't have tests for types neither.
How about let me follow up on this topic later after having more discussions with people. I can add more tests for nullability at that time.
db4ea09
to
993c2c3
Compare
@@ -114,7 +115,26 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |||
import org.apache.spark.sql.connect.dsl.plans._ | |||
transform(connectTestRelation.as("target_table")) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
thanks, merging to master! |
…d and each field should have a name ### What changes were proposed in this pull request? This PR refactors proto's struct datatype with 1) each struct has a struct field list 2) each struct field has a name. ### Why are the changes needed? In the past, connect's struct datatype does not have name for its struct fields. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38200 from amaliujia/struct_type_should_contain_name. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR refactors proto's struct datatype with 1) each struct has a struct field list 2) each struct field has a name.
Why are the changes needed?
In the past, connect's struct datatype does not have name for its struct fields.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT