Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name #38200

Closed

Conversation

amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

This PR refactors proto's struct datatype with 1) each struct has a struct field list 2) each struct field has a name.

Why are the changes needed?

In the past, connect's struct datatype does not have name for its struct fields.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @cloud-fan
cc @grundprinzip @HyukjinKwon

@amaliujia
Copy link
Contributor Author

amaliujia commented Oct 11, 2022

BTW struct type is a blocker for us to support the full version of LocalRelation which is a blocker for us to support sparkSession.createDataFrame(in-memory data structure).


def protoQualifiedAttr: proto.Expression.QualifiedAttribute =
proto.Expression.QualifiedAttribute.newBuilder()
.setName(identifier.mkString("."))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LocalRelation's attributes, how to convert identifier: Seq[String] to it?

Do we need to add an extra qualifier: Seq[String]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually a.b should not directly map to QualifiedAttribute. It can be GetStructField(QualifiedAttribute...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can follow the catalyst dsl

  1. if we call protoAttr, then we parse the name (split by dot) and create UnresolvedAttribute
  2. if we call bool, then we directly use the name as attribute name and create AttributeReference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah let me take a look.

@HyukjinKwon HyukjinKwon changed the title [SPARK-40743][CONNECT]StructType should contain a list of StructField and each field should have a name [SPARK-40743][CONNECT] StructType should contain a list of StructField and each field should have a name Oct 11, 2022
message StructField {
DataType type = 1;
string name = 2;
Nullability nullability = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't Nullability just a boolean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and shall we include metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

message StructField {
DataType type = 1;
string name = 2;
bool nullability = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current data model assumes that the type itself expresses its nullability, so why do we need nullability here?

I am not saying that the current approach of the datatype having nullability is great. I would also be good if we only set nullability in the StructField message.

Copy link
Contributor Author

@amaliujia amaliujia Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a pure type system design, for composable type in which there are sub-types, both parent and children can have nullability. Use StructType as an example:

  1. Struct(nullability=true, structFields=[(name="a", nullability=true)]):
    val x = Struct("a"=NULL) // valid
    val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // valid

  2. Struct(nullability=true, structFields=[(name="a", nullability=false)]):
    val x = Struct("a"=NULL) // invalid
    val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // valid

  3. Struct(nullability=false, structFields=[(name="a", nullability=false)]):
    val x = Struct("a"=NULL) // invalid
    val y: Struct(nullability=true, structFields=[(name="a", nullability=true)] = NULL // invalid

Same for ARRAY, MAP, etc.

This is very general. I am thinking you are thinking a data model that is a subset of it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different from Catalyst. The catalyst data model is:

  1. most data types are just enums: string, int, float, etc.
  2. some data types have parameters: decimal(3, 1), varchar(10), etc.
  3. there are 3 collection types: struct, array, map. These 3 types have nullability: struct field nullability, array element nullability, map value nullability.

Why is proto data type different?

Copy link
Contributor Author

@amaliujia amaliujia Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the connect proto type system is a bit different on the container type. However I just refactor code here and this is existing code. We don't have tests for types neither.

How about let me follow up on this topic later after having more discussions with people. I can add more tests for nullability at that time.

@@ -114,7 +115,26 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
import org.apache.spark.sql.connect.dsl.plans._
transform(connectTestRelation.as("target_table"))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 269acf7 Oct 18, 2022
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…d and each field should have a name

### What changes were proposed in this pull request?

This PR refactors proto's struct datatype with 1) each struct has a struct field list  2) each struct field has a name.

### Why are the changes needed?

In the past, connect's struct datatype does not have name for its struct fields.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

UT

Closes apache#38200 from amaliujia/struct_type_should_contain_name.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants