Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] fix delta column mapping for struct type columns #4530

Merged
merged 4 commits into from
Jan 26, 2024

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Jan 26, 2024

What changes were proposed in this pull request?

We only handles column mapping for top layers, if there are struct type columns, then it would cause wrong result (null for struct type columns). This change leverage delta API to map the schema and attributes to solve the above issue.

Repro:

import org.apache.spark.sql.{AnalysisException, DataFrame, Row}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StringType, StructType}
import scala.collection.JavaConverters._
val simpleNestedSchema = new StructType().add("a", StringType, true).add("b",new StructType().add("c", StringType, true).add("d", IntegerType, true)).add("map", MapType(StringType, StringType), true).add("arr", ArrayType(IntegerType), true)
val simpleNestedData = spark.createDataFrame(Seq(Row("str1", Row("str1.1", 1), Map("k1" -> "v1"), Array(1, 11)),Row("str2", Row("str1.2", 2), Map("k2" -> "v2"), Array(2, 22))).asJava, simpleNestedSchema)
spark.sql("CREATE TABLE t1 (a STRING,b STRUCT<c: STRING NOT NULL, d: INT>,map MAP<STRING, STRING>,arr ARRAY<INT>) USING DELTA PARTITIONED BY (`a`) TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")
simpleNestedData.write.format("delta").mode("append").saveAsTable("t1")
spark.sql("select * from t1")

Result:

+----+----+----------+-------+
|a   |b   |map       |arr    |
+----+----+----------+-------+
|str2|**null**|{k2 -> v2}|[2, 22]|
|str1|**null**|{k1 -> v1}|[1, 11]|
+----+----+----------+-------+

Wrong plan:

-- Project[expressions: (n1_4:ROW<c:VARCHAR,d:INTEGER>, "n0_0"), (n1_5:MAP<VARCHAR,VARCHAR>, "n0_1"), (n1_6:ARRAY<INTEGER>, "n0_2"), (n1_7:VARCHAR, "n0_3")] -> n1_4:ROW<c:VARCHAR,d:INTEGER>, n1_5:MAP<VARCHAR,VARCHAR>, n1_6:ARRAY<INTEGER>, n1_7:VARCHAR
  -- TableScan[table: hive_table] -> n0_0:ROW<c:VARCHAR,d:INTEGER>, n0_1:MAP<VARCHAR,VARCHAR>, n0_2:ARRAY<INTEGER>, n0_3:VARCHAR

-->n0_0:ROW<c:VARCHAR,d:INTEGER>

Expected Result:

+----+-----------+----------+-------+
|a   |b          |map       |arr    |
+----+-----------+----------+-------+
|str2|{str1.2, 2}|{k2 -> v2}|[2, 22]|
|str1|{str1.1, 1}|{k1 -> v1}|[1, 11]|
+----+-----------+----------+-------+

Expected plan:

-- Project[expressions: (n1_4:VARCHAR, "n0_3"), (n1_5:ROW<"col-0dab91d3-baa7-4e3a-8633-b436d7cbe1dc":VARCHAR,"col-e394acdc-73fb-4e78-9b79-258e53143547":INTEGER>, "n0_0"), (n1_6:MAP<VARCHAR,VARCHAR>, "n0_1"), (n1_7:ARRAY<INTEGER>, "n0_2")] -> n1_4:VARCHAR, n1_5:ROW<"col-0dab91d3-baa7-4e3a-8633-b436d7cbe1dc":VARCHAR,"col-e394acdc-73fb-4e78-9b79-258e53143547":INTEGER>, n1_6:MAP<VARCHAR,VARCHAR>, n1_7:ARRAY<INTEGER>
  -- TableScan[table: hive_table] -> n0_0:ROW<"col-0dab91d3-baa7-4e3a-8633-b436d7cbe1dc":VARCHAR,"col-e394acdc-73fb-4e78-9b79-258e53143547":INTEGER>, n0_1:MAP<VARCHAR,VARCHAR>, n0_2:ARRAY<INTEGER>, n0_3:VARCHAR

--> n0_0:ROW<"col-0dab91d3-baa7-4e3a-8633-b436d7cbe1dc":VARCHAR,"col-e394acdc-73fb-4e78-9b79-258e53143547":INTEGER>

How was this patch tested?

UT.

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

@zhli1142015 zhli1142015 marked this pull request as ready for review January 26, 2024 07:15
@zhli1142015
Copy link
Contributor Author

zhli1142015 commented Jan 26, 2024

@liujiayi771, @YannByron, @yma11, @JkSelf @zhouyuan could you help review?

@zhli1142015 zhli1142015 changed the title [VL] fix delta column mapping for struct type fields [VL] fix delta column mapping for struct type columns Jan 26, 2024
@zhli1142015 zhli1142015 merged commit 15f51e8 into apache:main Jan 26, 2024
23 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_4530_time.csv log/native_master_01_25_2024_c54d993cf_time.csv difference percentage
q1 32.96 33.10 0.138 100.42%
q2 23.72 23.80 0.083 100.35%
q3 36.42 38.54 2.124 105.83%
q4 37.62 38.13 0.508 101.35%
q5 68.16 68.95 0.788 101.16%
q6 5.46 6.99 1.535 128.11%
q7 82.16 81.35 -0.817 99.01%
q8 84.69 82.69 -2.002 97.64%
q9 120.32 121.78 1.457 101.21%
q10 42.25 43.34 1.085 102.57%
q11 20.47 19.67 -0.808 96.05%
q12 25.49 24.86 -0.630 97.53%
q13 45.05 44.65 -0.407 99.10%
q14 18.07 16.33 -1.737 90.38%
q15 26.97 25.93 -1.036 96.16%
q16 12.63 12.58 -0.053 99.58%
q17 99.69 99.43 -0.261 99.74%
q18 145.46 146.99 1.536 101.06%
q19 12.49 13.86 1.376 111.02%
q20 26.40 27.84 1.436 105.44%
q21 225.30 222.28 -3.025 98.66%
q22 13.38 14.97 1.595 111.92%
total 1205.17 1208.05 2.884 100.24%

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_master_01_26_2024_time.csv log/native_master_01_25_2024_c54d993cf_time.csv difference percentage
q1 32.35 33.10 0.745 102.30%
q2 23.86 23.80 -0.056 99.77%
q3 37.79 38.54 0.752 101.99%
q4 35.53 38.13 2.605 107.33%
q5 68.63 68.95 0.317 100.46%
q6 6.67 6.99 0.322 104.82%
q7 81.85 81.35 -0.501 99.39%
q8 81.03 82.69 1.660 102.05%
q9 121.70 121.78 0.083 100.07%
q10 42.62 43.34 0.718 101.68%
q11 19.76 19.67 -0.094 99.52%
q12 28.00 24.86 -3.137 88.80%
q13 45.98 44.65 -1.336 97.09%
q14 15.48 16.33 0.846 105.46%
q15 26.74 25.93 -0.815 96.95%
q16 14.26 12.58 -1.680 88.22%
q17 100.65 99.43 -1.218 98.79%
q18 145.21 146.99 1.781 101.23%
q19 14.13 13.86 -0.263 98.14%
q20 26.05 27.84 1.788 106.86%
q21 223.23 222.28 -0.958 99.57%
q22 13.44 14.97 1.530 111.38%
total 1204.97 1208.05 3.089 100.26%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants