[SPARK-50067][SQL] Codegen Support for SchemaOfCsv(by Invoke & RuntimeReplaceable) #48595

panbingkun · 2024-10-22T08:42:48Z

What changes were proposed in this pull request?

The pr aims to add Codegen Support for schema_of_csv.

Why are the changes needed?

improve codegen coverage.
simplified code.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA & Existed UT (eg: CsvFunctionsSuite#*schema_of_csv*)

Was this patch authored or co-authored using generative AI tooling?

No.

…eReplaceable)

panbingkun · 2024-10-22T08:46:16Z

.../src/main/scala/org/apache/spark/sql/catalyst/expressions/json/JsonExpressionEvalUtils.scala

    nullableSchema: DataType,
    nameOfCorruptRecord: String,
    timeZoneId: Option[String],
-    variantAllowDuplicateKeys: Boolean) extends Serializable {


It is not related to this PR, as we used the case class here, which implements Serializable by default, so we can remove it.

panbingkun · 2024-10-22T09:00:31Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/csv/CsvExpressionEvalUtils.scala

+import org.apache.spark.unsafe.types.UTF8String
+
+case class SchemaOfCsvEvaluator(options: Map[String, String]) {
+


I think we can share some objects to avoid creating them every time, thereby reducing the pressure on GC

panbingkun · 2024-10-22T11:47:27Z

cc @MaxGekk @cloud-fan

MaxGekk · 2024-10-23T08:27:58Z

+1, LGTM. Merging to master.
Thank you, @panbingkun.

panbingkun · 2024-10-23T08:36:26Z

+1, LGTM. Merging to master. Thank you, @panbingkun.

Thanks @MaxGekk ❤️

…lable as false ### What changes were proposed in this pull request? The pr is following up [schema_of_json](#48473), [schema_of_xml](#48594) and [schema_of_csv](#48595), to make returnNullable as false. ### Why are the changes needed? As `cloud-fan`'s comment https://github.com/apache/spark/pull/48594/files#r1860534460, we should follow the original logic, otherwise it's a regression. https://github.com/apache/spark/blob/1a502d32ef5a69739e10b827be4c9063b2a20493/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L846 https://github.com/apache/spark/blob/1a502d32ef5a69739e10b827be4c9063b2a20493/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala#L166 https://github.com/apache/spark/blob/1a502d32ef5a69739e10b827be4c9063b2a20493/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala#L141 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48987 from panbingkun/SPARK-50066_FOLLOWUP. Authored-by: panbingkun <panbingkun@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-50067][SQL] Codegen Support for SchemaOfCsv(by Invoke & Runtim…

d9f9950

…eReplaceable)

github-actions bot added the SQL label Oct 22, 2024

panbingkun commented Oct 22, 2024

View reviewed changes

fix ut

b77600d

github-actions bot added the CONNECT label Oct 22, 2024

panbingkun marked this pull request as ready for review October 22, 2024 11:47

MaxGekk approved these changes Oct 23, 2024

View reviewed changes

MaxGekk closed this in 369c40c Oct 23, 2024

panbingkun mentioned this pull request Nov 27, 2024

[SPARK-50067]][SPARK-50066][SPARK-49954][SQL][FOLLOWUP] Make returnNullable as false #48987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50067][SQL] Codegen Support for SchemaOfCsv(by Invoke & RuntimeReplaceable) #48595

[SPARK-50067][SQL] Codegen Support for SchemaOfCsv(by Invoke & RuntimeReplaceable) #48595

Uh oh!

panbingkun commented Oct 22, 2024

Uh oh!

panbingkun Oct 22, 2024

Uh oh!

panbingkun Oct 22, 2024

Uh oh!

panbingkun commented Oct 22, 2024

Uh oh!

MaxGekk commented Oct 23, 2024

Uh oh!

panbingkun commented Oct 23, 2024

Uh oh!

Uh oh!

		import org.apache.spark.unsafe.types.UTF8String

		case class SchemaOfCsvEvaluator(options: Map[String, String]) {

[SPARK-50067][SQL] Codegen Support for SchemaOfCsv(by Invoke & RuntimeReplaceable) #48595

[SPARK-50067][SQL] Codegen Support for SchemaOfCsv(by Invoke & RuntimeReplaceable) #48595

Uh oh!

Conversation

panbingkun commented Oct 22, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

panbingkun Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

panbingkun Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

panbingkun commented Oct 22, 2024

Uh oh!

MaxGekk commented Oct 23, 2024

Uh oh!

panbingkun commented Oct 23, 2024

Uh oh!

Uh oh!