Skip to content

Conversation

panbingkun
Copy link
Contributor

What changes were proposed in this pull request?

The pr aims to add Codegen Support for schema_of_csv.

Why are the changes needed?

  • improve codegen coverage.
  • simplified code.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA & Existed UT (eg: CsvFunctionsSuite#*schema_of_csv*)

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 22, 2024
nullableSchema: DataType,
nameOfCorruptRecord: String,
timeZoneId: Option[String],
variantAllowDuplicateKeys: Boolean) extends Serializable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not related to this PR, as we used the case class here, which implements Serializable by default, so we can remove it.

import org.apache.spark.unsafe.types.UTF8String

case class SchemaOfCsvEvaluator(options: Map[String, String]) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can share some objects to avoid creating them every time, thereby reducing the pressure on GC

@panbingkun panbingkun marked this pull request as ready for review October 22, 2024 11:47
@panbingkun
Copy link
Contributor Author

cc @MaxGekk @cloud-fan

@MaxGekk
Copy link
Member

MaxGekk commented Oct 23, 2024

+1, LGTM. Merging to master.
Thank you, @panbingkun.

@MaxGekk MaxGekk closed this in 369c40c Oct 23, 2024
@panbingkun
Copy link
Contributor Author

+1, LGTM. Merging to master. Thank you, @panbingkun.

Thanks @MaxGekk ❤️

cloud-fan pushed a commit that referenced this pull request Nov 28, 2024
…lable as false

### What changes were proposed in this pull request?
The pr is following up [schema_of_json](#48473), [schema_of_xml](#48594) and [schema_of_csv](#48595), to make  returnNullable as false.

### Why are the changes needed?
As `cloud-fan`'s comment https://github.com/apache/spark/pull/48594/files#r1860534460, we should follow the original logic, otherwise it's a regression.

https://github.com/apache/spark/blob/1a502d32ef5a69739e10b827be4c9063b2a20493/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L846

https://github.com/apache/spark/blob/1a502d32ef5a69739e10b827be4c9063b2a20493/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xmlExpressions.scala#L166

https://github.com/apache/spark/blob/1a502d32ef5a69739e10b827be4c9063b2a20493/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala#L141

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48987 from panbingkun/SPARK-50066_FOLLOWUP.

Authored-by: panbingkun <panbingkun@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants