Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -27,7 +27,7 @@ import org.apache.spark.sql.catalyst.expressions.ExprUtils
import org.apache.spark.sql.catalyst.util.{DateFormatter, TimestampFormatter}
import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
import org.apache.spark.sql.errors.QueryExecutionErrors
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
import org.apache.spark.sql.types._

class CSVInferSchema(val options: CSVOptions) extends Serializable {
Expand Down Expand Up @@ -202,8 +202,11 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
// We can only parse the value as TimestampNTZType if it does not have zone-offset or
// time-zone component and can be parsed with the timestamp formatter.
// Otherwise, it is likely to be a timestamp with timezone.
if (timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
SQLConf.get.timestampType
val timestampType = SQLConf.get.timestampType
if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
timestampType == TimestampNTZType) &&
timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
Comment on lines +206 to +208
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this:
(legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType)
we are trying to parse it as NTZ, and it is parsable we return TimestampLTZType?

This confuses me, return TIMESTAMP LTZ when the input was parsed by a NTZ function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @gengliangwang

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Because the LEGACY behavior used timestampNTZFormatter to parse timestamp. So I don't change it when use LEGACY mode. Without this, some test case like CSVLegacyTimeParserSuite.SPARK-37326: Timestamp type inference for a column with TIMESTAMP_NTZ values can't passed. https://github.com/Hisoka-X/spark/runs/17462554632
  2. It should be (legacyTimeParserPolicy = LEGACY || timestampType == TimestampNTZType) not (legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType) if I think correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.

I see. It is ok if the such legacy behaviour is covered by a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.

I can't find the related code, @Hisoka-X can you point to it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the string format are same of two type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we hit the else branch and tryParseTimestamp can infer the type properly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According test case, I think yes. Is any case not right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks wrong, we may infer ltz using the nzt formatter. This can be a potential bug and bite us in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but it only happened when use legacy mode. Feel free to change it if you think the legacy behavior not right.

timestampType
} else {
tryParseTimestamp(field)
}
Expand Down
Expand Up @@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.json.JacksonUtils.nextUntil
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
import org.apache.spark.sql.errors.QueryExecutionErrors
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
import org.apache.spark.sql.types._
import org.apache.spark.util.Utils

Expand Down Expand Up @@ -148,11 +148,13 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
val bigDecimal = decimalParser(field)
DecimalType(bigDecimal.precision, bigDecimal.scale)
}
val timestampType = SQLConf.get.timestampType
if (options.prefersDecimal && decimalTry.isDefined) {
decimalTry.get
} else if (options.inferTimestamp &&
} else if (options.inferTimestamp && (SQLConf.get.legacyTimeParserPolicy ==
LegacyBehaviorPolicy.LEGACY || timestampType == TimestampNTZType) &&
timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
SQLConf.get.timestampType
timestampType
} else if (options.inferTimestamp &&
timestampFormatter.parseOptional(field).isDefined) {
TimestampType
Expand Down
Expand Up @@ -263,4 +263,14 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper {
inferSchema = new CSVInferSchema(options)
assert(inferSchema.inferField(DateType, "2012_12_12") == DateType)
}

test("SPARK-45433: inferring the schema when timestamps do not match specified timestampFormat" +
" with only one row") {
val options = new CSVOptions(
Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss"),
columnPruning = false,
defaultTimeZoneId = "UTC")
val inferSchema = new CSVInferSchema(options)
assert(inferSchema.inferField(NullType, "2884-06-24T02:45:51.138") == StringType)
}
}
Expand Up @@ -112,4 +112,12 @@ class JsonInferSchemaSuite extends SparkFunSuite with SQLHelper {
checkType(Map("inferTimestamp" -> "true"), json, TimestampType)
checkType(Map("inferTimestamp" -> "false"), json, StringType)
}

test("SPARK-45433: inferring the schema when timestamps do not match specified timestampFormat" +
" with only one row") {
checkType(
Map("timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss", "inferTimestamp" -> "true"),
"""{"a": "2884-06-24T02:45:51.138"}""",
StringType)
}
}