Timestamps not matching format are replaced with nulls #662

dolfinus · 2023-10-09T09:02:33Z

Hi.

I'm trying to parse simple xml file:

<item>
  <created-at>2021-01-01T01:01:01+00:00</created-at>
</item>

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show()

Result:

created-at
2021-01-01 01:01:01

But if timestamp does not match format, e.g. T is replaced with space:

<item>
  <created-at>2021-01-01 01:01:01+00:00</created-at>
</item>

It is read as null:

created-at
null

I see that there is an option mode with PERMISSIVE as default, which leads to when it encounters a field of the wrong datatype, it sets the offending field to null. But malformed value is not being added to column _corrupt_record because there is nothing wrong with xml structure.
So there is no way to detect if input file contains tag with wrong field value or nullValue, unless user set a different mode.
Is that desired behavior?

The text was updated successfully, but these errors were encountered:

srowen · 2023-10-09T12:44:28Z

You did not include the column _corrupt_record in your schema. It's automatically added if you infer the schema, otherwise you need to add it. If not present, it can't be added.

dolfinus · 2023-10-09T12:55:21Z

Tried:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType, StringType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType()), StructField("_corrupt_record", StringType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show(10, False)

|created-at|_corrupt_record                                                      |
|----------|---------------------------------------------------------------------|
|null      |<item>\n  <created-at>2021-01-01 01:01:01+00:00</created-at>\n</item>|

It is worth mentioning in Readme that _corrupt_record should be explicitly added to dataframe schema.

dolfinus changed the title ~~Timestamps not matchinf format are replaced with nulls~~ Timestamps not matching format are replaced with nulls Oct 9, 2023

dolfinus mentioned this issue Oct 9, 2023

[DOP-9645] - Add XML file format MobileTeleSystems/onetl#163

Merged

7 tasks

srowen closed this as completed Oct 9, 2023

dolfinus mentioned this issue Dec 16, 2023

Add notes about file extensions and _corrupt_record to documentation #674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamps not matching format are replaced with nulls #662

Timestamps not matching format are replaced with nulls #662

dolfinus commented Oct 9, 2023 •

edited

srowen commented Oct 9, 2023

dolfinus commented Oct 9, 2023 •

edited

Timestamps not matching format are replaced with nulls #662

Timestamps not matching format are replaced with nulls #662

Comments

dolfinus commented Oct 9, 2023 • edited

srowen commented Oct 9, 2023

dolfinus commented Oct 9, 2023 • edited

dolfinus commented Oct 9, 2023 •

edited

dolfinus commented Oct 9, 2023 •

edited