Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps not matching format are replaced with nulls #662

Closed
dolfinus opened this issue Oct 9, 2023 · 2 comments
Closed

Timestamps not matching format are replaced with nulls #662

dolfinus opened this issue Oct 9, 2023 · 2 comments

Comments

@dolfinus
Copy link
Contributor

dolfinus commented Oct 9, 2023

Hi.

I'm trying to parse simple xml file:

<item>
  <created-at>2021-01-01T01:01:01+00:00</created-at>
</item>
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show()

Result:

created-at
2021-01-01 01:01:01

But if timestamp does not match format, e.g. T is replaced with space:

<item>
  <created-at>2021-01-01 01:01:01+00:00</created-at>
</item>

It is read as null:

created-at
null

I see that there is an option mode with PERMISSIVE as default, which leads to when it encounters a field of the wrong datatype, it sets the offending field to null. But malformed value is not being added to column _corrupt_record because there is nothing wrong with xml structure.
So there is no way to detect if input file contains tag with wrong field value or nullValue, unless user set a different mode.
Is that desired behavior?

@dolfinus dolfinus changed the title Timestamps not matchinf format are replaced with nulls Timestamps not matching format are replaced with nulls Oct 9, 2023
@srowen
Copy link
Collaborator

srowen commented Oct 9, 2023

You did not include the column _corrupt_record in your schema. It's automatically added if you infer the schema, otherwise you need to add it. If not present, it can't be added.

@srowen srowen closed this as completed Oct 9, 2023
@dolfinus
Copy link
Contributor Author

dolfinus commented Oct 9, 2023

Tried:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType, StringType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType()), StructField("_corrupt_record", StringType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show(10, False)
|created-at|_corrupt_record                                                      |
|----------|---------------------------------------------------------------------|
|null      |<item>\n  <created-at>2021-01-01 01:01:01+00:00</created-at>\n</item>|

It is worth mentioning in Readme that _corrupt_record should be explicitly added to dataframe schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants