[SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error #42979

Hisoka-X · 2023-09-18T12:27:08Z

What changes were proposed in this pull request?

Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error, it would be like:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 (TID 4031) (10.68.177.106 executor 0): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Auto-closing enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Delimiters for detection=null
	Empty value=
	Escape unquoted values=false
	Header extraction enabled=null
	Headers=null
	Ignore leading whitespaces=false
	Ignore leading whitespaces in quotes=false
	Ignore trailing whitespaces=false
	Ignore trailing whitespaces in quotes=false
	Input buffer size=1048576
	Input reading on separate thread=false
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=1000
	Line separator detection enabled=true
	Maximum number of characters per column=-1
	Maximum number of columns=20480
	Normalize escaped line separators=true
	Null value=
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=none
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
	CsvFormat:
		Comment character=#
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character=\
		Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...

Because multiline CSV/JSON use BinaryFileRDD not FileScanRDD. Unlike FileScanRDD, when met corrupt files will check ignoreCorruptFiles config to avoid report IOException, BinaryFileRDD will not report error because it return normal PortableDataStream. So we should catch it when infer schema in lambda function. Also do same thing for ignoreMissingFiles.

Why are the changes needed?

Fix the bug when use mulitline mode with ignoreCorruptFiles/ignoreMissingFiles config.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new test.

Was this patch authored or co-authored using generative AI tooling?

No

… report error

Hisoka-X · 2023-09-18T12:27:29Z

cc @HyukjinKwon

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

MaxGekk

@Hisoka-X Could you rebase this on the recent master Scala 2.13 + Java 17.

Hisoka-X · 2023-10-09T14:13:07Z

@Hisoka-X Could you rebase this on the recent master Scala 2.13 + Java 17.

Done

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

MaxGekk · 2023-10-16T07:49:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

+            Some(StructType(Nil))
+          // Throw FileNotFoundException even if `ignoreCorruptFiles` is true
+          case e: FileNotFoundException if !options.ignoreMissingFiles => throw e
+          case e: IOException if options.ignoreCorruptFiles =>


How about RuntimeException like at

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

Line 262 in 7796d8a

case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>

?

RuntimeException already be catched in https://github.com/apache/spark/pull/42979/files/942b7c38a277a1b38b85a64ad940b56528ae8a03#diff-774d08eb04cd18039c576c7e23609430476d3dd2668535f0432f04b65b8ab234R92.

I feel like JSON/CSV infer code should be similar to FileScanRDD and ignoreCorruptFiles should impact on errors caused by RuntimeException. What happens if put RuntimeException in the case? :

case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>

It never reach here if we catch one RuntimeException. Because it will be catched by https://github.com/apache/spark/pull/42979/files/942b7c38a277a1b38b85a64ad940b56528ae8a03#diff-774d08eb04cd18039c576c7e23609430476d3dd2668535f0432f04b65b8ab234R92. So it would be useless code.

I mean remove it from the case:

case e @ (_: JsonProcessingException | _: MalformedInputException) => handleJsonErrorsByParseMode(parseMode, columnNameOfCorruptRecord, e) ...

and put it there:

case e @ (_: IOException | _: RuntimeException) if options.ignoreCorruptFiles => logWarning("Skipped the rest of the content in the corrupted file", e) Some(StructType(Nil))

oh, I got it. But I'm not sure it would change behavior or not. Let me change it and see CI would report error or not.

From my understanding, we consider a JSON record as malformed when the JSON parser cannot parse its already retrieved input. So, it was able to read some text from a file but cannot parse. For instance, if we look at JSON parser exceptions:

public JsonParser createParser(InputStream in) throws IOException, JsonParseException { IOContext ctxt = _createContext(_createContentReference(in), false); return _createParser(_decorate(in, ctxt), ctxt); } ... public abstract JsonToken nextToken() throws IOException;

JsonParseException (JsonProcessingException) + IOException

So, the RuntimeException can come only for some corrupted files. cc @HyukjinKwon @cloud-fan

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

MaxGekk

Please, adjust PR's title and its description regarding to ignoreMissingFiles .

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

cloud-fan · 2023-10-17T13:18:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

@@ -99,6 +102,13 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
            val wrappedCharException = new CharConversionException(msg)
            wrappedCharException.initCause(e)
            handleJsonErrorsByParseMode(parseMode, columnNameOfCorruptRecord, wrappedCharException)
+          case e: FileNotFoundException if ignoreMissingFiles =>


It's a bit annoying that we need to handle corrupted/missing files in multiple places, but I don't have a better idea.

MaxGekk · 2023-10-18T06:05:54Z

+1, LGTM. Merging to master.
Thank you, @Hisoka-X and @cloud-fan @HyukjinKwon @beliefer for review.

Hisoka-X added 2 commits September 18, 2023 14:30

[SPARK-45035][SQL] Fix ignoreCorruptFiles for multiline CSV/JSON will…

a0d3f01

… report error

[SPARK-45035][SQL] Fix ignoreCorruptFiles for multiline CSV/JSON will…

1438cf7

… report error

github-actions bot added the SQL label Sep 18, 2023

Hisoka-X added 3 commits September 18, 2023 23:26

update

a35d191

update

8f33e7f

Merge branch 'master_' into SPARK-45035_csv_multi_line

385749c

HyukjinKwon reviewed Sep 19, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala Show resolved Hide resolved

MaxGekk reviewed Oct 9, 2023

View reviewed changes

Merge branch 'master_' into SPARK-45035_csv_multi_line

e8730f1

Hisoka-X added 2 commits October 11, 2023 09:11

update

461515a

update

942b7c3

MaxGekk requested changes Oct 16, 2023

View reviewed changes

update

cb6e5c5

MaxGekk reviewed Oct 16, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala Outdated Show resolved Hide resolved

update

0d7b06d

Hisoka-X changed the title ~~[SPARK-45035][SQL] Fix ignoreCorruptFiles with multiline CSV/JSON will report error~~ [SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error Oct 16, 2023

update

cfd103f

beliefer reviewed Oct 17, 2023

View reviewed changes

update

546b182

cloud-fan reviewed Oct 17, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 17, 2023

View reviewed changes

update

18857db

MaxGekk approved these changes Oct 18, 2023

View reviewed changes

MaxGekk closed this in 11e7ea4 Oct 18, 2023

Hisoka-X deleted the SPARK-45035_csv_multi_line branch October 18, 2023 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error #42979

[SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error #42979

Hisoka-X commented Sep 18, 2023 •

edited

Loading

Hisoka-X commented Sep 18, 2023

MaxGekk left a comment

Hisoka-X commented Oct 9, 2023

MaxGekk Oct 16, 2023

Hisoka-X Oct 16, 2023

MaxGekk Oct 16, 2023

Hisoka-X Oct 16, 2023

MaxGekk Oct 16, 2023

Hisoka-X Oct 16, 2023

MaxGekk Oct 16, 2023

MaxGekk left a comment

cloud-fan Oct 17, 2023 •

edited

Loading

MaxGekk commented Oct 18, 2023

[SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error #42979

[SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error #42979

Conversation

Hisoka-X commented Sep 18, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Hisoka-X commented Sep 18, 2023

MaxGekk left a comment

Choose a reason for hiding this comment

Hisoka-X commented Oct 9, 2023

MaxGekk Oct 16, 2023

Choose a reason for hiding this comment

Hisoka-X Oct 16, 2023

Choose a reason for hiding this comment

MaxGekk Oct 16, 2023

Choose a reason for hiding this comment

Hisoka-X Oct 16, 2023

Choose a reason for hiding this comment

MaxGekk Oct 16, 2023

Choose a reason for hiding this comment

Hisoka-X Oct 16, 2023

Choose a reason for hiding this comment

MaxGekk Oct 16, 2023

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

cloud-fan Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

MaxGekk commented Oct 18, 2023

Hisoka-X commented Sep 18, 2023 •

edited

Loading

cloud-fan Oct 17, 2023 •

edited

Loading