Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error #42979

Closed
wants to merge 13 commits into from

Conversation

Hisoka-X
Copy link
Member

@Hisoka-X Hisoka-X commented Sep 18, 2023

What changes were proposed in this pull request?

Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error, it would be like:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 (TID 4031) (10.68.177.106 executor 0): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Auto-closing enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Delimiters for detection=null
	Empty value=
	Escape unquoted values=false
	Header extraction enabled=null
	Headers=null
	Ignore leading whitespaces=false
	Ignore leading whitespaces in quotes=false
	Ignore trailing whitespaces=false
	Ignore trailing whitespaces in quotes=false
	Input buffer size=1048576
	Input reading on separate thread=false
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=1000
	Line separator detection enabled=true
	Maximum number of characters per column=-1
	Maximum number of columns=20480
	Normalize escaped line separators=true
	Null value=
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=none
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
	CsvFormat:
		Comment character=#
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character=\
		Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... 

Because multiline CSV/JSON use BinaryFileRDD not FileScanRDD. Unlike FileScanRDD, when met corrupt files will check ignoreCorruptFiles config to avoid report IOException, BinaryFileRDD will not report error because it return normal PortableDataStream. So we should catch it when infer schema in lambda function. Also do same thing for ignoreMissingFiles.

Why are the changes needed?

Fix the bug when use mulitline mode with ignoreCorruptFiles/ignoreMissingFiles config.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new test.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Sep 18, 2023
@Hisoka-X
Copy link
Member Author

cc @HyukjinKwon

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hisoka-X Could you rebase this on the recent master Scala 2.13 + Java 17.

@Hisoka-X
Copy link
Member Author

Hisoka-X commented Oct 9, 2023

@Hisoka-X Could you rebase this on the recent master Scala 2.13 + Java 17.

Done

Some(StructType(Nil))
// Throw FileNotFoundException even if `ignoreCorruptFiles` is true
case e: FileNotFoundException if !options.ignoreMissingFiles => throw e
case e: IOException if options.ignoreCorruptFiles =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about RuntimeException like at

case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like JSON/CSV infer code should be similar to FileScanRDD and ignoreCorruptFiles should impact on errors caused by RuntimeException. What happens if put RuntimeException in the case? :

                  case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never reach here if we catch one RuntimeException. Because it will be catched by https://github.com/apache/spark/pull/42979/files/942b7c38a277a1b38b85a64ad940b56528ae8a03#diff-774d08eb04cd18039c576c7e23609430476d3dd2668535f0432f04b65b8ab234R92. So it would be useless code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean remove it from the case:

          case e @ (_: JsonProcessingException | _: MalformedInputException) =>
            handleJsonErrorsByParseMode(parseMode, columnNameOfCorruptRecord, e)
...

and put it there:

          case e @ (_: IOException | _: RuntimeException) if options.ignoreCorruptFiles =>
            logWarning("Skipped the rest of the content in the corrupted file", e)
            Some(StructType(Nil))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I got it. But I'm not sure it would change behavior or not. Let me change it and see CI would report error or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, we consider a JSON record as malformed when the JSON parser cannot parse its already retrieved input. So, it was able to read some text from a file but cannot parse. For instance, if we look at JSON parser exceptions:

    public JsonParser createParser(InputStream in) throws IOException, JsonParseException {
        IOContext ctxt = _createContext(_createContentReference(in), false);
        return _createParser(_decorate(in, ctxt), ctxt);
    }
...
    public abstract JsonToken nextToken() throws IOException;

JsonParseException (JsonProcessingException) + IOException

So, the RuntimeException can come only for some corrupted files. cc @HyukjinKwon @cloud-fan

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, adjust PR's title and its description regarding to ignoreMissingFiles .

@Hisoka-X Hisoka-X changed the title [SPARK-45035][SQL] Fix ignoreCorruptFiles with multiline CSV/JSON will report error [SPARK-45035][SQL] Fix ignoreCorruptFiles/ignoreMissingFiles with multiline CSV/JSON will report error Oct 16, 2023
@@ -99,6 +102,13 @@ private[sql] class JsonInferSchema(options: JSONOptions) extends Serializable {
val wrappedCharException = new CharConversionException(msg)
wrappedCharException.initCause(e)
handleJsonErrorsByParseMode(parseMode, columnNameOfCorruptRecord, wrappedCharException)
case e: FileNotFoundException if ignoreMissingFiles =>
Copy link
Contributor

@cloud-fan cloud-fan Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit annoying that we need to handle corrupted/missing files in multiple places, but I don't have a better idea.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 18, 2023

+1, LGTM. Merging to master.
Thank you, @Hisoka-X and @cloud-fan @HyukjinKwon @beliefer for review.

@MaxGekk MaxGekk closed this in 11e7ea4 Oct 18, 2023
@Hisoka-X Hisoka-X deleted the SPARK-45035_csv_multi_line branch October 18, 2023 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants