Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.

The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FileScanRDD, handles bad records according to the parse mode.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

It's not finished yet, I'm sending it out to get some initial feedback.

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74525 has finished for PR 17291 at commit c378ea1.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class BadRecordException(
  • class RowWithBadRecord(var row: InternalRow, index: Int, var record: UTF8String)

@HyukjinKwon
Copy link
Member

Thank you for cc'ing me @cloud-fan. Yes, I support this idea and my final plan was also the de-duplication.

One thing I am worried of is though, some code paths (for example, DataFrameReader.json(jsonDataset: Dataset[String]) and DataFrameReader.csv(csvDataset: Dataset[String])) directly use JacksonParser or UnivocityParser. If it is generalized in FileScanRDD, I am worried of missing the variants if I understood correctly.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74645 has finished for PR 17291 at commit 23c1c3e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class BadRecordException(
  • class DataSourceReader(mode: String, numFields: Int, corruptFieldIndex: Option[Int])
  • class RowWithBadRecord(var row: InternalRow, index: Int, var record: UTF8String)

@cloud-fan cloud-fan closed this Mar 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants