[SPARK-19949][SQL][WIP] unify bad record handling in CSV and JSON #17291

cloud-fan · 2017-03-14T14:09:44Z

What changes were proposed in this pull request?

Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.

The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FileScanRDD, handles bad records according to the parse mode.

How was this patch tested?

existing tests

cloud-fan · 2017-03-14T14:10:15Z

It's not finished yet, I'm sending it out to get some initial feedback.

cc @HyukjinKwon

SparkQA · 2017-03-14T14:13:32Z

Test build #74525 has finished for PR 17291 at commit c378ea1.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class BadRecordException(
class RowWithBadRecord(var row: InternalRow, index: Int, var record: UTF8String)

HyukjinKwon · 2017-03-14T21:57:20Z

Thank you for cc'ing me @cloud-fan. Yes, I support this idea and my final plan was also the de-duplication.

One thing I am worried of is though, some code paths (for example, DataFrameReader.json(jsonDataset: Dataset[String]) and DataFrameReader.csv(csvDataset: Dataset[String])) directly use JacksonParser or UnivocityParser. If it is generalized in FileScanRDD, I am worried of missing the variants if I understood correctly.

SparkQA · 2017-03-16T06:09:00Z

Test build #74645 has finished for PR 17291 at commit 23c1c3e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class BadRecordException(
class DataSourceReader(mode: String, numFields: Int, corruptFieldIndex: Option[Int])
class RowWithBadRecord(var row: InternalRow, index: Int, var record: UTF8String)

unify bad record handling in CSV and JSON

23c1c3e

cloud-fan force-pushed the bad-record branch from c378ea1 to 23c1c3e Compare March 16, 2017 05:56

cloud-fan closed this Mar 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19949][SQL][WIP] unify bad record handling in CSV and JSON #17291

[SPARK-19949][SQL][WIP] unify bad record handling in CSV and JSON #17291

Uh oh!

cloud-fan commented Mar 14, 2017

Uh oh!

cloud-fan commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

HyukjinKwon commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-19949][SQL][WIP] unify bad record handling in CSV and JSON #17291

[SPARK-19949][SQL][WIP] unify bad record handling in CSV and JSON #17291

Uh oh!

Conversation

cloud-fan commented Mar 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

HyukjinKwon commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants