-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142
Conversation
fieldsWithIndexes | ||
}.map { case (f, i) => | ||
(dataSchema.indexOf(f), i) | ||
}.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan and @maropu, could you check if this comment looks nicer to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cc'ing and brushing-up code ;) I'll check in hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just a little bit for codes.. actually :). I hope the comment makes reading this code easier and not look too verbose.
Test build #73783 has finished for PR 17142 at commit
|
val corrFieldIndex = corruptFieldIndex.get | ||
reorderedFields.indices.filter(_ != corrFieldIndex).toArray | ||
} else { | ||
reorderedFields.indices.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below is better?
private val rowIndexArr: Array[Int] = corruptFieldIndex.map { corrFieldIndex =>
reorderedFields.indices.filter(_ != corrFieldIndex).toArray
}.getOrElse {
reorderedFields.indices.toArray
}
requiredSchema | ||
} | ||
|
||
private val tokenIndexArr: Array[Int] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about fromIndexInTokens
instead of tokenIndexArr
for self-describing more?
Along with his, rowIndexArr
to toIndexInRow
?
// | ||
// For example, let's say there is CSV data as below: | ||
// | ||
// a,b,c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "_c0,_c1,_c2" in the header line? I couldn't first tell this is a header.
// to map those values below: | ||
// | ||
// required schema - ["c", "b", "_unparsed", "a"] | ||
// CSV data schema - ["a", "b", "c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISTM it'd be better to map the names into these variables here, reuiqredSchema
and dataSchema
?
// | ||
// required schema - ["c", "b", "_unparsed", "a"] | ||
// CSV data schema - ["a", "b", "c"] | ||
// required CSV data schema - ["c", "b", "a"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel "required CSV data schema" is a little ambiguous because there is no schema variable along this name in this class. So, it seems we need to describe more?
|
||
// Only used to create both `tokenIndexArr` and `rowIndexArr`. This variable means | ||
// the fields that we should try to convert. | ||
private val reorderedFields = if (options.dropMalformed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requiredFields
is better?
// value's converter (by its value) in an order of CSV data schema. In this case, | ||
// [string->int, string->int, string->string]. | ||
// | ||
// - `tokenIndexArr`, input tokens - required CSV data schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto; tokenIndexArr
is an index array in tokens corresponding to requiredSchema
?
// `tokenIndexArr` keeps the positions of input token indices (by its index) to reordered | ||
// fields given the required CSV data schema (by its value). In this case, [2, 1, 0]. | ||
// | ||
// - `rowIndexArr`, input tokens - required schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
except for minor comments, LGTM. |
thanks, merging to master! |
@HyukjinKwon you can address @maropu 's comments in your next CSV PR. |
I definitely will. Thank you so much @cloud-fan and @maropu. |
What changes were proposed in this pull request?
This PR suggests adding some comments in
UnivocityParser
logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain).How was this patch tested?
Unit tests in
CSVSuite
.