Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR suggests adding some comments in UnivocityParser logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain).

How was this patch tested?

Unit tests in CSVSuite.

fieldsWithIndexes
}.map { case (f, i) =>
(dataSchema.indexOf(f), i)
}.toArray
Copy link
Member Author

@HyukjinKwon HyukjinKwon Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan and @maropu, could you check if this comment looks nicer to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cc'ing and brushing-up code ;) I'll check in hours

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a little bit for codes.. actually :). I hope the comment makes reading this code easier and not look too verbose.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73783 has finished for PR 17142 at commit ee30cc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val corrFieldIndex = corruptFieldIndex.get
reorderedFields.indices.filter(_ != corrFieldIndex).toArray
} else {
reorderedFields.indices.toArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code below is better?

  private val rowIndexArr: Array[Int] = corruptFieldIndex.map { corrFieldIndex =>
    reorderedFields.indices.filter(_ != corrFieldIndex).toArray
  }.getOrElse {
    reorderedFields.indices.toArray
  }

requiredSchema
}

private val tokenIndexArr: Array[Int] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fromIndexInTokens instead of tokenIndexArr for self-describing more?
Along with his, rowIndexArr to toIndexInRow?

//
// For example, let's say there is CSV data as below:
//
// a,b,c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "_c0,_c1,_c2" in the header line? I couldn't first tell this is a header.

// to map those values below:
//
// required schema - ["c", "b", "_unparsed", "a"]
// CSV data schema - ["a", "b", "c"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISTM it'd be better to map the names into these variables here, reuiqredSchema and dataSchema?

//
// required schema - ["c", "b", "_unparsed", "a"]
// CSV data schema - ["a", "b", "c"]
// required CSV data schema - ["c", "b", "a"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel "required CSV data schema" is a little ambiguous because there is no schema variable along this name in this class. So, it seems we need to describe more?


// Only used to create both `tokenIndexArr` and `rowIndexArr`. This variable means
// the fields that we should try to convert.
private val reorderedFields = if (options.dropMalformed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiredFields is better?

// value's converter (by its value) in an order of CSV data schema. In this case,
// [string->int, string->int, string->string].
//
// - `tokenIndexArr`, input tokens - required CSV data schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto; tokenIndexArr is an index array in tokens corresponding to requiredSchema?

// `tokenIndexArr` keeps the positions of input token indices (by its index) to reordered
// fields given the required CSV data schema (by its value). In this case, [2, 1, 0].
//
// - `rowIndexArr`, input tokens - required schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@maropu
Copy link
Member

maropu commented Mar 3, 2017

except for minor comments, LGTM.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Contributor

@HyukjinKwon you can address @maropu 's comments in your next CSV PR.

@asfgit asfgit closed this in d556b31 Mar 3, 2017
@HyukjinKwon
Copy link
Member Author

I definitely will. Thank you so much @cloud-fan and @maropu.

@HyukjinKwon HyukjinKwon deleted the SPARK-18699 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants