[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142

HyukjinKwon · 2017-03-02T17:47:56Z

What changes were proposed in this pull request?

This PR suggests adding some comments in UnivocityParser logics to explain what happens. Also, it proposes, IMHO, a little bit cleaner (at least easy for me to explain).

How was this patch tested?

Unit tests in CSVSuite.

HyukjinKwon · 2017-03-02T17:49:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

-      fieldsWithIndexes
-    }.map { case (f, i) =>
-      (dataSchema.indexOf(f), i)
-    }.toArray


cc @cloud-fan and @maropu, could you check if this comment looks nicer to you?

Thanks for cc'ing and brushing-up code ;) I'll check in hours

It is just a little bit for codes.. actually :). I hope the comment makes reading this code easier and not look too verbose.

SparkQA · 2017-03-02T19:40:33Z

Test build #73783 has finished for PR 17142 at commit ee30cc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-03T07:57:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+    val corrFieldIndex = corruptFieldIndex.get
+    reorderedFields.indices.filter(_ != corrFieldIndex).toArray
+  } else {
+    reorderedFields.indices.toArray


The code below is better?

private val rowIndexArr: Array[Int] = corruptFieldIndex.map { corrFieldIndex => reorderedFields.indices.filter(_ != corrFieldIndex).toArray }.getOrElse { reorderedFields.indices.toArray }

maropu · 2017-03-03T07:58:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+    requiredSchema
+  }
+
+  private val tokenIndexArr: Array[Int] = {


How about fromIndexInTokens instead of tokenIndexArr for self-describing more?
Along with his, rowIndexArr to toIndexInRow?

maropu · 2017-03-03T08:06:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+  //
+  // For example, let's say there is CSV data as below:
+  //
+  //   a,b,c


How about "_c0,_c1,_c2" in the header line? I couldn't first tell this is a header.

maropu · 2017-03-03T08:11:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+  // to map those values below:
+  //
+  //   required schema - ["c", "b", "_unparsed", "a"]
+  //   CSV data schema - ["a", "b", "c"]


ISTM it'd be better to map the names into these variables here, reuiqredSchema and dataSchema?

maropu · 2017-03-03T08:13:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+  //
+  //   required schema - ["c", "b", "_unparsed", "a"]
+  //   CSV data schema - ["a", "b", "c"]
+  //   required CSV data schema - ["c", "b", "a"]


I feel "required CSV data schema" is a little ambiguous because there is no schema variable along this name in this class. So, it seems we need to describe more?

maropu · 2017-03-03T08:17:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+
+  // Only used to create both `tokenIndexArr` and `rowIndexArr`. This variable means
+  // the fields that we should try to convert.
+  private val reorderedFields = if (options.dropMalformed) {


requiredFields is better?

maropu · 2017-03-03T08:24:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+  //   value's converter (by its value) in an order of CSV data schema. In this case,
+  //   [string->int, string->int, string->string].
+  //
+  // - `tokenIndexArr`, input tokens - required CSV data schema


ditto; tokenIndexArr is an index array in tokens corresponding to requiredSchema?

maropu · 2017-03-03T08:24:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

+  //   `tokenIndexArr` keeps the positions of input token indices (by its index) to reordered
+  //   fields given the required CSV data schema (by its value). In this case, [2, 1, 0].
+  //
+  // - `rowIndexArr`, input tokens - required schema


maropu · 2017-03-03T08:26:12Z

except for minor comments, LGTM.

cloud-fan · 2017-03-03T08:51:29Z

thanks, merging to master!

cloud-fan · 2017-03-03T08:52:51Z

@HyukjinKwon you can address @maropu 's comments in your next CSV PR.

HyukjinKwon · 2017-03-03T09:03:47Z

I definitely will. Thank you so much @cloud-fan and @maropu.

Add more comments and clean up a litle bit

ee30cc3

HyukjinKwon commented Mar 2, 2017

View reviewed changes

maropu reviewed Mar 3, 2017

View reviewed changes

asfgit closed this in d556b31 Mar 3, 2017

HyukjinKwon deleted the SPARK-18699 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142

[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142

HyukjinKwon commented Mar 2, 2017

HyukjinKwon Mar 2, 2017 •

edited

Loading

maropu Mar 3, 2017

HyukjinKwon Mar 3, 2017

SparkQA commented Mar 2, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu Mar 3, 2017

maropu commented Mar 3, 2017

cloud-fan commented Mar 3, 2017

cloud-fan commented Mar 3, 2017

HyukjinKwon commented Mar 3, 2017

[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142

[SPARK-18699][SQL][FOLLOWUP] Add explanation in CSV parser and minor cleanup #17142

Conversation

HyukjinKwon commented Mar 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Mar 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Mar 3, 2017

cloud-fan commented Mar 3, 2017

cloud-fan commented Mar 3, 2017

HyukjinKwon commented Mar 3, 2017

HyukjinKwon Mar 2, 2017 •

edited

Loading