Configurable null values #76

petro-rudenko · 2015-06-08T01:33:08Z

There's datasets where each column has it's own marker for missing values. spark-csv assumes only empty string for missing values. To avoid additional data transformation and saving on user's side would be great to specify a set of null markers and replace them to empty string on a library side.

To avoid replacement MONTANA -> MONTA

saurfang · 2015-06-25T18:16:07Z

+1. This will be very helpful.

falaki · 2015-07-02T20:23:54Z

@petro-rudenko Since this can easily be done with a transformation, I prefer leaving it as it is rather than adding yet another option to spark-csv.

petro-rudenko · 2015-07-06T17:23:03Z

Spark-csv accepts sqlContext and path to files. So transformation is only possible by saving to file, which is not efficient for big files. Also replacement is done on token basis (when csv parser parsed the data). If doing csv parsing on the client side - there would no need to use spark-csv.

binarybana · 2015-07-09T19:42:18Z

src/main/scala/com/databricks/spark/csv/util/TypeCast.scala

-      case _: DateType => Date.valueOf(datum)
-      case _: StringType => datum
-      case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}")
+    if (datum.isEmpty && castType != StringType) {


It'd be nice if this was another option. IE: In my application we have decided to standardize on parsing empty string fields as nulls rather than empty strings.

mohitjaggi · 2015-07-13T05:27:44Z

i was working on the same and several other options. see https://github.com/databricks/spark-csv/pull/94/files
i closed that pull request temporarily as i forgot to extract the unit tests from bigdf code.

mohitjaggi · 2015-08-01T05:29:54Z

please look at pull request #113

jaley · 2015-08-01T09:42:21Z

One reason to want this over client-side processing is that user-provided schemata have to initially state the all nullable columns are StringType in order to avoid parsing problems. Amazon ELB logs use a - character to represent null, including columns that are really DoubleType (request time in seconds). You'll get a number format exception if you try to make that a double column straight up.

Seeing as a user-provided schema can tell us whether a column is nullable or not, it might be nice if we could also say what the null values will actually look like in the data.

ragrawal · 2015-12-24T15:42:59Z

+1. I have some data that was generated using R and in this case nulls are encoded as "NA". Currently I am running another job that converts "NA" to "" but it will be nice if there is an option to specify how null values are encoded. All CSV parser I know off have such an option.

HyukjinKwon · 2016-01-07T03:47:02Z

@falaki As #224 is merged, I think we might need to close this?

petro-rudenko added 4 commits June 8, 2015 04:28

Configurable null values

ebd269b

Fix style

d89dcb3

Null value for types other than String

9612b16

Replace null markers for the whole token.

18753a9

To avoid replacement MONTANA -> MONTA

binarybana reviewed Jul 9, 2015
View reviewed changes

JoshRosen added the stale / awaiting update label Sep 12, 2015

addisonj mentioned this pull request Dec 29, 2015

Add nullValue being respected when parsing CSVs #224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable null values #76

Configurable null values #76

petro-rudenko commented Jun 8, 2015

saurfang commented Jun 25, 2015

falaki commented Jul 2, 2015

petro-rudenko commented Jul 6, 2015

binarybana Jul 9, 2015

mohitjaggi commented Jul 13, 2015

mohitjaggi commented Aug 1, 2015

jaley commented Aug 1, 2015

ragrawal commented Dec 24, 2015

HyukjinKwon commented Jan 7, 2016

Configurable null values #76

Are you sure you want to change the base?

Configurable null values #76

Conversation

petro-rudenko commented Jun 8, 2015

saurfang commented Jun 25, 2015

falaki commented Jul 2, 2015

petro-rudenko commented Jul 6, 2015

binarybana Jul 9, 2015

Choose a reason for hiding this comment

mohitjaggi commented Jul 13, 2015

mohitjaggi commented Aug 1, 2015

jaley commented Aug 1, 2015

ragrawal commented Dec 24, 2015

HyukjinKwon commented Jan 7, 2016