Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable null values #76

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

petro-rudenko
Copy link
Contributor

There's datasets where each column has it's own marker for missing values. spark-csv assumes only empty string for missing values. To avoid additional data transformation and saving on user's side would be great to specify a set of null markers and replace them to empty string on a library side.

@saurfang
Copy link
Contributor

+1. This will be very helpful.

@falaki
Copy link
Member

falaki commented Jul 2, 2015

@petro-rudenko Since this can easily be done with a transformation, I prefer leaving it as it is rather than adding yet another option to spark-csv.

@petro-rudenko
Copy link
Contributor Author

Spark-csv accepts sqlContext and path to files. So transformation is only possible by saving to file, which is not efficient for big files. Also replacement is done on token basis (when csv parser parsed the data). If doing csv parsing on the client side - there would no need to use spark-csv.

case _: DateType => Date.valueOf(datum)
case _: StringType => datum
case _ => throw new RuntimeException(s"Unsupported type: ${castType.typeName}")
if (datum.isEmpty && castType != StringType) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if this was another option. IE: In my application we have decided to standardize on parsing empty string fields as nulls rather than empty strings.

@mohitjaggi
Copy link
Contributor

i was working on the same and several other options. see https://github.com/databricks/spark-csv/pull/94/files
i closed that pull request temporarily as i forgot to extract the unit tests from bigdf code.

@mohitjaggi
Copy link
Contributor

please look at pull request #113

@jaley
Copy link
Contributor

jaley commented Aug 1, 2015

One reason to want this over client-side processing is that user-provided schemata have to initially state the all nullable columns are StringType in order to avoid parsing problems. Amazon ELB logs use a - character to represent null, including columns that are really DoubleType (request time in seconds). You'll get a number format exception if you try to make that a double column straight up.

Seeing as a user-provided schema can tell us whether a column is nullable or not, it might be nice if we could also say what the null values will actually look like in the data.

@ragrawal
Copy link

+1. I have some data that was generated using R and in this case nulls are encoded as "NA". Currently I am running another job that converts "NA" to "" but it will be nice if there is an option to specify how null values are encoded. All CSV parser I know off have such an option.

@HyukjinKwon
Copy link
Member

@falaki As #224 is merged, I think we might need to close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants