Maybe parse numbers? #6

makc · 2015-01-25T21:17:08Z

Let's say, put whatever is

/^[+-]?\d*(?:\d*\.\d*)?(?:e-?\d+)?$/i

(not tested :) through parseFloat()
Arguably, figuring out what the data is is not the parser job, but realistically almost every use of this deals with numbers.

The text was updated successfully, but these errors were encountered:

makc · 2015-01-25T22:45:16Z

These guys suggest

/^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$/

mbostock · 2015-01-25T23:22:27Z

Related d3/d3#387. Generally speaking I think it’s a bad thing to convert to non-string types implicitly. Even though it’s what you want in 99% of cases, the cases where it does something unexpected to your data can be very harmful. Hence we’ve been favoring the use of type-conversion functions where you can explicitly coerce the data to the type you want (typically on a per-column basis).

mbostock · 2016-01-06T05:06:51Z

I think it’s probably worth doing this automatically, if we can. As long as there’s a way to disable it.

curran · 2016-01-06T06:10:49Z

Related projects with similar intent:

datalib - dl.csv supports arguments that specify types that each column should be coerced to. Not sure if it does number parsing automatically, but it seems like it does from the documentation.
dsv-dataset Parses numbers and dates according to a specification of column types. No automatic type detection.

mbostock · 2016-01-06T17:23:42Z

Thanks for the references. I was aware of those projects, but it was a useful reminder.

This issue was only intended to cover detecting numbers. It appears datalib also detects booleans and dates. I wonder whether that’s possible to do safely.

Detecting booleans as “true” or “false” is simple enough, but many datasets do not use these exact strings to represent booleans; “Y” and “N”, for example, is probably more common. Also, many datasets use the empty string to indicate missing data. You wouldn’t want to inadvertently coerce the empty string to false—undefined would be more appropriate—and it would likewise be weird mix the empty string in with boolean true and false.

Similarly, what would you do with a mix of “true”, “false” and other non-empty strings? The same issue applies to detecting numbers. Do you use strings, NaN or undefined for non-numeric values if the column contains a mix of numeric and non-numeric values? Undefined for the empty string and NaN for non-numeric values is arguably more type-safe than including strings, but it loses information.

Detecting dates generically is much more difficult. Using Date.parse is dangerous because its behavior varies widely across browsers: you might test your code on one browser that understands the given date format, but users on a different browser might see invalid dates! It’s safer to strictly define the set of supported date formats, but that implies d3-dsv depending on d3-time-format and d3-time, which is a fairly significant addition!

Maybe I’ve convinced myself to stick with the status quo and coerce types explicitly.

mbostock · 2016-01-06T17:34:09Z

Also, JavaScript already provides type coercion if you don’t mind being sloppy: for example, putting a string value into an arithmetic expression automatically coerces that string to a number. And the nice thing about leaving things as strings is that it doesn’t lose information, like greedily coercing to a number does. Though of course there are times, such as when sorting, that leaving number-like values as strings will behave unexpectedly.

Another approach to this problem might be to improve how types are coerced. Do we think the current approach is too verbose, or too tedious to type? Or perhaps we think it’s also unsafe, because it silently coerces the empty string to zero and non-numeric strings to NaN?

d3_dsv.csv(text, function(d) {
  return {
    foo: +d.foo,
    bar: !!d.bar,
    baz: parseDate(d.baz)
  };
}, function(error, data) {
  if (error) throw error;
  console.log(data);
});

I could imagine an API for constructing the above type-coercion function more declaratively (but not that much more declarative, since the above is pretty clean):

d3_dsv.csv(text, d3_dsv.type()
    .field("foo", d3_dsv.typeNumber)
    .field("bar", d3_dsv.typeBoolean)
    .field("baz", parseDate), function(error, data) {
  if (error) throw error;
  console.log(data);
});

So, the field foo is coerced to a number, but maybe it throws an error if the number is invalid rather than silently coercing to NaN?

The hypothetical API doesn’t seem like a big win, though, since the current approach is shorter (or about the same) to type and more transparent.

I suppose you could have d3_dsv.typeAuto() if you wanted to opt-in to unsafe conversion, though. :)

mbostock · 2016-01-06T17:39:34Z

Another variation:

d3_dsv.csv(text, d3_dsv.type({
      foo: d3_dsv.typeNumber,
      bar: d3_dsv.typeBoolean,
      baz: parseDate
    }), function(error, data) {
  if (error) throw error;
  console.log(data);
});

It’s also interesting to consider whether this would be useful for renaming columns (#10) and restructuring. But, that’s also something the current approach does relatively well, perhaps even better if you use ES6 destructuring.

d3_dsv.csv(text, function(d) {
  return {
    foo: +d.Foo,
    bar: {
      confirmed: !!d.barConfirmed,
      date: parseDate(d.barDate)
    }
  };
}, function(error, data) {
  if (error) throw error;
  console.log(data);
});

makc · 2016-01-06T19:40:44Z

Ha ha,

mbostock self-assigned this 15 hours ago
mbostock removed their assignment 2 hours ago

I'm late to the party and yet did not miss anything.

curran · 2016-01-19T20:10:22Z

All very interesting ideas. It seems like automatic parsing might be best suited to leave out of D3, as other tools like Datalib will come along and evolve. Lots of open questions, like how to detect date format automatically. Also there's the classic case of id fields that are strings, like "00320", that should not be parsed as numbers. I've heard so many stories of Excel automatically parsing these kinds of identifiers (e.g. FIPS codes) and causing problems.

makc · 2016-01-19T20:22:48Z

@curran I trust microsoft that excel team has performed extensive use case studies and determined that the number of cases where this caused problems was far less than the number of cases where it was helpful.

vogievetsky · 2016-01-19T21:24:44Z

And yet, in a data file with a large number of columns, Excel will always find some column to mess up :-p

jstcki · 2016-08-29T11:53:30Z

@makc Whoops!

makc · 2016-08-30T13:24:13Z

@herrstucki way to blame computer program for human error. this is how robot revolution will start.

mbostock · 2019-02-07T17:15:44Z

I realize this issue is four years old and I haven’t found a solution I’m happy with yet, but I’d like to make some progress here. At the very least, there should be some explicit option to coerce values to numbers if the value would not be NaN, even if it’s not the default behavior. For example:

function autoType(d) {
  for (const key in d) {
    if (!isNaN(d[key])) {
      d[key] = +d[key];
    }
  }
  return d;
}

This “autoType” function could then be passed as the row accessor function to dsv.parse.

If we change the default string conversion for dates to use ISO 8601 format, we could likewise add parsing for dates in ISO 8601 format to autoType, and thus have a clean way to roundtrip dates as well. (While avoiding the problem of trying to parse arbitrary date formats, which is a minefield, and should be avoided anyway in most cases by encouraging people to use the standard ISO 8601 representation.)

mbostock · 2019-02-07T17:33:10Z

We should also guarantee that NaN is roundtripped as the number NaN, rather than coming back as the string "NaN". It might also be sensible to parse Python’s "nan" and R’s "NA" as NaN, too.

mbostock · 2019-02-07T17:34:08Z

We could roundtrip "true" and "false" (exact strings, but case-insensitive?) to booleans, too.

mbostock · 2019-02-07T19:28:24Z

I’ve fleshed out a solution that I’m pretty happy with in #42.

https://github.com/d3/d3-dsv/blob/auto-type/README.md#autoType

mbostock self-assigned this Jan 6, 2016

mbostock removed their assignment Jan 6, 2016

mbostock mentioned this issue Jul 14, 2016

Dynamic Data Scales d3/d3#2860

Closed

lennerd mentioned this issue Oct 7, 2016

Parse numbers as numbers wbkd/dsv-loader#8

Closed

curran mentioned this issue Feb 7, 2017

Type inference for strings containing numbers vega/altair#296

Open

mbostock mentioned this issue Feb 7, 2019

Format dates as ISO 8601 rather than using date.toString. #41

Closed

mbostock added a commit that referenced this issue Feb 7, 2019

Add d3.autoType. Fixes #6.

c84f67b

mbostock mentioned this issue Feb 7, 2019

Add d3.autoType. #42

Merged

mbostock closed this as completed in #42 Feb 7, 2019

snyk-bot mentioned this issue May 2, 2020

[Snyk] Upgrade d3-dsv from 1.0.5 to 1.2.0 jmarca/tams_classifications#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe parse numbers? #6

Maybe parse numbers? #6

makc commented Jan 25, 2015

makc commented Jan 25, 2015

mbostock commented Jan 25, 2015

mbostock commented Jan 6, 2016

curran commented Jan 6, 2016

mbostock commented Jan 6, 2016

mbostock commented Jan 6, 2016

mbostock commented Jan 6, 2016

makc commented Jan 6, 2016

curran commented Jan 19, 2016

makc commented Jan 19, 2016

vogievetsky commented Jan 19, 2016

jstcki commented Aug 29, 2016

makc commented Aug 30, 2016

mbostock commented Feb 7, 2019

mbostock commented Feb 7, 2019

mbostock commented Feb 7, 2019

mbostock commented Feb 7, 2019

Maybe parse numbers? #6

Maybe parse numbers? #6

Comments

makc commented Jan 25, 2015

makc commented Jan 25, 2015

mbostock commented Jan 25, 2015

mbostock commented Jan 6, 2016

curran commented Jan 6, 2016

mbostock commented Jan 6, 2016

mbostock commented Jan 6, 2016

mbostock commented Jan 6, 2016

makc commented Jan 6, 2016

curran commented Jan 19, 2016

makc commented Jan 19, 2016

vogievetsky commented Jan 19, 2016

jstcki commented Aug 29, 2016

makc commented Aug 30, 2016

mbostock commented Feb 7, 2019

mbostock commented Feb 7, 2019

mbostock commented Feb 7, 2019

mbostock commented Feb 7, 2019