-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe parse numbers? #6
Comments
These guys suggest
|
Related d3/d3#387. Generally speaking I think it’s a bad thing to convert to non-string types implicitly. Even though it’s what you want in 99% of cases, the cases where it does something unexpected to your data can be very harmful. Hence we’ve been favoring the use of type-conversion functions where you can explicitly coerce the data to the type you want (typically on a per-column basis). |
I think it’s probably worth doing this automatically, if we can. As long as there’s a way to disable it. |
Related projects with similar intent:
|
Thanks for the references. I was aware of those projects, but it was a useful reminder. This issue was only intended to cover detecting numbers. It appears datalib also detects booleans and dates. I wonder whether that’s possible to do safely. Detecting booleans as “true” or “false” is simple enough, but many datasets do not use these exact strings to represent booleans; “Y” and “N”, for example, is probably more common. Also, many datasets use the empty string to indicate missing data. You wouldn’t want to inadvertently coerce the empty string to false—undefined would be more appropriate—and it would likewise be weird mix the empty string in with boolean true and false. Similarly, what would you do with a mix of “true”, “false” and other non-empty strings? The same issue applies to detecting numbers. Do you use strings, NaN or undefined for non-numeric values if the column contains a mix of numeric and non-numeric values? Undefined for the empty string and NaN for non-numeric values is arguably more type-safe than including strings, but it loses information. Detecting dates generically is much more difficult. Using Date.parse is dangerous because its behavior varies widely across browsers: you might test your code on one browser that understands the given date format, but users on a different browser might see invalid dates! It’s safer to strictly define the set of supported date formats, but that implies d3-dsv depending on d3-time-format and d3-time, which is a fairly significant addition! Maybe I’ve convinced myself to stick with the status quo and coerce types explicitly. |
Also, JavaScript already provides type coercion if you don’t mind being sloppy: for example, putting a string value into an arithmetic expression automatically coerces that string to a number. And the nice thing about leaving things as strings is that it doesn’t lose information, like greedily coercing to a number does. Though of course there are times, such as when sorting, that leaving number-like values as strings will behave unexpectedly. Another approach to this problem might be to improve how types are coerced. Do we think the current approach is too verbose, or too tedious to type? Or perhaps we think it’s also unsafe, because it silently coerces the empty string to zero and non-numeric strings to NaN? d3_dsv.csv(text, function(d) {
return {
foo: +d.foo,
bar: !!d.bar,
baz: parseDate(d.baz)
};
}, function(error, data) {
if (error) throw error;
console.log(data);
}); I could imagine an API for constructing the above type-coercion function more declaratively (but not that much more declarative, since the above is pretty clean): d3_dsv.csv(text, d3_dsv.type()
.field("foo", d3_dsv.typeNumber)
.field("bar", d3_dsv.typeBoolean)
.field("baz", parseDate), function(error, data) {
if (error) throw error;
console.log(data);
}); So, the field The hypothetical API doesn’t seem like a big win, though, since the current approach is shorter (or about the same) to type and more transparent. I suppose you could have d3_dsv.typeAuto() if you wanted to opt-in to unsafe conversion, though. :) |
Another variation: d3_dsv.csv(text, d3_dsv.type({
foo: d3_dsv.typeNumber,
bar: d3_dsv.typeBoolean,
baz: parseDate
}), function(error, data) {
if (error) throw error;
console.log(data);
}); It’s also interesting to consider whether this would be useful for renaming columns (#10) and restructuring. But, that’s also something the current approach does relatively well, perhaps even better if you use ES6 destructuring. d3_dsv.csv(text, function(d) {
return {
foo: +d.Foo,
bar: {
confirmed: !!d.barConfirmed,
date: parseDate(d.barDate)
}
};
}, function(error, data) {
if (error) throw error;
console.log(data);
}); |
Ha ha,
I'm late to the party and yet did not miss anything. |
All very interesting ideas. It seems like automatic parsing might be best suited to leave out of D3, as other tools like Datalib will come along and evolve. Lots of open questions, like how to detect date format automatically. Also there's the classic case of id fields that are strings, like "00320", that should not be parsed as numbers. I've heard so many stories of Excel automatically parsing these kinds of identifiers (e.g. FIPS codes) and causing problems. |
@curran I trust microsoft that excel team has performed extensive use case studies and determined that the number of cases where this caused problems was far less than the number of cases where it was helpful. |
And yet, in a data file with a large number of columns, Excel will always find some column to mess up :-p |
@herrstucki way to blame computer program for human error. this is how robot revolution will start. |
I realize this issue is four years old and I haven’t found a solution I’m happy with yet, but I’d like to make some progress here. At the very least, there should be some explicit option to coerce values to numbers if the value would not be NaN, even if it’s not the default behavior. For example: function autoType(d) {
for (const key in d) {
if (!isNaN(d[key])) {
d[key] = +d[key];
}
}
return d;
} This “autoType” function could then be passed as the row accessor function to dsv.parse. If we change the default string conversion for dates to use ISO 8601 format, we could likewise add parsing for dates in ISO 8601 format to autoType, and thus have a clean way to roundtrip dates as well. (While avoiding the problem of trying to parse arbitrary date formats, which is a minefield, and should be avoided anyway in most cases by encouraging people to use the standard ISO 8601 representation.) |
We should also guarantee that |
We could roundtrip |
I’ve fleshed out a solution that I’m pretty happy with in #42. https://github.com/d3/d3-dsv/blob/auto-type/README.md#autoType |
Let's say, put whatever is
(not tested :) through parseFloat()
Arguably, figuring out what the data is is not the parser job, but realistically almost every use of this deals with numbers.
The text was updated successfully, but these errors were encountered: