A more consistent approach to data. #1145
Possibilities:
- Always wrap data.
- Never wrap data; assign directly to properties.
- Add an out operator to all layouts to control assignment (and possibly allow wrapping?).
- Wrap data using Object.create (tempting but likely confusing).
Related: d3.geom.quadtree requires input of the form {x, y}, whereas the other d3.geom classes use [x, y]. The defaults for d3.svg.line and d3.svg.area also use [x, y], though the hierarchy and force layouts use {x, y}.
thanks, just checking to make sure i was on the right page... i agree that accessors are more convenient than map, as far as "Should the derived data contain the accessed values? How should the derived data be linked back to the original data?" goes, creating an 'accessed_data' object that stores the values pulled from the original data by the accessor seems inefficient since this process might need to be repeated each time the original data changes... what about making the values of 'accessed_data' be functions that reproduce the accessor actions on the original data? i.e. out.x() rather than out.x for the output data structure in order to maintain its link to the original data
Hi, i want to contribute even if this topic requires a wide knowledge of d3.
My humble opinion is that the case of not-isomorphic result should be handled separately. I think the term layout in d3 has grown during time, and maybe now it is the right moment to define what a layout should be, and what it should not be, and define some new terms.
Keeping the original input data is useful (consider for example the store of original values in elements attributes, in order to show labels or popups), and i can't see any drawback in it. Of course nobody should expect to write on those attributes in the output nodes and have the computed attributes updated. And of course people should be aware that the derived attributes could overwrite the original data attributes.
I agree that it would be a good idea to add the out method to all layouts, even if meaningful defaults are always important to invite to convergence.
About input data which have not accessors (like geom, line and area), what is the advantage? Performance in data size? So, if you want to keep this advantage, you can also produce an array node with accessors as a result. I don't know if you consider this not elegant, anyway javascript won't prevent it. This would be an object where you can access d[0], d[1], and also d.x.
I don't understand what does it means that just the force layout is stateful. The produced nodes are stateful? Well what if all the nodes would be stateful, so that you can feed new data into them and propagate this change towards the selection elements? Maybe it is working like this also now, i don't used many layouts.
Anyway, if a layout produces just an enrichment of the original data object, the produced object and the original data object become conceptually really close to each other, and maybe you can evaluate a specific term (nodes?) to define them.
I hope this can be of any help
making the values of 'accessed_data' be functions that reproduce the accessor actions on the original data
Thanks for the suggestion, but I don't think this is what we want, for a few reasons.
First, reinvoking the accessor functions is already pretty easy to do. For example, pack.value() returns the value accessor function, and pack.value(node) would reevaluate this function on the specified node. Even easier, since the caller defines the value accessor function, they could just have a reference to that function and say value(node). Creating closures on the accessors that are bound to a specific node and then returning them on the output data objects would add a significant amount of overhead for little value; a closure is significantly heavier than the numeric value it returns.
(This glosses over some nuances in how the accessors are invoked. For example the value accessor is invoked with an additional argument of the depth in the tree, and the this context is the hierarchy layout instance. But in practice accessors rarely depend on this, and the caller could adjust the invocation syntax accordingly if necessary.)
More crucially, accessors reflect the state of the input, and that should be captured in the layout's output. In other words the output of the layout should be static, reflective of the produced layout, and not a view of the data. Other parts of D3, we even go so far to allow nondeterministic accessors consistently; for example this allows you to apply random jitter with the d3.svg.area generator:
fx0 = d3_functor(x0),
fy0 = d3_functor(y0),
fx1 = x0 === x1 ? function() { return x; } : d3_functor(x1),
fy1 = y0 === y1 ? function() { return y; } : d3_functor(y1),Here if the area's x0 and x1 are the same function, this function is only evaluated once, and the value is reused for both properties.
I don't think it's strictly a requirement that a layout capture and export all of the accessed properties. If, for example, the caller defines the value accessor as function(d) { return d.size; }, we could expect them to know to access the value as d.size rather than relying on d.value. But since the layout has to output data anyway, it does seem more convenient to also output the accessed values.
I don't understand what does it means that just the force layout is stateful.
Most layouts follow the configurable function pattern, and are stateless. By stateless, I mean that layouts are functions that take some input and return some output, and aside from some minimal configuration, the output depends only on the input. In other words, the configuration is the only state of the layout.
The force layout is different because the data (the nodes and links) are part of the configuration. This is largely because the force layout is iterative rather than a one-time transformation of data. A more consistent way of implementing the force layout would probably be for the force layout to return a simulation that you can then listen to, rather than listening to the force layout directly. Then you could use the same force layout instance (and thus share a configuration) while simulating multiple networks.
For example, you might do something like this:
var force = d3.layout.force()
.size([960, 500])
.charge(-40)
.gravity(.2);
var simulation = force({nodes: nodes, links: links})
.on("tick", tick);Then if you wanted to start and stop the simulation, you'd say simulation.start or simulation.stop, rather than using the force layout.
But then again, I'm not sure how useful this additional abstraction is. I mean, it's not very often that you need to simulate multiple networks simultaneously, and even if you did, it's pretty easy to write a function that creates multiple force layouts with the same parameters.
The treemap layout is also partly stateful if you use the sticky feature.
I agree that it would be a good idea to add the out method to all layouts, even if meaningful defaults are always important to invite to convergence.
This might be challenging, at least for the force layout.
For most layouts an out function would be doable; it's already the case that some layouts store temporary state on nodes during the computation of the layout. (The tree layout uses d._tree and the pack layout uses d._pack_next and d._pack_prev, for example.) These layouts could store their accessed properties on temporary private variables as needed to compute the layout, before mapping them to output data using the out function.
The force layout would be harder since the private data needs to persist after the invocation of the layout (force.start). Even if force.tick can capture a mapping from private data to the associated output node, to support interaction with the force layout's drag behavior, it would need a way to modify the private data for a given node (the reverse mapping). I suppose you could iterate over the private data to find a matching node, but this would be slow.
My larger concern on adding an output transformation is that this additional indirection makes the code more complicated without a commensurate benefit; it would probably be better to limit accessors to being a convenient way to map input data to the standard representation, and from there always refer to the standard representation. So, for data produced by the layout, I argue it should always be in the same format rather than allowing an output transformation. I think this would make user code easier to read, as well, since the layout outputs would be named consistently.
I actually can't find any examples of using stack.out at the moment, so I think it's a good candidate for deletion in 4.0. For that matter, even the stacked bar chart example was simpler without using the stack layout.
About input data which have not accessors (like geom, line and area), what is the advantage? Performance in data size?
d3.svg.line and d3.svg.area do allow accessors (e.g., line.x). But they output only path data strings, so they don't need to capture and export the accessed values. The d3.geom classes were different from layouts is that they did not allow accessors; they required a standard form of input. They were designed this way because it made them simpler to implement. But, part of this discussion is whether those accessor-less classes in d3.geom should exist, or if they should be rewritten as layouts (and thus have accessors). I've already done this for the voronoi layout, which subsumes d3.geom.voronoi and d3.geom.delaunay. That leaves d3.geom.hull and d3.geom.quadtree, which could be rewritten as layouts. As for d3.geom.polygon, I'm less sure, but I suppose d3.layout.clip is an option?
Anyway, I'm debating whether it was a good idea to rewrite d3.geom.voronoi as a layout. And if it was, should d3.geom.quadtree and d3.geom.hull also be layouts? Will the quadtree layout be noticeably slower if it has to invoke accessors rather than relying on a standard input form?
thanks for the great explanations, i can see that the bigger picture of how the layout outputs fit into the design patterns already used in other parts of D3 is richer than i realized. i wasn't considering the advantages of layouts as static snapshots of the data, but now that i understand this perspective i agree that the extra overhead of closures etc. would be undesirable. one of the great things about learning D3 is that i'm also learning about programming in general.
quick question regarding your use of the word "functor": is this to emphasize that the output could be any object, not necessarily a function? i.e. i'm more used to the term "operator" for functors that output functions, so to me the name "functor" suggests a non-function output but i'm not sure if this was intended or not. although this detail might seem unimportant, understanding this choice of terminology might reveal something about D3 design & implementation in general.
Questions regarding d3.geom.voronoi:
- Should d3.layout.voronoi exist?
- If so, should d3.geom.voronoi also exist, or be deprecated?
- If d3.geom.voronoi is deprecated, should d3.geom.quadtree and d3.geom.hull be rewritten as layouts, too (i.e., with accessors)?
- If d3.layout.voronoi should not exist, should d3.geom.voronoi be configurable with a clip function (but no accessors), or should people simply use d3.geom.polygon clip as before?
Questions regarding d3.geom.quadtree:
- Should d3.geom.quadtree take [x, y] as input, for consistency with the other d3.geom classes, or {x, y} as input for consistency with its major consumer, d3.layout.force? (If we restored [x, y] input support it would need both for backwards-compatibility, at least in 3.x.)
Questions regarding the stack layout:
- Should d3.layout.stack's default x and y accessor be d[0] and d[1] for consistency with other layouts, rather than d.x and d.y? For backwards-compatibility, we could have default accessors in 3.x that return d.x if (!(0 in d)) and d.y if (!(1 in d)).
- Should d3.layout.stack expose the accessed x property? Currently it only outputs the computed y (which is typically unchanged except for the "expand" offset) and y0? Should the output properties be d.x, d.y and d.y0, or d[0], d[1] and something else? If d.x, d.y and d.y0, is it inconsistent that the standard input is different from the standard output?
- Should d3.layout.stack support an output transformation (out)? (I think I've convinced myself that it would be nice if this went away.)
Questions regarding other layouts and data:
- Should other layouts use d[0] and d[1] as output rather than d.x and d.y? One benefit of using the array form is that the default toString() returns "x,y", which is convenient for setting a translate attribute or computing a path data string. (It would also make the output of layouts consistent with the inputs to other path generators!) Of course, this would work better if each input object were an array rather than an object, since the default object toString is "[object Object]". For backwards-compatibility, 3.x layouts could set both properties, though this wouldn't provide strict backwards-compatibility for custom constraints with a force layout (e.g., assigning x and y for collision detection).
- Should d3.layout.pie wrap data or assign to properties on data as other layouts?
- Should d3.layout.force and d3.layout.treemap be stateful (i.e., capture data)?
regarding your use of the word "functor": is this to emphasize that the output could be any object, not necessarily a function?
That's a code snippet referring to d3_functor, whose purpose is to promote a constant value (such as 42) to a function that returns this value (function() { return 42; }), so that code that can accept either constants or functions as inputs only needs to handle the more general case of a function.
ah i see, so "functor" is being used as a programming term and not in the mathematical sense of a "function that takes a function as input". thanks for the explanation! i can see why this type of object-to-function promotion could be handy. in mathematics, this might be called a "function-valued function" since it can take non-function inputs, but always returns a function as output, unlike a "functor" or an "operator" which (by definition) take functions as input. Perhaps a name like "to_func" would be less confusing for some users.
I think if performance is no object, then it seems consistent and more flexible to use higher-order programming and have d3.layout.{voronoi,delaunay,quadtree,hull,polygon}. I’d hope that accessor functions would be inlined by modern VMs anyway, although a benchmark would say for sure.
You could of course argue the opposite: that input data should be transformed into a particular structure using array.map or array.forEach prior to being passed to all layouts. I’d say that accessor functions are cleaner and more readable, since you can reuse a configured layout rather than having to perform a transformation step on input data each time. Are there any other benefits?
I don’t think accessed values should ever need to be saved, because if you really want to you can either a) call the accessor on the output again or b) you can perform a transformation step as mentioned above and store computed values for efficiency.
I agree d3.layout.stack.out seems inconsistent with the operation of other layouts: they all have a defined output structure and will overwrite existing properties, so I think it’s fine if the stack layout does the same and we eventually drop stack.out.
Ultimately I think there probably is a small performance hit (and of course a slight complexity in implementation) when using accessor functions, so the question is whether the benefits of flexibility (and other benefits?) outweigh this cost.
As for arrays as output, another possibility is to have point objects with a toString method function() { return this.x + "," + this.y; }.
For voronoi.{links,triangles}, I think the outputs should refer directly to the input elements (assuming we are not going to go the wrap-everything route). So {source: a, target: b} and [a, b, c], where the input is [a, b, c, …]. I’m thinking the voronoi layout itself should set a cell property (or polygon?) on each input element.
On second thoughts, it might also be reasonable to separate d3.geom and d3.layout on the basis of “primitives” that are useful as building blocks for algorithms, and higher-level layouts that are more useful to lay out data for visualisations. In that sense, I can see that the Voronoi algorithm is a little bit ambiguous because it’s more likely to be used in visualisations, albeit probably in the background for hit testing etc. As for quadtree and hull though, they’re definitely primitive building blocks and so higher-order programming is not so important.
For what it's worth I generally think custom assessor functions are excessive sugar compared to the general minimalism of the d3 library. I think it would take a new user shorter time to understand data preparation with map/forEach (when necessarily) and that understanding would be much more valuable than understanding the assessor functions, which are limited to d3. A general assessor object (with all the usual suspect, x,y,x1,y1,....) to transform data might be a good intermediary step?
Ziggy, I think removing all accessors would be more tedious than you think, and would give D3 a less declarative feel. For example, consider the line chart example, which defines the line path generator as both accessing the appropriate property and applying the relevant x- and y-scale:
var line = d3.svg.line()
.x(function(d) { return x(d.date); })
.y(function(d) { return y(d.close); });This allows the line generator to be declared before the x- and y-scale domains are set. Down below, once we have data, we render the chart:
svg.append("path")
.datum(data)
.attr("class", "line")
.attr("d", line);If d3.svg.line didn't support accessors, and thus didn't require configuration (ok, ignoring other line generator properties such as interpolation), the code would look like:
svg.append("path")
.datum(data)
.attr("class", "line")
.attr("d", function(data) {
return d3.svg.line(data.map(function(d) { return [x(d.date), y(d.close)]; }));
});Or alternatively, you could bind to the transformed data:
svg.append("path")
.datum(data.map(function(d) { return [x(d.date), y(d.close)]; }))
.attr("class", "line")
.attr("d", d3.svg.line);Or alternatively, if you don't want to bind data:
svg.append("path")
.attr("class", "line")
.attr("d", d3.svg.line(data.map(function(d) { return [x(d.date), y(d.close)]; })));But then if you want to render an area, or you want to re-render the line after updating the data, you have to perform the same mapping again. So you'd want to then encapsulate your data mapping to a separate function:
function xy(d) {
return [x(d.date), y(d.close)];
}And then you could say:
svg.append("path")
.attr("class", "line")
.attr("d", d3.svg.line(data.map(xy)));Or, perhaps you'd combine the data mapping and the line rendering into one:
function line(data) {
return d3.svg.line(data.map(function(d) { return [x(d.date), y(d.close)]; }));
}At which point, you've basically back to where you started by implementing your own accessors on top of d3.svg.line. And, the original syntax is cleaner, and guides the user to an appropriate abstraction:
var line = d3.svg.line()
.x(function(d) { return x(d.date); })
.y(function(d) { return y(d.close); });So, while it would certainly be possible to remove accessors and thus eliminate some of the problems discussed here, I think it would end up being more burdensome to users, as they would need to solve the problem of mapping data themselves without guidance from D3. I know it's good, in general, to favor a minimalist approach. But I think retaining accessors here is still a good idea, even if it's not required for everything.
My opinion on this is not a strong one, but wanted to share the viewpoint nevertheless. I do understand and agree with you on a many levels. My view is that any data property in a layout should always have a logical/meaningful name, not just an "array position". So in the examples above, if svg.line() would simply look for an x and a y in the data, the result could be:
d3.svg.line(data.map(function(d) { return {
x: x(d.date),
y: y(d.close)
}});
Alternatively a generic map-function for all layouts could work like this:
d3.svg.line(data)
.map(function(d) {
return {
x : x(d.date),
y : y(d.close)
};
})
Benefit of both methods above is that the base is a standard javarscript map function, which is probably the most powerful element of javascript. Code could also be closer to current assessors like this:
d3.svg.line(data)
.map({
x : function(d) { return x(d.date) },
y : function(d) { return y(d.close)}
})
But again, I understand why current structure is sticky and arguments for changes aren't strong enough. Maybe in a major release, it might make sense to evaluate the pros and const of general boilerplate (assessors, getters/setters etc).
I appreciate you sharing your opinion, Ziggy, even if it's not a strong one! :)
As for using {x, y} rather than [x, y], that is using named properties rather than numeric properties, I don't see that as a big readability win given that the data being represented is 2D coordinates. An array [x, y] is a very natural representation for 2D coordinates. (I would agree with you if x and y were not representing coordinates, but instead had arbitrary meaning.)
And, as I mentioned previously, using [x, y] has the advantage of useful array.toString and array.map implementations, as well as already being the standard input form of many methods of D3… (e.g., d3.geo.projection, d3.geom.polygon). Not to mention standard fare for GeoJSON and TopoJSON.
That is indeed a very good point, but would like to note that the .map() functionality mentioned above for each layout would not conflict a default assessor of [x,y]. I'm thinking on the lines of: d.hasOwnProperty("x") ? d.x : d[0];.
Anyway, I should rest my case before it's too late :) Appreciate the discussion!
Despite my previous comment, I don't think it would be worth the trouble to switch the layouts to set d[0] and d[1]. For example, if we change d3.layout.stack to set d[0] and d[1] rather than d.x and d.y, should it still set d.y0? In addition d.y is a length, not a position, for the stack layout; so should it instead set y0 as d[1] and then set d.y (or d.dy)? Moreover it feels clunky for layouts to set d[0] and d[1] if the nodes are objects, and not arrays, where they won't have the useful array.toString and array.map built-in.
For code that operates solely on geometry, and is not about deriving geometry from non-geometric data (as layouts do), then I think it makes sense to continue to use [x, y] rather than switching to {x, y}. This applies to d3.geom, d3.geo, d3.svg, d3.mouse and d3.touches. (I'm unsure about d3.behavior.drag origin.)
And specifically d3.geom.quadtree? It does seem like it should take [x, y] as input as it did originally, but then the force layout must map its nodes to this representation, or set d[0] and d[1] in addition to d.x and d.y, or the quadtree must expose accessors. If the force layout maps nodes, I worry that this would be slow, and we'd still need a version of d3.geom.quadtree that accepts {x, y} for backwards-compatibility. And if the quadtree allows accessors, should voronoi, delaunay and hull likewise allow accessors? Jason argues that these are low-level components, but
even the lowly d3.min and d3.max support accessors! :)
Perhaps we should refactor d3.geom.quadtree so that it is more like a layout, even if its default behavior is the same as it is today. For example:
var quadtree = d3.geom.quadtree()
.x(function(d) { return d.x; })
.y(function(d) { return d.y; })
.size([width, height]);
var q = quadtree(points);For backwards-compatibility, the old d3.geom.quadtree(points, width, height) is equivalent to:
var q = d3.geom.quadtree()
.x(function(d) { return d[0]; })
.y(function(d) { return d[1]; })
.size([width, height])
(points);Likewise, a similar redesigned d3.geom.voronoi could obviate the d3.layout.voronoi staged in #1120, so there wouldn't need to be confusion about whether to use the d3.layout or d3.geom version. Yay, parsimony.
Why not simply both...... wherever both [x,y] and {x:d.x,y:d.y} could be viewed as natural choice, that is.
This is implicitly the case for many layouts already, i.e. d3.svg.line has an assessor function by the name of x pointing by default to element [0]. Clearly the name ['x'] and number [0] are two different name-tags for the same thing (although one is just an assessor name).
Using assessors (current general setup), the default assessor, as applicable, would simply be:
var d3_default_X = function(d) {
return d.hasOwnProperty("x") ? d.x : d[0];
}
.....
This way, assessor selection functions could be added systematically to all layouts, with the default setting taking care of both default possibilities.
Should polygon be refactored to use accessors too?
No, I don't think that's necessary. But I think we should do this before releasing 3.1.0:
- d3.geom.hull should be refactored to use accessors.
Quick notes:
D3 favors accessors to convert data to expected representations (e.g., line.x, hierarchy.children). These accessors are more convenient than array.map, presumably. For path generators that return path strings rather than readable data, and thus only use derived variables (via accessors) temporarily, there's no issue.
However, layouts (rather than path generators) return inspectable data. Layouts generally write their output to fixed properties on the data (e.g., force.x, hierarchy.x). Configuration of output is not generally allowed, with the exception of stack.out.
Sometimes this causes problems, as when an accessor is specified that reads from an output property (e.g.,
pack.value(function(d) { return d.value * 2; })), or when the layout clobbers an input data property.Some layouts do not require objects as input, and so instead produce wrapper objects, such as the pie layout. (Previously, the hierarchy layouts also did this.) For wrapping layouts, a
dataproperty is usually provided to access the original data from the layout's provided wrapper.Some layouts return data that is not isomorphic to the input, such as voronoi.links and voronoi.triangles. What should happen if the layout in this case is also dependent on accessors? Should the derived data contain the accessed values? How should the derived data be linked back to the original data?
Related: the force layout is stateful, whereas other layouts are stateless (or mostly stateless) transformers.