References #4

davidagold · 2016-07-23T14:33:05Z

A collection of relevant issues in other repositories:

floswald · 2016-07-25T12:13:38Z

Hi @davidagold!
I'm a big fan of dplyr and therefore welcome your initiative. I was just wondering whether it might be useful to provide a small discussion on your readme with the main differences from DataFramesMeta, as that package seems to be very close to what you seem to be building here. Do you expect significant differences in performance? will you be better able to deal with things like #58 in DataFramesMeta? I'm not a DataFramesMeta developer, just curious.

davidagold · 2016-07-25T13:38:06Z

@floswald That's a great idea! I just included issue #58 above because it seems like something this package will eventually confront as well. I don't think this package will necessarily be more suited to handling it -- it seems like a design decision more than an implementation detail. I could be wrong. I think performance in both packages should be comparable.

One small difference concerns how the packages emulate R's non-standard evaluation. In DataFramesMeta (DFM), columns are referenced within macro invocations by symbol objects (e.g. :SepalLength) whereas in the present package columns can just be named (e.g. SepalLength).

A larger difference concerns how the commands of DFM are built on top of @with, which works fundamentally at the level of DataFrame columns. A result of this latter implementation detail is that one uses vectorized operations in, say, filtering expressions passed to @where. In the present package, filtering a DataFrame is accomplished via bitbroadcast (as described in the readme), and hence can accept elementwise operations.

The two points above are summarized by the difference between

@where(iris, :SepalLength .> 1.5)

and

@filter(iris, SepalLength > 1.5)

Apart from increased flexibility, one isn't restricted to thinking about manipulations fundamentally at the level of columns.

Another larger difference concerns the generality of backends. The plan for this package is to support querying multiple backends, not just DataFrames. We envision support for both database connections as well as other Julia objects that satisfy some sort of interface for tabular data (hopefully to be codified in AbstractTables.jl). As such, the internals of the package revolve around the creation of graphs that represent the manipulation commands issued by the user. Both the objective (supporting multiple backends) and the means of doing so (representing manipulations via DAGs) are notable differences from DFM. Furthermore, representation of manipulations via DAGs will ideally not only be internal, but rather for use by methods from other packages.

Another minor-ish feature difference concerns the support of chaining. The present package supports, for instance

iris |> @filter(Species == "setosa")

whereas chaining using DFM requires use of either Mike Innes' Lazy.jl package or the @linq macro. Support for chaining in the present package is facilitated by the internals as described above.

Those are some thoughts off the top of my head. Is that helpful?

floswald · 2016-07-25T14:14:58Z

very helpful! I think particularly the multiple backend support you mention is a big difference. I'm looking forward to hear more from this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

References #4

References #4

davidagold commented Jul 23, 2016

floswald commented Jul 25, 2016

davidagold commented Jul 25, 2016

floswald commented Jul 25, 2016

References #4

References #4

Comments

davidagold commented Jul 23, 2016

floswald commented Jul 25, 2016

davidagold commented Jul 25, 2016

floswald commented Jul 25, 2016