Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

References #4

Open
davidagold opened this issue Jul 23, 2016 · 3 comments
Open

References #4

davidagold opened this issue Jul 23, 2016 · 3 comments

Comments

@davidagold
Copy link
Owner

A collection of relevant issues in other repositories:

API:
-JuliaData/DataFramesMeta.jl#58

@floswald
Copy link

Hi @davidagold!
I'm a big fan of dplyr and therefore welcome your initiative. I was just wondering whether it might be useful to provide a small discussion on your readme with the main differences from DataFramesMeta, as that package seems to be very close to what you seem to be building here. Do you expect significant differences in performance? will you be better able to deal with things like #58 in DataFramesMeta? I'm not a DataFramesMeta developer, just curious.

@davidagold
Copy link
Owner Author

@floswald That's a great idea! I just included issue #58 above because it seems like something this package will eventually confront as well. I don't think this package will necessarily be more suited to handling it -- it seems like a design decision more than an implementation detail. I could be wrong. I think performance in both packages should be comparable.

One small difference concerns how the packages emulate R's non-standard evaluation. In DataFramesMeta (DFM), columns are referenced within macro invocations by symbol objects (e.g. :SepalLength) whereas in the present package columns can just be named (e.g. SepalLength).

A larger difference concerns how the commands of DFM are built on top of @with, which works fundamentally at the level of DataFrame columns. A result of this latter implementation detail is that one uses vectorized operations in, say, filtering expressions passed to @where. In the present package, filtering a DataFrame is accomplished via bitbroadcast (as described in the readme), and hence can accept elementwise operations.

The two points above are summarized by the difference between

@where(iris, :SepalLength .> 1.5)

and

@filter(iris, SepalLength > 1.5)

Apart from increased flexibility, one isn't restricted to thinking about manipulations fundamentally at the level of columns.

Another larger difference concerns the generality of backends. The plan for this package is to support querying multiple backends, not just DataFrames. We envision support for both database connections as well as other Julia objects that satisfy some sort of interface for tabular data (hopefully to be codified in AbstractTables.jl). As such, the internals of the package revolve around the creation of graphs that represent the manipulation commands issued by the user. Both the objective (supporting multiple backends) and the means of doing so (representing manipulations via DAGs) are notable differences from DFM. Furthermore, representation of manipulations via DAGs will ideally not only be internal, but rather for use by methods from other packages.

Another minor-ish feature difference concerns the support of chaining. The present package supports, for instance

iris |> @filter(Species == "setosa")

whereas chaining using DFM requires use of either Mike Innes' Lazy.jl package or the @linq macro. Support for chaining in the present package is facilitated by the internals as described above.

Those are some thoughts off the top of my head. Is that helpful?

@floswald
Copy link

very helpful! I think particularly the multiple backend support you mention is a big difference. I'm looking forward to hear more from this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants