New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`compute` does not compute #1401

Closed
kwmsmith opened this Issue Jan 27, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@kwmsmith
Member

kwmsmith commented Jan 27, 2016

I'm breaking this topic out into its own issue to try and keep things topical. It has overlap with issues #1357 and #1304.

Proposed improvements to compute() API

For reasons described below, we propose the following changes to compute():

  • compute() always see computation through to completion, unless instructed otherwise by a default argument override. Notably, this would change the behavior of compute() for SQL-backed computations.
  • We propose that, for table-like results, compute() by default always returns a Pandas dataframe or Dask dataframe for results that will fit in memory (for some modifiable heuristic for "what fits in memory").
  • For tabular results that don't fit in memory, my initial preference is to fall back to odo's failure mode or raise a NotImplementedError asking the user to reduce / summarize the computation further.
  • The new compute() can be instructed to return the native, backend-specific result (like a SQLA select object) via overriding a return_native=True or return_type='native' default argument at call time (I'm open to better names here).
  • For array-like things, compute() returns a NumPy array as its default for in-memory results, and raises or returns a dask.array for OOC results. This behavior is similarly overrideable.

Reasoning and usecases

Currently, Blaze's compute() returns different types of results depending on the backend. For instance, for a Pandas-backed dataframe a:

a = bz.Data(pd.DataFrame(...))
bz.compute(a.col.sum())

this code actually triggers a Pandas computation, and yields a scalar.

If we use the same logical data but use a SQL DB, the results are significantly different:

db = bz.Data('postgresql://...')
bz.compute(db.col.sum())

This will result in a sqlalchemy.sql.selectable.Select object. It is not obvious without searching in the documentation that the "right way" to see the computation to completion is to use odo:

db = bz.Data('postgresql://...')
bz.odo(bz.compute(db.col.sum()), float)

The current behavior has led to significant confusion and lost time on the part of new users, especially for new users who know nothing of SQL and who are specifically using Blaze as a way to abstract away SQL entirely.

The benefits to these proposed changes to compute() are several:

  • Blaze's user-facing API is more consistent across backends. Pandas and SQL backed computations always result in a (fully realized) Pandas dataframe by default when that result can fit in memory, and results in a dask.dataframe otherwise (and similarly for array-based computations with NumPy / dask.array as the default).
  • This behavior is more straightforward, less error prone, and less surprising. * compute() makes good on its name: it actually computes things. (The current SQLA behavior is more accurately called compile, not compute, IMO.)
  • This behavior facilitates users comparing results when swapping between SQL and Pandas backends, for example.
  • By providing an overrideable optional argument, users who, for example, want the sqlalchemy object for use in other SQLA code can easily get what they need.

Implementation-wise, one way to reduce code churn would be to provide a public-facing compute() with the behavior described here. The current compute() would be renamed to an internal function and would retain its current behavior. The public compute() could be implemented along the lines of:

# variable names subject to modification...
def compute(expr, data, return_type=None, **kwargs):
    '''
    User-facing docstring w/ examples, etc.
    '''
    res = _internal_compute(expr, data, **kwargs)
    if return_type == 'native':
        return res
    elif (return_type is None and
            isinstance(res, tabular_types) and
            fits_in_memory_heuristic(res)):
        return odo(res, pd.DataFrame)
    elif # other cases for dask.dataframe, numpy, etc.
        ...
@postelrich

This comment has been minimized.

Contributor

postelrich commented Jan 31, 2016

the features/fixes I see we want are:

  • A way to handle large datasets - can be done with pagination or returning an error
  • Fix compute to always compute as compute can sometimes just build an expression instead of computing
  • Give options of how the data can be returned: pandas, numpy, native
@kwmsmith

This comment has been minimized.

Member

kwmsmith commented Feb 1, 2016

@postelrich

A way to handle large datasets - can be done with pagination or returning an error

First cut would just slap a .head(10) on the expression before previewing it -- that's what Blaze does now to keep the dataset usage low.

Future work would add some sophistication to detect when we're doing potentially expensive operations--joins for example--and raise an exception in those cases.

Fix compute to always compute as compute can sometimes just build an expression instead of computing

Yes, that's right -- after this work, compute will always see the computation through to the end, no "intermediate representations" for lack of a better term.

Give options of how the data can be returned: pandas, numpy, native

Yes. One more level of sophistication:

  • if we're doing table-centric stuff, then the default result is pandas (or dask.dataframe for out-of-core, but that's a future improvement and out of scope for this issue here).
  • if we're doing array-centric stuff, then the default result is numpy (or dask.array also for the future, and out of scope for now).
  • if the user specifies compute(..., return_type=sqlalchemy.Select) (or something similar), then the result is a SQLA query--you see how this generalizes.
  • the return_type='native' would return a type that's consistent with the backends and doesn't require any migration.
@kwmsmith

This comment has been minimized.

Member

kwmsmith commented Apr 13, 2016

With the work started in #1411, this thread is resolved, closing.

@kwmsmith kwmsmith closed this Apr 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment