Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coalesce and cast #1409

Closed
wants to merge 7 commits into from

Conversation

@llllllllll
Copy link
Member

commented Feb 9, 2016

cast is needed when blaze is incorrect about the types of an expression. For example, the case I opened earlier about reductions of var * A being typed A instead of ?A when the collection could be empty. To make it easy to work around this we should be able to explicitly overwrite the type.

coalesce is useful for a lot of queries.

@llllllllll llllllllll force-pushed the quantopian:c-funcs branch 3 times, most recently from 3e008de to d290fc6 Feb 12, 2016

@llllllllll llllllllll force-pushed the quantopian:c-funcs branch 4 times, most recently from 6e0ca18 to 0b6e761 Feb 12, 2016

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Feb 18, 2016

ping @kwmsmith I would like someone to look this one over because I am adding some new expressions.

@llllllllll llllllllll force-pushed the quantopian:c-funcs branch from 0b6e761 to 7da76ef Feb 22, 2016

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Feb 22, 2016

freshly rebased against master

@llllllllll llllllllll force-pushed the quantopian:c-funcs branch from 7da76ef to b42cab5 Mar 1, 2016

@richafrank richafrank force-pushed the quantopian:c-funcs branch from b42cab5 to 3142218 Mar 4, 2016

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2016

@kwmsmith I would like to merge this soon

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2016

If you don't have any comments I plan on merging by the end of tomorrow

class Cast(Expr):
"""Cast an expression to a different type.
This is only an expression time operation.

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

Can you provide a brief Examples section?

This comment has been minimized.

Copy link
@llllllllll

llllllllll Mar 15, 2016

Author Member

done

@@ -790,6 +804,92 @@ def coerce(expr, to):
return Coerce(expr, dshape(to) if isinstance(to, _strtypes) else to)


class Cast(Expr):

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

We'll now have both cast and coerce expressions with this. Is the essential distinction that cast is for the expression's dshape, while coerce is for the underlying backend?

This comment has been minimized.

Copy link
@llllllllll

llllllllll Mar 11, 2016

Author Member

Yeha, coerce will actually affect the data, cast will just reinterperate the data

coalesce(a, b) = {
a if a is not NULL
b otherwise
}

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

A brief Examples section would be very useful here as well.

This comment has been minimized.

Copy link
@llllllllll

llllllllll Mar 15, 2016

Author Member

done

}
"""
__slots__ = '_hash', 'lhs', 'rhs', 'dshape'
__inputs__ = 'lhs', 'rhs'

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

Is Coalesce always binary? Wouldn't it make sense to be variadic? Or is it variadic in SQL and we can't express variadic things in Blaze's multiple dispatch, so we're limited to binary?

This comment has been minimized.

Copy link
@llllllllll

llllllllll Mar 11, 2016

Author Member

This is variadic in sql but we don't yet have a good way to handle this in blaze. When we do support that this would be a good candidate for a variadic (and then concat)

@@ -845,6 +945,7 @@ def ndim(expr):
schema_method_list.extend([
(isscalar, set([label, relabel, coerce])),
(isrecord, set([relabel])),
(lambda ds: isinstance(ds, Option), {coalesce}),

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

coalesce is valid for any isscalar dshape, right? Not just Option types? Or am I misunderstanding coalesce?

This comment has been minimized.

Copy link
@llllllllll

llllllllll Mar 11, 2016

Author Member

it is not meaningful to coalesce a non-optional type with anything because it is just the identity.

# to hold either result
assert_dshape_equal(expr.dshape, dshape('?int64'))
assert expr.lhs.isidentical(v)
assert expr.rhs.isidentical(y)

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

Some error / edge cases for testing:

  • non-scalar dshapes: Record, others?;
  • non-integer dshapes: string comes to mind;
  • non-atomic expression arguments with compatible dshapes.
for reinterpreting an expression as another dshape. This is based on C++
``reinterpret_cast``, or just normal C casts. For example:
``symbol('s', 'int32').cast('uint32').dshape == dshape('uint32')``. This
expression is the identity function at runtime but allows users to alter the

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

"is the identity function at runtime" -- can you expand on this? Do you mean it has no effect on the execution of compute on the expression?

This comment has been minimized.

Copy link
@llllllllll

llllllllll Mar 11, 2016

Author Member

I can put this in the docs, but cast is defined as

@dispatch(Expr, object)
def compute_up(expr, data, **kwargs):
    return data
``coalesce(None, 1)`` == 1, and ``coalesce(None, None) == None``.
This is inspired by the sql function of the same name (:issue:`1409`).
* Adds :func:`~blaze.expr.expressions.cast` expression as a type level function
for reinterpreting an expression as another dshape. This is based on C++

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 11, 2016

Member

Perhaps "Adds the :func:~blaze.expr.expressions.cast expression to reinterpret an expression's dshape"?

@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 11, 2016

@llllllllll Apologies for taking so long to get to this -- I've been pinned down on a number of internal things lately.

This looks good to me; my only question is on the names; new users may be confused on the distinction between cast and coerce, and coalesce isn't obvious at first what it does unless you're coming from SQL :)

Because of the possibly confusing or non-obvious names, it's important that we have good docstrings with good examples.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2016

No worries, I just wanted to make sure didn't get forgotten. I completely agree about the naming and documentation issue. I will go through and try to add some good examples and maybe some prose docs about the difference between cast and coerce. If you have an alternative name for cast please let me know. This operation is not at all the same as sql cast which can change values, this is really more a C cast, but idk if our users will have the same intuition for what a C cast is vs a sql cast.

@kwmsmith kwmsmith added this to the 0.10 milestone Mar 11, 2016

@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 11, 2016

If you have an alternative name for cast please let me know.

Taking a page from Numpy / pandas, which have astype(): what about as_dshape instead of cast? I'm trying to capture the idea that we're just changing the dshape of the expression, not the underlying data.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2016

numpy astype actually does conversions though, the equivalent in numpy is view

@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 11, 2016

For cast, what happens if the user casts a dataframe with integer column to a dataframe with float column?

bz.compute(df.int_column.cast('float32'))

Since cast is a no-op at compute time, the result here would be an integer pandas series, correct? So in that sense, it isn't analogous to Numpy's view, since the cast is completely ignored at compute time. I don't think I can come up with any Numpy / Pandas analogue for cast here, can you?

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2016

I guess it is closer to view which is a C style cast. Really this is just a type level function for allowing us to use different expressions when blaze gets the type wrong. The use case I have for this is when there are bugs in the blaze type system I need to be able to fix them in my app code before I have time to fix them in blaze. For example, reductions over var returning A instead of ?A

@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 11, 2016

Yes, I totally understand the usecase for correcting blaze / datashape's option / non option types. That makes sense. It's for the other cases when the user tries to cast from integer to floats, for example, and it has no effect in the computed results that gets more confusing.

Is dealing with option types the only usecase here, or are there others? Because my preference would be to keep this limited to handling casts between option / non-option types of the same underlying type for now, and only expand it to more general casts if there's a clear usecase for that.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2016

The use case is for any time that the blaze type system is incorrect. This was happening before with reductions on timedeltas giving back floats. This was corrected but but I would like to have the freedom to work around this class of issue in the future.

@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 11, 2016

I'm OK with this for the option / non-option case, of course, and I might envision other cases as well (string & string[10], for example). But if it turns out users misuse cast frequently and it leads to confusion--the int to float cast, for example, and the user expecting some change in the result--then we may need to modify the types cast is allowed to cast between.

My preference would be to only allow cast for the clear pain points, and expand it as other cases become clear in the future.

I'm OK with merging this as long as we're able to restrict the cases for cast in the future if it becomes a place where users get frequently tripped up.

llllllllll added some commits Feb 9, 2016

ENH: adds cast and coalesce
Cast is a type level function that reinterpretes a node as a different
type. This is like c cast, or c++ reinterpret_cast. This allows users to
correct mistakes in the blaze type system.

Coalesce is a binary operator much like sql coalesce. It picks the first
non null element.

@llllllllll llllllllll force-pushed the quantopian:c-funcs branch from cb51585 to f78e0e5 Mar 15, 2016

llllllllll added some commits Mar 15, 2016

ENH: fix fields optional records
updates tests for coalesce
@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Added more tests, including some xfail tests around promotion of mixed shaped records which I think should work in the future.

@kwmsmith tests are passing

def test_coalesce(sqla):
t = symbol('t', discover(sqla))
assert (
odo(compute(coalesce(t.B, -1), {t: sqla}), list) ==

This comment has been minimized.

Copy link
@kwmsmith

kwmsmith Mar 15, 2016

Member

nit: for consistency, we've been using compute(expr, sql_data, return_type=list) for simple cases like this rather than an explicit odo call.

[(1,), (1,), (-1,)]
)
assert (
odo(compute(coalesce(t.A, 'z'), {t: sqla}), list) ==

This comment has been minimized.

Copy link
@kwmsmith
@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 15, 2016

@llllllllll just some consistency nits, otherwise +1.

As I mentioned above, in the future, we may need to restrict cast's applicability if we find that non power-users misuse it frequently. And if we find a better name for cast, then I'm fine with a rename in the future.

Also, please make a separate issue to add sections for coerce and cast in the blaze docs.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

I like the return_type use here, I think it reads nicely. I added these exprs to the API section, but I will add an issue to write actual docs for this

@kwmsmith

This comment has been minimized.

Copy link
Member

commented Mar 15, 2016

LGTM.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Thanks for looking at this

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

merged on cli

@llllllllll llllllllll closed this Mar 15, 2016

@llllllllll llllllllll deleted the quantopian:c-funcs branch Mar 15, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.