Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coalesce and cast #1409

Closed
wants to merge 7 commits into from
Closed

coalesce and cast #1409

wants to merge 7 commits into from

Conversation

llllllllll
Copy link
Member

cast is needed when blaze is incorrect about the types of an expression. For example, the case I opened earlier about reductions of var * A being typed A instead of ?A when the collection could be empty. To make it easy to work around this we should be able to explicitly overwrite the type.

coalesce is useful for a lot of queries.

@llllllllll
Copy link
Member Author

ping @kwmsmith I would like someone to look this one over because I am adding some new expressions.

@llllllllll
Copy link
Member Author

freshly rebased against master

@llllllllll
Copy link
Member Author

@kwmsmith I would like to merge this soon

@llllllllll
Copy link
Member Author

If you don't have any comments I plan on merging by the end of tomorrow

class Cast(Expr):
"""Cast an expression to a different type.

This is only an expression time operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a brief Examples section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kwmsmith
Copy link
Member

@llllllllll Apologies for taking so long to get to this -- I've been pinned down on a number of internal things lately.

This looks good to me; my only question is on the names; new users may be confused on the distinction between cast and coerce, and coalesce isn't obvious at first what it does unless you're coming from SQL :)

Because of the possibly confusing or non-obvious names, it's important that we have good docstrings with good examples.

@llllllllll
Copy link
Member Author

No worries, I just wanted to make sure didn't get forgotten. I completely agree about the naming and documentation issue. I will go through and try to add some good examples and maybe some prose docs about the difference between cast and coerce. If you have an alternative name for cast please let me know. This operation is not at all the same as sql cast which can change values, this is really more a C cast, but idk if our users will have the same intuition for what a C cast is vs a sql cast.

@kwmsmith kwmsmith added this to the 0.10 milestone Mar 11, 2016
@kwmsmith
Copy link
Member

If you have an alternative name for cast please let me know.

Taking a page from Numpy / pandas, which have astype(): what about as_dshape instead of cast? I'm trying to capture the idea that we're just changing the dshape of the expression, not the underlying data.

@llllllllll
Copy link
Member Author

numpy astype actually does conversions though, the equivalent in numpy is view

@kwmsmith
Copy link
Member

For cast, what happens if the user casts a dataframe with integer column to a dataframe with float column?

bz.compute(df.int_column.cast('float32'))

Since cast is a no-op at compute time, the result here would be an integer pandas series, correct? So in that sense, it isn't analogous to Numpy's view, since the cast is completely ignored at compute time. I don't think I can come up with any Numpy / Pandas analogue for cast here, can you?

@llllllllll
Copy link
Member Author

I guess it is closer to view which is a C style cast. Really this is just a type level function for allowing us to use different expressions when blaze gets the type wrong. The use case I have for this is when there are bugs in the blaze type system I need to be able to fix them in my app code before I have time to fix them in blaze. For example, reductions over var returning A instead of ?A

@kwmsmith
Copy link
Member

Yes, I totally understand the usecase for correcting blaze / datashape's option / non option types. That makes sense. It's for the other cases when the user tries to cast from integer to floats, for example, and it has no effect in the computed results that gets more confusing.

Is dealing with option types the only usecase here, or are there others? Because my preference would be to keep this limited to handling casts between option / non-option types of the same underlying type for now, and only expand it to more general casts if there's a clear usecase for that.

@llllllllll
Copy link
Member Author

The use case is for any time that the blaze type system is incorrect. This was happening before with reductions on timedeltas giving back floats. This was corrected but but I would like to have the freedom to work around this class of issue in the future.

@kwmsmith
Copy link
Member

I'm OK with this for the option / non-option case, of course, and I might envision other cases as well (string & string[10], for example). But if it turns out users misuse cast frequently and it leads to confusion--the int to float cast, for example, and the user expecting some change in the result--then we may need to modify the types cast is allowed to cast between.

My preference would be to only allow cast for the clear pain points, and expand it as other cases become clear in the future.

I'm OK with merging this as long as we're able to restrict the cases for cast in the future if it becomes a place where users get frequently tripped up.

Joe Jevnik added 3 commits March 14, 2016 19:56
Cast is a type level function that reinterpretes a node as a different
type. This is like c cast, or c++ reinterpret_cast. This allows users to
correct mistakes in the blaze type system.

Coalesce is a binary operator much like sql coalesce. It picks the first
non null element.
@llllllllll
Copy link
Member Author

Added more tests, including some xfail tests around promotion of mixed shaped records which I think should work in the future.

@kwmsmith tests are passing

def test_coalesce(sqla):
t = symbol('t', discover(sqla))
assert (
odo(compute(coalesce(t.B, -1), {t: sqla}), list) ==
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for consistency, we've been using compute(expr, sql_data, return_type=list) for simple cases like this rather than an explicit odo call.

@kwmsmith
Copy link
Member

@llllllllll just some consistency nits, otherwise +1.

As I mentioned above, in the future, we may need to restrict cast's applicability if we find that non power-users misuse it frequently. And if we find a better name for cast, then I'm fine with a rename in the future.

Also, please make a separate issue to add sections for coerce and cast in the blaze docs.

@llllllllll
Copy link
Member Author

I like the return_type use here, I think it reads nicely. I added these exprs to the API section, but I will add an issue to write actual docs for this

@kwmsmith
Copy link
Member

LGTM.

@llllllllll
Copy link
Member Author

Thanks for looking at this

@llllllllll
Copy link
Member Author

merged on cli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants