Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat #1128

Merged
merged 22 commits into from Jun 22, 2015

Conversation

@llllllllll
Copy link
Member

commented Jun 11, 2015

Defines a vstack operation that can be performed on 2 tables. This pull request also implements the backend for pandas, sql, csv, and numpy

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2015

I think you should just call this concat or concatenate. I have always found the *stack methods pretty confusing, my 2c.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2015

I don't have a preference; however, I took the name from @notestaff in #974. I do like that vstack vs hstack drops the ambiguity about how you are stacking where concat could be horizontal or vertical.

__inputs__ = 'lhs', 'rhs'

@property
def dshape(self):

This comment has been minimized.

Copy link
@llllllllll

llllllllll Jun 12, 2015

Author Member

is there some magic in datashape to do this already?

@llllllllll llllllllll closed this Jun 12, 2015

@llllllllll llllllllll reopened this Jun 12, 2015

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 12, 2015

Thoughts on the stack operation as defined in dask.array (and now I think ported over to numpy). This is, I think, significantly cleaner than the *stack functions.

http://dask.readthedocs.org/en/latest/stack.html

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2015

I think that concatenate is a good name for this operation. How should I name the node to not clash with the Concat class for string concats?

@cpcloud cpcloud added this to the 0.8.1 milestone Jun 12, 2015

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jun 12, 2015

i like stack because it's flexible enough to cover both the table case as well as the array case. With tables the only valid value for the axis argument is 0 whereas with arrays you can stack along any dimension.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jun 12, 2015

unioning two tables sounds more like np.concatenate than stack. To me the difference is that stack creates a new dimension while concatenate adds along an existing dimension.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jun 12, 2015

fair enough, concat is probably the way to go. @llllllllll i think as long as they're namespaced properly it should be fine to name this operation Concat as well.

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2015

Do you not think that would make the compute engine code more complicated because I can string concat two tables or "vstack" concat two tables? As a reader, how would I know what is meant. Also, the fact that bz.concat(a, b) and a.concat(b) are totally different would be too confusing. I personally think that we should just change string concat to an implementation of Add on string shapes to avoid the confusion, I am not sure that the new Concat node was even in the new release so it can't have too much adoption yet

@llllllllll llllllllll force-pushed the quantopian:vstack branch from b5ccd56 to 445317c Jun 16, 2015

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 16, 2015

Only test_sparksql is failing.

@llllllllll llllllllll changed the title vstack concat Jun 17, 2015

@llllllllll llllllllll force-pushed the quantopian:vstack branch from d7df0c3 to 6872e03 Jun 17, 2015

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2015

rebased with spark fix.

if not (isinstance(lhs, type(rhs)) or isinstance(rhs, type(lhs))):
raise TypeError('lhs and rhs must be the same type')

return _concat((lhs, rhs), axis=t.axis)

This comment has been minimized.

Copy link
@llllllllll

llllllllll Jun 17, 2015

Author Member

I cannot decide if I should reset_index(drop=True) on this so that if you are using an integer index, it is still unique after the concat.

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 17, 2015

Member

if you decided to do that then pass ignore_index=True to pd.concat

This comment has been minimized.

Copy link
@llllllllll

llllllllll Jun 17, 2015

Author Member

Ah, I didn't know about this flag.

@llllllllll llllllllll closed this Jun 18, 2015

@llllllllll llllllllll reopened this Jun 18, 2015

@llllllllll llllllllll force-pushed the quantopian:vstack branch from 42d1638 to 5c5daec Jun 19, 2015

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 19, 2015

Can someone take a took at this?

ldshape = lhs.dshape
rdshape = rhs.dshape
if ldshape.measure != rdshape.measure:
raise ValueError(

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 20, 2015

Member

I think a TypeError is more appropriate here.

for n, (a, b) in enumerate(zip_longest(lshape, rshape, fillvalue=None)):
if n != axis and a != b:
raise ValueError(
"Shapes are not equal along axis '{n}': {a} != {b}".format(

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 20, 2015

Member

Any reason for the single quotes here? I found it a bit awkward to see this

Shapes are not equal along axis '0': 5 != 3

Not a big deal though.

This comment has been minimized.

Copy link
@llllllllll

llllllllll Jun 20, 2015

Author Member

I will drop them. Do you think this should also be a type error since it is about the shape of the expr?

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 20, 2015

Member

Yes a TypeError is probably the way to go here.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jun 20, 2015

@llllllllll Can you add a test that exercises the error conditions?

def compute_up(t, lhs, rhs, **kwargs):
if t.axis != 0:
raise NotImplementedError(
'Cannot concat along a non-zero axis in sql',

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 20, 2015

Member

I wonder if pointing people to Merge expressions is useful in this case since if they pass axis=1, Merge is likely what they want.

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 20, 2015

Member

I also think a ValueError is better here. Whenever I see NotImplementedError I have hope that one day the world will improve. In this case, I don't think we'll ever implement axis=1 for SQL.

This comment has been minimized.

Copy link
@llllllllll

llllllllll Jun 20, 2015

Author Member

I think will point people to merge here, because this seems like it might be a common mistake. I also will switch to a valueerror. I think there needs to be some notion of a NotImplementedOnPurpose exception in the stdlib for cases like this.

This comment has been minimized.

Copy link
@cpcloud

cpcloud Jun 20, 2015

Member

ha! raise DreamsCrushedError() :)

This comment has been minimized.

Copy link
@llllllllll

llllllllll Jun 20, 2015

Author Member

lol

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jun 22, 2015

@llllllllll Anything else you want to add here?

@llllllllll

This comment has been minimized.

Copy link
Member Author

commented Jun 22, 2015

Nope, all set

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jun 22, 2015

@llllllllll merge when ready

llllllllll added a commit that referenced this pull request Jun 22, 2015

@llllllllll llllllllll merged commit ff370aa into blaze:master Jun 22, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@llllllllll llllllllll deleted the quantopian:vstack branch Jun 22, 2015

@llllllllll llllllllll referenced this pull request Jun 26, 2015

Closed

Union Operation #974

@ywang007

This comment has been minimized.

Copy link

commented on 737953b Jul 7, 2015

Hi, I see you've made the effort to expose the concat, however, even with this, I'm still getting the concat from toolz.concat:

from a freshly started python shell, I do the following:

help(concat)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'concat' is not defined
from blaze import concat
help(concat)

----> Now it will show me: Help on function concat in module toolz.itertoolz:

Any idea why?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.