Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split input into multiple partitions on request? #3

Closed
mrocklin opened this issue Jun 7, 2015 · 6 comments
Closed

Split input into multiple partitions on request? #3

mrocklin opened this issue Jun 7, 2015 · 6 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2015

There is the use case of "please partition my data by hour, regardless of how much data I give to you". is this something that we want to support on top of castra?

@esc
Copy link
Contributor

esc commented Jun 9, 2015

Yes, and support efficient buffering and appends, like bcolz. But I would say, that is a feature for the future. It also depends on what our 'clients' want.

@cpcloud
Copy link
Member

cpcloud commented Jul 15, 2015

wouldn't this be better suited for dask?

something like

df.repartition(by='H')

@cpcloud
Copy link
Member

cpcloud commented Jul 15, 2015

currently repartition is a bit more generic, but i could imagine an index-type-sensitive version of it

@mrocklin
Copy link
Member Author

This would still be important for data access. The idea here is that one might want data that is accessible in hourly chunks (or some other fixed period.) While dask could repartition this data it would still have to read possibly larger chunks at a time.

@cpcloud
Copy link
Member

cpcloud commented Jul 15, 2015

hm thinking a bit more i can see the value of having the splitting done as soon as you call extend

@jcrist
Copy link
Member

jcrist commented Aug 27, 2015

Fixed by #40. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants