Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First impression feedback #2

Closed
7 of 8 tasks
mrocklin opened this issue Oct 25, 2016 · 3 comments
Closed
7 of 8 tasks

First impression feedback #2

mrocklin opened this issue Oct 25, 2016 · 3 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Oct 25, 2016

Just took this for a spin, here is some feedback. Some of this is administrative and probably obvious to you. I'm listing it here just for completeness, not to prioritize:

Administrative

  • Needs nicer __repr__ implementation
  • Should move code out of __init__.py
  • Should maybe move tests within parquet/tests ?
  • Should probably keep test data outside of the package (I think that this is already done, just wanted to check)
  • Need to remove references to personal namespace (mdurant shows up in the tests)
  • How are we going to package this long-term? Do we need a different name than the existing parquet project?

Performance

  • The from_delayed call needs to be passed metadata to avoid triggering a local computation
  • The from_delayed call would like to be passed divisions= information if we can get it from the parquet file. We might want to sniff statistics within the parquet file for sorted columns and, if we find exactly one, make it the index automatically?
@martindurant
Copy link
Member

I now pass divisions= based on the row counts in each row_group. The information on whether there are sorting columns and statistics in each piece are also available, so I can implement what you suggest.

@martindurant
Copy link
Member

Suggestions for what should should be in __repr__/__str__? I have an info, columns, dtypes, divisions (i.e., row numbers) and cats (i.e., partitions) properties currently available.

@mrocklin
Copy link
Member Author

Location, size in bytes, columns or number of columns, dtypes? I wouldn't
worry about it near term though. I think we'll have more to say on this
topic after the project sees more use.

On Wed, Oct 26, 2016 at 9:46 AM, Martin Durant notifications@github.com
wrote:

Suggestions for what should should be in repr/str? I have an
info, columns, dtypes, divisions (i.e., row numbers) and cats (i.e.,
partitions) properties currently available.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AASszNhUMuZrhXIXYI5ttzCNKYVBEKfXks5q31nCgaJpZM4Kf9JM
.

martindurant added a commit that referenced this issue May 30, 2017
TEST: added updated test file for special strings used in filters for #149 and #155
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants