Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single partition fix #183

Merged
merged 2 commits into from
Jul 21, 2017
Merged

Conversation

martindurant
Copy link
Member

Fixes #182

Martin Durant added 2 commits July 13, 2017 11:26
Prevent expected warnings from displaying in tests; small opt to
MAP record assembly.
In the case where data is partitioned, and the upmost partition
only has one possible value, we have no clear way to know that
it is meant as a partition rather than a location if passed a list
of paths.
As a workaround, provide extra parameter to ParquetFile (and merge)
to specify where the dataset root is.
@martindurant
Copy link
Member Author

@yackoa , can you test against your s3 data, please? You would do

pf = fastparquet.ParquetFile(files, root='root_dir_in_s3/my_table')

@yackoa
Copy link

yackoa commented Jul 21, 2017

@martindurant would this work for multiple leaf partitions as well ?
eg:

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌​‌​a‌​rt.0.parquet
root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-25/p‌​‌​a‌​rt.1.parquet
root_dir_in_s3/my_table/unique_id=700/my_date=2017-02-12/p‌​‌​a‌​rt.1.parquet

@martindurant
Copy link
Member Author

Yes indeed it should - it skips the inference of the root of the data-set in favour of the value provided.

@yackoa
Copy link

yackoa commented Jul 21, 2017

which means
pf = fastparquet.ParquetFile(files, root='root_dir_in_s3/my_table')
should address if my_table had one partition or multiple partitions right ?? 💃

@martindurant
Copy link
Member Author

Yes, that is the idea.

@yackoa
Copy link

yackoa commented Jul 21, 2017

awesome !!!
could you please let me know the conda forge command for the old version incase I need to revert .

I reckon the conda install -c conda-forge fastparquet will install the latest version with the fix.
Please correct me if I am wrong

The reason is that I have some programs working with the previous version (worked around with using fp.assign() with hard-coded partition assignments ) , so would need to revert incase there are any issues. It would help me very much, if you could share the conda command for the previous version as well

is it conda install -c conda-forge fastparquet=0.1.0 ??

@martindurant
Copy link
Member Author

Yes, you are correct for how to revert. This fix is not in conda-forge yet.
In fact, it is not even in master - I was hoping you might try it before I merged the fix.
I just tried now, and it works for me, so I'll merge, and try to get a version on conda-forge soon (perhaps later today).

@martindurant martindurant reopened this Jul 21, 2017
@martindurant martindurant merged commit 65769ad into dask:master Jul 21, 2017
@yackoa
Copy link

yackoa commented Jul 21, 2017

I have been using conda for testing this on my work related EC2 machine where everything. It would be easy for me to test this in my work environment, since everything is already setup. That's why I had asked for the conda vesrion.

I can PIP install in my local if needed. but would need sometime to get my dependencies in order, since currently my local and work machine is out of sync.

@martindurant martindurant deleted the single_partition_fix branch September 30, 2017 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants