Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing Partitioned Datasets recipe for Python #47

Merged
merged 4 commits into from
Aug 24, 2021
Merged

Writing Partitioned Datasets recipe for Python #47

merged 4 commits into from
Aug 24, 2021

Conversation

amol-
Copy link
Member

@amol- amol- commented Aug 19, 2021

Fixes #46

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you are using the dummy "chunk" column to be able to have a partitioned dataset without having a natural column to partition on, but my feeling is that this is a bit of a workaround because we don't give a better alternative (yet).

Personally, I would maybe use a more useful partitioning column as the main example, and then mention the "chunk" column as a way to do it if there is no column already present that can be used for partitioning (and then we can easily update this when a better API becomes available)

python/source/io.rst Outdated Show resolved Hide resolved
@amol-
Copy link
Member Author

amol- commented Aug 20, 2021

I understand that you are using the dummy "chunk" column to be able to have a partitioned dataset without having a natural column to partition on, but my feeling is that this is a bit of a workaround because we don't give a better alternative (yet).

Personally, I would maybe use a more useful partitioning column as the main example, and then mention the "chunk" column as a way to do it if there is no column already present that can be used for partitioning (and then we can easily update this when a better API becomes available)

Switched to a real column (year of birthday) instead of using the chunk column.

@amol-
Copy link
Member Author

amol- commented Aug 23, 2021

@jorisvandenbossche can you re-review? Should have addressed your concerns

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One small language remark


Arrow will partition datasets in subdirectories by default, which will
result in 10 different directories named with the value of the partitioning
column and with file containing the data partition inside:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "and with file containing the data partition inside" reads a bit strange. Maybe something like "each with a file containing the subset of the data for that partition"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 reworded

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for merge?

@jorisvandenbossche jorisvandenbossche merged commit a3e01f7 into apache:main Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Add a recipe on how to save partitioned datasets to the Cookbook
2 participants