Writing Partitioned Datasets recipe for Python #47

amol- · 2021-08-19T09:34:11Z

Fixes #46

jorisvandenbossche

I understand that you are using the dummy "chunk" column to be able to have a partitioned dataset without having a natural column to partition on, but my feeling is that this is a bit of a workaround because we don't give a better alternative (yet).

Personally, I would maybe use a more useful partitioning column as the main example, and then mention the "chunk" column as a way to do it if there is no column already present that can be used for partitioning (and then we can easily update this when a better API becomes available)

python/source/io.rst

amol- · 2021-08-20T14:31:02Z

I understand that you are using the dummy "chunk" column to be able to have a partitioned dataset without having a natural column to partition on, but my feeling is that this is a bit of a workaround because we don't give a better alternative (yet).

Personally, I would maybe use a more useful partitioning column as the main example, and then mention the "chunk" column as a way to do it if there is no column already present that can be used for partitioning (and then we can easily update this when a better API becomes available)

Switched to a real column (year of birthday) instead of using the chunk column.

amol- · 2021-08-23T09:51:37Z

@jorisvandenbossche can you re-review? Should have addressed your concerns

jorisvandenbossche

Looks good! One small language remark

jorisvandenbossche · 2021-08-23T15:05:04Z

python/source/io.rst

+
+Arrow will partition datasets in subdirectories by default, which will
+result in 10 different directories named with the value of the partitioning
+column and with file containing the data partition inside:


the "and with file containing the data partition inside" reads a bit strange. Maybe something like "each with a file containing the subset of the data for that partition"

👍 reworded

jorisvandenbossche

Ready for merge?

Writing Partitioned Datasets recipe for Python

dfd6303

jorisvandenbossche reviewed Aug 19, 2021

View reviewed changes

python/source/io.rst Outdated Show resolved Hide resolved

amol- added 2 commits August 20, 2021 12:36

ops, rename variable

18afd59

Do not use the chunk column

dbeebe9

jorisvandenbossche reviewed Aug 23, 2021

View reviewed changes

reword

f088096

jorisvandenbossche approved these changes Aug 24, 2021

View reviewed changes

jorisvandenbossche merged commit a3e01f7 into apache:main Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing Partitioned Datasets recipe for Python #47

Writing Partitioned Datasets recipe for Python #47

amol- commented Aug 19, 2021 •

edited

Loading

jorisvandenbossche left a comment

amol- commented Aug 20, 2021

amol- commented Aug 23, 2021

jorisvandenbossche left a comment

jorisvandenbossche Aug 23, 2021

amol- Aug 24, 2021

jorisvandenbossche left a comment

Writing Partitioned Datasets recipe for Python #47

Writing Partitioned Datasets recipe for Python #47

Conversation

amol- commented Aug 19, 2021 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

amol- commented Aug 20, 2021

amol- commented Aug 23, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Aug 23, 2021

Choose a reason for hiding this comment

amol- Aug 24, 2021

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

amol- commented Aug 19, 2021 •

edited

Loading