Skip to content
Browse files
Writing Partitioned Datasets recipe for Python (#47)
  • Loading branch information
amol- committed Aug 24, 2021
1 parent d15e75c commit a3e01f762b70f964650e5fb5999d2fa5893b4352
Showing 1 changed file with 59 additions and 0 deletions.
@@ -217,6 +217,65 @@ provided to :func:`pyarrow.csv.read_csv` to drive
col1: int64
ChunkedArray = 0 .. 99

Writing Partitioned Datasets

When your dataset is big it usually makes sense to split it into
multiple separate files. You can do this manually or use
:func:`pyarrow.dataset.write_dataset` to let Arrow do the effort
of splitting the data in chunks for you.

The ``partitioning`` argument allows to tell :func:`pyarrow.dataset.write_dataset`
for which columns the data should be split.

For example given 100 birthdays, within 2000 and 2009

.. testcode::

import numpy.random
data = pa.table({"day": numpy.random.randint(1, 31, size=100),
"month": numpy.random.randint(1, 12, size=100),
"year": [2000 + x // 10 for x in range(100)]})

Then we could partition the data by the year column so that it
gets saved in 10 different files:

.. testcode::

import pyarrow as pa
import pyarrow.dataset as ds

ds.write_dataset(data, "./partitioned", format="parquet",
partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))

Arrow will partition datasets in subdirectories by default, which will
result in 10 different directories named with the value of the partitioning
column each with a file containing the subset of the data for that partition:

.. testcode::

from pyarrow import fs

localfs = fs.LocalFileSystem()
partitioned_dir_content = localfs.get_file_info(fs.FileSelector("./partitioned", recursive=True))
files = sorted((f.path for f in partitioned_dir_content if f.type == fs.FileType.File))

for file in files:

.. testoutput::


Reading Partitioned data

0 comments on commit a3e01f7

Please sign in to comment.