Skip to content

Commit

Permalink
ARROW-7745: [Doc] [C++] Update Parquet documentation
Browse files Browse the repository at this point in the history
Add documentation for new StreamReader and StreamWriter classes.

Also document ParquetException class and Parquet PARQUET_THROW_NOT_OK
and PARQUET_ASSIGN_OR_THROW macros.

Closes #6341 from gawain-bolton/ARROW-7745_update_parquet_documentation and squashes the following commits:

9f6e546 <gawain.bolton> Updates after review by Antoine Pitrou on 20200203
3084c79 <gawain.bolton> ARROW-7745:   Update Parquet documentation

Authored-by: gawain.bolton <gawain.bolton@cfm.fr>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
  • Loading branch information
gawain.bolton authored and nealrichardson committed Feb 8, 2020
1 parent f99a81b commit d15c3b2
Show file tree
Hide file tree
Showing 3 changed files with 149 additions and 10 deletions.
6 changes: 6 additions & 0 deletions docs/source/cpp/api/formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,9 @@ Parquet reader
.. doxygengroup:: parquet-arrow-reader-factories
:content-only:

.. doxygenclass:: parquet::StreamReader
:members:

Parquet writer
==============

Expand All @@ -83,4 +86,7 @@ Parquet writer

.. doxygenfunction:: parquet::arrow::WriteTable

.. doxygenclass:: parquet::StreamWriter
:members:

.. TODO ORC
8 changes: 8 additions & 0 deletions docs/source/cpp/api/support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,14 @@ Error return and reporting
:project: arrow_cpp
:members:

.. doxygenclass:: parquet::ParquetException
:project: arrow_cpp
:members:

.. doxygendefine:: ARROW_RETURN_NOT_OK

.. doxygendefine:: ARROW_ASSIGN_OR_RAISE

.. doxygendefine:: PARQUET_THROW_NOT_OK

.. doxygendefine:: PARQUET_ASSIGN_OR_THROW
145 changes: 135 additions & 10 deletions docs/source/cpp/parquet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
.. default-domain:: cpp
.. highlight:: cpp

.. cpp:namespace:: parquet::arrow
.. cpp:namespace:: parquet

=================================
Reading and writing Parquet files
Expand All @@ -27,11 +27,29 @@ Reading and writing Parquet files
The Parquet C++ library is part of the Apache Arrow project and benefits
from tight integration with Arrow C++.

Reading
=======
The :class:`arrow::FileReader` class reads data for an entire
file or row group into an :class:`::arrow::Table`.

The Parquet :class:`FileReader` requires a :class:`::arrow::io::RandomAccessFile`
instance representing the input file.
The :func:`arrow::WriteTable` function writes an entire
:class:`::arrow::Table` to an output file.

The :class:`StreamReader` and :class:`StreamWriter` classes allow for
data to be written using a C++ input/output streams approach to
read/write fields column by column and row by row. This approach is
offered for ease of use and type-safety. It is of course also useful
when data must be streamed as files are read and written
incrementally.

Please note that the performance of the :class:`StreamReader` and
:class:`StreamWriter` classes will not be as good due to the type
checking and the fact that column values are processed one at a time.

FileReader
==========

The Parquet :class:`arrow::FileReader` requires a
:class:`::arrow::io::RandomAccessFile` instance representing the input
file.

.. code-block:: cpp
Expand All @@ -58,12 +76,119 @@ instance representing the input file.
}
}
Finer-grained options are available through the :class:`FileReaderBuilder`
helper class.
Finer-grained options are available through the
:class:`arrow::FileReaderBuilder` helper class.

.. TODO write section about performance and memory efficiency
Writing
=======
WriteTable
==========

The :func:`arrow::WriteTable` function writes an entire
:class:`::arrow::Table` to an output file.

.. code-block:: cpp
#include "parquet/arrow/writer.h"
{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
PARQUET_ASSIGN_OR_THROW(
outfile,
arrow::io::FileOutputStream::Open("test.parquet"));
PARQUET_THROW_NOT_OK(
parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}
StreamReader
============

The :class:`StreamReader` allows for Parquet files to be read using
standard C++ input operators which ensures type-safety.

Please note that types must match the schema exactly i.e. if the
schema field is an unsigned 16-bit integer then you must supply a
uint16_t type.

Exceptions are used to signal errors. A :class:`ParquetException` is
thrown in the following circumstances:

* Attempt to read field by supplying the incorrect type.

* Attempt to read beyond end of row.

* Attempt to read beyond end of file.

.. code-block:: cpp
#include "arrow/io/file.h"
#include "parquet/stream_reader.h"
{
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(
infile,
arrow::io::ReadableFile::Open("test.parquet"));
parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)};
std::string article;
float price;
uint32_t quantity;
while ( !os.eof() )
{
os >> article >> price >> quantity >> parquet::EndRow;
// ...
}
}
StreamWriter
============

The :class:`StreamWriter` allows for Parquet files to be written using
standard C++ output operators. This type-safe approach also ensures
that rows are written without ommitting fields and allows for new row
groups to be created automatically (after certain volume of data) or
explicitly by using the :type:`EndRowGroup` stream modifier.

Exceptions are used to signal errors. A :class:`ParquetException` is
thrown in the following circumstances:

* Attempt to write a field using an incorrect type.

* Attempt to write too many fields in a row.

* Attempt to skip a required field.

TODO: write this
.. code-block:: cpp
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"
{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
PARQUET_ASSIGN_OR_THROW(
outfile,
arrow::io::FileOutputStream::Open("test.parquet"));
parquet::WriterProperties::Builder builder;
std::shared_ptr<parquet::schema::GroupNode> schema;
// Set up builder with required compression type etc.
// Define schema.
// ...
parquet::StreamWriter os{
parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};
// Loop over some data structure which provides the required
// fields to be written and write each row.
for (const auto& a : getArticles())
{
os << a.name() << a.price() << a.quantity() << parquet::EndRow;
}
}

0 comments on commit d15c3b2

Please sign in to comment.