ARROW-13231: [Doc] Add ORC documentation #11779

iajoiner · 2021-11-26T00:33:51Z

No description provided.

github-actions · 2021-11-26T00:34:10Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2021-11-26T00:34:39Z

https://issues.apache.org/jira/browse/ARROW-13231

iajoiner · 2021-11-26T00:35:27Z

@jorisvandenbossche Here we go!

iajoiner · 2022-02-06T09:09:07Z

@jorisvandenbossche I think I already have the non-dataset-related Python user guide added. I will work on C++ as well as dataset docs. However since I didn't write the code I'm not that familiar with the exact amount of progress. So please correct me if I'm wrong. Really thanks!

…dling is desired

iajoiner · 2022-02-18T09:43:13Z

There is of course more that can be added to ORC user guides but we need to start from somewhere.

iajoiner · 2022-02-25T05:30:03Z

@jorisvandenbossche @pitrou This is the first time I have ever written user guides. Could you guys please check whether there is stuff that is seriously wrong? I will iron out the details myself. Really thanks!

pitrou

Thank you very much @iajoiner . This is a quick first pass. I'll let @jorisvandenbossche give further advice for the Python docs.

pitrou · 2022-02-28T12:24:47Z

docs/source/cpp/orc.rst

+Supported ORC features
+==========================
+
+The ORC format has many features, and we support a subset of them.


Also mention which datatypes are supported?

pitrou · 2022-02-28T12:25:00Z

docs/source/cpp/orc.rst

+-----------
+
+-------------------+---------+
+| Compression codec | Notes   |


No need for a notes column if there aren't any notes.

You are right. Removed!

pitrou · 2022-02-28T12:26:37Z

docs/source/python/orc.rst

+
+The `Apache ORC <http://orc.apache.org/>`_ project provides a
+standardized open-source columnar storage format for use in data analysis
+systems. It was created originally for use in `Apache Hadoop


The Python and C++ doc pages should probably give the same description of the ORC format, it seems a bit gratuitous to have two different ones.

Thanks! Fixed.

pitrou · 2022-02-28T12:26:55Z

docs/source/python/orc.rst

+<http://spark.apache.org>`_ adopting it as a shared standard for high
+performance data IO.
+
+Apache Arrow is an ideal in-memory transport layer for data that is being read


Rather a representation layer if you're using ORC for transport or storage...

pitrou · 2022-02-28T12:34:46Z

docs/source/python/orc.rst

+If you installed ``pyarrow`` with pip or conda, it should be built with ORC
+support bundled:
+
+.. ipython:: python


We would like to cut down on ipython code blocks actually, since they make building the docs slower and more fragile. AFAIU we would like to use doctest instead:
https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html

@jorisvandenbossche or @amol- may want to elaborate on this.

pitrou · 2022-02-28T12:36:50Z

docs/source/python/orc.rst

+There are some additional data type handling-specific options
+described below.
+
+Omitting the DataFrame index


I get the impression you're copying and adapting large chunks of the Parquet docs. I'm not sure it makes sense to do this (also, I don't know why the Parquet docs talk about preserve_index specifically).

Yes, I think in this case we can omit this section about the pandas index (or maybe refer to the pandas.rst page on those details).

(I suppose the reason that the parquet page includes those details is that, historically, many people where using the parquet methods to get pandas DataFrames (the ParquetFile also has functionality to directly get a DataFrame))

pitrou · 2022-02-28T12:44:32Z

docs/source/python/orc.rst

+files.
+
+Partitioned Datasets (Multiple Files)
+------------------------------------------------


This section doesn't showcase any example or API, is it useful to have it?

I agree that it is not that useful now. I removed it. When we complete the ORC dataset feature we will add it again.

pitrou · 2022-02-28T12:45:20Z

docs/source/python/orc.rst

+    s3  = fs.S3FileSystem(region="us-east-2")
+    table = po.read_table("bucket/object/key/prefix", filesystem=s3)
+
+Currently, :class:`HDFS <pyarrow.fs.HadoopFileSystem>` and


I think we could stop this section here and just redirect the user to the filesystem docs for details.

iajoiner · 2022-03-18T21:17:58Z

Note that the Docs error is in the 7.0.0 release caused by the existence of pyarrow.dataset.ORCFileFormat being conditional. It is not being shown in the actual dataset API docs. Maybe we need to include ORC when building and release our docs?

@pitrou @jorisvandenbossche Could you please review again? Really thanks!

pitrou · 2022-03-22T14:03:00Z

@iajoiner Can you look at the doc building warnings and errors?

iajoiner · 2022-03-22T15:00:50Z

@pitrou Yes. Looks like the problem is that we are not building either Arrow Dataset or ORC support (likely the latter) in the GitHub Action hence it is upset. Shall the fix be changing the test so that ORC support is actually included so that this can run correctly?

pitrou · 2022-03-22T15:06:21Z

Well, it is being built correctly, so the problem is probably something else:
https://github.com/apache/arrow/runs/5582660707?check_suite_focus=true#step:6:3525

Note you can easily reproduce locally using Archery and Docker:
https://arrow.apache.org/docs/developers/continuous_integration/docker.html

jorisvandenbossche · 2022-03-22T16:00:06Z

@iajoiner the error about pyarrow.dataset.OrcFileFormat is actually an already existing warning and not related to this PR, I think (that was already added to /docs/source/python/api/dataset.rst a long time ago). It's still something we should try to fix of course (but that can be done separately from this PR), but the error we see here should have some other cause.

jorisvandenbossche · 2022-03-22T16:06:38Z

Running the docs locally (and removing the -j8 from the Makefile to run it single threaded, so you get more informative logging), it seems that it now segfaults while reading the orc.rst page.
This might be indicating that one of the code examples is segfaulting?

jorisvandenbossche · 2022-03-22T16:07:13Z

docs/source/python/orc.rst

+.. ipython:: python
+
+   with po.ORCWriter('example2.orc') as writer:
+      writer.write_table(table)


It is this example that segfaults

Ah it is due to a typo. If you call a random orc.ORCWriter method that doesn't actually exist it segfaults out. I wonder whether this behavior itself is concerning (that's for a different PR of course).

If you call a random orc.ORCWriter method that doesn't actually exist it segfaults out. I wonder whether this behavior itself is concerning (that's for a different PR of course).

Yes, that sounds as something we should also fix then! Can you open a JIRA about it?

https://issues.apache.org/jira/browse/ARROW-16025

jorisvandenbossche

In addition to the build error, also added a few other comments. And thanks a lot for working on this!

jorisvandenbossche · 2022-03-22T19:04:22Z

docs/source/python/orc.rst

+
+.. ipython:: python
+
+   import pyarrow.orc as po


I know this mimics the import pyarrow.parquet as pq we already have, but I personally would want to generalize such cryptic abbreviations. I would maybe go for from pyarrow import orc and then orc.read_table etc ?

I agree. Fixed!

jorisvandenbossche · 2022-03-22T19:06:26Z

docs/source/python/orc.rst

+   df = pd.DataFrame({'one': [-1, np.nan, 2.5],
+                      'two': ['foo', 'bar', 'baz'],
+                      'three': [True, False, True]},
+                      index=list('abc'))


I would leave out the custom index here, to have the example focus on reading and writing ORC, and not on the pandas<->arrow conversion details.
Maybe we could even start with creating directly a pyarrow.Table from this dict (pa.table({..}) instead of pd.DataFrame({{..}))?

jorisvandenbossche · 2022-03-22T19:07:32Z

docs/source/python/orc.rst

+
+We need not use a string to specify the origin of the file. It can be any of:
+
+* A file path as a string


Does a pathlib.Path object also work? (or in general a path-like object)

I think so. io.BytesIO as well.

Can you add the pathlib.Path object to the above bullet point?

jorisvandenbossche · 2022-03-22T19:10:41Z

docs/source/python/orc.rst

+There are some additional data type handling-specific options
+described below.
+
+Omitting the DataFrame index


Yes, I think in this case we can omit this section about the pandas index (or maybe refer to the pandas.rst page on those details).

(I suppose the reason that the parquet page includes those details is that, historically, many people where using the parquet methods to get pandas DataFrames (the ParquetFile also has functionality to directly get a DataFrame))

iajoiner · 2022-03-24T12:52:32Z

@pitrou @jorisvandenbossche Now the tests pass! Do I need to start replacing ipython blocks with doctests in files as well?

jorisvandenbossche · 2022-03-24T13:58:23Z

Do I need to start replacing ipython blocks with doctests in files as well?

Either way is fine for me. Given we already use IPython blocks in many other places, I wouldn't want to block this PR on that. But if you want to convert it, that's also fine (but, eg, I don't think we already discussed what format we want to use. I would personally use the doctest format (with >>>, like in docstrings), but the sphinx.ext.doctest extension also provides other directives)

jorisvandenbossche

One more thing: you will also need to add orc to the python and cpp toctrees (in source/python/index.rst and source/cpp/index.rst)

jorisvandenbossche · 2022-03-24T14:14:21Z

docs/source/python/orc.rst

+In general, a Python file object will have the worst read performance, while a
+string file path or an instance of :class:`~.NativeFile` (especially memory
+maps) will perform the best.
+


I would maybe add here a Note or See also box about the fact that you can also read partitioned datasets with multiple ORC files through the pyarrow.dataset interface, and refer to that documentation.

Thanks! Added!

jorisvandenbossche · 2022-03-24T14:14:47Z

docs/source/python/orc.rst

+
+We need not use a string to specify the origin of the file. It can be any of:
+
+* A file path as a string


Can you add the pathlib.Path object to the above bullet point?

jorisvandenbossche · 2022-03-24T14:16:15Z

docs/source/python/orc.rst

+   orc_file.schema
+   orc_file.nrows
+
+See the :class:`~pyarrow.orc.ORCFile()` docstring for more details.


Suggested change

See the :class:`~pyarrow.orc.ORCFile()` docstring for more details.

See the :class:`~pyarrow.orc.ORCFile` docstring for more details.

jorisvandenbossche · 2022-03-24T14:17:13Z

docs/source/python/orc.rst

+    table = orc.read_table("bucket/object/key/prefix", filesystem=s3)
+
+.. seealso::
+   :ref:`Documentation for filesystems <filesystems>`.


Suggested change

:ref:`Documentation for filesystems <filesystems>`.

:ref:`Documentation for filesystems <filesystem>`.

iajoiner · 2022-03-24T23:20:56Z

@pitrou @jorisvandenbossche The ipython examples have been converted to doctests.

jorisvandenbossche · 2022-03-25T08:49:18Z

Note that the Docs error is in the 7.0.0 release caused by the existence of pyarrow.dataset.ORCFileFormat being conditional. It is not being shown in the actual dataset API docs. Maybe we need to include ORC when building and release our docs?

This was actually caused by a typo .. Fixing this in #12714

jorisvandenbossche

Just a small remaining comment on using ... in continuation lines in the doctest format, looks good for the rest!

docs/source/python/orc.rst

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

iajoiner · 2022-03-25T15:15:18Z

@jorisvandenbossche @pitrou Fixed! :)
Also really thanks for #12714 ! Can we merge, may I ask?

iajoiner · 2022-03-30T12:54:57Z

@pitrou @jorisvandenbossche Just a reminder that this one is still open with all the tests passed and comments addressed. :)

jorisvandenbossche · 2022-04-04T08:55:30Z

Thanks @iajoiner !

ursabot · 2022-04-04T09:01:03Z

Benchmark runs are scheduled for baseline = 9ac8301 and contender = 4c3edd2. 4c3edd2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.17% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.38% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/436| 4c3edd27 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/421| 4c3edd27 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/421| 4c3edd27 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/431| 4c3edd27 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/435| 9ac83015 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/420| 9ac83015 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/420| 9ac83015 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/430| 9ac83015 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Closes apache#11779 from iajoiner/ARROW-13231-docs Lead-authored-by: Ian Alexander Joiner <iajoiner809@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

chloeandmargaret added 4 commits November 22, 2021 04:09

Fix warning in file_orc_test

7b67afd

Merge remote-tracking branch 'upstream/master'

2d6a800

Merge branch 'master' of https://github.com/apache/arrow

721abeb

Add the doc files

e7fb01d

iajoiner changed the title ~~[Doc]ARROW-13231: Add ORC documentation~~ ARROW-13231: [Doc] Add ORC documentation Nov 26, 2021

chloeandmargaret added 4 commits November 28, 2021 11:16

Add writer basic usage

f19a1f4

Merge remote-tracking branch 'upstream/master' into ARROW-13231-docs

c8bf70e

Python rst init

5fe0252

First version of python user guide complete

768a7db

chloeandmargaret added 3 commits February 18, 2022 02:48

Merge remote-tracking branch 'origin/master' into ARROW-13231-docs

3509518

Minimal C++ User Guide included

60ec921

Avoid ARROW_ASSIGN_OR_RAISE for situations where immeditate error han…

309704b

…dling is desired

iajoiner marked this pull request as ready for review February 18, 2022 09:42

pitrou requested a review from jorisvandenbossche February 28, 2022 12:24

pitrou reviewed Feb 28, 2022

View reviewed changes

chloeandmargaret added 2 commits March 17, 2022 04:18

Modification according to comments

a093d2c

Merge remote-tracking branch 'upstream/master' into ARROW-13231-docs

550dc5e

jorisvandenbossche reviewed Mar 22, 2022

View reviewed changes

chloeandmargaret added 3 commits March 23, 2022 04:38

Merge remote-tracking branch 'upstream/master' into ARROW-13231-docs

65c7198

Python doc fix

04aae2f

Fix ipython typo

0b7899c

jorisvandenbossche reviewed Mar 24, 2022

View reviewed changes

chloeandmargaret added 2 commits March 24, 2022 18:29

Fix issues according to Joris

333f914

Convert ipython to doctest

30ef222

jorisvandenbossche approved these changes Mar 25, 2022

View reviewed changes

docs/source/python/orc.rst Outdated Show resolved Hide resolved

docs/source/python/orc.rst Outdated Show resolved Hide resolved

chloeandmargaret and others added 2 commits March 25, 2022 07:38

Update docs/source/python/orc.rst according to Joris’ suggestion

2c390e8

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update docs/source/python/orc.rst according to Joris’ suggestion

b248bd3

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche closed this in 4c3edd2 Apr 4, 2022


		We need not use a string to specify the origin of the file. It can be any of:

		* A file path as a string

	See the :class:`~pyarrow.orc.ORCFile()` docstring for more details.
	See the :class:`~pyarrow.orc.ORCFile` docstring for more details.

	:ref:`Documentation for filesystems <filesystems>`.
	:ref:`Documentation for filesystems <filesystem>`.

ARROW-13231: [Doc] Add ORC documentation #11779

ARROW-13231: [Doc] Add ORC documentation #11779

Conversation

iajoiner commented Nov 26, 2021

github-actions bot commented Nov 26, 2021

github-actions bot commented Nov 26, 2021

iajoiner commented Nov 26, 2021

iajoiner commented Feb 6, 2022 • edited

iajoiner commented Feb 18, 2022

iajoiner commented Feb 25, 2022

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Mar 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Mar 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner commented Mar 18, 2022 • edited

pitrou commented Mar 22, 2022

iajoiner commented Mar 22, 2022

pitrou commented Mar 22, 2022

jorisvandenbossche commented Mar 22, 2022

jorisvandenbossche commented Mar 22, 2022

Choose a reason for hiding this comment

iajoiner Mar 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner commented Mar 24, 2022

jorisvandenbossche commented Mar 24, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner commented Mar 24, 2022

jorisvandenbossche commented Mar 25, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

iajoiner commented Mar 25, 2022 • edited

iajoiner commented Mar 30, 2022

jorisvandenbossche commented Apr 4, 2022

ursabot commented Apr 4, 2022 • edited

iajoiner commented Feb 6, 2022 •

edited

iajoiner Mar 17, 2022 •

edited

iajoiner Mar 17, 2022 •

edited

iajoiner commented Mar 18, 2022 •

edited

iajoiner Mar 24, 2022 •

edited

iajoiner commented Mar 25, 2022 •

edited

ursabot commented Apr 4, 2022 •

edited