Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13231: [Doc] Add ORC documentation #11779

Closed
wants to merge 20 commits into from

Conversation

iajoiner
Copy link

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@iajoiner iajoiner changed the title [Doc]ARROW-13231: Add ORC documentation ARROW-13231: [Doc] Add ORC documentation Nov 26, 2021
@github-actions
Copy link

@iajoiner
Copy link
Author

@jorisvandenbossche Here we go!

@iajoiner
Copy link
Author

iajoiner commented Feb 6, 2022

@jorisvandenbossche I think I already have the non-dataset-related Python user guide added. I will work on C++ as well as dataset docs. However since I didn't write the code I'm not that familiar with the exact amount of progress. So please correct me if I'm wrong. Really thanks!

@iajoiner iajoiner marked this pull request as ready for review February 18, 2022 09:42
@iajoiner
Copy link
Author

There is of course more that can be added to ORC user guides but we need to start from somewhere.

@iajoiner
Copy link
Author

@jorisvandenbossche @pitrou This is the first time I have ever written user guides. Could you guys please check whether there is stuff that is seriously wrong? I will iron out the details myself. Really thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @iajoiner . This is a quick first pass. I'll let @jorisvandenbossche give further advice for the Python docs.

Supported ORC features
==========================

The ORC format has many features, and we support a subset of them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also mention which datatypes are supported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

-----------

+-------------------+---------+
| Compression codec | Notes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a notes column if there aren't any notes.

Copy link
Author

@iajoiner iajoiner Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Removed!


The `Apache ORC <http://orc.apache.org/>`_ project provides a
standardized open-source columnar storage format for use in data analysis
systems. It was created originally for use in `Apache Hadoop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python and C++ doc pages should probably give the same description of the ORC format, it seems a bit gratuitous to have two different ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed.

<http://spark.apache.org>`_ adopting it as a shared standard for high
performance data IO.

Apache Arrow is an ideal in-memory transport layer for data that is being read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather a representation layer if you're using ORC for transport or storage...

If you installed ``pyarrow`` with pip or conda, it should be built with ORC
support bundled:

.. ipython:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would like to cut down on ipython code blocks actually, since they make building the docs slower and more fragile. AFAIU we would like to use doctest instead:
https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html

@jorisvandenbossche or @amol- may want to elaborate on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

There are some additional data type handling-specific options
described below.

Omitting the DataFrame index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the impression you're copying and adapting large chunks of the Parquet docs. I'm not sure it makes sense to do this (also, I don't know why the Parquet docs talk about preserve_index specifically).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think in this case we can omit this section about the pandas index (or maybe refer to the pandas.rst page on those details).

(I suppose the reason that the parquet page includes those details is that, historically, many people where using the parquet methods to get pandas DataFrames (the ParquetFile also has functionality to directly get a DataFrame))

files.

Partitioned Datasets (Multiple Files)
------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section doesn't showcase any example or API, is it useful to have it?

Copy link
Author

@iajoiner iajoiner Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is not that useful now. I removed it. When we complete the ORC dataset feature we will add it again.

s3 = fs.S3FileSystem(region="us-east-2")
table = po.read_table("bucket/object/key/prefix", filesystem=s3)

Currently, :class:`HDFS <pyarrow.fs.HadoopFileSystem>` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could stop this section here and just redirect the user to the filesystem docs for details.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@iajoiner
Copy link
Author

iajoiner commented Mar 18, 2022

Note that the Docs error is in the 7.0.0 release caused by the existence of pyarrow.dataset.ORCFileFormat being conditional. It is not being shown in the actual dataset API docs. Maybe we need to include ORC when building and release our docs?

@pitrou @jorisvandenbossche Could you please review again? Really thanks!

@pitrou
Copy link
Member

pitrou commented Mar 22, 2022

@iajoiner Can you look at the doc building warnings and errors?

@iajoiner
Copy link
Author

@pitrou Yes. Looks like the problem is that we are not building either Arrow Dataset or ORC support (likely the latter) in the GitHub Action hence it is upset. Shall the fix be changing the test so that ORC support is actually included so that this can run correctly?

@pitrou
Copy link
Member

pitrou commented Mar 22, 2022

Well, it is being built correctly, so the problem is probably something else:
https://github.com/apache/arrow/runs/5582660707?check_suite_focus=true#step:6:3525

Note you can easily reproduce locally using Archery and Docker:
https://arrow.apache.org/docs/developers/continuous_integration/docker.html

@jorisvandenbossche
Copy link
Member

@iajoiner the error about pyarrow.dataset.OrcFileFormat is actually an already existing warning and not related to this PR, I think (that was already added to /docs/source/python/api/dataset.rst a long time ago). It's still something we should try to fix of course (but that can be done separately from this PR), but the error we see here should have some other cause.

@jorisvandenbossche
Copy link
Member

Running the docs locally (and removing the -j8 from the Makefile to run it single threaded, so you get more informative logging), it seems that it now segfaults while reading the orc.rst page.
This might be indicating that one of the code examples is segfaulting?

.. ipython:: python

with po.ORCWriter('example2.orc') as writer:
writer.write_table(table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is this example that segfaults

Copy link
Author

@iajoiner iajoiner Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it is due to a typo. If you call a random orc.ORCWriter method that doesn't actually exist it segfaults out. I wonder whether this behavior itself is concerning (that's for a different PR of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call a random orc.ORCWriter method that doesn't actually exist it segfaults out. I wonder whether this behavior itself is concerning (that's for a different PR of course).

Yes, that sounds as something we should also fix then! Can you open a JIRA about it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the build error, also added a few other comments. And thanks a lot for working on this!


.. ipython:: python

import pyarrow.orc as po
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this mimics the import pyarrow.parquet as pq we already have, but I personally would want to generalize such cryptic abbreviations. I would maybe go for from pyarrow import orc and then orc.read_table etc ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Fixed!

df = pd.DataFrame({'one': [-1, np.nan, 2.5],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]},
index=list('abc'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out the custom index here, to have the example focus on reading and writing ORC, and not on the pandas<->arrow conversion details.
Maybe we could even start with creating directly a pyarrow.Table from this dict (pa.table({..}) instead of pd.DataFrame({{..}))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.


We need not use a string to specify the origin of the file. It can be any of:

* A file path as a string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a pathlib.Path object also work? (or in general a path-like object)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. io.BytesIO as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the pathlib.Path object to the above bullet point?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

There are some additional data type handling-specific options
described below.

Omitting the DataFrame index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think in this case we can omit this section about the pandas index (or maybe refer to the pandas.rst page on those details).

(I suppose the reason that the parquet page includes those details is that, historically, many people where using the parquet methods to get pandas DataFrames (the ParquetFile also has functionality to directly get a DataFrame))

@iajoiner
Copy link
Author

@pitrou @jorisvandenbossche Now the tests pass! Do I need to start replacing ipython blocks with doctests in files as well?

@jorisvandenbossche
Copy link
Member

Do I need to start replacing ipython blocks with doctests in files as well?

Either way is fine for me. Given we already use IPython blocks in many other places, I wouldn't want to block this PR on that. But if you want to convert it, that's also fine (but, eg, I don't think we already discussed what format we want to use. I would personally use the doctest format (with >>>, like in docstrings), but the sphinx.ext.doctest extension also provides other directives)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: you will also need to add orc to the python and cpp toctrees (in source/python/index.rst and source/cpp/index.rst)

In general, a Python file object will have the worst read performance, while a
string file path or an instance of :class:`~.NativeFile` (especially memory
maps) will perform the best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe add here a Note or See also box about the fact that you can also read partitioned datasets with multiple ORC files through the pyarrow.dataset interface, and refer to that documentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added!


We need not use a string to specify the origin of the file. It can be any of:

* A file path as a string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the pathlib.Path object to the above bullet point?

orc_file.schema
orc_file.nrows

See the :class:`~pyarrow.orc.ORCFile()` docstring for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See the :class:`~pyarrow.orc.ORCFile()` docstring for more details.
See the :class:`~pyarrow.orc.ORCFile` docstring for more details.

table = orc.read_table("bucket/object/key/prefix", filesystem=s3)

.. seealso::
:ref:`Documentation for filesystems <filesystems>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:ref:`Documentation for filesystems <filesystems>`.
:ref:`Documentation for filesystems <filesystem>`.

@iajoiner
Copy link
Author

@pitrou @jorisvandenbossche The ipython examples have been converted to doctests.

@jorisvandenbossche
Copy link
Member

Note that the Docs error is in the 7.0.0 release caused by the existence of pyarrow.dataset.ORCFileFormat being conditional. It is not being shown in the actual dataset API docs. Maybe we need to include ORC when building and release our docs?

This was actually caused by a typo .. Fixing this in #12714

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small remaining comment on using ... in continuation lines in the doctest format, looks good for the rest!

docs/source/python/orc.rst Outdated Show resolved Hide resolved
docs/source/python/orc.rst Outdated Show resolved Hide resolved
chloeandmargaret and others added 2 commits March 25, 2022 07:38
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@iajoiner
Copy link
Author

iajoiner commented Mar 25, 2022

@jorisvandenbossche @pitrou Fixed! :)
Also really thanks for #12714 ! Can we merge, may I ask?

@iajoiner
Copy link
Author

@pitrou @jorisvandenbossche Just a reminder that this one is still open with all the tests passed and comments addressed. :)

@jorisvandenbossche
Copy link
Member

Thanks @iajoiner !

@ursabot
Copy link

ursabot commented Apr 4, 2022

Benchmark runs are scheduled for baseline = 9ac8301 and contender = 4c3edd2. 4c3edd2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.17% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.38% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/436| 4c3edd27 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/421| 4c3edd27 test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/421| 4c3edd27 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/431| 4c3edd27 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/435| 9ac83015 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/420| 9ac83015 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/420| 9ac83015 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/430| 9ac83015 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jcralmeida pushed a commit to rafael-telles/arrow that referenced this pull request Apr 19, 2022
Closes apache#11779 from iajoiner/ARROW-13231-docs

Lead-authored-by: Ian Alexander Joiner <iajoiner809@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants