New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13231: [Doc] Add ORC documentation #11779
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
@jorisvandenbossche Here we go! |
@jorisvandenbossche I think I already have the non-dataset-related Python user guide added. I will work on C++ as well as dataset docs. However since I didn't write the code I'm not that familiar with the exact amount of progress. So please correct me if I'm wrong. Really thanks! |
There is of course more that can be added to ORC user guides but we need to start from somewhere. |
@jorisvandenbossche @pitrou This is the first time I have ever written user guides. Could you guys please check whether there is stuff that is seriously wrong? I will iron out the details myself. Really thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @iajoiner . This is a quick first pass. I'll let @jorisvandenbossche give further advice for the Python docs.
Supported ORC features | ||
========================== | ||
|
||
The ORC format has many features, and we support a subset of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also mention which datatypes are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
docs/source/cpp/orc.rst
Outdated
----------- | ||
|
||
+-------------------+---------+ | ||
| Compression codec | Notes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a notes column if there aren't any notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Removed!
|
||
The `Apache ORC <http://orc.apache.org/>`_ project provides a | ||
standardized open-source columnar storage format for use in data analysis | ||
systems. It was created originally for use in `Apache Hadoop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python and C++ doc pages should probably give the same description of the ORC format, it seems a bit gratuitous to have two different ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed.
docs/source/python/orc.rst
Outdated
<http://spark.apache.org>`_ adopting it as a shared standard for high | ||
performance data IO. | ||
|
||
Apache Arrow is an ideal in-memory transport layer for data that is being read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather a representation layer if you're using ORC for transport or storage...
docs/source/python/orc.rst
Outdated
If you installed ``pyarrow`` with pip or conda, it should be built with ORC | ||
support bundled: | ||
|
||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to cut down on ipython
code blocks actually, since they make building the docs slower and more fragile. AFAIU we would like to use doctest instead:
https://www.sphinx-doc.org/en/master/usage/extensions/doctest.html
@jorisvandenbossche or @amol- may want to elaborate on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
docs/source/python/orc.rst
Outdated
There are some additional data type handling-specific options | ||
described below. | ||
|
||
Omitting the DataFrame index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the impression you're copying and adapting large chunks of the Parquet docs. I'm not sure it makes sense to do this (also, I don't know why the Parquet docs talk about preserve_index
specifically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think in this case we can omit this section about the pandas index (or maybe refer to the pandas.rst page on those details).
(I suppose the reason that the parquet page includes those details is that, historically, many people where using the parquet methods to get pandas DataFrames (the ParquetFile also has functionality to directly get a DataFrame))
docs/source/python/orc.rst
Outdated
files. | ||
|
||
Partitioned Datasets (Multiple Files) | ||
------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section doesn't showcase any example or API, is it useful to have it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is not that useful now. I removed it. When we complete the ORC dataset feature we will add it again.
docs/source/python/orc.rst
Outdated
s3 = fs.S3FileSystem(region="us-east-2") | ||
table = po.read_table("bucket/object/key/prefix", filesystem=s3) | ||
|
||
Currently, :class:`HDFS <pyarrow.fs.HadoopFileSystem>` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could stop this section here and just redirect the user to the filesystem docs for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Note that the Docs error is in the 7.0.0 release caused by the existence of @pitrou @jorisvandenbossche Could you please review again? Really thanks! |
@iajoiner Can you look at the doc building warnings and errors? |
@pitrou Yes. Looks like the problem is that we are not building either Arrow Dataset or ORC support (likely the latter) in the GitHub Action hence it is upset. Shall the fix be changing the test so that ORC support is actually included so that this can run correctly? |
Well, it is being built correctly, so the problem is probably something else: Note you can easily reproduce locally using Archery and Docker: |
@iajoiner the error about |
Running the docs locally (and removing the |
docs/source/python/orc.rst
Outdated
.. ipython:: python | ||
|
||
with po.ORCWriter('example2.orc') as writer: | ||
writer.write_table(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is this example that segfaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it is due to a typo. If you call a random orc.ORCWriter
method that doesn't actually exist it segfaults out. I wonder whether this behavior itself is concerning (that's for a different PR of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call a random orc.ORCWriter method that doesn't actually exist it segfaults out. I wonder whether this behavior itself is concerning (that's for a different PR of course).
Yes, that sounds as something we should also fix then! Can you open a JIRA about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the build error, also added a few other comments. And thanks a lot for working on this!
docs/source/python/orc.rst
Outdated
|
||
.. ipython:: python | ||
|
||
import pyarrow.orc as po |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this mimics the import pyarrow.parquet as pq
we already have, but I personally would want to generalize such cryptic abbreviations. I would maybe go for from pyarrow import orc
and then orc.read_table
etc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Fixed!
docs/source/python/orc.rst
Outdated
df = pd.DataFrame({'one': [-1, np.nan, 2.5], | ||
'two': ['foo', 'bar', 'baz'], | ||
'three': [True, False, True]}, | ||
index=list('abc')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave out the custom index here, to have the example focus on reading and writing ORC, and not on the pandas<->arrow conversion details.
Maybe we could even start with creating directly a pyarrow.Table from this dict (pa.table({..})
instead of pd.DataFrame({{..})
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
|
||
We need not use a string to specify the origin of the file. It can be any of: | ||
|
||
* A file path as a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a pathlib.Path object also work? (or in general a path-like object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. io.BytesIO
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the pathlib.Path object to the above bullet point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
docs/source/python/orc.rst
Outdated
There are some additional data type handling-specific options | ||
described below. | ||
|
||
Omitting the DataFrame index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think in this case we can omit this section about the pandas index (or maybe refer to the pandas.rst page on those details).
(I suppose the reason that the parquet page includes those details is that, historically, many people where using the parquet methods to get pandas DataFrames (the ParquetFile also has functionality to directly get a DataFrame))
@pitrou @jorisvandenbossche Now the tests pass! Do I need to start replacing ipython blocks with doctests in files as well? |
Either way is fine for me. Given we already use IPython blocks in many other places, I wouldn't want to block this PR on that. But if you want to convert it, that's also fine (but, eg, I don't think we already discussed what format we want to use. I would personally use the doctest format (with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: you will also need to add orc
to the python and cpp toctrees (in source/python/index.rst
and source/cpp/index.rst
)
In general, a Python file object will have the worst read performance, while a | ||
string file path or an instance of :class:`~.NativeFile` (especially memory | ||
maps) will perform the best. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe add here a Note or See also box about the fact that you can also read partitioned datasets with multiple ORC files through the pyarrow.dataset interface, and refer to that documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added!
|
||
We need not use a string to specify the origin of the file. It can be any of: | ||
|
||
* A file path as a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the pathlib.Path object to the above bullet point?
docs/source/python/orc.rst
Outdated
orc_file.schema | ||
orc_file.nrows | ||
|
||
See the :class:`~pyarrow.orc.ORCFile()` docstring for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the :class:`~pyarrow.orc.ORCFile()` docstring for more details. | |
See the :class:`~pyarrow.orc.ORCFile` docstring for more details. |
docs/source/python/orc.rst
Outdated
table = orc.read_table("bucket/object/key/prefix", filesystem=s3) | ||
|
||
.. seealso:: | ||
:ref:`Documentation for filesystems <filesystems>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ref:`Documentation for filesystems <filesystems>`. | |
:ref:`Documentation for filesystems <filesystem>`. |
@pitrou @jorisvandenbossche The ipython examples have been converted to doctests. |
This was actually caused by a typo .. Fixing this in #12714 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small remaining comment on using ...
in continuation lines in the doctest format, looks good for the rest!
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche @pitrou Fixed! :) |
@pitrou @jorisvandenbossche Just a reminder that this one is still open with all the tests passed and comments addressed. :) |
Thanks @iajoiner ! |
Benchmark runs are scheduled for baseline = 9ac8301 and contender = 4c3edd2. 4c3edd2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Closes apache#11779 from iajoiner/ARROW-13231-docs Lead-authored-by: Ian Alexander Joiner <iajoiner809@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
No description provided.