ARROW-7906: [C++] [Python] Add ORC write support #8648

iajoiner · 2020-11-12T03:40:44Z

This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.

Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION

Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY

Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION

github-actions · 2020-11-12T03:46:58Z

https://issues.apache.org/jira/browse/ARROW-7906

xhochy

The test file is really huge. I made some general suggestions that hopefully save quite some lines.

cpp/src/arrow/adapters/orc/adapter.cc

cpp/src/arrow/adapters/orc/adapter.h

cpp/src/arrow/adapters/orc/adapter_test.cc

iajoiner · 2020-12-27T16:17:46Z

I have revamped the tests completely and refactored the code to eliminate dependency issues and get all checks to pass. Right now I'm integrating my old nested type tests into adapter_test.cpp. Then I will make this PR ready for review again.

Note that support for dense union and sparse union has been delayed till a further PR since there is no read union in the ORC reader which makes testing hard. In that PR I will probably add the following features:

Read and write support for union types.
Replicate orc::WriterOptions in adapters::orc::ORCWriterOptions.
Replicate orc::ReaderOptions in adapters::orc::ORCReaderOptions and add them to the the ORC reader.

iajoiner · 2021-01-03T11:06:57Z

Now it is ready for review! I haven't spotted any ORC writer bug in the code base itself since 8 days ago so I think it is likely pretty good.

xhochy · 2021-01-07T08:32:34Z

I can have a look on Monday / Tuesday.

codecov-io · 2021-01-10T16:46:18Z

Codecov Report

Merging #8648 (552bf93) into master (1f32ca1) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #8648      +/-   ##
==========================================
- Coverage   81.80%   81.80%   -0.01%     
==========================================
  Files         214      214              
  Lines       51383    51383              
==========================================
- Hits        42034    42033       -1     
- Misses       9349     9350       +1

Impacted Files	Coverage Δ
rust/arrow/src/buffer.rs	`97.72% <ø> (ø)`
rust/arrow/src/compute/kernels/aggregate.rs	`75.00% <ø> (ø)`
rust/arrow/src/compute/kernels/arithmetic.rs	`89.83% <ø> (ø)`
rust/arrow/src/compute/kernels/comparison.rs	`95.91% <ø> (ø)`
rust/arrow/src/datatypes.rs	`78.75% <ø> (ø)`
rust/arrow/src/util/bit_util.rs	`100.00% <ø> (ø)`
rust/parquet/src/encodings/encoding.rs	`95.24% <0.00%> (-0.20%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f32ca1...552bf93. Read the comment docs.

iajoiner · 2021-01-10T16:52:37Z

I have finished the Python binding as well. Note that I have made no changes to the Rust code.

iajoiner · 2021-01-12T23:35:02Z

@xhochy Please review it when you can. Thanks!

nevi-me · 2021-01-13T14:50:30Z

Hi @mathyingzhou I see that you didn't make changes to the Rust code. Please rebase with git rebase origin/master (if origin is apache/arrow) so that you can remove the Rust changes from the PR. Then you'll need to force-push onto your branch (git push --force). Thanks

cpp/src/arrow/adapters/orc/adapter.cc

emkornfield · 2021-04-14T05:53:38Z

cpp/src/arrow/adapters/orc/adapter.cc

+    }
+    return Status::OK();
+  }
+  Status Close() {


nit: whitespace

Sure. Fixed.

emkornfield · 2021-04-14T05:54:58Z

cpp/src/arrow/adapters/orc/adapter_test.cc

+}
+
+template <typename T, typename U>
+void randintpartition(int64_t n, T sum, std::vector<U>* out) {


some docs here would be good.

Sure. Docs Added.

emkornfield · 2021-04-14T05:55:11Z

cpp/src/arrow/adapters/orc/adapter_test.cc

+}
+
+template <typename T, typename U>
+void randintpartition(int64_t n, T sum, std::vector<U>* out) {


nit: RandIntPartition?

Ah. Yes. Eventually I plan to relocate the code to the place where we generate random arrays since this functionality helps generating random ChunkedArrays.

emkornfield · 2021-04-14T05:55:48Z

cpp/src/arrow/adapters/orc/adapter_test.cc

+      date64(), rand.Int64(size, kMilliMin, kMilliMax, null_probability));
+}
+
+Result<std::shared_ptr<Array>> GenerateRandomTimestampArray(int64_t size,


@lidavidm did you recently check in code that could replace this?

@emkornfield @lidavidm Please correct me if I'm wrong. Since I use the fact that real DATE64 and TIMESTAMP (with UNIT not equals NANO) can be cast to TIMESTAMP (using NANO) without getting beyond int64_t (because ORC essentially only supports NANO, see TimestampVectorBatch in https://orc.apache.org/docs/core-cpp.html) I don't think arrow::random::RandomArrayGenerator.ArrayOf can be used.

You could do something like GenerateArray(field("", timestamp(TimeUnit::SECOND), key_value_metadata({{"min", kSecondMin}, {"max", kSecondMax}}))) but that isn't really much more concise than what you have here at this point, right, since your constraint is that they all have to be castable to nanoseconds without overflowing.

emkornfield

Sorry still reviewing, will try to do more tomorrow.

iajoiner · 2021-04-14T07:24:02Z

Sorry still reviewing, will try to do more tomorrow.

That's fine. I have addressed all the comments you gave and pushed.

emkornfield · 2021-04-15T05:40:16Z

sorry, I did not have time to do a further review.

emkornfield · 2021-04-15T05:40:26Z

will aim for tomorrow.

pitrou

Ok, lots of comments still. Thanks for being persistent :-)

pitrou · 2021-04-15T12:24:39Z

python/pyarrow/tests/test_orc.py

@@ -26,140 +24,13 @@
 pytestmark = pytest.mark.orc


-try:


I don't understand why the tests against example files were removed here. I think it would be worthwhile to keep them, especially if there's no new tests to replace them.

We can keep them for now but eventually these tests need to be replaced by Arrow2ORC(ORC2Arrow) ones.

Those are slightly different, though. Roundtripping between Arrow and ORC doesn't validate that ORC data is correct, or that we are able to read foreign-produced ORC files.

Well, I removed a test that compares the ORC schema with some JSON one since they don’t actually line up given the new behavior of the ORC Reader on MAP arrays.

@pitrou Fair enough. Well, back then in Dec and Jan I did manual tests using the ORC adapter to write ORC files and then loaded them using pyorc and compared the results. So we should be good.

Of course in the future I can add some more tests (read using Arrow, write without Arrow and vice versa). Can we get this PR out there though? The functionality has been very stable since early Jan and the amount of bugs affecting the actual ORC files we have caught during the past 3 months is 2-4.

pitrou · 2021-04-15T12:26:41Z

cpp/src/arrow/adapters/orc/adapter.h

+  /// \return Status
+  Status Write(const Table& table);
+
+  /// \brief Close a file


Does it actually close the file (i.e. the output stream)? It doesn't seem to. Can you make the docstring less ambiguous?

It closes the std::unique_ptr<liborc::Writer> writer_ so yes closure does take place. However it doesn’t close the output stream. Doc clarified.

pitrou · 2021-04-15T12:28:51Z

python/pyarrow/_orc.pyx

+        get_writer(source, &rd_handle)
+        with nogil:
+            self.writer = move(GetResultValue[unique_ptr[ORCFileWriter]](
+                ORCFileWriter.Open(rd_handle.get())))


ORCFileWriter doesn't keep a strong reference to the shared_ptr[COutputStream], which is a local variable. This means the stream can be destroyed when this function exists.

You should store the shared_ptr[COutputStream] either in the Python ORCWriter object, or in the C++ ORCFileWriter object. The latter sounds better to me.

Thanks! Given that we use the pImpl pattern for ORCFileWriter (and in fact ORCFileReader as well) I will store it in the Python ORCWriter object.

pitrou · 2021-04-15T12:31:48Z

cpp/src/arrow/testing/random.h

+  // /// \return a generated Array
+  // std::shared_ptr<Array> Struct(const ArrayVector& children, int64_t size,
+  //                               double null_probability);
+


If this is not implemented in this PR, can you remove the commented declaration?

pitrou · 2021-04-15T12:32:44Z

cpp/src/arrow/adapters/orc/adapter_util.h

-                   int64_t offset, int64_t length, ArrayBuilder* builder);
+                   int64_t offset, int64_t length, arrow::ArrayBuilder* builder);
+
+Status WriteBatch(liborc::ColumnVectorBatch* column_vector_batch,


No need for a formal docstring, but you could still add a comment explaining what this does. Especially the arrow_index_offset, arrow_chunk_offset and length parameters.

pitrou · 2021-04-15T13:34:40Z

cpp/src/arrow/adapters/orc/adapter_util.cc

+    case arrow::Type::type::FIXED_SIZE_LIST:
+    case arrow::Type::type::LARGE_LIST: {
+      std::shared_ptr<arrow::DataType> arrow_child_type =
+          static_cast<const arrow::BaseListType&>(type).value_type();


checked_cast...

Yup. Fixed.

pitrou · 2021-04-15T13:35:16Z

cpp/src/arrow/adapters/orc/adapter_util.cc

+    std::string field_name = field->name();
+    std::shared_ptr<DataType> arrow_child_type = field->type();
+    ORC_UNIQUE_PTR<liborc::Type> orc_subtype =
+        ::GetORCType(*arrow_child_type).ValueOrDie();


ARROW_ASSIGN_OR_RAISE, or similar

pitrou · 2021-04-15T13:35:37Z

cpp/src/arrow/adapters/orc/adapter_util.cc

        type_codes.push_back(static_cast<int8_t>(child));
      }
      *out = sparse_union(fields, type_codes);
      break;
    }
    default: {
-      return Status::Invalid("Unknown Orc type kind: ", kind);
+      return Status::Invalid("Unknown Orc type kind: ", type->toString());


TypeError or NotImplemented here

TypeError is used haha since if default is reached there is indeed some TypeError in ORC.

pitrou · 2021-04-15T13:36:23Z

cpp/src/arrow/adapters/orc/adapter_test.cc

+#include "arrow/buffer.h"
+#include "arrow/buffer_builder.h"
+#include "arrow/chunked_array.h"
+#include "arrow/compute/cast.h"


Is the compute module needed to test ORC functionality? I'm a bit surprised.

It actually does. Arrow has a lot more types than ORC hence Arrow2ORC(ORC2Arrow(x)) may not be the same as x. As a result we have to have casting for testing purposes.

Fair enough!

pitrou · 2021-04-15T13:42:57Z

cpp/src/arrow/adapters/orc/adapter_test.cc

+
+std::shared_ptr<ChunkedArray> GenerateRandomChunkedArray(
+    const std::shared_ptr<DataType>& data_type, int64_t size, int64_t min_num_chunks,
+    int64_t max_num_chunks, double null_probability) {


I suggest factoring this functionality in testing/random.h under the form:

class ARROW_TESTING_EXPORT RandomArrayGenerator { // [snip] std::shared_ptr<ChunkedArray> Chunked(const std::shared_ptr<Array>& Array, int num_chunks); }

Sure. This and weak composition do not seem to belong to the ORC adapter tests as they are a lot more general.

Wait. This function itself actually contains ORC-specific code such as the requirement that Date64 and Timestamp scalars must not overflow when cast to Timestamp NANO. Unless this requirement is actually universal in which case we should change how random arrays are canonically generated for these types we shouldn't really leave some function so ORC-specific in testing/random.h.

iajoiner · 2021-04-16T07:05:02Z

@pitrou Really thanks for your detailed comments! I have addressed all of them. Please review again since we need to release it. Thanks!

* Reduce C++ tests runtime * Expose less global symbols * Factor out ORC error handling * Clean up style

pitrou · 2021-04-19T09:26:47Z

I've made a bunch of fixes (including restoring the Python integration tests). I'll merge if CI is green.

This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array. Arrow data types (arrow::Type::type) I do support: Boolean: BOOL Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE Time-related: DATE32 Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION Arrow data types I plan to support: Numerical: DECIMAL128 Time-related: DATE64, TIMESTAMP Dictionary: DICTIONARY Arrow data types I currently do NOT plan to support: Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.) Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information) Extension: EXTENSION Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc Lead-authored-by: Ying Zhou <yingzhou474@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

iajoiner marked this pull request as draft November 16, 2020 20:45

iajoiner marked this pull request as ready for review November 28, 2020 20:41

xhochy reviewed Dec 3, 2020

View reviewed changes

pitrou added the Component: C++ label Dec 9, 2020

iajoiner marked this pull request as draft December 27, 2020 16:10

iajoiner marked this pull request as ready for review January 3, 2021 10:49

iajoiner requested a review from xhochy January 7, 2021 01:18

github-actions bot added the Component: Python label Jan 10, 2021

iajoiner changed the title ~~ARROW-7906: [C++] Add ORC write support~~ ARROW-7906: [C++] [Python] Add ORC write support Jan 10, 2021

github-actions bot added Component: Rust Component: Parquet labels Jan 10, 2021

Ying Zhou added 11 commits January 13, 2021 17:21

Add Arrow2ORC type conversion

68120da

Comment out incomplete feature, prepare for testing

0da3aeb

GetORCType compiles

231a74e

Add FillBatch array -> chunkedarray

98b216e

numeric done

e65d53c

Numeric tests completed, binary & decimal added

3a7baae

Binary types done and tested

cefa9da

Struct tested

99d34db

List written

23f2424

All lists tested

a07c3ce

Union initial version

ee51e9c

emkornfield reviewed Apr 14, 2021

View reviewed changes

cpp/src/arrow/adapters/orc/adapter.cc Show resolved Hide resolved

emkornfield reviewed Apr 14, 2021

View reviewed changes

Address further comments

b3ca8fd

pitrou requested changes Apr 15, 2021

View reviewed changes

Ying Zhou added 5 commits April 15, 2021 21:33

Fix issues

f15c69b

Address most comments by Antoine Pitrou

65475a3

Decimal fix for big endian machines

4d42e1d

All C++ comments addressed

e4a4cb3

All comments addressed

3447e36

iajoiner requested a review from pitrou April 16, 2021 07:05

Ying Zhou and others added 4 commits April 16, 2021 03:35

Linter satisfied, obsolete python test removed

1aff1f0

Python test issue fixed

efc6b69

Python test issue fixed

db14f9d

* Restore Python integration tests

328db6f

* Reduce C++ tests runtime * Expose less global symbols * Factor out ORC error handling * Clean up style

pitrou approved these changes Apr 19, 2021

View reviewed changes

pitrou closed this in 1dc8f94 Apr 19, 2021

iajoiner deleted the ARROW-7906_pyarrow_write_orc branch April 19, 2021 18:48

asfimport mentioned this pull request Apr 19, 2021

[C++][Python] Full functionality for ORC format #24128

Closed

ARROW-7906: [C++] [Python] Add ORC write support #8648

ARROW-7906: [C++] [Python] Add ORC write support #8648

Conversation

iajoiner commented Nov 12, 2020

github-actions bot commented Nov 12, 2020

xhochy left a comment

Choose a reason for hiding this comment

iajoiner commented Dec 27, 2020

iajoiner commented Jan 3, 2021

xhochy commented Jan 7, 2021

codecov-io commented Jan 10, 2021

Codecov Report

iajoiner commented Jan 10, 2021

iajoiner commented Jan 12, 2021

nevi-me commented Jan 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield left a comment

Choose a reason for hiding this comment

iajoiner commented Apr 14, 2021

emkornfield commented Apr 15, 2021

emkornfield commented Apr 15, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iajoiner Apr 15, 2021 • edited

Choose a reason for hiding this comment

iajoiner Apr 16, 2021 • edited

Choose a reason for hiding this comment

iajoiner commented Apr 16, 2021

pitrou commented Apr 19, 2021

iajoiner Apr 14, 2021 •

edited

iajoiner Apr 14, 2021 •

edited

iajoiner Apr 14, 2021 •

edited

iajoiner Apr 16, 2021 •

edited

iajoiner Apr 16, 2021 •

edited

iajoiner Apr 15, 2021 •

edited

iajoiner Apr 15, 2021 •

edited

iajoiner Apr 16, 2021 •

edited