New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7906: [C++] [Python] Add ORC write support #8648
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test file is really huge. I made some general suggestions that hopefully save quite some lines.
I have revamped the tests completely and refactored the code to eliminate dependency issues and get all checks to pass. Right now I'm integrating my old nested type tests into adapter_test.cpp. Then I will make this PR ready for review again. Note that support for dense union and sparse union has been delayed till a further PR since there is no read union in the ORC reader which makes testing hard. In that PR I will probably add the following features:
|
Now it is ready for review! I haven't spotted any ORC writer bug in the code base itself since 8 days ago so I think it is likely pretty good. |
I can have a look on Monday / Tuesday. |
Codecov Report
@@ Coverage Diff @@
## master #8648 +/- ##
==========================================
- Coverage 81.80% 81.80% -0.01%
==========================================
Files 214 214
Lines 51383 51383
==========================================
- Hits 42034 42033 -1
- Misses 9349 9350 +1
Continue to review full report at Codecov.
|
I have finished the Python binding as well. Note that I have made no changes to the Rust code. |
@xhochy Please review it when you can. Thanks! |
Hi @mathyingzhou I see that you didn't make changes to the Rust code. Please rebase with |
} | ||
return Status::OK(); | ||
} | ||
Status Close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Fixed.
} | ||
|
||
template <typename T, typename U> | ||
void randintpartition(int64_t n, T sum, std::vector<U>* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some docs here would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Docs Added.
} | ||
|
||
template <typename T, typename U> | ||
void randintpartition(int64_t n, T sum, std::vector<U>* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RandIntPartition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Yes. Eventually I plan to relocate the code to the place where we generate random arrays since this functionality helps generating random ChunkedArrays.
date64(), rand.Int64(size, kMilliMin, kMilliMax, null_probability)); | ||
} | ||
|
||
Result<std::shared_ptr<Array>> GenerateRandomTimestampArray(int64_t size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm did you recently check in code that could replace this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield @lidavidm Please correct me if I'm wrong. Since I use the fact that real DATE64 and TIMESTAMP (with UNIT not equals NANO) can be cast to TIMESTAMP (using NANO) without getting beyond int64_t (because ORC essentially only supports NANO, see TimestampVectorBatch in https://orc.apache.org/docs/core-cpp.html) I don't think arrow::random::RandomArrayGenerator.ArrayOf
can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do something like GenerateArray(field("", timestamp(TimeUnit::SECOND), key_value_metadata({{"min", kSecondMin}, {"max", kSecondMax}})))
but that isn't really much more concise than what you have here at this point, right, since your constraint is that they all have to be castable to nanoseconds without overflowing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry still reviewing, will try to do more tomorrow.
That's fine. I have addressed all the comments you gave and pushed. |
sorry, I did not have time to do a further review. |
will aim for tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lots of comments still. Thanks for being persistent :-)
python/pyarrow/tests/test_orc.py
Outdated
@@ -26,140 +24,13 @@ | |||
pytestmark = pytest.mark.orc | |||
|
|||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the tests against example files were removed here. I think it would be worthwhile to keep them, especially if there's no new tests to replace them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep them for now but eventually these tests need to be replaced by Arrow2ORC(ORC2Arrow) ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are slightly different, though. Roundtripping between Arrow and ORC doesn't validate that ORC data is correct, or that we are able to read foreign-produced ORC files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I removed a test that compares the ORC schema with some JSON one since they don’t actually line up given the new behavior of the ORC Reader on MAP arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou Fair enough. Well, back then in Dec and Jan I did manual tests using the ORC adapter to write ORC files and then loaded them using pyorc and compared the results. So we should be good.
Of course in the future I can add some more tests (read using Arrow, write without Arrow and vice versa). Can we get this PR out there though? The functionality has been very stable since early Jan and the amount of bugs affecting the actual ORC files we have caught during the past 3 months is 2-4.
cpp/src/arrow/adapters/orc/adapter.h
Outdated
/// \return Status | ||
Status Write(const Table& table); | ||
|
||
/// \brief Close a file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it actually close the file (i.e. the output stream)? It doesn't seem to. Can you make the docstring less ambiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It closes the std::unique_ptr<liborc::Writer> writer_
so yes closure does take place. However it doesn’t close the output stream. Doc clarified.
python/pyarrow/_orc.pyx
Outdated
get_writer(source, &rd_handle) | ||
with nogil: | ||
self.writer = move(GetResultValue[unique_ptr[ORCFileWriter]]( | ||
ORCFileWriter.Open(rd_handle.get()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORCFileWriter
doesn't keep a strong reference to the shared_ptr[COutputStream]
, which is a local variable. This means the stream can be destroyed when this function exists.
You should store the shared_ptr[COutputStream]
either in the Python ORCWriter
object, or in the C++ ORCFileWriter
object. The latter sounds better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Given that we use the pImpl
pattern for ORCFileWriter
(and in fact ORCFileReader
as well) I will store it in the Python ORCWriter
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/arrow/testing/random.h
Outdated
// /// \return a generated Array | ||
// std::shared_ptr<Array> Struct(const ArrayVector& children, int64_t size, | ||
// double null_probability); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not implemented in this PR, can you remove the commented declaration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Done!
int64_t offset, int64_t length, ArrayBuilder* builder); | ||
int64_t offset, int64_t length, arrow::ArrayBuilder* builder); | ||
|
||
Status WriteBatch(liborc::ColumnVectorBatch* column_vector_batch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a formal docstring, but you could still add a comment explaining what this does. Especially the arrow_index_offset
, arrow_chunk_offset
and length
parameters.
case arrow::Type::type::FIXED_SIZE_LIST: | ||
case arrow::Type::type::LARGE_LIST: { | ||
std::shared_ptr<arrow::DataType> arrow_child_type = | ||
static_cast<const arrow::BaseListType&>(type).value_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked_cast
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Fixed.
std::string field_name = field->name(); | ||
std::shared_ptr<DataType> arrow_child_type = field->type(); | ||
ORC_UNIQUE_PTR<liborc::Type> orc_subtype = | ||
::GetORCType(*arrow_child_type).ValueOrDie(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_ASSIGN_OR_RAISE
, or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
type_codes.push_back(static_cast<int8_t>(child)); | ||
} | ||
*out = sparse_union(fields, type_codes); | ||
break; | ||
} | ||
default: { | ||
return Status::Invalid("Unknown Orc type kind: ", kind); | ||
return Status::Invalid("Unknown Orc type kind: ", type->toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeError
or NotImplemented
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeError is used haha since if default is reached there is indeed some TypeError in ORC.
#include "arrow/buffer.h" | ||
#include "arrow/buffer_builder.h" | ||
#include "arrow/chunked_array.h" | ||
#include "arrow/compute/cast.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the compute module needed to test ORC functionality? I'm a bit surprised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually does. Arrow has a lot more types than ORC hence Arrow2ORC(ORC2Arrow(x)) may not be the same as x. As a result we have to have casting for testing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough!
|
||
std::shared_ptr<ChunkedArray> GenerateRandomChunkedArray( | ||
const std::shared_ptr<DataType>& data_type, int64_t size, int64_t min_num_chunks, | ||
int64_t max_num_chunks, double null_probability) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest factoring this functionality in testing/random.h
under the form:
class ARROW_TESTING_EXPORT RandomArrayGenerator {
// [snip]
std::shared_ptr<ChunkedArray> Chunked(const std::shared_ptr<Array>& Array, int num_chunks);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This and weak composition do not seem to belong to the ORC adapter tests as they are a lot more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait. This function itself actually contains ORC-specific code such as the requirement that Date64 and Timestamp scalars must not overflow when cast to Timestamp NANO. Unless this requirement is actually universal in which case we should change how random arrays are canonically generated for these types we shouldn't really leave some function so ORC-specific in testing/random.h
.
@pitrou Really thanks for your detailed comments! I have addressed all of them. Please review again since we need to release it. Thanks! |
* Reduce C++ tests runtime * Expose less global symbols * Factor out ORC error handling * Clean up style
I've made a bunch of fixes (including restoring the Python integration tests). I'll merge if CI is green. |
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array. Arrow data types (arrow::Type::type) I do support: Boolean: BOOL Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE Time-related: DATE32 Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION Arrow data types I plan to support: Numerical: DECIMAL128 Time-related: DATE64, TIMESTAMP Dictionary: DICTIONARY Arrow data types I currently do NOT plan to support: Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.) Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information) Extension: EXTENSION Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc Lead-authored-by: Ying Zhou <yingzhou474@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array. Arrow data types (arrow::Type::type) I do support: Boolean: BOOL Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE Time-related: DATE32 Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION Arrow data types I plan to support: Numerical: DECIMAL128 Time-related: DATE64, TIMESTAMP Dictionary: DICTIONARY Arrow data types I currently do NOT plan to support: Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.) Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information) Extension: EXTENSION Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc Lead-authored-by: Ying Zhou <yingzhou474@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array. Arrow data types (arrow::Type::type) I do support: Boolean: BOOL Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE Time-related: DATE32 Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION Arrow data types I plan to support: Numerical: DECIMAL128 Time-related: DATE64, TIMESTAMP Dictionary: DICTIONARY Arrow data types I currently do NOT plan to support: Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.) Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information) Extension: EXTENSION Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc Lead-authored-by: Ying Zhou <yingzhou474@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Heres, Daniel <danielheres@gmail.com> Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com> Co-authored-by: mqy <meng.qingyou@gmail.com> Co-authored-by: Kenta Murata <mrkn@mrkn.jp> Co-authored-by: Johannes Müller <JohannesMueller@fico.com> Co-authored-by: Mahmut Bulut <vertexclique@gmail.com> Co-authored-by: Ryan Jennings <ryan@ryanj.net> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Co-authored-by: Daniël Heres <danielheres@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Matt Brubeck <mbrubeck@limpet.net> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.
Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION
Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY
Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION