Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7906: [C++] [Python] Add ORC write support #8648

Closed
wants to merge 187 commits into from

Conversation

iajoiner
Copy link

This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.

Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION

Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY

Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION

@github-actions
Copy link

@iajoiner iajoiner marked this pull request as draft November 16, 2020 20:45
@iajoiner iajoiner marked this pull request as ready for review November 28, 2020 20:41
Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file is really huge. I made some general suggestions that hopefully save quite some lines.

cpp/src/arrow/adapters/orc/adapter.cc Outdated Show resolved Hide resolved
cpp/src/arrow/adapters/orc/adapter.h Outdated Show resolved Hide resolved
cpp/src/arrow/adapters/orc/adapter_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/adapters/orc/adapter_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/adapters/orc/adapter_test.cc Outdated Show resolved Hide resolved
@iajoiner iajoiner marked this pull request as draft December 27, 2020 16:10
@iajoiner
Copy link
Author

I have revamped the tests completely and refactored the code to eliminate dependency issues and get all checks to pass. Right now I'm integrating my old nested type tests into adapter_test.cpp. Then I will make this PR ready for review again.

Note that support for dense union and sparse union has been delayed till a further PR since there is no read union in the ORC reader which makes testing hard. In that PR I will probably add the following features:

  1. Read and write support for union types.
  2. Replicate orc::WriterOptions in adapters::orc::ORCWriterOptions.
  3. Replicate orc::ReaderOptions in adapters::orc::ORCReaderOptions and add them to the the ORC reader.

@iajoiner iajoiner marked this pull request as ready for review January 3, 2021 10:49
@iajoiner
Copy link
Author

iajoiner commented Jan 3, 2021

Now it is ready for review! I haven't spotted any ORC writer bug in the code base itself since 8 days ago so I think it is likely pretty good.

@iajoiner iajoiner requested a review from xhochy January 7, 2021 01:18
@xhochy
Copy link
Member

xhochy commented Jan 7, 2021

I can have a look on Monday / Tuesday.

@iajoiner iajoiner changed the title ARROW-7906: [C++] Add ORC write support ARROW-7906: [C++] [Python] Add ORC write support Jan 10, 2021
@codecov-io
Copy link

Codecov Report

Merging #8648 (552bf93) into master (1f32ca1) will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8648      +/-   ##
==========================================
- Coverage   81.80%   81.80%   -0.01%     
==========================================
  Files         214      214              
  Lines       51383    51383              
==========================================
- Hits        42034    42033       -1     
- Misses       9349     9350       +1     
Impacted Files Coverage Δ
rust/arrow/src/buffer.rs 97.72% <ø> (ø)
rust/arrow/src/compute/kernels/aggregate.rs 75.00% <ø> (ø)
rust/arrow/src/compute/kernels/arithmetic.rs 89.83% <ø> (ø)
rust/arrow/src/compute/kernels/comparison.rs 95.91% <ø> (ø)
rust/arrow/src/datatypes.rs 78.75% <ø> (ø)
rust/arrow/src/util/bit_util.rs 100.00% <ø> (ø)
rust/parquet/src/encodings/encoding.rs 95.24% <0.00%> (-0.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f32ca1...552bf93. Read the comment docs.

@iajoiner
Copy link
Author

I have finished the Python binding as well. Note that I have made no changes to the Rust code.

@iajoiner
Copy link
Author

@xhochy Please review it when you can. Thanks!

@nevi-me
Copy link
Contributor

nevi-me commented Jan 13, 2021

Hi @mathyingzhou I see that you didn't make changes to the Rust code. Please rebase with git rebase origin/master (if origin is apache/arrow) so that you can remove the Rust changes from the PR. Then you'll need to force-push onto your branch (git push --force). Thanks

}
return Status::OK();
}
Status Close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Fixed.

}

template <typename T, typename U>
void randintpartition(int64_t n, T sum, std::vector<U>* out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some docs here would be good.

Copy link
Author

@iajoiner iajoiner Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Docs Added.

}

template <typename T, typename U>
void randintpartition(int64_t n, T sum, std::vector<U>* out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RandIntPartition?

Copy link
Author

@iajoiner iajoiner Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yes. Eventually I plan to relocate the code to the place where we generate random arrays since this functionality helps generating random ChunkedArrays.

date64(), rand.Int64(size, kMilliMin, kMilliMax, null_probability));
}

Result<std::shared_ptr<Array>> GenerateRandomTimestampArray(int64_t size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm did you recently check in code that could replace this?

Copy link
Author

@iajoiner iajoiner Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield @lidavidm Please correct me if I'm wrong. Since I use the fact that real DATE64 and TIMESTAMP (with UNIT not equals NANO) can be cast to TIMESTAMP (using NANO) without getting beyond int64_t (because ORC essentially only supports NANO, see TimestampVectorBatch in https://orc.apache.org/docs/core-cpp.html) I don't think arrow::random::RandomArrayGenerator.ArrayOf can be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do something like GenerateArray(field("", timestamp(TimeUnit::SECOND), key_value_metadata({{"min", kSecondMin}, {"max", kSecondMax}}))) but that isn't really much more concise than what you have here at this point, right, since your constraint is that they all have to be castable to nanoseconds without overflowing.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry still reviewing, will try to do more tomorrow.

@iajoiner
Copy link
Author

Sorry still reviewing, will try to do more tomorrow.

That's fine. I have addressed all the comments you gave and pushed.

@emkornfield
Copy link
Contributor

sorry, I did not have time to do a further review.

@emkornfield
Copy link
Contributor

will aim for tomorrow.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lots of comments still. Thanks for being persistent :-)

@@ -26,140 +24,13 @@
pytestmark = pytest.mark.orc


try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the tests against example files were removed here. I think it would be worthwhile to keep them, especially if there's no new tests to replace them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep them for now but eventually these tests need to be replaced by Arrow2ORC(ORC2Arrow) ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are slightly different, though. Roundtripping between Arrow and ORC doesn't validate that ORC data is correct, or that we are able to read foreign-produced ORC files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I removed a test that compares the ORC schema with some JSON one since they don’t actually line up given the new behavior of the ORC Reader on MAP arrays.

Copy link
Author

@iajoiner iajoiner Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Fair enough. Well, back then in Dec and Jan I did manual tests using the ORC adapter to write ORC files and then loaded them using pyorc and compared the results. So we should be good.

Of course in the future I can add some more tests (read using Arrow, write without Arrow and vice versa). Can we get this PR out there though? The functionality has been very stable since early Jan and the amount of bugs affecting the actual ORC files we have caught during the past 3 months is 2-4.

/// \return Status
Status Write(const Table& table);

/// \brief Close a file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it actually close the file (i.e. the output stream)? It doesn't seem to. Can you make the docstring less ambiguous?

Copy link
Author

@iajoiner iajoiner Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It closes the std::unique_ptr<liborc::Writer> writer_ so yes closure does take place. However it doesn’t close the output stream. Doc clarified.

get_writer(source, &rd_handle)
with nogil:
self.writer = move(GetResultValue[unique_ptr[ORCFileWriter]](
ORCFileWriter.Open(rd_handle.get())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORCFileWriter doesn't keep a strong reference to the shared_ptr[COutputStream], which is a local variable. This means the stream can be destroyed when this function exists.

You should store the shared_ptr[COutputStream] either in the Python ORCWriter object, or in the C++ ORCFileWriter object. The latter sounds better to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Given that we use the pImpl pattern for ORCFileWriter (and in fact ORCFileReader as well) I will store it in the Python ORCWriter object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// /// \return a generated Array
// std::shared_ptr<Array> Struct(const ArrayVector& children, int64_t size,
// double null_probability);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not implemented in this PR, can you remove the commented declaration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Done!

int64_t offset, int64_t length, ArrayBuilder* builder);
int64_t offset, int64_t length, arrow::ArrayBuilder* builder);

Status WriteBatch(liborc::ColumnVectorBatch* column_vector_batch,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a formal docstring, but you could still add a comment explaining what this does. Especially the arrow_index_offset, arrow_chunk_offset and length parameters.

case arrow::Type::type::FIXED_SIZE_LIST:
case arrow::Type::type::LARGE_LIST: {
std::shared_ptr<arrow::DataType> arrow_child_type =
static_cast<const arrow::BaseListType&>(type).value_type();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked_cast...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Fixed.

std::string field_name = field->name();
std::shared_ptr<DataType> arrow_child_type = field->type();
ORC_UNIQUE_PTR<liborc::Type> orc_subtype =
::GetORCType(*arrow_child_type).ValueOrDie();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_ASSIGN_OR_RAISE, or similar

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

type_codes.push_back(static_cast<int8_t>(child));
}
*out = sparse_union(fields, type_codes);
break;
}
default: {
return Status::Invalid("Unknown Orc type kind: ", kind);
return Status::Invalid("Unknown Orc type kind: ", type->toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError or NotImplemented here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError is used haha since if default is reached there is indeed some TypeError in ORC.

#include "arrow/buffer.h"
#include "arrow/buffer_builder.h"
#include "arrow/chunked_array.h"
#include "arrow/compute/cast.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the compute module needed to test ORC functionality? I'm a bit surprised.

Copy link
Author

@iajoiner iajoiner Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually does. Arrow has a lot more types than ORC hence Arrow2ORC(ORC2Arrow(x)) may not be the same as x. As a result we have to have casting for testing purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough!


std::shared_ptr<ChunkedArray> GenerateRandomChunkedArray(
const std::shared_ptr<DataType>& data_type, int64_t size, int64_t min_num_chunks,
int64_t max_num_chunks, double null_probability) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest factoring this functionality in testing/random.h under the form:

class ARROW_TESTING_EXPORT RandomArrayGenerator {
  // [snip]
  std::shared_ptr<ChunkedArray> Chunked(const std::shared_ptr<Array>& Array, int num_chunks);
}

Copy link
Author

@iajoiner iajoiner Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This and weak composition do not seem to belong to the ORC adapter tests as they are a lot more general.

Copy link
Author

@iajoiner iajoiner Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. This function itself actually contains ORC-specific code such as the requirement that Date64 and Timestamp scalars must not overflow when cast to Timestamp NANO. Unless this requirement is actually universal in which case we should change how random arrays are canonically generated for these types we shouldn't really leave some function so ORC-specific in testing/random.h.

@iajoiner
Copy link
Author

@pitrou Really thanks for your detailed comments! I have addressed all of them. Please review again since we need to release it. Thanks!

@iajoiner iajoiner requested a review from pitrou April 16, 2021 07:05
Ying Zhou and others added 4 commits April 16, 2021 03:35
* Reduce C++ tests runtime
* Expose less global symbols
* Factor out ORC error handling
* Clean up style
@pitrou
Copy link
Member

pitrou commented Apr 19, 2021

I've made a bunch of fixes (including restoring the Python integration tests). I'll merge if CI is green.

@pitrou pitrou closed this in 1dc8f94 Apr 19, 2021
@iajoiner iajoiner deleted the ARROW-7906_pyarrow_write_orc branch April 19, 2021 18:48
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.

Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION

Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY

Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION

Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc

Lead-authored-by: Ying Zhou <yingzhou474@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.

Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION

Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY

Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION

Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc

Lead-authored-by: Ying Zhou <yingzhou474@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This pull request tracks the progress on adding ORC write support. The functionality is not complete yet. However for most types the process of populating a ColumnVectorBatch in ORC using data from Arrow Array.

Arrow data types (arrow::Type::type) I do support:
Boolean: BOOL
Numerical: INT8, INT16, INT32, INT64, FLOAT, DOUBLE
Time-related: DATE32
Binary: BINARY, STRING, LARGE_BINARY, LARGE_STRING, FIXED_SIZE_BINARY
Nested: LIST, LARGE_LIST, FIXED_SIZE_LIST, STRUCT, MAP, DENSE_UNION, SPARSE_UNION

Arrow data types I plan to support:
Numerical: DECIMAL128
Time-related: DATE64, TIMESTAMP
Dictionary: DICTIONARY

Arrow data types I currently do NOT plan to support:
Numerical: UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, DECIMAL256 (There are no corresponding types in ORC. Of course except for in the case of DECIMAL256 we can always cast them into larger types. However I think maybe users need to explicitly do that.)
Time-related: TIME32, TIME64, INTERVAL_MONTHS, INTERVAL_DAY_TIME, DURATION (There are no corresponding types in ORC and it is impossible to cast them into ORC types without losing time-related information)
Extension: EXTENSION

Closes apache#8648 from mathyingzhou/ARROW-7906_pyarrow_write_orc

Lead-authored-by: Ying Zhou <yingzhou474@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Dmitry Patsura <zaets28rus@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Yibo Cai <yibo.cai@arm.com>
Co-authored-by: Yordan Pavlov <yordan.pavlov@outlook.com>
Co-authored-by: mqy <meng.qingyou@gmail.com>
Co-authored-by: Kenta Murata <mrkn@mrkn.jp>
Co-authored-by: Johannes Müller <JohannesMueller@fico.com>
Co-authored-by: Mahmut Bulut <vertexclique@gmail.com>
Co-authored-by: Ryan Jennings <ryan@ryanj.net>
Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Matt Brubeck <mbrubeck@limpet.net>
Co-authored-by: Max Burke <max@urbanlogiq.com>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet