-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Arrow: Support DecimalType #3023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a3f2bd4 to
4b5e344
Compare
rymurr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, I would rather we lazily convert the decimal value though
arrow/src/main/java/org/apache/iceberg/arrow/DictEncodedArrowConverter.java
Outdated
Show resolved
Hide resolved
4b5e344 to
0703854
Compare
| DecimalVector decimalVector = new DecimalVector( | ||
| vectorHolder.vector().getName(), | ||
| ArrowSchemaUtil.convert(vectorHolder.icebergField()).getFieldType(), | ||
| vectorHolder.vector().getAllocator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should be doing any allocation. We should pass in the correct vector and this should just fill it with data. That way, we can control allocation and reuse buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this allocation is also the reason why there is a breaking change to the VectorHolder API. Can we avoid the breaking change as well by fixing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the same approach as you suggested when I was implementing this. However, when we reach this point, the ArrowVectorAccessor in this particular case is a DictionaryDecimalIntAccessor which holds an IntVector and not a DecimalVector, so we can't just pass the correct vector (or maybe I'm misunderstanding by what you mean with "just pass the correct vector").
In order to initialize the correct arrow vector (DecimalVector in this particular case), it's not enough to just know the Type. We actually need to know the entire NestedField to know if it's a required/optional field, hence the change around changing from Type to NestedField in the VectorHolder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need for NestedField in order to create the vector here. But I don't think that this should be allocating anything. It should instead copy from the IntVector to DecimalVector, both of which should be passed in.
The current structure always passes vectors into readers to be filled with data. Sometimes those are reallocated, but we try to be able to pass the last set of vectors in to avoid allocation. Here should be the same.
For example, if you're reading a table of (int, string) then we allocate a vector for each and pass them into the read method. If that int is actually a decimal, then we should create an appropriate decimal vector and pass that in as well so it can be reused through the same lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue I just pushed a new version where we try and re-use the non-dict-encoded vector in https://github.com/apache/iceberg/pull/3023/files#diff-8b31f9172c09339b4a3bc06e6fc288e2b1e34de90091a3d73a51b84e89bd97a9R189. Is this what you had in mind?
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
c74fded to
7d029af
Compare
…pected Arrow vector types
7d029af to
82faecd
Compare
This is partly fixing #2486 and adding
DecimalTypesupport.Additional functionality to
DictEncodedArrowConverterwill be added as part of #2484.VectorizedReadFlatParquetDataBenchmark
Results on this branch
Results on
mastervectorized-read-flat-parquet-data-result-branch.txt
vectorized-read-flat-parquet-data-result-master.txt