Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Aug 25, 2021

This is partly fixing #2486 and adding DecimalType support.
Additional functionality to DictEncodedArrowConverter will be added as part of #2484.

VectorizedReadFlatParquetDataBenchmark

Results on this branch

Benchmark                                                                 Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  1.609 ± 0.114   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  1.522 ± 0.092   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.706 ± 0.063   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  7.947 ± 0.079   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.826 ± 0.044   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  3.082 ± 0.158   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.392 ± 0.094   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.231 ± 0.105   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.700 ± 0.152   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.559 ± 0.135   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.773 ± 0.234   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.593 ± 0.066   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  4.289 ± 0.157   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  4.222 ± 0.176   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  1.593 ± 0.030   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  1.554 ± 0.063   s/op

Results on master

Benchmark                                                                 Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  1.587 ± 0.106   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  1.515 ± 0.061   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  9.324 ± 0.511   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  7.987 ± 0.164   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.829 ± 0.239   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.381 ± 0.142   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.399 ± 0.088   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.584 ± 0.230   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.664 ± 0.083   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.535 ± 0.090   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.740 ± 0.171   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.255 ± 0.128   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  4.803 ± 0.142   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  4.326 ± 0.286   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  1.946 ± 0.152   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  1.551 ± 0.042   s/op

vectorized-read-flat-parquet-data-result-branch.txt
vectorized-read-flat-parquet-data-result-master.txt

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I would rather we lazily convert the decimal value though

@nastra nastra requested a review from rymurr September 17, 2021 09:12
@nastra nastra force-pushed the arrow-support-decimal branch from 4b5e344 to 0703854 Compare September 18, 2021 07:19
DecimalVector decimalVector = new DecimalVector(
vectorHolder.vector().getName(),
ArrowSchemaUtil.convert(vectorHolder.icebergField()).getFieldType(),
vectorHolder.vector().getAllocator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this should be doing any allocation. We should pass in the correct vector and this should just fill it with data. That way, we can control allocation and reuse buffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this allocation is also the reason why there is a breaking change to the VectorHolder API. Can we avoid the breaking change as well by fixing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the same approach as you suggested when I was implementing this. However, when we reach this point, the ArrowVectorAccessor in this particular case is a DictionaryDecimalIntAccessor which holds an IntVector and not a DecimalVector, so we can't just pass the correct vector (or maybe I'm misunderstanding by what you mean with "just pass the correct vector").

In order to initialize the correct arrow vector (DecimalVector in this particular case), it's not enough to just know the Type. We actually need to know the entire NestedField to know if it's a required/optional field, hence the change around changing from Type to NestedField in the VectorHolder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need for NestedField in order to create the vector here. But I don't think that this should be allocating anything. It should instead copy from the IntVector to DecimalVector, both of which should be passed in.

The current structure always passes vectors into readers to be filled with data. Sometimes those are reallocated, but we try to be able to pass the last set of vectors in to avoid allocation. Here should be the same.

For example, if you're reading a table of (int, string) then we allocate a vector for each and pass them into the read method. If that int is actually a decimal, then we should create an appropriate decimal vector and pass that in as well so it can be reused through the same lifecycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I just pushed a new version where we try and re-use the non-dict-encoded vector in https://github.com/apache/iceberg/pull/3023/files#diff-8b31f9172c09339b4a3bc06e6fc288e2b1e34de90091a3d73a51b84e89bd97a9R189. Is this what you had in mind?

@nastra nastra force-pushed the arrow-support-decimal branch 2 times, most recently from c74fded to 7d029af Compare October 14, 2021 16:03
@nastra nastra force-pushed the arrow-support-decimal branch from 7d029af to 82faecd Compare October 21, 2021 12:17
@nastra nastra closed this Apr 28, 2023
@nastra nastra deleted the arrow-support-decimal branch April 28, 2023 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants