Skip to content
This repository was archived by the owner on May 10, 2024. It is now read-only.

PARQUET-545: Improve API to support decimal type#65

Closed
majetideepak wants to merge 15 commits intoapache:masterfrom
majetideepak:PARQUET-545
Closed

PARQUET-545: Improve API to support decimal type#65
majetideepak wants to merge 15 commits intoapache:masterfrom
majetideepak:PARQUET-545

Conversation

@majetideepak
Copy link

This PR also

  1. Improves scanner coverage.
  2. Implements PARQUET-546
  3. Adds tests for Types to string

@majetideepak
Copy link
Author

@wesm I don't see much to do for DECIMAL types except to check that the scale and precision are set properly in the ColumnDescriptor. I modified the ColumnDescriptor and I am yet to add some test cases.
Do you have other thoughts here?
I also implemented PARQUET-546. Can you have a look ? I will add test cases if you are okay with the new API.
Thanks!

TEST(FLBAEncodeDecode, TestEncodeDecode) {
schema::NodePtr node;
node = schema::PrimitiveNode::MakeFLBA("name", Repetition::OPTIONAL,
flba_length, LogicalType::UTF8);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a single API to create Primitive nodes.

@majetideepak
Copy link
Author

I just looked at parquet-mr and it looks like they check if the length precision and scale validate the decimal type. I will add this check.

@wesm
Copy link
Member

wesm commented Feb 26, 2016

I commented on https://issues.apache.org/jira/browse/PARQUET-545. Impala has a lot of decimal-related code, but it is mostly used for computations in the runtime. I'm not sure if having container types for decimal data (and stuff like coercing to/from double -- that would be useful IMHO) would be helpful overall. Being able to print decimals would be nice. It's a can of worms though

@majetideepak majetideepak changed the title Parquet 545: (WIP) Support for decimals Parquet 545: Improve API to support decimal type Feb 26, 2016
@majetideepak
Copy link
Author

@wesm I agree. I just commented on https://issues.apache.org/jira/browse/PARQUET-545 too. In this patch, I want to extend the ColumnDescriptor API to be able to extract the scale and precision values required to interpret decimals. We can use the Impala Decimal code at the Scanner level in a later patch.


bool is_required() const {
return max_definition_level_ == 0;
return primitive_node_->is_required();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using the Type values set in the PrimitiveNode.

@majetideepak
Copy link
Author

@wesm I feel this patch is important for the release as well. Can you please give feedback. I will rebase this off #64.

bool is_repeated() const {
return max_repetition_level_ > 0;
return primitive_node_->is_repeated();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes have fundamentally changed the nature of their results -- they can't be used like they were in the scanner any more, because you could have required types with definition level > 0:

optional group bag
  repeated group list
    required int32 item

I suggest using the max repetition/definition levels in the scanner to avoid this issue

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I did not think about the groups. Will revert this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this didn't cause test failures, it may be worth trying to add a failing test case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any groups tests yet. This code works for all flat schemas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some GroupNode tests in the descriptor tests

ASSERT_THROW(descr_.Init(node), ParquetException);
}

TEST_F(TestSchemaDescriptor, DescriptorRepetitionValues) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for repetition values of schemas with GroupNodes

@majetideepak
Copy link
Author

The changes are complete. I will wait for the test coverage report.

@wesm
Copy link
Member

wesm commented Feb 26, 2016

Cool I can review in a little bit

reinterpret_cast<TypedScanner<FLBAType::type_num>* >(scanner_.get());
ASSERT_EQ(10, scanner->descr()->type_precision());
ASSERT_EQ(2, scanner->descr()->type_scale());
ASSERT_EQ(FLBA_LENGTH, scanner->descr()->type_length());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for ColumnDescriptor API extensions in this patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably go in their own test case.

@wesm
Copy link
Member

wesm commented Feb 26, 2016

Looks like we're blocked on Travis builds again for the next few hours. Let me know on these relatively minor issues and I'll look again a bit later. I'm going to close out the remaining clean-up patches as quickly as I can

@wesm
Copy link
Member

wesm commented Feb 26, 2016

Also, the PR title needs to be changed to PARQUET-545

@majetideepak majetideepak changed the title Parquet 545: Improve API to support decimal type Parquet-545: Improve API to support decimal type Feb 27, 2016
@majetideepak
Copy link
Author

I am making these changes now.

@wesm
Copy link
Member

wesm commented Feb 27, 2016

+1, thank you!

@majetideepak
Copy link
Author

@wesm Thanks!
On a separate note, will it help if we test the column reader and scanner for Dictionary pages too ?

@wesm
Copy link
Member

wesm commented Feb 28, 2016

Yes, that would be a good idea -- I only tested the lowest tier of encoding. Can you open a JIRA?

@wesm
Copy link
Member

wesm commented Feb 28, 2016

This issue title needs to start with PARQUET-545: (capital letters) or the merge tool will fail

@majetideepak majetideepak changed the title Parquet-545: Improve API to support decimal type PARQUET-545: Improve API to support decimal type Feb 28, 2016
@majetideepak
Copy link
Author

@julienledem
Copy link
Member

+1

@asfgit asfgit closed this in ebb45b1 Feb 29, 2016
@majetideepak majetideepak deleted the PARQUET-545 branch March 3, 2016 15:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants