Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a simple test to execute TableScan using Parquet files #869

Closed

Conversation

mbasmanova
Copy link
Contributor

@mbasmanova mbasmanova commented Jan 11, 2022

  • Add a small test to execute TableScan using Parquet files with a pushed down filter and an aggregation on top.
  • Update HiveConnector to destroy RowReader before destroying Reader. ParquetRowReader holds an instance of an Allocator which needs to be alive as long as ParquetRowReader is alive.

Fixes #846

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 11, 2022
@mbasmanova
Copy link
Contributor Author

I fixed the issue by changing the order of destruction to destroy RowReader before Reader. Will work on a proper PR.

@@ -14,7 +14,7 @@

add_subdirectory(common)
add_subdirectory(dwrf)
if(VELOX_ENABLE_PARQUET)
#if(VELOX_ENABLE_PARQUET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commenting the if check out?

@@ -177,7 +177,7 @@ void ParquetRowReader::resetFilterCaches() {
}

size_t ParquetRowReader::estimatedRowSize() const {
VELOX_FAIL("ParquetRowReader::estimatedRowSize is NYI");
// VELOX_FAIL("ParquetRowReader::estimatedRowSize is NYI");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line commented out?

@@ -14,25 +14,19 @@
* limitations under the License.
*/

#include <gmock/gmock.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes relavant?

@@ -60,6 +60,7 @@ target_link_libraries(
velox_functions_lib
velox_functions_prestosql
velox_hive_connector
velox_dwio_parquet_reader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add if(VELOX_ENABLE_PARQUET) check? If it's not turned on, the binary won't be built

.planNode();

auto filePath =
"/Users/mbasmanova/cpp/velox-1/velox/dwio/parquet/tests/examples/sample.parquet";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs change?

@majetideepak
Copy link
Collaborator

Will work on a proper PR

Thanks!

@mbasmanova mbasmanova force-pushed the parquet-example branch 4 times, most recently from b892e53 to 6147768 Compare January 12, 2022 00:01
@mbasmanova mbasmanova changed the title [WIP] A quick example of TableScan for Parquet files Add a simple test to execute TableScan using Parquet files Jan 12, 2022
@mbasmanova
Copy link
Contributor Author

@majetideepak @yingsu00 The PR is ready for review.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

protected:
using OperatorTestBase::assertQuery;

void SetUp() override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: setUp (please modify the base function to be also camelcase).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is defined in the googletest library.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.
I assume we cannot test a subset of columns due to #865?

parquet::registerParquetReaderFactory();
}

std::string getExampleFilePath(const std::string& fileName) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inline?

"velox/dwio/parquet/tests", "examples/" + fileName);
}

std::shared_ptr<connector::hive::HiveConnectorSplit> makeSplit(
Copy link
Collaborator

@majetideepak majetideepak Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend makeHiveConnectorSplit inside velox/exec/tests/utils/HiveConnectorTestBase.h and remove this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good way of doing that, but I updated this method to re-use makeHiveConnectorSplit:

    auto split = makeHiveConnectorSplit(filePath);
    split->fileFormat = dwio::common::FileFormat::PARQUET;
    return split;

});
createDuckDbTable({data});

auto filePath = getExampleFilePath("sample.parquet");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline?

auto split = makeSplit(getExampleFilePath("sample.parquet"));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@mbasmanova
Copy link
Contributor Author

I assume we cannot test a subset of columns due to #865?

That's right. Now that we have a basic test in place, it will be easy to repro such issues and add tests when fixing them.

…ncubator#869)

Summary:
- Add a small test to execute TableScan using Parquet files with a pushed down filter and an aggregation on top.
- Update HiveConnector to destroy RowReader before destroying Reader. ParquetRowReader holds an instance of an Allocator which needs to be alive as long as ParquetRowReader is alive.

Pull Request resolved: facebookincubator#869

Differential Revision: D33540693

Pulled By: mbasmanova

fbshipit-source-id: 26fd5cb07af1b3cf255838e3f9fc46740872a20f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33540693

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Masha!

@yingsu00
Copy link
Collaborator

@mbasmanova Was this actually merged? It seems the PR was closed without merging.

@mbasmanova
Copy link
Contributor Author

@mbasmanova Was this actually merged? It seems the PR was closed without merging.

Yes, this change was merged. See https://github.com/facebookincubator/velox/commits/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet reader throws EXC_BAD_ACCESS on simple query with filter
5 participants