Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for reading read statistics from parquet files without statistics and boolean & struct data type #10608

Merged
merged 9 commits into from
May 22, 2024

Conversation

NGA-TRAN
Copy link
Contributor

Which issue does this PR close?

More tests for #10453

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label May 21, 2024
@alamb alamb changed the title read statistics from parquet without statistics Test for reading read statistics from parquet files without statistics May 21, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN -- I think this just needs some comments (or maybe a builder style API)

)
}

pub fn parquet_file_one_column_stats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some doc comments about what this function does / how it is different? It is clear from this PR that it adds enable_stats but that might not be obvious in the future

While we are on this topic, I personally find a struct with named parameters easier to read than something the following where you have to look up parquet_file_one_column_stats to know what the 0, 4 and 7 mean.

    let reader =
        parquet_file_one_column_stats(0, 4, 7, row_per_group, EnabledStatistics::None);

I wonder if we could change the code to something like this to make it easier to read 🤔

  let reader = TestFile {
    num_null: 0
    no_null_values_start: 4,
    no_null_values_end: 7,
    row_per_group,
    enable_stats: EnabledStatistic::None,
  }
  .build();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to do this, we need to think about parquet_file_many_columns, too.
I have to sign off now. How about we do this in following PR?

@alamb
Copy link
Contributor

alamb commented May 21, 2024

This PR also appears to have a conflict now

@NGA-TRAN NGA-TRAN changed the title Test for reading read statistics from parquet files without statistics Test for reading read statistics from parquet files without statistics and boolean & struct data type May 21, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NGA-TRAN -- I think this is great. I took the liberty of updating the test framework in the more functional style and merged up from main.

@alamb alamb merged commit 1bf7112 into apache:main May 22, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented May 22, 2024

Since this was test only code I just merged it in to keep things moving

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…s and boolean & struct data type (apache#10608)

* read statistics from parquet without statistics

* tests for boolean and struct array

* link to a the struct array bug

* Make builder style API

* Port to builder style

* simlify

* Add column name to test

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants