Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` #6000

adriangb · 2024-07-03T18:23:46Z

A step towards #5988, #6002

alamb

Thanks @adriangb -- this PR looks good to me and I think we could proceed with this design.

I did file #6002 to track a potentially more flexible API that I think is worth considering. However, adding this API to mirror decode_metadata I think would also be fine (and we could make a more complex API later)

alamb · 2024-07-04T10:00:04Z

parquet/src/file/footer.rs

+
+        let encoded = encode_metadata(&metadata).unwrap();
+        let decoded = decode_metadata(&encoded).unwrap();
+        assert_eq!(


Can you simply just assert that encoded == decoded?

alamb · 2024-07-04T10:01:11Z

parquet/src/file/footer.rs

+        {
+            assert_eq!(a, b);
+        }
+        // TODO: add encoding and decoding of column and offset indexes (aka page indexes)


I agree that encoding/decoding of these structures doesn't have to be present in the initial PR, however given they are stored out of line / slightly differently than the other structures I think it would be good to ensure we could encode them using this same API

alamb · 2024-07-04T10:03:55Z

parquet/src/file/footer.rs

+/// specified by the [Parquet Spec].
+///
+/// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
+pub fn encode_metadata(metadata: &ParquetMetaData) -> Result<Vec<u8>> {


Is it possible to switch the existing writers to use this API as well? Not only would that avoid code duplication, it would ensure the API is general enough

For example, I wonder if it would make sense for this function signature to be more like

/// write the metadata to the target `std::io:Write`, returning the number of bytes written pub fn encode_metadata<W: Write>(metadata: &ParquetMetaData) -> Result<usize> { ... }

That would allow writing into a Vec but also allow writing into various other targets and perhaps avoid buffering

adriangb · 2024-07-04T11:23:20Z

@alamb I pushed a fluentish API version of this.

I got bogged down implementing the page index writing because there doesn't seem to be a clean path to go from a ParquetMetadata's PageLocation and Index to the thrift OffsetIndex and ColumnIndex. I think the thing is that the current writers never materialize a ParquetMetadata and thus forcing them to do so might introduce unnecessary overhead. Maybe the path to go from a ParquetMetadata to bytes shouldn't be merged with writers? But also maybe I just couldn't come up with a good implementation and with more trial or with your help we can get there.

I do think the readers could be merged.

For this encoder to make sense I think it should have an option to handle page indexes and have it enabled and working by default (like the writers do).

adriangb · 2024-07-06T22:43:55Z

One thing I can do to avoid blocking on my lack of knowledge of encoding the page index stuff is to design the API first and implement it later. E.g. we can add .with_page_index(bool) and error if you set it to true or don't set it at all so that you're forced to acknowledge that the future default will be true.

alamb · 2024-07-08T10:04:35Z

Thanks @adriangb -- I will try and review this PR today

alamb · 2024-07-11T00:09:11Z

Working through the list of PRs in arrow-rs is on my list of things to do tomorrow

alamb

Thanks @adriangb -- this is looking like a good start

I think we should try and structure the code so the existing writer uses this new MetadataEncoder which would keep metadata writing consistent as well as enable usecases like encoding bloom filters, etc.

Let me know what you think.

cc @sunchao @tustvold @Jefffrey @liukun4515 @nevi-me for any thoughts you might have on this API / approach

alamb · 2024-07-11T10:03:19Z

parquet/src/file/metadata/mod.rs

@@ -86,7 +86,7 @@ pub type ParquetOffsetIndex = Vec<Vec<Vec<PageLocation>>>;
 ///
 /// [`parquet.thrift`]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
 /// [`parse_metadata`]: crate::file::footer::parse_metadata
-#[derive(Debug, Clone)]
+#[derive(Debug, Clone, PartialEq)]


alamb · 2024-07-11T10:16:40Z

parquet/src/file/footer.rs

+        let column_orders = encode_column_orders(metadata.file_metadata().column_orders());
+        let schema = types::to_thrift(&metadata.file_metadata().schema().clone())?;
+
+        let t_file_metadata = TFileMetaData {


I noticed that this is not quite the same code as used n the actual writer (specifically the way column order is not the same) so I worry it would be inconsistent or drift over time from the actual writer

arrow-rs/parquet/src/file/writer.rs

Lines 352 to 375 in 22e0b44

// We only include ColumnOrder for leaf nodes.

// Currently only supported ColumnOrder is TypeDefinedOrder so we set this

// for all leaf nodes.

// Even if the column has an undefined sort order, such as INTERVAL, this

// is still technically the defined TYPEORDER so it should still be set.

let column_orders = (0..self.schema_descr().num_columns())

.map(|_| parquet::ColumnOrder::TYPEORDER(parquet::TypeDefinedOrder {}))

.collect();

// This field is optional, perhaps in cases where no min/max fields are set

// in any Statistics or ColumnIndex object in the whole file.

// But for simplicity we always set this field.

let column_orders = Some(column_orders);

let file_metadata = parquet::FileMetaData {

num_rows,

row_groups,

key_value_metadata,

version: self.props.writer_version().as_num(),

schema: types::to_thrift(self.schema.as_ref())?,

created_by: Some(self.props.created_by().to_owned()),

column_orders,

encryption_algorithm: None,

footer_signing_key_metadata: None,

};

Thus what I suggest we do here is change writer.rs to use the ParquetMetadataEncoder and refactor the code from there into this function. That would be a bit more involved but I think would set us up nicely so that metadata encoding remains consistent.

adriangb · 2024-07-11T15:41:55Z

I think we should try and structure the code so the existing writer uses this new MetadataEncoder which would keep metadata writing consistent as well as enable usecases like encoding bloom filters, etc.

I completely agree. That's just a much bigger chunk to bite off, I can give it a shot but I may need support to get there.

adriangb · 2024-07-12T04:07:59Z

I've made some progress. I made a (very rough) metadata writer that is used internally by SerializedFileWriter and can encode from a ParquetMetadata. My plan of attack from here:

Implement reading of metadata without needing to have the entire file available. There's already MetadataLoader as pointed out in API for encoding/decoding ParquetMetadata with more control #6002 (comment) but it wants to read metadata from an entire file and I think needs to be refactored to be able to load metadata when that's all you have.
Get feedback here on the APIs (they really aren't pretty).
Add roundtrip tests.

alamb

I think this is looking quite nice @adriangb and I think we should try and proceed with this approach.

I think it would be easier to make progress if we can work on the approach incrementally as multiple smaller PRs rather than one large one (it will be easier for me to give you timely feedback)

Also, it is probably good to know of #5486 from @etseidl which could conflict as we change the metadata.

Also #5933 from @progval

Given we are now being careful about breaking changes (see https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes) I am worried that these PRs will interact / cause conflicts with each other

What do you think of this idea: #6050 ?

alamb · 2024-07-13T11:03:35Z

parquet/src/file/writer.rs

+            Some(self.props.created_by().to_string()),
+            self.props.writer_version().as_num(),
+        );
+        encoder.finish()


alamb · 2024-07-13T11:04:37Z

parquet/src/file/writer.rs


        let mut row_groups = self
            .row_groups
-            .as_slice()
            .iter()
            .map(|v| v.to_thrift())
            .collect::<Vec<_>>();

        self.write_bloom_filters(&mut row_groups)?;


FWIW #5933 also contains changes for bloom filter writing

alamb · 2024-07-13T11:23:25Z

parquet/src/file/writer.rs

@@ -791,23 +710,274 @@ impl<'a, W: Write + Send> PageWriter for SerializedPageWriter<'a, W> {
    }
 }

+struct ThriftMetadataWriter<'a, W: Write> {


I always get confused when reading the parquet code between what are the generated Thrift structures from the structures in https://docs.rs/parquet/latest/parquet/file/metadata/index.html

I like how you have split out writing of the thrift structures here from the writing of the parquet::file structures

alamb · 2024-07-13T11:26:45Z

parquet/src/file/writer.rs

+        Ok(())
+    }
+
+    fn convert_column_indexes(&self) -> Vec<Vec<Option<ColumnIndex>>> {


I was looking around for another copy of this code and I now see that this is the first time we are going from Index --> ColumnIndex

Makes sense to me. I think this type of structure could really help clean up some of the tests too (but I am getting ahead of myself)

alamb · 2024-07-13T11:30:56Z

parquet/src/file/writer.rs

-
-        let file_metadata = parquet::FileMetaData {
-            num_rows,
+        let encoder = ThriftMetadataWriter::new(


This might read nicer like this:

let encoder = ThriftMetadataWriter::new() .with_schema(&self.schema) .with_descr(&self.descr) .with_row_groups(row_groups) ... ); // encode the data to buf encoder.encode(&mut buf)

Though I realize many of these fields are required

Maybe something like

let encoder = ThriftMetadataWriter::new( &self.schema, &self.descr, ... ) .with_column_indexes(&self.column_indexes) .with_offset_indexes(&self.offset_indexes); encoder.encode(&mut buf)

etseidl · 2024-07-14T00:14:54Z

parquet/src/file/writer.rs

+        if let Some(row_group_offset_indexes) = self.metadata.offset_index() {
+            (0..self.metadata.row_groups().len())
+                .map(|rg_idx| {
+                    let column_indexes = &row_group_offset_indexes[rg_idx];


Minor nit: could this be named offset_indexes?

etseidl · 2024-07-15T17:31:25Z

parquet/src/file/page_index/index.rs

+        let null_counts = self
+            .indexes
+            .iter()
+            .map(|x| x.null_count())
+            .collect::<Option<Vec<_>>>()
+            .unwrap_or_else(|| vec![0; self.indexes.len()]);


While merging with #5486, I noticed this. IIUC, if on read the optional thrift ColumnIndex::null_counts is not present, then the PageIndex::null_count will be None. When converting back to a thrift ColumnIndex, it appears that this will convert the missing null_counts into a vector of num_pages zeros. I don't know if this is the correct behavior, mostly because the spec is (AFAICT) silent on the interpretation of a non-present null_counts. Is it not present as an optimization when there are no nulls, or is it not present due to a lack of information (say a V1 encoder doesn't keep null counts since the V1 page header doesn't require them). Due to that ambiguity I think null_counts here should be None if any or all of the PageIndex::null_count fields is None. Perhaps stop after the collect() and pass null_counts directly below.

alamb · 2024-07-16T15:46:54Z

Update here is I plan to make a 53 dev branch today so we can start getting this code merged and iterate on the API

alamb · 2024-07-16T22:57:31Z

Hi @adriangb -- I changed this PR to point at the 53.0.0-dev branch. I plan to give it a careful review tomorrow and then I am thinking we can merge it and iterate over the course of a few PRs

Again, I am really sorry for the delay in reviewing. I think this is a really important feature but I have been overwhelmed with reviews for the last week or two

adriangb · 2024-07-17T02:25:02Z

Thank you @alamb! No need to apologize; you have such a diverse and impactful contribution to open source, your time management is really quite inspiring. If anything I need to apologize for lagging on applying feedback. I will go over this PR and incorporate feedback (hopefully before your review tomorrow).

alamb

Here is how I suggest we proceed with this PR:

Let's create an example with the usecase described in API for encoding/decoding ParquetMetadata with more control #6002 (comment) (I will try to do this later today). I think this will motivate how the API looks like
In parallel we could pull out some of the simple usability changes (like adding PartialEq and pub use thrift stuff into their own PR so we can merge that.

alamb · 2024-07-17T21:53:39Z

I started on a basic example here: #6081 -- tomorrow I'll try and find time to try and rebase it on this PR and see if I can do what is needed

Prep for apache#6000

Prep for #6000

alamb

Thanks again @adriangb -- I think this is looking really close

While looking through the API for #6097 I had a few more suggestions, but then I think this would be ready. What do you think @etseidl ? I feel this is close to what is proposed in #6095 and related PRs. I feel we are quite close to having some sort of reasonable API for reading / writing these structures:

Metadata is stored in ParquetFileMetadata and associated structures
Write ParquetFileMetadata to bytes using ThriftMetadataWriter (and maybe there will be an async version)
(Eventually) we can have an equivalent ThriftMetadataReader (and the async version `MetadataLoader)

alamb · 2024-07-21T11:37:33Z

parquet/src/file/page_index/index.rs

@@ -168,6 +168,38 @@ impl<T: ParquetValueType> NativeIndex<T> {
            boundary_order: index.boundary_order,
        })
    }
+
+    pub(crate) fn to_column_index(&self) -> ColumnIndex {


I think calling this method to_thrift might be more consistent with other APIs like

https://docs.rs/parquet/latest/parquet/file/metadata/struct.RowGroupMetaData.html#method.to_thrift

The naming is already pretty confusing

parquet/src/file/writer.rs

alamb · 2024-07-21T11:51:28Z

parquet/src/file/writer.rs

+    buf: &'a mut TrackedWrite<W>,
+    schema: &'a TypePtr,
+    schema_descr: &'a SchemaDescPtr,
+    row_groups: Vec<RowGroup>,


Rather than storing these fields separately, I wonder if it would be possible simply to store a https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html

(can be done as a follow on PR)

Something like

struct ThriftMetadataWriter<'a, W: Write> { buf: &'a mut TrackedWrite<W>, parquet_metadata: &ParquetMetadata, // or maybe Arc<ParquetMetadata> 🤔 }

And then add the various builder APIs like with_key_value_metadata directly to ParquetMetadata

This would make it easier to work / manipulate ParquetMetadata in general and would make the responsibility of the writer clearer (handle the details of coordinating the writing of the thrift encoded structures and their indexes)

I agree, but lets do that in a followup PR since (I think) that would not be breaking in any way and this is quite large already

parquet/src/file/writer.rs

etseidl · 2024-07-22T17:39:48Z

What do you think @etseidl ? I feel this is close to what is proposed in #6095 and related PRs.

Yes, I think this is ready to merge into 53.0.0-dev. The unencoded size info is still separate from the page locations, so hopefully this will merge in fairly cleanly. From a practical standpoint, I think it will be easier to merge this before #6095, and then I can make the needed changes to the new offset index struct to implement to_thrift().

adriangb · 2024-07-22T20:03:12Z

I think I've addressed the feedback and updated the branch :)

parquet/src/file/writer.rs

adriangb · 2024-07-24T16:29:49Z

I'm not sure why the test is failing (it was before, I don't think it's from a merge). Need to investigate.

etseidl · 2024-07-24T16:37:57Z

I'm not sure why the test is failing (it was before, I don't think it's from a merge). Need to investigate.

I think you'll need to merge 53.0.0-dev again to pick up the latest changes to the offset index, and then reformat (some new names are longer and changed how the linter wants lines wrapped).

adriangb · 2024-07-24T19:35:21Z

I've updated the branch and cleaned up, test is still failing. It seems the reading part is trying to access byte 0 of the file, which doesn't make sense and makes me think there's a bug somewhere (could be in the test since there's a lot of shim in there): https://github.com/apache/arrow-rs/actions/runs/10082832978/job/27878006690?pr=6000#step:6:761

etseidl · 2024-07-24T22:35:57Z

parquet/src/file/writer.rs

+
+        let data = buf.into_inner().freeze();
+
+        let decoded_metadata = load_metadata_from_bytes(metadata.file_size, data).await;


Suggested change

let decoded_metadata = load_metadata_from_bytes(metadata.file_size, data).await;

let decoded_metadata = load_metadata_from_bytes(data.len(), data).await;

This will load the page indexes, but then the assert below fails because the offset_index_offset and column_index_offset fields of the column chunk are different. Might have to write an equals that accounts for that.

alamb · 2024-07-26T10:15:57Z

I merged the 53 dev branch ~~and that seems to have closed this PR~~ -- any chance you can retarget main?

Update: I restored the branch

alamb · 2024-07-26T13:02:19Z

I wrote up some thoughts that were floating in my head in #6129

I am hoping to spend some more time today looking at this PR in deatil

Thank you again for your patience

github-actions bot added the parquet Changes to the parquet crate label Jul 3, 2024

adriangb changed the title ~~Add function to mirror~~ Add encode_metadata function to mirror decode_metadata and allow ad-hoc encoding of ParquetMetadata Jul 3, 2024

alamb mentioned this pull request Jul 4, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Open

alamb reviewed Jul 4, 2024

View reviewed changes

alamb mentioned this pull request Jul 7, 2024

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 apache/datafusion#11190

Closed

10 tasks

alamb mentioned this pull request Jul 8, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 apache/datafusion#11334

Closed

9 tasks

alamb reviewed Jul 11, 2024

View reviewed changes

adriangb force-pushed the add-encode_metadata branch from afa975d to d7a4156 Compare July 12, 2024 04:01

This was referenced Jul 13, 2024

Minor: clarify the relationship between file::metadata and format in docs #6049

Merged

Proposal: parquet 53.0.0 feature branch #6050

Closed

alamb reviewed Jul 13, 2024

View reviewed changes

adriangb mentioned this pull request Jul 13, 2024

Reintroduce: Write Bloom filters between row groups instead of the end #5933

Merged

etseidl reviewed Jul 14, 2024

View reviewed changes

alamb mentioned this pull request Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 apache/datafusion#11474

Closed

7 tasks

etseidl reviewed Jul 15, 2024

View reviewed changes

alamb changed the base branch from master to 53.0.0-dev July 16, 2024 22:56

alamb reviewed Jul 17, 2024

View reviewed changes

adriangb added a commit to adriangb/arrow-rs that referenced this pull request Jul 18, 2024

Add PartialEq to ParquetMetaData and FileMetadata

c43107a

Prep for apache#6000

adriangb mentioned this pull request Jul 18, 2024

Add PartialEq to ParquetMetaData and FileMetadata #6082

Merged

adriangb changed the title ~~Add encode_metadata function to mirror decode_metadata and allow ad-hoc encoding of ParquetMetadata~~ Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata Jul 18, 2024

adriangb force-pushed the add-encode_metadata branch from 67545a6 to 96fa84d Compare July 18, 2024 13:14

alamb pushed a commit that referenced this pull request Jul 19, 2024

Add PartialEq to ParquetMetaData and FileMetadata (#6082)

16915b5

Prep for #6000

alamb reviewed Jul 21, 2024

View reviewed changes

alamb mentioned this pull request Jul 22, 2024

DataFusion weekly project plan (Andrew Lamb) - July 22, 2024 apache/datafusion#11601

Open

5 tasks

etseidl reviewed Jul 22, 2024

View reviewed changes

parquet/src/file/writer.rs Show resolved Hide resolved

etseidl reviewed Jul 23, 2024

View reviewed changes

parquet/src/file/writer.rs Outdated Show resolved Hide resolved

adriangb force-pushed the add-encode_metadata branch from b7943fc to b38ccf7 Compare July 24, 2024 19:27

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata

b41173f

adriangb force-pushed the add-encode_metadata branch from b38ccf7 to b41173f Compare July 24, 2024 19:28

etseidl reviewed Jul 24, 2024

View reviewed changes

alamb deleted the branch apache:53.0.0-dev July 26, 2024 10:11

alamb closed this Jul 26, 2024

alamb reopened this Jul 26, 2024

alamb mentioned this pull request Jul 26, 2024

[DISCUSSION] Parquet Metadata Improvements #6129

Open

etseidl added a commit to etseidl/arrow-rs that referenced this pull request Jul 26, 2024

add to_thrift to NativeIndex in prep for apache#6000

e8a0b7f

	// We only include ColumnOrder for leaf nodes.
	// Currently only supported ColumnOrder is TypeDefinedOrder so we set this
	// for all leaf nodes.
	// Even if the column has an undefined sort order, such as INTERVAL, this
	// is still technically the defined TYPEORDER so it should still be set.
	let column_orders = (0..self.schema_descr().num_columns())
	.map(\|_\| parquet::ColumnOrder::TYPEORDER(parquet::TypeDefinedOrder {}))
	.collect();
	// This field is optional, perhaps in cases where no min/max fields are set
	// in any Statistics or ColumnIndex object in the whole file.
	// But for simplicity we always set this field.
	let column_orders = Some(column_orders);

	let file_metadata = parquet::FileMetaData {
	num_rows,
	row_groups,
	key_value_metadata,
	version: self.props.writer_version().as_num(),
	schema: types::to_thrift(self.schema.as_ref())?,
	created_by: Some(self.props.created_by().to_owned()),
	column_orders,
	encryption_algorithm: None,
	footer_signing_key_metadata: None,
	};


		let data = buf.into_inner().freeze();

		let decoded_metadata = load_metadata_from_bytes(metadata.file_size, data).await;

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Are you sure you want to change the base?

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Conversation

adriangb commented Jul 3, 2024 • edited by alamb Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jul 4, 2024 • edited Loading

adriangb commented Jul 6, 2024

alamb commented Jul 8, 2024

alamb commented Jul 11, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jul 11, 2024

adriangb commented Jul 12, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

adriangb commented Jul 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 17, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Jul 22, 2024

adriangb commented Jul 22, 2024

adriangb commented Jul 24, 2024

etseidl commented Jul 24, 2024

adriangb commented Jul 24, 2024

Choose a reason for hiding this comment

alamb commented Jul 26, 2024 • edited Loading

alamb commented Jul 26, 2024

Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` #6000

Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` #6000

adriangb commented Jul 3, 2024 •

edited by alamb

Loading

adriangb commented Jul 4, 2024 •

edited

Loading

alamb commented Jul 26, 2024 •

edited

Loading