From 399416452ea97ec36cedb1c0fb1aa67840047c7f Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 4 Jun 2024 23:07:40 -0700 Subject: [PATCH 01/28] DRAFT: Strawman guidance on feature releases --- CONTRIBUTING.md | 74 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 38a845e81..bbf00bebe 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -29,3 +29,77 @@ Recommendations and requirements for how to best contribute to Parquet. We striv ### License By contributing your code, you agree to license your contribution under the terms of the APLv2: https://github.com/apache/parquet-format/blob/master/LICENSE + +### Additions/Changes to the Format + +The general steps for adding features to the format are as follows: + +1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the discussion concrete. This step is complete when there lazy consensus. + +2. One a change has lazy consensus two implementations of the feature +demonstrating interopability must also be provided. One implementation MUST be [parquet-java](http://github.com/apache/parquet-java). It is preferred that the second implementation be [parquet-cpp](https://github.com/apache/arrow) or [parquet-rs](https://github.com/apache/arrow-rs), however at the discretion of the PMC any +open source Parquet implementation may be acceptable. + +3. After the first two steps are complete a formal vote is held on the Parquet mailing list to officially +ratify the feature. After the vote passes the format change is merged into the parquet-format repository +and it is expected the change in step 2 will also be merged soon after. + +#### General guidelines/preferences on additions. + +1. To the greatest extent possible changes should have an option for backwards compatibility. +2. New encodings should be fully specified in this repository and ideally not rely on an external + dependencies for implementation (i.e. Parquet is the source of truth for the encoding) +3. New compression mechanisms must have a pure Java implementation that can be used as dependency + in parquet-java. + +### Releases + +The parquet community aims to do releases of the format package only as needed when new features are introduced. +If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. +Do to confusion in the past over parquet versioning it is not expected that there will be a 3.0 release of the specification in the foreseeable future. + +### Compatibility and Feature Enablement + +For the purposes of this discussion we classify features into the following buckets: + +1. Backwards compatible. A file written by an older version of a library can be read by a newer version of the +library. + +2. Forwards compatible. A file written by a new version of the library can be read by an older version +of the library. + +3. Forwards compatible with suboptimal performance. A file written by a new version of the library can +be read an older version of the library but performance might be suboptimal (e.g. statistics might be missing +from the older reader's perspective). + +4. Forwards incompatible. A file written with a new version of the library cannot be read by an older version +of the library. + +Backwards compatibility is the concern of implementations but given the ubiquity of Parquet and the length +of time it has been used, libraries SHOULD always support reading older format versions. + +The Parquet PMC recommends the following guidance for using features defined in the specification. + +1. Forwards compatible changes MAY be used by default in implementations once the parquet-format containing +those changes has been formally released. These features SHOULD be turned on 1 year after the parquet-java +implementation containing feature is released. + +2. Forwards compatible with suboptimal performance features MAY be used default after +the parquet-java implementation containing the feature is released. Features in this category +SHOULD be turned on 1 year after the parquet-java +implementation containing the feature is released. Implementations MAY choose +to do a major version bump when turning on a feature in this category. + +3. Forwards incompatible changes MAY be turned on by default 2 years after the parquet-java +implementation containing feature is released. Features in this category SHOULD be turned on by +default 3 years after the parquet-java implementation containing feature is released. Implementations MUST do +a major version bump when enabling a forward incompatible feature by default. + +For feature released prior to October 2024, target dates for each of these categories will be updated +as part of the parquet-java 2.0 process based on a collected feature compatibility matrix. + +For each release of parquet-java or parquet-format that influences this guidance it is expected +exact dates will be added to parquet-format to provide clarity to implementors. + +End users of software are generally encouraged to follow the same guidance unless they have mechanisms +for ensuring the version of all possible readers of the Parquet files. \ No newline at end of file From 527b8f4e5f1afad165d9accae1780dd799e8030b Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 4 Jun 2024 23:38:29 -0700 Subject: [PATCH 02/28] fix some typos --- CONTRIBUTING.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index bbf00bebe..157522bd5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -46,7 +46,8 @@ and it is expected the change in step 2 will also be merged soon after. #### General guidelines/preferences on additions. -1. To the greatest extent possible changes should have an option for backwards compatibility. +1. To the greatest extent possible changes should have an option for forwards compatibility + (old readers can still read files). 2. New encodings should be fully specified in this repository and ideally not rely on an external dependencies for implementation (i.e. Parquet is the source of truth for the encoding) 3. New compression mechanisms must have a pure Java implementation that can be used as dependency @@ -54,7 +55,7 @@ and it is expected the change in step 2 will also be merged soon after. ### Releases -The parquet community aims to do releases of the format package only as needed when new features are introduced. +The Parquet community aims to do releases of the format package only as needed when new features are introduced. If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. Do to confusion in the past over parquet versioning it is not expected that there will be a 3.0 release of the specification in the foreseeable future. From ec561335128e34372b8bce63423cac05164c6acb Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 5 Jun 2024 09:44:20 -0700 Subject: [PATCH 03/28] Update CONTRIBUTING.md Co-authored-by: Rok Mihevc --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 157522bd5..434f0b91c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -57,7 +57,7 @@ and it is expected the change in step 2 will also be merged soon after. The Parquet community aims to do releases of the format package only as needed when new features are introduced. If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. -Do to confusion in the past over parquet versioning it is not expected that there will be a 3.0 release of the specification in the foreseeable future. +Due to confusion in the past over parquet versioning it is not expected that there will be a 3.0 release of the specification in the foreseeable future. ### Compatibility and Feature Enablement From e2ef8d270956531987915674aa07d6830b5055cb Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 5 Jun 2024 21:10:54 -0700 Subject: [PATCH 04/28] update based on comments --- CONTRIBUTING.md | 58 ++++++++++++++++++++++++++++++------------------- 1 file changed, 36 insertions(+), 22 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 434f0b91c..819bda047 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -38,20 +38,28 @@ The general steps for adding features to the format are as follows: 2. One a change has lazy consensus two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be [parquet-java](http://github.com/apache/parquet-java). It is preferred that the second implementation be [parquet-cpp](https://github.com/apache/arrow) or [parquet-rs](https://github.com/apache/arrow-rs), however at the discretion of the PMC any -open source Parquet implementation may be acceptable. +open source Parquet implementation may be acceptable. Implementations +whose contributors actively +participate in the community (e.g. keep their feature matrix +up-to-date on parquet-site) are more likely to be considered. + +Unless otherwise discussed, it is expected the implementations will +develop from the main branch (i.e. packporting is not expected). + +In some cases in addition to library level implementations it is +expected the changes to be justified with integration into a +processing engine to show there viability. 3. After the first two steps are complete a formal vote is held on the Parquet mailing list to officially ratify the feature. After the vote passes the format change is merged into the parquet-format repository -and it is expected the change in step 2 will also be merged soon after. +and it is expected the change in step 2 will also be merged soon after. Before merging into Parquet-java a parquet-format release +must be performed. #### General guidelines/preferences on additions. -1. To the greatest extent possible changes should have an option for forwards compatibility - (old readers can still read files). -2. New encodings should be fully specified in this repository and ideally not rely on an external - dependencies for implementation (i.e. Parquet is the source of truth for the encoding) -3. New compression mechanisms must have a pure Java implementation that can be used as dependency - in parquet-java. +1. To the greatest extent possible changes should have an option for forwards compatibility (old readers can still read files). +2. New encodings should be fully specified in this repository and ideally not rely on an external dependencies for implementation (i.e. Parquet is the source of truth for the encoding). +3. New compression mechanisms must have a pure Java implementation that can be used as dependency in parquet-java. ### Releases @@ -69,38 +77,44 @@ library. 2. Forwards compatible. A file written by a new version of the library can be read by an older version of the library. -3. Forwards compatible with suboptimal performance. A file written by a new version of the library can +3. Forward compatible with suboptimal performance. A file written by a new version of the library can be read an older version of the library but performance might be suboptimal (e.g. statistics might be missing from the older reader's perspective). -4. Forwards incompatible. A file written with a new version of the library cannot be read by an older version +4. Forward incompatible. A file written with a new version of the library cannot be read by an older version of the library. -Backwards compatibility is the concern of implementations but given the ubiquity of Parquet and the length -of time it has been used, libraries SHOULD always support reading older format versions. +The Parquet community hopes that new features are widely beneficial +to users of Parquet, and therefore third-party implementations will +adopt them quickly after they are introduced. It is assumed that most new features will be implemented behind a feature flag that defaults to "off".To avoid, compatibility issues across the ecosystem some amount of lead time is desirable to ensure a critical mass of Parquet implementations support a feature. Therefore, the Parquet PMC gives the following guidance for changing a feature to be "on" by default: -The Parquet PMC recommends the following guidance for using features defined in the specification. +1. Backwards compatibility is the concern of implementations but given the ubiquity of Parquet and the length +of time it has been used, libraries SHOULD support reading older +file variants. -1. Forwards compatible changes MAY be used by default in implementations once the parquet-format containing +2. Forward compatible changes MAY be used by default in implementations once the parquet-format containing those changes has been formally released. These features SHOULD be turned on 1 year after the parquet-java -implementation containing feature is released. +implementation containing feature is released (e.g. it is expected +the Java implementation itself will turn them on for the first +release 1 year after a features initial introduction). -2. Forwards compatible with suboptimal performance features MAY be used default after +3. Forward compatible with suboptimal performance features MAY be made default after the parquet-java implementation containing the feature is released. Features in this category SHOULD be turned on 1 year after the parquet-java implementation containing the feature is released. Implementations MAY choose to do a major version bump when turning on a feature in this category. -3. Forwards incompatible changes MAY be turned on by default 2 years after the parquet-java -implementation containing feature is released. Features in this category SHOULD be turned on by +4. Forwards incompatible changes MAY be made default 2 years after the parquet-java +implementation containing the feature is released. Features in this category SHOULD be turned on by default 3 years after the parquet-java implementation containing feature is released. Implementations MUST do a major version bump when enabling a forward incompatible feature by default. -For feature released prior to October 2024, target dates for each of these categories will be updated +For features released prior to October 2024, target dates for each of these categories will be updated as part of the parquet-java 2.0 process based on a collected feature compatibility matrix. For each release of parquet-java or parquet-format that influences this guidance it is expected -exact dates will be added to parquet-format to provide clarity to implementors. +exact dates will be added to parquet-format to provide clarity to implementors (e.g. When parquet-java 2.X.X is released, any +new format features it uses will be updated with concrete dates). -End users of software are generally encouraged to follow the same guidance unless they have mechanisms -for ensuring the version of all possible readers of the Parquet files. \ No newline at end of file +End users of software are generally encouraged to follow the same guidance unless they have mechanisms for ensuring the version of all possible readers of the Parquet files support the feature. One way +of doing this is to cross-reference feature matrix. \ No newline at end of file From 660c32532a8054cb42915d03a61e9e0db01200cf Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 5 Jun 2024 23:05:24 -0700 Subject: [PATCH 05/28] fix typo --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 819bda047..d235640dc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -36,7 +36,7 @@ The general steps for adding features to the format are as follows: 1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the discussion concrete. This step is complete when there lazy consensus. -2. One a change has lazy consensus two implementations of the feature +2. Once a change has lazy consensus two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be [parquet-java](http://github.com/apache/parquet-java). It is preferred that the second implementation be [parquet-cpp](https://github.com/apache/arrow) or [parquet-rs](https://github.com/apache/arrow-rs), however at the discretion of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively From 11c6b7697d564555dc2512c124cf5d031faa7dcd Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 5 Jun 2024 23:19:39 -0700 Subject: [PATCH 06/28] Apply suggestions from code review Co-authored-by: Ed Seidl --- CONTRIBUTING.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d235640dc..fa901b03d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -44,15 +44,15 @@ participate in the community (e.g. keep their feature matrix up-to-date on parquet-site) are more likely to be considered. Unless otherwise discussed, it is expected the implementations will -develop from the main branch (i.e. packporting is not expected). +develop from the main branch (i.e. backporting is not expected). In some cases in addition to library level implementations it is -expected the changes to be justified with integration into a -processing engine to show there viability. +expected the changes will be justified via integration into a +processing engine to show their viability. 3. After the first two steps are complete a formal vote is held on the Parquet mailing list to officially ratify the feature. After the vote passes the format change is merged into the parquet-format repository -and it is expected the change in step 2 will also be merged soon after. Before merging into Parquet-java a parquet-format release +and it is expected the changes from step 2 will also be merged soon after. Before merging into Parquet-java a parquet-format release must be performed. #### General guidelines/preferences on additions. From 82c0fa13014c3cddf3c6c16dd3fc9d8ecd39e881 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 5 Jun 2024 23:23:31 -0700 Subject: [PATCH 07/28] add paragraph covering minor changes --- CONTRIBUTING.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index fa901b03d..1b4157c9a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -32,6 +32,13 @@ https://github.com/apache/parquet-format/blob/master/LICENSE ### Additions/Changes to the Format +Note: This section applies to actual functional changes to the +specification. Fixing typos, grammar, and clarifying concepts +that would not change the semantics of the specification can +be done as long a comitter feels comfortable to merge them. When +in doubt starting a discussion on the dev mailing list is +encouraged. + The general steps for adding features to the format are as follows: 1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the discussion concrete. This step is complete when there lazy consensus. From 5eba8d62ce3489d8396e1f8813114cdfaa30677f Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Fri, 7 Jun 2024 04:57:00 +0000 Subject: [PATCH 08/28] reflow text --- CONTRIBUTING.md | 177 +++++++++++++++++++++++++++--------------------- 1 file changed, 101 insertions(+), 76 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1b4157c9a..100701688 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -32,96 +32,121 @@ https://github.com/apache/parquet-format/blob/master/LICENSE ### Additions/Changes to the Format -Note: This section applies to actual functional changes to the -specification. Fixing typos, grammar, and clarifying concepts -that would not change the semantics of the specification can -be done as long a comitter feels comfortable to merge them. When -in doubt starting a discussion on the dev mailing list is +Note: This section applies to actual functional changes to the specification. +Fixing typos, grammar, and clarifying concepts that would not change the +semantics of the specification can be done as long a comitter feels comfortable +to merge them. When in doubt starting a discussion on the dev mailing list is encouraged. The general steps for adding features to the format are as follows: -1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the discussion concrete. This step is complete when there lazy consensus. +1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). + Often times it is helpful to link to a draft pull request to make the + discussion concrete. This step is complete when there lazy consensus. 2. Once a change has lazy consensus two implementations of the feature -demonstrating interopability must also be provided. One implementation MUST be [parquet-java](http://github.com/apache/parquet-java). It is preferred that the second implementation be [parquet-cpp](https://github.com/apache/arrow) or [parquet-rs](https://github.com/apache/arrow-rs), however at the discretion of the PMC any -open source Parquet implementation may be acceptable. Implementations -whose contributors actively -participate in the community (e.g. keep their feature matrix -up-to-date on parquet-site) are more likely to be considered. - -Unless otherwise discussed, it is expected the implementations will -develop from the main branch (i.e. backporting is not expected). - -In some cases in addition to library level implementations it is -expected the changes will be justified via integration into a -processing engine to show their viability. - -3. After the first two steps are complete a formal vote is held on the Parquet mailing list to officially -ratify the feature. After the vote passes the format change is merged into the parquet-format repository -and it is expected the changes from step 2 will also be merged soon after. Before merging into Parquet-java a parquet-format release -must be performed. + demonstrating interopability must also be provided. One implementation MUST + be [parquet-java](http://github.com/apache/parquet-java). It is preferred + that the second implementation be + [parquet-cpp](https://github.com/apache/arrow) or + [parquet-rs](https://github.com/apache/arrow-rs), however at the discretion + of the PMC any open source Parquet implementation may be acceptable. + Implementations whose contributors actively participate in the community + (e.g. keep their feature matrix up-to-date on parquet-site) are more likely + to be considered. + +Unless otherwise discussed, it is expected the implementations will develop from +the main branch (i.e. backporting is not expected). + +In some cases in addition to library level implementations it is expected the +changes will be justified via integration into a processing engine to show their +viability. + +3. After the first two steps are complete a formal vote is held on the Parquet + mailing list to officially ratify the feature. After the vote passes the + format change is merged into the parquet-format repository and it is expected + the changes from step 2 will also be merged soon after. Before merging into + Parquet-java a parquet-format release must be performed. #### General guidelines/preferences on additions. -1. To the greatest extent possible changes should have an option for forwards compatibility (old readers can still read files). -2. New encodings should be fully specified in this repository and ideally not rely on an external dependencies for implementation (i.e. Parquet is the source of truth for the encoding). -3. New compression mechanisms must have a pure Java implementation that can be used as dependency in parquet-java. +1. To the greatest extent possible changes should have an option for forwards + compatibility (old readers can still read files). +2. New encodings should be fully specified in this repository and ideally not + rely on an external dependencies for implementation (i.e. Parquet is the + source of truth for the encoding). +3. New compression mechanisms must have a pure Java implementation that can be + used as dependency in parquet-java. ### Releases -The Parquet community aims to do releases of the format package only as needed when new features are introduced. -If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. -Due to confusion in the past over parquet versioning it is not expected that there will be a 3.0 release of the specification in the foreseeable future. +The Parquet community aims to do releases of the format package only as needed +when new features are introduced. If multiple new features are being proposed +simultaneously some features might be consolidated into the same release. +Guidance is provided below on when implementations should enable features added +to the specification. Due to confusion in the past over parquet versioning it +is not expected that there will be a 3.0 release of the specification in the +foreseeable future. ### Compatibility and Feature Enablement For the purposes of this discussion we classify features into the following buckets: -1. Backwards compatible. A file written by an older version of a library can be read by a newer version of the -library. - -2. Forwards compatible. A file written by a new version of the library can be read by an older version -of the library. - -3. Forward compatible with suboptimal performance. A file written by a new version of the library can -be read an older version of the library but performance might be suboptimal (e.g. statistics might be missing -from the older reader's perspective). - -4. Forward incompatible. A file written with a new version of the library cannot be read by an older version -of the library. - -The Parquet community hopes that new features are widely beneficial -to users of Parquet, and therefore third-party implementations will -adopt them quickly after they are introduced. It is assumed that most new features will be implemented behind a feature flag that defaults to "off".To avoid, compatibility issues across the ecosystem some amount of lead time is desirable to ensure a critical mass of Parquet implementations support a feature. Therefore, the Parquet PMC gives the following guidance for changing a feature to be "on" by default: - -1. Backwards compatibility is the concern of implementations but given the ubiquity of Parquet and the length -of time it has been used, libraries SHOULD support reading older -file variants. - -2. Forward compatible changes MAY be used by default in implementations once the parquet-format containing -those changes has been formally released. These features SHOULD be turned on 1 year after the parquet-java -implementation containing feature is released (e.g. it is expected -the Java implementation itself will turn them on for the first -release 1 year after a features initial introduction). - -3. Forward compatible with suboptimal performance features MAY be made default after -the parquet-java implementation containing the feature is released. Features in this category -SHOULD be turned on 1 year after the parquet-java -implementation containing the feature is released. Implementations MAY choose -to do a major version bump when turning on a feature in this category. - -4. Forwards incompatible changes MAY be made default 2 years after the parquet-java -implementation containing the feature is released. Features in this category SHOULD be turned on by -default 3 years after the parquet-java implementation containing feature is released. Implementations MUST do -a major version bump when enabling a forward incompatible feature by default. - -For features released prior to October 2024, target dates for each of these categories will be updated -as part of the parquet-java 2.0 process based on a collected feature compatibility matrix. - -For each release of parquet-java or parquet-format that influences this guidance it is expected -exact dates will be added to parquet-format to provide clarity to implementors (e.g. When parquet-java 2.X.X is released, any -new format features it uses will be updated with concrete dates). - -End users of software are generally encouraged to follow the same guidance unless they have mechanisms for ensuring the version of all possible readers of the Parquet files support the feature. One way -of doing this is to cross-reference feature matrix. \ No newline at end of file +1. Backwards compatible. A file written by an older version of a library can be + read by a newer version of the library. + +2. Forwards compatible. A file written by a new version of the library can be + read by an older version of the library. + +3. Forward compatible with suboptimal performance. A file written by a new + version of the library can be read an older version of the library but + performance might be suboptimal (e.g. statistics might be missing from the + older reader's perspective). + +4. Forward incompatible. A file written with a new version of the library cannot + be read by an older version of the library. + +The Parquet community hopes that new features are widely beneficial to users of +Parquet, and therefore third-party implementations will adopt them quickly after +they are introduced. It is assumed that most new features will be implemented +behind a feature flag that defaults to "off".To avoid, compatibility issues +across the ecosystem some amount of lead time is desirable to ensure a critical +mass of Parquet implementations support a feature. Therefore, the Parquet PMC +gives the following guidance for changing a feature to be "on" by default: + +1. Backwards compatibility is the concern of implementations but given the + ubiquity of Parquet and the length of time it has been used, libraries SHOULD + support reading older file variants. + +2. Forward compatible changes MAY be used by default in implementations once the + parquet-format containing those changes has been formally released. These + features SHOULD be turned on 1 year after the parquet-java implementation + containing feature is released (e.g. it is expected the Java implementation + itself will turn them on for the first release 1 year after a features + initial introduction). + +3. Forward compatible with suboptimal performance features MAY be made default + after the parquet-java implementation containing the feature is released. + Features in this category SHOULD be turned on 1 year after the parquet-java + implementation containing the feature is released. Implementations MAY + choose to do a major version bump when turning on a feature in this category. + +4. Forwards incompatible changes MAY be made default 2 years after the + parquet-java implementation containing the feature is released. Features in + this category SHOULD be turned on by default 3 years after the parquet-java + implementation containing feature is released. Implementations MUST do a + major version bump when enabling a forward incompatible feature by default. + +For features released prior to October 2024, target dates for each of these +categories will be updated as part of the parquet-java 2.0 process based on a +collected feature compatibility matrix. + +For each release of parquet-java or parquet-format that influences this guidance +it is expected exact dates will be added to parquet-format to provide clarity to +implementors (e.g. When parquet-java 2.X.X is released, any new format features +it uses will be updated with concrete dates). + +End users of software are generally encouraged to follow the same guidance +unless they have mechanisms for ensuring the version of all possible readers of +the Parquet files support the feature. One way of doing this is to +cross-reference feature matrix. From 4fe4859dfd6d4775c08c2005a11d8f2c4c2eed68 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Fri, 7 Jun 2024 05:31:39 +0000 Subject: [PATCH 09/28] rephrase to be less proscriptive --- CONTRIBUTING.md | 93 +++++++++++++++++++++++++------------------------ 1 file changed, 47 insertions(+), 46 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 100701688..ebe750dcb 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -42,7 +42,12 @@ The general steps for adding features to the format are as follows: 1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the - discussion concrete. This step is complete when there lazy consensus. + discussion concrete. This step is complete when there lazy consensus. Part + of the consensus is whether it sufficient to provide 2 working + implementations as outlined in step 2 or if demonstration of the feature + with a down-stream query engine is necessary to justify the feature (e.g. + demonstrate performance improvements in Arrow's DataSet library or + Apache Data Fusion). 2. Once a change has lazy consensus two implementations of the feature demonstrating interopability must also be provided. One implementation MUST @@ -53,15 +58,12 @@ The general steps for adding features to the format are as follows: of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively participate in the community (e.g. keep their feature matrix up-to-date on parquet-site) are more likely - to be considered. + to be considered. If discussed as a requirement in step one demonstration + of integration with a query engine is also required for this step. Unless otherwise discussed, it is expected the implementations will develop from the main branch (i.e. backporting is not expected). -In some cases in addition to library level implementations it is expected the -changes will be justified via integration into a processing engine to show their -viability. - 3. After the first two steps are complete a formal vote is held on the Parquet mailing list to officially ratify the feature. After the vote passes the format change is merged into the parquet-format repository and it is expected @@ -81,7 +83,7 @@ viability. ### Releases The Parquet community aims to do releases of the format package only as needed -when new features are introduced. If multiple new features are being proposed +when new features are introduced. If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. Due to confusion in the past over parquet versioning it @@ -92,50 +94,49 @@ foreseeable future. For the purposes of this discussion we classify features into the following buckets: -1. Backwards compatible. A file written by an older version of a library can be - read by a newer version of the library. - -2. Forwards compatible. A file written by a new version of the library can be - read by an older version of the library. - -3. Forward compatible with suboptimal performance. A file written by a new - version of the library can be read an older version of the library but - performance might be suboptimal (e.g. statistics might be missing from the - older reader's perspective). - -4. Forward incompatible. A file written with a new version of the library cannot - be read by an older version of the library. +1. Backwards compatible. A file written under an older version of the format + should be readable under a newer version of the format. +2. Forwards compatible. A file written under a newer version of the format with + the enabled feature can be read under an older version of the format, but + some information might be missing or performance might be suboptimal. +3. Forward incompatible. A file written under a new version of the format with + the feature enabled cannot be read under and older version of the format + (e.g. Adding a new compression algorithm. The Parquet community hopes that new features are widely beneficial to users of Parquet, and therefore third-party implementations will adopt them quickly after they are introduced. It is assumed that most new features will be implemented -behind a feature flag that defaults to "off".To avoid, compatibility issues -across the ecosystem some amount of lead time is desirable to ensure a critical -mass of Parquet implementations support a feature. Therefore, the Parquet PMC -gives the following guidance for changing a feature to be "on" by default: +behind a feature flag that defaults to "off" and at some future point the +features are turned on by default. To avoid, compatibility issues across the +ecosystem some amount of lead time is desirable to ensure a critical mass of +Parquet implementations support a feature. Therefore, the Parquet PMC gives the +following guidance for changing a feature to be "on" by default: 1. Backwards compatibility is the concern of implementations but given the ubiquity of Parquet and the length of time it has been used, libraries SHOULD - support reading older file variants. - -2. Forward compatible changes MAY be used by default in implementations once the - parquet-format containing those changes has been formally released. These - features SHOULD be turned on 1 year after the parquet-java implementation - containing feature is released (e.g. it is expected the Java implementation - itself will turn them on for the first release 1 year after a features - initial introduction). - -3. Forward compatible with suboptimal performance features MAY be made default - after the parquet-java implementation containing the feature is released. - Features in this category SHOULD be turned on 1 year after the parquet-java - implementation containing the feature is released. Implementations MAY - choose to do a major version bump when turning on a feature in this category. - -4. Forwards incompatible changes MAY be made default 2 years after the - parquet-java implementation containing the feature is released. Features in - this category SHOULD be turned on by default 3 years after the parquet-java - implementation containing feature is released. Implementations MUST do a - major version bump when enabling a forward incompatible feature by default. + support reading older version of the formats to the greatest extent possible. + +2. Forward compatible features/changes MAY be used by default in implementations + once the parquet-format containing those changes has been formally released. + For features that may pose a significant performance regression to prior + format readers, libaries SHOULD consider delaying until 1 year after the + release of the parquet-java implementation that contains the feature + implementation. Implementations MAY choose to do a major version bump when + turning on a feature in this category. + +3. Forwards incompatible features/changes MAY be made default 2 years after the + parquet-java implementation containing the feature is released. + Implementations MUST do a major version bump when enabling a forward + incompatible feature by default. + +For forward compatible changes which have a high chance or performance +regression for older readers and forward incompatible changes implementations +SHOULD clearly document the compatibility issues and SHOULD consider also +logging a warning when such a feature is used. Additionally, while it is up to +maintainers of individual implementations to make the best decision to serve +their ecosystem they are encouraged to start enabling features by default along +the same timelines as parquet-java. Parquet-java will aim to enable features +based on the most conservative timelines outlined above. For features released prior to October 2024, target dates for each of these categories will be updated as part of the parquet-java 2.0 process based on a @@ -148,5 +149,5 @@ it uses will be updated with concrete dates). End users of software are generally encouraged to follow the same guidance unless they have mechanisms for ensuring the version of all possible readers of -the Parquet files support the feature. One way of doing this is to -cross-reference feature matrix. +the Parquet files support the feature they want to enable. One way of doing this +is to cross-reference feature matrix and any relevant vendor documentation. From e6ce62e5165a4d6ad0fd4267ea9a605e619b7321 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 7 Jun 2024 09:29:51 -0700 Subject: [PATCH 10/28] Update CONTRIBUTING.md Co-authored-by: Andrew Lamb --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index ebe750dcb..a85a5ce0e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -107,7 +107,7 @@ The Parquet community hopes that new features are widely beneficial to users of Parquet, and therefore third-party implementations will adopt them quickly after they are introduced. It is assumed that most new features will be implemented behind a feature flag that defaults to "off" and at some future point the -features are turned on by default. To avoid, compatibility issues across the +features are turned on by default. To avoid compatibility issues across the ecosystem some amount of lead time is desirable to ensure a critical mass of Parquet implementations support a feature. Therefore, the Parquet PMC gives the following guidance for changing a feature to be "on" by default: From f58c5d2f96aa58a6e992dbf5ed1941dc9d7d40c0 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 7 Jun 2024 09:33:30 -0700 Subject: [PATCH 11/28] Update CONTRIBUTING.md Co-authored-by: Andrew Lamb --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a85a5ce0e..c625901a7 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -75,7 +75,7 @@ the main branch (i.e. backporting is not expected). 1. To the greatest extent possible changes should have an option for forwards compatibility (old readers can still read files). 2. New encodings should be fully specified in this repository and ideally not - rely on an external dependencies for implementation (i.e. Parquet is the + rely on an external dependencies for implementation (i.e. `parquet-format` is the source of truth for the encoding). 3. New compression mechanisms must have a pure Java implementation that can be used as dependency in parquet-java. From 8421b24d5360235ecc42e72bdf5e4825e1329e9e Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 7 Jun 2024 09:57:57 -0700 Subject: [PATCH 12/28] clarify forward incompatible features --- CONTRIBUTING.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index c625901a7..fec51b874 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -105,7 +105,8 @@ For the purposes of this discussion we classify features into the following buck The Parquet community hopes that new features are widely beneficial to users of Parquet, and therefore third-party implementations will adopt them quickly after -they are introduced. It is assumed that most new features will be implemented +they are introduced. It is assumed that forward +incompatible features will be implemented behind a feature flag that defaults to "off" and at some future point the features are turned on by default. To avoid compatibility issues across the ecosystem some amount of lead time is desirable to ensure a critical mass of From 7e1845209748a7f56d2838682193b2bd2ca9a8e1 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Mon, 10 Jun 2024 05:21:12 +0000 Subject: [PATCH 13/28] Respond to more feedback. - Soften language on recommendations and add clarifications - Fix some grammar issues. --- CONTRIBUTING.md | 114 ++++++++++++++++++++++++++---------------------- 1 file changed, 63 insertions(+), 51 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index fec51b874..4d780eb8d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -40,16 +40,16 @@ encouraged. The general steps for adding features to the format are as follows: -1. Discuss changes on on the developer mailing list (dev@parquet.apache.org). +1. Discuss changes on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the - discussion concrete. This step is complete when there lazy consensus. Part + discussion concrete. This step is complete when there is lazy consensus. Part of the consensus is whether it sufficient to provide 2 working - implementations as outlined in step 2 or if demonstration of the feature - with a down-stream query engine is necessary to justify the feature (e.g. - demonstrate performance improvements in Arrow's DataSet library or - Apache Data Fusion). + implementations as outlined in step 2 or if demonstration of the feature with + a down-stream query engine is necessary to justify the feature (e.g. + demonstrate performance improvements in Arrow's DataSet library or Apache + Data Fusion or another open source engine). -2. Once a change has lazy consensus two implementations of the feature +2. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be [parquet-java](http://github.com/apache/parquet-java). It is preferred that the second implementation be @@ -58,8 +58,10 @@ The general steps for adding features to the format are as follows: of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively participate in the community (e.g. keep their feature matrix up-to-date on parquet-site) are more likely - to be considered. If discussed as a requirement in step one demonstration - of integration with a query engine is also required for this step. + to be considered. If discussed as a requirement in step one, demonstration + of integration with a query engine is also required for this step. The + implementations must be made available publicly (e.g. as a pull request + against the target repository). Unless otherwise discussed, it is expected the implementations will develop from the main branch (i.e. backporting is not expected). @@ -74,20 +76,22 @@ the main branch (i.e. backporting is not expected). 1. To the greatest extent possible changes should have an option for forwards compatibility (old readers can still read files). + 2. New encodings should be fully specified in this repository and ideally not - rely on an external dependencies for implementation (i.e. `parquet-format` is the - source of truth for the encoding). + rely on an external dependencies for implementation (i.e. `parquet-format` is + the source of truth for the encoding). + 3. New compression mechanisms must have a pure Java implementation that can be used as dependency in parquet-java. ### Releases -The Parquet community aims to do releases of the format package only as needed -when new features are introduced. If multiple new features are being proposed +The Parquet PMC aims to do releases of the format package only as needed when +new features are introduced. If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added to the specification. Due to confusion in the past over parquet versioning it -is not expected that there will be a 3.0 release of the specification in the +is not expected that there will be a 3.x release of the specification in the foreseeable future. ### Compatibility and Feature Enablement @@ -96,48 +100,52 @@ For the purposes of this discussion we classify features into the following buck 1. Backwards compatible. A file written under an older version of the format should be readable under a newer version of the format. + 2. Forwards compatible. A file written under a newer version of the format with the enabled feature can be read under an older version of the format, but some information might be missing or performance might be suboptimal. + 3. Forward incompatible. A file written under a new version of the format with - the feature enabled cannot be read under and older version of the format - (e.g. Adding a new compression algorithm. - -The Parquet community hopes that new features are widely beneficial to users of -Parquet, and therefore third-party implementations will adopt them quickly after -they are introduced. It is assumed that forward -incompatible features will be implemented -behind a feature flag that defaults to "off" and at some future point the -features are turned on by default. To avoid compatibility issues across the -ecosystem some amount of lead time is desirable to ensure a critical mass of -Parquet implementations support a feature. Therefore, the Parquet PMC gives the -following guidance for changing a feature to be "on" by default: + the feature enabled cannot be read under an older version of the format (e.g. + Adding a new compression algorithm). + +New features are intended to be widely beneficial to users of Parquet, and +therefore it is hoped third-party implementations will adopt them quickly after +they are introduced. It is assumed that writing new parts of the format, and +especially forward incompatible features, will be configured with feature flag +defaulted to "off" and at some future point the features are turned on by default +(reading of the new feature will typically be enabled without configuration or +defaulted to on). Some amount of lead time is desirable to ensure a critical +mass of Parquet implementations support a feature to avoid compatability issues +across the ecosystem. Therefore, the Parquet PMC gives the following +recommendations for managing features: 1. Backwards compatibility is the concern of implementations but given the - ubiquity of Parquet and the length of time it has been used, libraries SHOULD - support reading older version of the formats to the greatest extent possible. + ubiquity of Parquet and the length of time it has been used, libraries should + support reading older version of the format to the greatest extent possible. -2. Forward compatible features/changes MAY be used by default in implementations +2. Forward compatible features/changes may be used by default in implementations once the parquet-format containing those changes has been formally released. - For features that may pose a significant performance regression to prior - format readers, libaries SHOULD consider delaying until 1 year after the - release of the parquet-java implementation that contains the feature - implementation. Implementations MAY choose to do a major version bump when - turning on a feature in this category. - -3. Forwards incompatible features/changes MAY be made default 2 years after the - parquet-java implementation containing the feature is released. - Implementations MUST do a major version bump when enabling a forward - incompatible feature by default. - -For forward compatible changes which have a high chance or performance + For features that may pose a significant performance regression to older + format readers, libaries should consider delaying default enablement until 1 + year after the release of the parquet-java implementation that contains the + feature implementation. + +3. Forwards incompatible features/changes should not be turned on by default + until 2 years after the parquet-java implementation containing the feature is + released. It is recommended that changing the default value for a forward + incompatible feature flag be done as part of a major release of an + implementation (it is out of the scope for this guidance on how and when + implementations decide to do releases). + +For forward compatible changes which have a high chance of performance regression for older readers and forward incompatible changes implementations -SHOULD clearly document the compatibility issues and SHOULD consider also -logging a warning when such a feature is used. Additionally, while it is up to -maintainers of individual implementations to make the best decision to serve -their ecosystem they are encouraged to start enabling features by default along -the same timelines as parquet-java. Parquet-java will aim to enable features -based on the most conservative timelines outlined above. +should clearly document the compatibility issues and should consider logging a +warning when such a feature is used. Additionally, while it is up to maintainers +of individual implementations to make the best decision to serve their +ecosystem, they are encouraged to start enabling features by default along the +same timelines as parquet-java. Parquet-java will aim to enable features by +default based on the most conservative timelines outlined above. For features released prior to October 2024, target dates for each of these categories will be updated as part of the parquet-java 2.0 process based on a @@ -146,9 +154,13 @@ collected feature compatibility matrix. For each release of parquet-java or parquet-format that influences this guidance it is expected exact dates will be added to parquet-format to provide clarity to implementors (e.g. When parquet-java 2.X.X is released, any new format features -it uses will be updated with concrete dates). +it uses will be updated with concrete dates). As part of parquet-format +releases the compatibility matrix will be updated to contain the release date +in the format. Implementations are also encouraged to provide implementation +date/release version information when updating the feature matrix. End users of software are generally encouraged to follow the same guidance -unless they have mechanisms for ensuring the version of all possible readers of -the Parquet files support the feature they want to enable. One way of doing this -is to cross-reference feature matrix and any relevant vendor documentation. +detailed above unless they have mechanisms for ensuring the version of all +possible readers of the Parquet files support the feature they want to enable. +One way of doing this is to cross-reference feature matrix and any relevant +vendor documentation. From f8fc1498cd11413e680092b98d4c554de65a09cc Mon Sep 17 00:00:00 2001 From: emkornfield Date: Mon, 10 Jun 2024 10:18:10 -0700 Subject: [PATCH 14/28] Apply suggestions from code review Co-authored-by: Antoine Pitrou --- CONTRIBUTING.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4d780eb8d..8d58027ca 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -60,11 +60,12 @@ The general steps for adding features to the format are as follows: (e.g. keep their feature matrix up-to-date on parquet-site) are more likely to be considered. If discussed as a requirement in step one, demonstration of integration with a query engine is also required for this step. The - implementations must be made available publicly (e.g. as a pull request - against the target repository). + implementations must be made available publicly, and they should + be fit for inclusion (for example, they were submitted as a pull request + against the target repository and committers gave positive reviews). -Unless otherwise discussed, it is expected the implementations will develop from -the main branch (i.e. backporting is not expected). +Unless otherwise discussed, it is expected the implementations will be developed +from their respective main branch (i.e. backporting is not expected). 3. After the first two steps are complete a formal vote is held on the Parquet mailing list to officially ratify the feature. After the vote passes the @@ -90,7 +91,7 @@ The Parquet PMC aims to do releases of the format package only as needed when new features are introduced. If multiple new features are being proposed simultaneously some features might be consolidated into the same release. Guidance is provided below on when implementations should enable features added -to the specification. Due to confusion in the past over parquet versioning it +to the specification. Due to confusion in the past over Parquet versioning it is not expected that there will be a 3.x release of the specification in the foreseeable future. @@ -102,29 +103,29 @@ For the purposes of this discussion we classify features into the following buck should be readable under a newer version of the format. 2. Forwards compatible. A file written under a newer version of the format with - the enabled feature can be read under an older version of the format, but + the feature enabled can be read under an older version of the format, but some information might be missing or performance might be suboptimal. -3. Forward incompatible. A file written under a new version of the format with +3. Forwards incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. - Adding a new compression algorithm). + adding and using a new compression algorithm). New features are intended to be widely beneficial to users of Parquet, and therefore it is hoped third-party implementations will adopt them quickly after they are introduced. It is assumed that writing new parts of the format, and -especially forward incompatible features, will be configured with feature flag -defaulted to "off" and at some future point the features are turned on by default +especially forward incompatible features, will be configured with a feature flag +defaulted to "off", and at some future point the feature is turned on by default (reading of the new feature will typically be enabled without configuration or defaulted to on). Some amount of lead time is desirable to ensure a critical -mass of Parquet implementations support a feature to avoid compatability issues +mass of Parquet implementations support a feature to avoid compatibility issues across the ecosystem. Therefore, the Parquet PMC gives the following recommendations for managing features: 1. Backwards compatibility is the concern of implementations but given the ubiquity of Parquet and the length of time it has been used, libraries should - support reading older version of the format to the greatest extent possible. + support reading older versions of the format to the greatest extent possible. -2. Forward compatible features/changes may be used by default in implementations +2. Forwards compatible features/changes may be enabled and used by default in implementations once the parquet-format containing those changes has been formally released. For features that may pose a significant performance regression to older format readers, libaries should consider delaying default enablement until 1 @@ -139,7 +140,7 @@ recommendations for managing features: implementations decide to do releases). For forward compatible changes which have a high chance of performance -regression for older readers and forward incompatible changes implementations +regression for older readers and forward incompatible changes, implementations should clearly document the compatibility issues and should consider logging a warning when such a feature is used. Additionally, while it is up to maintainers of individual implementations to make the best decision to serve their From 5117b0330d31be482a495dca00b67d8f7a7d7ba0 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Mon, 17 Jun 2024 21:11:56 -0700 Subject: [PATCH 15/28] address feedback. --- CONTRIBUTING.md | 57 ++++++++++++++++++++++++------------------------- 1 file changed, 28 insertions(+), 29 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 8d58027ca..32a49a7ef 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -51,13 +51,13 @@ The general steps for adding features to the format are as follows: 2. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST - be [parquet-java](http://github.com/apache/parquet-java). It is preferred + be [`parquet-java`](http://github.com/apache/parquet-java). It is preferred that the second implementation be - [parquet-cpp](https://github.com/apache/arrow) or - [parquet-rs](https://github.com/apache/arrow-rs), however at the discretion + [`parquet-cpp`](https://github.com/apache/arrow) or + [`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively participate in the community - (e.g. keep their feature matrix up-to-date on parquet-site) are more likely + (e.g. keep their feature matrix up-to-date on the Parquet website) are more likely to be considered. If discussed as a requirement in step one, demonstration of integration with a query engine is also required for this step. The implementations must be made available publicly, and they should @@ -67,15 +67,14 @@ The general steps for adding features to the format are as follows: Unless otherwise discussed, it is expected the implementations will be developed from their respective main branch (i.e. backporting is not expected). -3. After the first two steps are complete a formal vote is held on the Parquet - mailing list to officially ratify the feature. After the vote passes the - format change is merged into the parquet-format repository and it is expected - the changes from step 2 will also be merged soon after. Before merging into - Parquet-java a parquet-format release must be performed. +3. After the first two steps are complete a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote passes the + format change is merged into the `parquet-format` repository and it is expected + the changes from step 2 will also be merged soon after (implementations should not be merged until the addition has + been merged to `parquet-format`). #### General guidelines/preferences on additions. -1. To the greatest extent possible changes should have an option for forwards +1. To the greatest extent possible changes should have an option for forward compatibility (old readers can still read files). 2. New encodings should be fully specified in this repository and ideally not @@ -83,7 +82,7 @@ from their respective main branch (i.e. backporting is not expected). the source of truth for the encoding). 3. New compression mechanisms must have a pure Java implementation that can be - used as dependency in parquet-java. + used as dependency in `parquet-java`. ### Releases @@ -99,16 +98,18 @@ foreseeable future. For the purposes of this discussion we classify features into the following buckets: -1. Backwards compatible. A file written under an older version of the format +1. Backward compatible. A file written under an older version of the format should be readable under a newer version of the format. -2. Forwards compatible. A file written under a newer version of the format with +2. Forward compatible. A file written under a newer version of the format with the feature enabled can be read under an older version of the format, but some information might be missing or performance might be suboptimal. -3. Forwards incompatible. A file written under a newer version of the format with +3. Forward incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. - adding and using a new compression algorithm). + adding and using a new compression algorithm). It is expected + any feature in this category will provide a signal to older readers, so they can unambiguously determine that they cannot + properly read the file (e.g. via changing the `PAR1` magic number). New features are intended to be widely beneficial to users of Parquet, and therefore it is hoped third-party implementations will adopt them quickly after @@ -121,41 +122,39 @@ mass of Parquet implementations support a feature to avoid compatibility issues across the ecosystem. Therefore, the Parquet PMC gives the following recommendations for managing features: -1. Backwards compatibility is the concern of implementations but given the +1. Backward compatibility is the concern of implementations but given the ubiquity of Parquet and the length of time it has been used, libraries should support reading older versions of the format to the greatest extent possible. -2. Forwards compatible features/changes may be enabled and used by default in implementations +2. Forward compatible features/changes may be enabled and used by default in implementations once the parquet-format containing those changes has been formally released. For features that may pose a significant performance regression to older format readers, libaries should consider delaying default enablement until 1 year after the release of the parquet-java implementation that contains the feature implementation. -3. Forwards incompatible features/changes should not be turned on by default +3. Forward incompatible features/changes should not be turned on by default until 2 years after the parquet-java implementation containing the feature is released. It is recommended that changing the default value for a forward - incompatible feature flag be done as part of a major release of an - implementation (it is out of the scope for this guidance on how and when - implementations decide to do releases). + incompatible feature flag should be clearly advertised to consumers (e.g. via a major version release if using Semantic Versioning, or highlighed in release notes). For forward compatible changes which have a high chance of performance regression for older readers and forward incompatible changes, implementations -should clearly document the compatibility issues and should consider logging a -warning when such a feature is used. Additionally, while it is up to maintainers +should clearly document the compatibility issues. Additionally, while it is up to maintainers of individual implementations to make the best decision to serve their ecosystem, they are encouraged to start enabling features by default along the -same timelines as parquet-java. Parquet-java will aim to enable features by -default based on the most conservative timelines outlined above. +same timelines as `parquet-java`. Parquet-java will wait to enable features by +default until the most conservative timelines outlined above +have been exceeded. For features released prior to October 2024, target dates for each of these -categories will be updated as part of the parquet-java 2.0 process based on a +categories will be updated as part of the `parquet-java 2.0` release process based on a collected feature compatibility matrix. -For each release of parquet-java or parquet-format that influences this guidance +For each release of `parquet-java` or `parquet-format` that influences this guidance it is expected exact dates will be added to parquet-format to provide clarity to -implementors (e.g. When parquet-java 2.X.X is released, any new format features -it uses will be updated with concrete dates). As part of parquet-format +implementors (e.g. When `parquet-java` 2.X.X is released, any new format features +it uses will be updated with concrete dates). As part of `parquet-format` releases the compatibility matrix will be updated to contain the release date in the format. Implementations are also encouraged to provide implementation date/release version information when updating the feature matrix. From 7bd9c1d6be7b9c44d0feafcf09c79a7647cb4ae4 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Tue, 18 Jun 2024 04:14:32 +0000 Subject: [PATCH 16/28] reflow --- CONTRIBUTING.md | 78 ++++++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 36 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 32a49a7ef..b0eed93ca 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -57,20 +57,22 @@ The general steps for adding features to the format are as follows: [`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively participate in the community - (e.g. keep their feature matrix up-to-date on the Parquet website) are more likely - to be considered. If discussed as a requirement in step one, demonstration - of integration with a query engine is also required for this step. The - implementations must be made available publicly, and they should - be fit for inclusion (for example, they were submitted as a pull request - against the target repository and committers gave positive reviews). + (e.g. keep their feature matrix up-to-date on the Parquet website) are more + likely to be considered. If discussed as a requirement in step one, + demonstration of integration with a query engine is also required for this + step. The implementations must be made available publicly, and they should be + fit for inclusion (for example, they were submitted as a pull request against + the target repository and committers gave positive reviews). Unless otherwise discussed, it is expected the implementations will be developed from their respective main branch (i.e. backporting is not expected). -3. After the first two steps are complete a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote passes the - format change is merged into the `parquet-format` repository and it is expected - the changes from step 2 will also be merged soon after (implementations should not be merged until the addition has - been merged to `parquet-format`). +3. After the first two steps are complete a formal vote is held on + dev@parquet.apache.org to officially ratify the feature. After the vote + passes the format change is merged into the `parquet-format` repository and + it is expected the changes from step 2 will also be merged soon after + (implementations should not be merged until the addition has been merged to + `parquet-format`). #### General guidelines/preferences on additions. @@ -107,9 +109,10 @@ For the purposes of this discussion we classify features into the following buck 3. Forward incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. - adding and using a new compression algorithm). It is expected - any feature in this category will provide a signal to older readers, so they can unambiguously determine that they cannot - properly read the file (e.g. via changing the `PAR1` magic number). + adding and using a new compression algorithm). It is expected any feature in + this category will provide a signal to older readers, so they can + unambiguously determine that they cannot properly read the file (e.g. via + changing the `PAR1` magic number). New features are intended to be widely beneficial to users of Parquet, and therefore it is hoped third-party implementations will adopt them quickly after @@ -126,38 +129,41 @@ recommendations for managing features: ubiquity of Parquet and the length of time it has been used, libraries should support reading older versions of the format to the greatest extent possible. -2. Forward compatible features/changes may be enabled and used by default in implementations - once the parquet-format containing those changes has been formally released. - For features that may pose a significant performance regression to older - format readers, libaries should consider delaying default enablement until 1 - year after the release of the parquet-java implementation that contains the - feature implementation. +2. Forward compatible features/changes may be enabled and used by default in + implementations once the parquet-format containing those changes has been + formally released. For features that may pose a significant performance + regression to older format readers, libaries should consider delaying default + enablement until 1 year after the release of the parquet-java implementation + that contains the feature implementation. 3. Forward incompatible features/changes should not be turned on by default until 2 years after the parquet-java implementation containing the feature is released. It is recommended that changing the default value for a forward - incompatible feature flag should be clearly advertised to consumers (e.g. via a major version release if using Semantic Versioning, or highlighed in release notes). + incompatible feature flag should be clearly advertised to consumers (e.g. via + a major version release if using Semantic Versioning, or highlighed in + release notes). For forward compatible changes which have a high chance of performance regression for older readers and forward incompatible changes, implementations -should clearly document the compatibility issues. Additionally, while it is up to maintainers -of individual implementations to make the best decision to serve their -ecosystem, they are encouraged to start enabling features by default along the -same timelines as `parquet-java`. Parquet-java will wait to enable features by -default until the most conservative timelines outlined above -have been exceeded. +should clearly document the compatibility issues. Additionally, while it is up +to maintainers of individual implementations to make the best decision to serve +their ecosystem, they are encouraged to start enabling features by default along +the same timelines as `parquet-java`. Parquet-java will wait to enable features +by default until the most conservative timelines outlined above have been +exceeded. For features released prior to October 2024, target dates for each of these -categories will be updated as part of the `parquet-java 2.0` release process based on a -collected feature compatibility matrix. - -For each release of `parquet-java` or `parquet-format` that influences this guidance -it is expected exact dates will be added to parquet-format to provide clarity to -implementors (e.g. When `parquet-java` 2.X.X is released, any new format features -it uses will be updated with concrete dates). As part of `parquet-format` -releases the compatibility matrix will be updated to contain the release date -in the format. Implementations are also encouraged to provide implementation -date/release version information when updating the feature matrix. +categories will be updated as part of the `parquet-java 2.0` release process +based on a collected feature compatibility matrix. + +For each release of `parquet-java` or `parquet-format` that influences this +guidance it is expected exact dates will be added to parquet-format to provide +clarity to implementors (e.g. When `parquet-java` 2.X.X is released, any new +format features it uses will be updated with concrete dates). As part of +`parquet-format` releases the compatibility matrix will be updated to contain +the release date in the format. Implementations are also encouraged to provide +implementation date/release version information when updating the feature +matrix. End users of software are generally encouraged to follow the same guidance detailed above unless they have mechanisms for ensuring the version of all From fcb2eb1dc32e6f34a3199f08d678e9e7017a6bac Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 26 Jun 2024 14:12:32 -0700 Subject: [PATCH 17/28] Update CONTRIBUTING.md Co-authored-by: Ed Seidl --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index b0eed93ca..d2faa1e83 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -84,7 +84,7 @@ from their respective main branch (i.e. backporting is not expected). the source of truth for the encoding). 3. New compression mechanisms must have a pure Java implementation that can be - used as dependency in `parquet-java`. + used as a dependency in `parquet-java`. ### Releases From 27ba2f5d4a7b5e4b7e05cb6bc626e9eeef832467 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 26 Jun 2024 14:13:39 -0700 Subject: [PATCH 18/28] Update CONTRIBUTING.md Co-authored-by: Ed Seidl --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d2faa1e83..32b546c2c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -112,7 +112,7 @@ For the purposes of this discussion we classify features into the following buck adding and using a new compression algorithm). It is expected any feature in this category will provide a signal to older readers, so they can unambiguously determine that they cannot properly read the file (e.g. via - changing the `PAR1` magic number). + adding a new value to an existing enum). New features are intended to be widely beneficial to users of Parquet, and therefore it is hoped third-party implementations will adopt them quickly after From 890fc2d320f7f255015327d2b59035dc71dc4e98 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Wed, 26 Jun 2024 14:14:06 -0700 Subject: [PATCH 19/28] Update CONTRIBUTING.md Co-authored-by: Ed Seidl --- CONTRIBUTING.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 32b546c2c..e9a4c8ac6 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -165,8 +165,6 @@ the release date in the format. Implementations are also encouraged to provide implementation date/release version information when updating the feature matrix. -End users of software are generally encouraged to follow the same guidance -detailed above unless they have mechanisms for ensuring the version of all -possible readers of the Parquet files support the feature they want to enable. -One way of doing this is to cross-reference feature matrix and any relevant -vendor documentation. +End users of software are generally encouraged to consult the feature matrix +and vendor documentation before enabling features that are not yet widely +adopted. From 2a8875a8560e923cba05d69454af9040dfda8400 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Mon, 1 Jul 2024 00:59:44 -0700 Subject: [PATCH 20/28] Update CONTRIBUTING.md Co-authored-by: Gang Wu --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index e9a4c8ac6..c91140a5c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -34,7 +34,7 @@ https://github.com/apache/parquet-format/blob/master/LICENSE Note: This section applies to actual functional changes to the specification. Fixing typos, grammar, and clarifying concepts that would not change the -semantics of the specification can be done as long a comitter feels comfortable +semantics of the specification can be done as long as a committer feels comfortable to merge them. When in doubt starting a discussion on the dev mailing list is encouraged. From 4a13c2ae0a85b13d7100fb897d6913a755ba61b4 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 9 Jul 2024 13:11:46 -0700 Subject: [PATCH 21/28] Apply suggestions from code review Co-authored-by: Antoine Pitrou --- CONTRIBUTING.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index c91140a5c..8769280e5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -43,11 +43,11 @@ The general steps for adding features to the format are as follows: 1. Discuss changes on the developer mailing list (dev@parquet.apache.org). Often times it is helpful to link to a draft pull request to make the discussion concrete. This step is complete when there is lazy consensus. Part - of the consensus is whether it sufficient to provide 2 working - implementations as outlined in step 2 or if demonstration of the feature with - a down-stream query engine is necessary to justify the feature (e.g. - demonstrate performance improvements in Arrow's DataSet library or Apache - Data Fusion or another open source engine). + of the consensus is whether it is sufficient to provide two working + implementations as outlined in step 2, or if demonstration of the feature with + a downstream query engine is necessary to justify the feature (e.g. + demonstrate performance improvements in the Apache Arrow C++ Dataset library, + the Apache DataFusion query engine, or any other open source engine). 2. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST @@ -58,7 +58,7 @@ The general steps for adding features to the format are as follows: of the PMC any open source Parquet implementation may be acceptable. Implementations whose contributors actively participate in the community (e.g. keep their feature matrix up-to-date on the Parquet website) are more - likely to be considered. If discussed as a requirement in step one, + likely to be considered. If discussed as a requirement in step 1 above, demonstration of integration with a query engine is also required for this step. The implementations must be made available publicly, and they should be fit for inclusion (for example, they were submitted as a pull request against From 12a79abbab5e6d9282038dbc6c0845c0cbe4a41a Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 12 Jul 2024 08:40:21 -0700 Subject: [PATCH 22/28] wip, address comments. --- CONTRIBUTING.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 8769280e5..f3d731374 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -62,10 +62,11 @@ The general steps for adding features to the format are as follows: demonstration of integration with a query engine is also required for this step. The implementations must be made available publicly, and they should be fit for inclusion (for example, they were submitted as a pull request against - the target repository and committers gave positive reviews). + the target repository and committers gave positive reviews). Reports on the benefits from closed source implementations + are welcome and help in deciding x Unless otherwise discussed, it is expected the implementations will be developed -from their respective main branch (i.e. backporting is not expected). +from their respective main branch (i.e. backporting is not required), to demonstrate that the feature is mergeable to its implementation. 3. After the first two steps are complete a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote @@ -81,7 +82,9 @@ from their respective main branch (i.e. backporting is not expected). 2. New encodings should be fully specified in this repository and ideally not rely on an external dependencies for implementation (i.e. `parquet-format` is - the source of truth for the encoding). + the source of truth for the encoding). If it does require an + external dependency, then the external dependency must have its + own specification separate from implementation. 3. New compression mechanisms must have a pure Java implementation that can be used as a dependency in `parquet-java`. @@ -105,7 +108,8 @@ For the purposes of this discussion we classify features into the following buck 2. Forward compatible. A file written under a newer version of the format with the feature enabled can be read under an older version of the format, but - some information might be missing or performance might be suboptimal. + some metadata might be missing or performance might be suboptimal. Simply phrased, forward compatible means all + data can be read back in an older version of the format. 3. Forward incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. From c62b3f32abfe8de1116874637c6d588cf2052b2a Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 12 Jul 2024 09:35:42 -0700 Subject: [PATCH 23/28] finish addressing comments --- CONTRIBUTING.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f3d731374..ec26903e7 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -40,16 +40,17 @@ encouraged. The general steps for adding features to the format are as follows: -1. Discuss changes on the developer mailing list (dev@parquet.apache.org). - Often times it is helpful to link to a draft pull request to make the - discussion concrete. This step is complete when there is lazy consensus. Part +1. Design/scoping: The goal of this phase is to identify design goals of a feature and provide some demonstration that the feature meets those goals. This phase starts with a discussion of changes on the developer mailing list (dev@parquet.apache.org). Depending on the scope and goals of the feature the it can be useful to provide additional artifacts as part of a discussion. The artifacts can include a +design docuemnt, a draft pull request to make the + discussion concrete and/or an prototype implementation to demostrate the viability of implementation. This step is complete when there is lazy consensus. Part of the consensus is whether it is sufficient to provide two working implementations as outlined in step 2, or if demonstration of the feature with a downstream query engine is necessary to justify the feature (e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset library, the Apache DataFusion query engine, or any other open source engine). -2. Once a change has lazy consensus, two implementations of the feature +2. Completeness: The goal of this phase is to ensure the feature is +viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be [`parquet-java`](http://github.com/apache/parquet-java). It is preferred that the second implementation be @@ -63,12 +64,13 @@ The general steps for adding features to the format are as follows: step. The implementations must be made available publicly, and they should be fit for inclusion (for example, they were submitted as a pull request against the target repository and committers gave positive reviews). Reports on the benefits from closed source implementations - are welcome and help in deciding x + are welcome and can help lend weight to features desirability but + are not sufficient for acceptance of a new feature. Unless otherwise discussed, it is expected the implementations will be developed from their respective main branch (i.e. backporting is not required), to demonstrate that the feature is mergeable to its implementation. -3. After the first two steps are complete a formal vote is held on +3. Ratification: After the first two steps are complete a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote passes the format change is merged into the `parquet-format` repository and it is expected the changes from step 2 will also be merged soon after From 34933539e3abfe4dd7f0eeca2c9f92dffe850d8c Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Fri, 12 Jul 2024 16:38:56 +0000 Subject: [PATCH 24/28] reflow --- CONTRIBUTING.md | 49 ++++++++++++++++++++++++++++++------------------- 1 file changed, 30 insertions(+), 19 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index ec26903e7..5cdf53a84 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -40,19 +40,27 @@ encouraged. The general steps for adding features to the format are as follows: -1. Design/scoping: The goal of this phase is to identify design goals of a feature and provide some demonstration that the feature meets those goals. This phase starts with a discussion of changes on the developer mailing list (dev@parquet.apache.org). Depending on the scope and goals of the feature the it can be useful to provide additional artifacts as part of a discussion. The artifacts can include a -design docuemnt, a draft pull request to make the - discussion concrete and/or an prototype implementation to demostrate the viability of implementation. This step is complete when there is lazy consensus. Part - of the consensus is whether it is sufficient to provide two working - implementations as outlined in step 2, or if demonstration of the feature with - a downstream query engine is necessary to justify the feature (e.g. - demonstrate performance improvements in the Apache Arrow C++ Dataset library, - the Apache DataFusion query engine, or any other open source engine). - -2. Completeness: The goal of this phase is to ensure the feature is -viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two implementations of the feature - demonstrating interopability must also be provided. One implementation MUST - be [`parquet-java`](http://github.com/apache/parquet-java). It is preferred +1. Design/scoping: The goal of this phase is to identify design goals of a + feature and provide some demonstration that the feature meets those goals. + This phase starts with a discussion of changes on the developer mailing list + (dev@parquet.apache.org). Depending on the scope and goals of the feature the + it can be useful to provide additional artifacts as part of a discussion. The + artifacts can include a design docuemnt, a draft pull request to make the + discussion concrete and/or an prototype implementation to demostrate the + viability of implementation. This step is complete when there is lazy + consensus. Part of the consensus is whether it is sufficient to provide two + working implementations as outlined in step 2, or if demonstration of the + feature with a downstream query engine is necessary to justify the feature + (e.g. demonstrate performance improvements in the Apache Arrow C++ Dataset + library, the Apache DataFusion query engine, or any other open source + engine). + +2. Completeness: The goal of this phase is to ensure the feature is viable, + there is no ambiguity in its specification by demonstrating compatibility + between implementations. Once a change has lazy consensus, two + implementations of the feature demonstrating interopability must also be + provided. One implementation MUST be + [`parquet-java`](http://github.com/apache/parquet-java). It is preferred that the second implementation be [`parquet-cpp`](https://github.com/apache/arrow) or [`parquet-rs`](https://github.com/apache/arrow-rs), however at the discretion @@ -63,12 +71,14 @@ viable, there is no ambiguity in its specification by demonstrating compatibilit demonstration of integration with a query engine is also required for this step. The implementations must be made available publicly, and they should be fit for inclusion (for example, they were submitted as a pull request against - the target repository and committers gave positive reviews). Reports on the benefits from closed source implementations - are welcome and can help lend weight to features desirability but - are not sufficient for acceptance of a new feature. + the target repository and committers gave positive reviews). Reports on the + benefits from closed source implementations are welcome and can help lend + weight to features desirability but are not sufficient for acceptance of a + new feature. Unless otherwise discussed, it is expected the implementations will be developed -from their respective main branch (i.e. backporting is not required), to demonstrate that the feature is mergeable to its implementation. +from their respective main branch (i.e. backporting is not required), to +demonstrate that the feature is mergeable to its implementation. 3. Ratification: After the first two steps are complete a formal vote is held on dev@parquet.apache.org to officially ratify the feature. After the vote @@ -110,8 +120,9 @@ For the purposes of this discussion we classify features into the following buck 2. Forward compatible. A file written under a newer version of the format with the feature enabled can be read under an older version of the format, but - some metadata might be missing or performance might be suboptimal. Simply phrased, forward compatible means all - data can be read back in an older version of the format. + some metadata might be missing or performance might be suboptimal. Simply + phrased, forward compatible means all data can be read back in an older + version of the format. 3. Forward incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. From 0841c94e4750619a6030a8c5c8a4c7a7897f5f41 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 12 Jul 2024 09:47:26 -0700 Subject: [PATCH 25/28] clarify new logical types --- CONTRIBUTING.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 5cdf53a84..702e23c35 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -122,7 +122,8 @@ For the purposes of this discussion we classify features into the following buck the feature enabled can be read under an older version of the format, but some metadata might be missing or performance might be suboptimal. Simply phrased, forward compatible means all data can be read back in an older - version of the format. + version of the format. New logical types are considered forward + compatible despite the loss of semantic meaning. 3. Forward incompatible. A file written under a newer version of the format with the feature enabled cannot be read under an older version of the format (e.g. From 4d6a9473e06f4cfdd588bdf489e33fcf8affb4e3 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 12 Jul 2024 22:21:47 -0700 Subject: [PATCH 26/28] Address some comments --- CONTRIBUTING.md | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 702e23c35..38e70d91d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -90,16 +90,19 @@ demonstrate that the feature is mergeable to its implementation. #### General guidelines/preferences on additions. 1. To the greatest extent possible changes should have an option for forward - compatibility (old readers can still read files). + compatibility (old readers can still read files). The 'compatibility and feature enablement' section below provides + more details on expectations for changes that break compatibility. -2. New encodings should be fully specified in this repository and ideally not +2. New encodings should be fully specified in this repository and not rely on an external dependencies for implementation (i.e. `parquet-format` is the source of truth for the encoding). If it does require an external dependency, then the external dependency must have its own specification separate from implementation. -3. New compression mechanisms must have a pure Java implementation that can be - used as a dependency in `parquet-java`. +3. New compression mechanisms should have a pure Java implementation that can be + used as a dependency in `parquet-java`, exceptions may be + discussed on the mailing list to see if a non-native Java + implementation is acceptable. ### Releases @@ -164,11 +167,17 @@ recommendations for managing features: For forward compatible changes which have a high chance of performance regression for older readers and forward incompatible changes, implementations should clearly document the compatibility issues. Additionally, while it is up -to maintainers of individual implementations to make the best decision to serve +to maintainers of individual open-source implementations to make the best decision to serve their ecosystem, they are encouraged to start enabling features by default along the same timelines as `parquet-java`. Parquet-java will wait to enable features by default until the most conservative timelines outlined above have been -exceeded. +exceeded. This timeline is an attempt to balance ensuring +new features make there way into the ecosystem and avoiding +breaking compatiblity for readers that are slower to adopt new standards. We encourage earlier adoption of new features when +an organization using Parquet can guarantee that +all readers of the parquet files they produce can read a new +feature. + For features released prior to October 2024, target dates for each of these categories will be updated as part of the `parquet-java 2.0` release process From 1f8178e5528cf0d8309f26b9f8244232145242d8 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Sat, 13 Jul 2024 05:25:12 +0000 Subject: [PATCH 27/28] add exceptions to top and reflow the rest of the content. --- CONTRIBUTING.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 38e70d91d..c3787716e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -17,7 +17,7 @@ - under the License. --> -Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. +Recommendations and requirements for how to best contribute to Parquet. We strive to obey these as best as possible. As always, thanks for contributing--we hope these guidelines make it easier and shed some light on our approach and processes. If you believe there should be a change or exception to these rules please bring it up for discussion on the developer mailing list (dev@parquet.apache.org). ### Key branches - `master` has the latest stable changes @@ -90,8 +90,9 @@ demonstrate that the feature is mergeable to its implementation. #### General guidelines/preferences on additions. 1. To the greatest extent possible changes should have an option for forward - compatibility (old readers can still read files). The 'compatibility and feature enablement' section below provides - more details on expectations for changes that break compatibility. + compatibility (old readers can still read files). The 'compatibility and + feature enablement' section below provides more details on expectations for + changes that break compatibility. 2. New encodings should be fully specified in this repository and not rely on an external dependencies for implementation (i.e. `parquet-format` is @@ -100,7 +101,7 @@ demonstrate that the feature is mergeable to its implementation. own specification separate from implementation. 3. New compression mechanisms should have a pure Java implementation that can be - used as a dependency in `parquet-java`, exceptions may be + used as a dependency in `parquet-java`, exceptions may be discussed on the mailing list to see if a non-native Java implementation is acceptable. @@ -173,9 +174,9 @@ the same timelines as `parquet-java`. Parquet-java will wait to enable features by default until the most conservative timelines outlined above have been exceeded. This timeline is an attempt to balance ensuring new features make there way into the ecosystem and avoiding -breaking compatiblity for readers that are slower to adopt new standards. We encourage earlier adoption of new features when -an organization using Parquet can guarantee that -all readers of the parquet files they produce can read a new +breaking compatiblity for readers that are slower to adopt new standards. We +encourage earlier adoption of new features when an organization using Parquet +can guarantee that all readers of the parquet files they produce can read a new feature. From f05a25696bef52d46d2e44f5c379baa2533a523a Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 12 Jul 2024 22:36:52 -0700 Subject: [PATCH 28/28] fix some typos, and sentence around keeping feature flags for compatibility --- CONTRIBUTING.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index c3787716e..f1fe768e1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -170,15 +170,19 @@ regression for older readers and forward incompatible changes, implementations should clearly document the compatibility issues. Additionally, while it is up to maintainers of individual open-source implementations to make the best decision to serve their ecosystem, they are encouraged to start enabling features by default along -the same timelines as `parquet-java`. Parquet-java will wait to enable features +the same timelines as `parquet-java`. Parquet-java will wait to enable features by default until the most conservative timelines outlined above have been exceeded. This timeline is an attempt to balance ensuring -new features make there way into the ecosystem and avoiding +new features make their way into the ecosystem and avoiding breaking compatiblity for readers that are slower to adopt new standards. We encourage earlier adoption of new features when an organization using Parquet can guarantee that all readers of the parquet files they produce can read a new feature. +After turning a feature on by default implementations +are encouraged to keep a configuration to turn off the feature. +A recommendation for full deprecation will be made in a future +iteration of this document. For features released prior to October 2024, target dates for each of these categories will be updated as part of the `parquet-java 2.0` release process