-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2489: Strawman guidance on feature releases #258
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Rok Mihevc <rok@mihevc.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the lead on this @emkornfield. It will be very helpful to have this codified.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@etseidl thank you for the proof reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is overly prescriptive wrt. timing of introducing and enabling new features. We should let every implementation choose their support policy, based on their user demographics and their respective targeted use cases.
cc @alamb
CONTRIBUTING.md
Outdated
|
||
In some cases in addition to library level implementations it is | ||
expected the changes will be justified via integration into a | ||
processing engine to show their viability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "processing engine" in this context? Perhaps this is out of scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the intent here is to demonstrate some real world benefit to justify the cost. For instance, for the recent size statistics addition we probably should have required some benchmark numbers showing a clear benefit to justify the cost of creating and storing them. My GPU workflow was made much faster due to the inclusion of the unencoded byte array size information and could have provided hard numbers demonstrating that. Similarly, it would have been nice to have a pushdown example that was made possible because of the histograms. Simply adding the feature and showing that two implementations produced them in a compatible way was probably insufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to rephrase and in a new pull request moved this to step 1.
CONTRIBUTING.md
Outdated
library. | ||
|
||
2. Forwards compatible. A file written by a new version of the library can be read by an older version | ||
of the library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, this is true of almost no format enhancements, since the additional semantics couldn't be understood under an older version of the format.
Perhaps this should be relaxed a bit? For example:
Forwards compatible. A file written under a newer version of the format can be read under an older version of the format, but some information might be missing or performance might be suboptimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use your wording (will be present in next commit).
Would it address most of your concerns if I replaced the "SHOULD" language, I'll revise to do so and we can hopefully iterate from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for writing this up @emkornfield -- it is very much needed and helpful
CONTRIBUTING.md
Outdated
### Releases | ||
|
||
The Parquet community aims to do releases of the format package only as needed | ||
when new features are introduced. If multiple new features are being proposed | ||
simultaneously some features might be consolidated into the same release. | ||
Guidance is provided below on when implementations should enable features added | ||
to the specification. Due to confusion in the past over parquet versioning it | ||
is not expected that there will be a 3.0 release of the specification in the | ||
foreseeable future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there seems to be confusion about what even constitutes a parquet "release" maybe we simply should say that the parquet-format has no overall release version identifier and instead is a collection of individual features?
Or maybe we should give it a version date ("parquet-format as of 2024-06-07" etc) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least for parquet-java we need a version identifier compatible with Maven because I think that is how it is consumed. We can maybe discuss separately just inlining the generated code there as well, as i think that is typically what other distributions do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @wgtmac thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, I think releases are also potentially helpful for any third-party java implementations beyond parquet java. I don't have strong feelings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the version identifier of parquet-format is compatible with Maven, it looks much weaker than usual Java libraries. IMO, a new version of parquet-format usually includes a set of new feature additions and format clarification/correction, meaning that a new version should be a good superset of previous versions. Therefore we should always stick to the latest version of parquet-format wherever possible. An exception is the deprecation of features to the format where a major version bump is required.
CONTRIBUTING.md
Outdated
the feature enabled cannot be read under and older version of the format | ||
(e.g. Adding a new compression algorithm. | ||
|
||
The Parquet community hopes that new features are widely beneficial to users of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to create a process to roll out a feature across all implementations is an admirable goal, however I am not sure there is sufficient consensus to include it in this document. If we include this (aspirational) process I fear there is a non trivial chance it would not be followed in practice, and thus its inclusion would be confusing
Thus I suggest we consider not including the rollout in docs
( I think this is similar to what @pitrou is saying here #258 (review))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove almost all prescriptions that actually require effort out of the suggestions. I feel that if we don't provide some guidance the ecosystem will never really move forward. If possible it would be nice to focus on concrete concerns. Are there any parts of what is documented here that you see as problematic as a maintainer from arrow-rs? Also @pitrou anything you see as specifically problematic from a parquet-cpp perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update. However, it is my opinion that the updated text still adds more potential for confusion than it adds. Specifically trying to proscribe the rollout cadence across loosly (at best) coordinated implementations in a document is unlikely to change behaviors and is likely to create confusion
Are there any parts of what is documented here that you see as problematic as a maintainer from arrow-rs?
One specific example is the mention of feature flags. While I am sure we could over come the technical details, as a maintainer I would not want to be in the position to try and dictate when a featre should be used by our users -- instead I would imagine that as new features were supported in parquet-rs there would be an appropriate writer configuration that users would decide to enable/disable as they deemed appropriate.
Another potential improvement (orthogonal to this disucssion) would be to label new format additions with the date in which they were added to the spec, and perhaps track the date when support was added to other implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically trying to proscribe the rollout cadence across loosly (at best) coordinated implementations in a document is unlikely to change behaviors and is likely to create confusion
Is there specific language here that you find problematic here? With the latest revision, my intent was to make the recommendation only about preserving compatibility. i.e. forward incompatible changes must wait at least 2 years before an implementation decides to turn them on by default. I wonder if changing the phrasing from MAY to MUST NOT USE before would help clarify?
instead I would imagine that as new features were supported in parquet-rs there would be an appropriate writer configuration that users would decide to enable/disable as they deemed appropriate.
I agree. Maybe this is a terminology issue. I view feature flag as a type of configuration, and I think I removed all recommendations for when, if ever, an implementation should turn something on by default (please let me know if something is still ambiguous or maybe I missed something in rephrasing).
Another potential improvement (orthogonal to this disucssion) would be to label new format additions with the date in which they were added to the spec, and perhaps track the date when support was added to other implementations
Agreed, I'll add this to the process I thought the feature matrix that was already underreview would be a good place for this? I would like to avoid putting too many dates directly in the spec to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another read -- I think it was easy to miss the sentence that says the rollout dates are guidelines.
Therefore, the Parquet PMC gives the following guidance for changing a feature to be "on" by default:
So I guess I got confused that the text says it is providing guidance and then uses MAY which suggests to me the process as a whole is required but the individual steps have options. I think we should just make it clear these are recommended best practices
What about toning it down a bit like instead of
Forward compatible features/changes MAY be used by default in implementations
once the parquet-format containing those changes has been formally released.
For features that may pose a significant performance regression to prior
format readers, libaries SHOULD consider delaying until 1 year after the
release of the parquet-java implementation that contains the feature
implementation. Implementations MAY choose to do a major version bump when
turning on a feature in this category.
Something like
Forward compatible features/changes may be used by default in implementations
once the parquet-format containing those changes has been formally released.
For features that may pose a significant performance regression to prior
format readers, libraries should wait 1 year after the
release of the first parquet-java implementation that contains the feature
implementation.
I also think the major version bump suggestion is so library specific it is not worth putting in the guidance. The libraries have their own cadence (e.g. parquet-rs and parquet-cpp have major releases based on calendar cycles rather than feature content)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to rephrase to make the issues above clearer.
I also think the major version bump suggestion is so library specific it is not worth putting in the guidance. The libraries have their own cadence (e.g. parquet-rs and parquet-cpp have major releases based on calendar cycles rather than feature content)
My intent here was not to specify a release cadence but to specify that if changing a forward incompatible change to default the next release must be a major version release, I will try to clarify. I think parquet-rs and parquet-cpp trivially satisfy this given the current release policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb thanks for feedback, I've update to clarify and remove upper case for SHOULD/MAY language, please let me know if the current iteration still raises concerns.
I think it could be rephrased as an example of how to deal with feature additions, rather than a recommendation. |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
My goal here is to provide at least a little bit stronger then just an example, I think we do want to try to align implementations for compatiblity purposes and moving the ecosystem further. Maybe we can continue the conversation here on the @alamb's comment |
- Soften language on recommendations and add clarifications - Fix some grammar issues.
discussion concrete. This step is complete when there is lazy consensus. Part | ||
of the consensus is whether it sufficient to provide 2 working | ||
implementations as outlined in step 2 or if demonstration of the feature with | ||
a down-stream query engine is necessary to justify the feature (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overly specific IMHO. Not all features are developed in relation with a "down-stream query engine". For example, the BYTE_STREAM_SPLIT additions were motivated by size measurements with different encodings and compression algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think a lot of features will not need this per @etseidl point on a previous revision, this helps avoid extraneous features that sound good in practice but don't provide much value in practice. Other areas where this is more likely to apply could be for more complicated indexing schemes.
CONTRIBUTING.md
Outdated
the source of truth for the encoding). | ||
|
||
3. New compression mechanisms must have a pure Java implementation that can be | ||
used as dependency in parquet-java. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR, is it the case for all current compression algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't audited myself, on the discussion it was asserted it was the case for parquet-java.
CONTRIBUTING.md
Outdated
format readers, libaries should consider delaying default enablement until 1 | ||
year after the release of the parquet-java implementation that contains the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still skeptical that this should suggest a precise delay for third-party implementations. Or, alternatively, replace "should" with "may"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a point of discussion. I'm OK changing the language either way but would like to get more opinions before doing so. I think specifying some timeline is important, even if non-binding for third party implementations as it gives third-parties a sense of timeline for adopting features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can either invite more parties to this discussion, or raise it on the ML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll plan on doing one more pass to address open comments then start a discuss thread on the mailing list (hopefully I will have time to do this today)
CONTRIBUTING.md
Outdated
3. Forwards incompatible features/changes should not be turned on by default | ||
until 2 years after the parquet-java implementation containing the feature is | ||
released. It is recommended that changing the default value for a forward | ||
incompatible feature flag be done as part of a major release of an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that "major release" is vague and potentially misleading, I would remove this recommendation as it is not really helpful for implementors, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can rephrase to something more reasonable otherwise we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to rephrase by only giving examples, let me know if you think this language is still too strong or still not helpful.
feature implementation. | ||
|
||
3. Forwards incompatible features/changes should not be turned on by default | ||
until 2 years after the parquet-java implementation containing the feature is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still skeptical about a delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per above, I can understand the skepticism, I'd like to get more opinions here and we can update accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a shortcut like "or a PMC vote was in favour of it". If something gets really fast adoption, then we could enable it sooner.
How did you come up with "2 years"? Was this a guess or a deeper reasoning behind it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to avoid this sentence and just start with "It is recommended that changing the default value for a forward incompatible feature flag should be clearly advertised to consumers..."
Otherwise, this could be read as "any change made to parquet will not even plausibly be used for at least two years". I think it would be a shame if people used this 2 years as a reason to slow down their adoption of new parquet features
I think parquet would be better off if we let market demand drive the proliferation of features rather than trying to control a rollout. The thinking is that if new files begin appearing that are not readable by some system that is lagging, it would be beneficial to add pressure on that system upgrading their support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a shortcut like "or a PMC vote was in favour of it". If something gets really fast adoption, then we could enable it sooner.
I think we should either do away with the guidance or release it. PMC can always modify the language necessary via vote if necessary, but I'm not sure there is a good scenario where we want to actively break guidance?
How did you come up with "2 years"? Was this a guess or a deeper reasoning behind it?
Mostly guesstimate and it might actually be too short. This assumes a least some downstream dependencies only do major releases on a yearly basis as well. So it leaves time for a downstream system to pickup a release in the following year and then ~1 year for rollout. Taking Spark as an example, it appears to average about 3.5 years beetween major releases. Minor releases are aimed at every six-months, but not everyone is going to use a release immediately. If you look at EMR ignoring the LTE release of 2.4.8, they are currently supporting 3.5.0 and 3.4.1. 3.4.1 was released 9 months ago and ~1 year ago respectively. Dataproc, explicitly supports images for 24 months, but the actual Spark versions are older then AWS (i.e. 3.3)
If we do a parquet 2.0 release with a breaking change, we will probably learn how long it has taken people to actually adopt some of the newer breaking features.
I think parquet would be better off if we let market demand drive the proliferation of features rather than trying to control a rollout. The thinking is that if new files begin appearing that are not readable by some system that is lagging, it would be beneficial to add pressure on that system upgrading their support
I think this ignores how long it takes changes to roll through the ecosystem, I detailed more in my reply on why 2 years above. I'd welcome people to start experimenting features early but I think not recommending at least some delay will cause a lot of unnecessary pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ignores how long it takes changes to roll through the ecosystem, I detailed more in my reply on why 2 years above
I am quite sure that stable mature enterprise platforms will take 2+ years to adopt any changes. However, this doesn't mean we should offer guidance that the entire ecosystem should follow that pace.
Instead I think we (the parquet community) should offer help / visibility into how far the feature has percolated through the ecosystem (e.g. by a compatibility matrix) rather than some prescriptive proposal of a rollout schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking is that if new files begin appearing that are not readable by some system that is lagging, it would be beneficial to add pressure on that system upgrading their support
Having a not-before date published is a means of applying pressure without breaking anything. I think it is much more reasonable messaging for downstream consumers to say, "you are seeing files that you can't read because you haven't updated your software for a long time". It seems less reasonable to me to say to someone you are seeing "files you can't read because we a new feature was designed and implemented last month and we decided to make it default so you should upgrade your software as well and/or contribute an implementation to the software you are using". Also, maintainers have differing priorities, motivations and bandwidth. "Pressuring" them because things are broken immediately rather than giving a reasonable time frame for implementation does not seem to me to be a considerate thing to do.
I am quite sure that stable mature enterprise platforms will take 2+ years to adopt any changes. However, this doesn't mean we should offer guidance that the entire ecosystem should follow that pace.
Perhaps there is more word-smithing needed here. This guidance is about when forward incompatible files should start being written by default. I expect that early adopters will opt into features sooner, especially if new features are appropriately publicized. Early adopters are also usually more sophisticated and can better assess the risks for there use-case. There is already language here saying users can decide for themselves if they are sure they are using Parquet in a closed ecosystem (maybe it can be strengthened?). I am more concerned about late adopters. I'd love to see new features "on" by default, but I just don't think it is serves the broader ecosystem well to do so. As a point of reference even two years might not be sufficient as seen by trinodb/trino#7953 (part of this is likely partially caused over confusion of "V2").
Instead I think we (the parquet community) should offer help / visibility into how far the feature has percolated through the ecosystem (e.g. by a compatibility matrix) rather than some prescriptive proposal of a rollout schedule.
I think both are useful. A matrix doesn't help further adoption without a mechanism for guiding when a feature should be turned on by default. I think it is especially difficult without more contextualized information about relative popularity of readers (e.g. if a parquet haskell implementation is lagging behind on a feature should it give pause for enablement).
Ultimately, implementations not under the governance of the Parquet PMC are free to do what they want but I think providing official and consistent guidance here to give some lead time for breaking changes is in the best interest of everyone using Parquet.
To annotate some of the options already suggested on the thread:
- Don't provide guidance. I think this will result in either the current status quo which is to never turn anything on by default OR lead to many incompatibilities across the ecosystem.
- PMC votes on when to recommend a feature be turned on (possibly informed by feature matrix?). I don't think this provides consistency (or if there is a consistent means of making decision we should simply document it here and avoid the overhead of votes).
- Time based implementation/implementations. Provides a consistent criteria that other implementations can understand and take action on.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
CONTRIBUTING.md
Outdated
|
||
3. Forwards incompatible. A file written under a newer version of the format with | ||
the feature enabled cannot be read under an older version of the format (e.g. | ||
adding and using a new compression algorithm). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One clarifying question here - does "cannot be read under an older version of the format" allow for "cannot be read correctly under an older version of the format"? I suspect we want that be considered a breaking change.
For reference I have the Arrow spec in mind which defines forward compatibility as "An older client library must be able to either read data generated from a new client library or detect that it cannot properly read the data." (emphasis mine). Is the "detect" case what we're calling "Forwards incompatible" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify this. Forward incompatible features I think should provide a signal to old readers so ideally an appropriate error message can be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more suggestions.
As to the most controversial section, I think the current wording strikes a good balance. Implementers should consider the ramifications of defaulting to breaking changes, and I like the idea of providing a deadline for downstream apps to adapt. Otherwise Parquet will always be trapped in amber, unable to steer users towards beneficial new features because some downstream project refuses to update for a decade. And these are now recommendations (unlike earlier revisions), so an implementation could still decide that a new feature is so beneficial for its user base that it's worth risking compatibility problems (so long as it clearly documents how to revert to non-breaking behavior). Most implementations already seem to be pretty conservative with their choice of defaults, so I don't think this guidance will be too burdensome.
Overall I'm +1 for the release process, and +0.5 for the last section.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Make sure you have checked all steps below.
Jira
Commits
Documentation