Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning: Data Package v2 or v1.1? #858

Open
peterdesmet opened this issue Dec 20, 2023 · 16 comments
Open

Versioning: Data Package v2 or v1.1? #858

peterdesmet opened this issue Dec 20, 2023 · 16 comments
Labels

Comments

@peterdesmet
Copy link
Member

Hi all, the communication on the Frictionless specs update names it v2 (version 2, see also #853 #857). The announcement blog post also states (emphasis mine):

The modular approach will of course still be the cornerstone of the Frictionless specs v2, and we won’t introduce any breaking changes.

I'm very happy no breaking changes will be introduced, I think that should be a guiding principle. But following semantic versioning, the specs update should then be a minor version. Given that all major specs† are currently v1, I would argue that the upcoming release is v1.1.

I understand that v2 indicates that there is serious momentum behind the current development (dedicated project, new website). But to anyone who's not closely following Frictionless v2 seems like it is a major overhaul without backward compatibility. A v1.1 would (correctly) communicate that while Data Package is now its own standard and most things will work as expected. It also sets us on a path to incorporate more changes in future (minor) releases.

Sidenote: will we version Data Package (the collection of standards) as a whole or will the 4 standards be versioned separately (current approach)? I see benefits and downsides with both approaches.

†All major specs are v1: Data Package, Tabular Data Package, Data Resource, Tabular Data Resource and Table Schema. The exception is CSV Dialect which is v1.2, but it seems this one is renamed to Table dialect so one could argue to start over. Some of the other experimental specs (like Fiscal Package or Views) have other version numbers like 1.0-rc.1 and 1.0-beta.

@khusmann
Copy link

+1 -- When I heard the v2 announcement, I immediately assumed it would include breaking changes and was surprised to find it was going to be backwards compatible.

Was v2 chosen because v1.1 felt like it wasn't communicating enough "distance" from v1.0 given the new website, dplib, etc.? If so, a jump to v1.5 might be another option to create separation before/after this initiative, which I would interpret as "major overhaul but no breaking changes".

... that said my opinion isn't very strong on this, so I'm happy to defer to whatever strategy has the most consensus/momentum.

Sidenote: will we version Data Package (the collection of standards) as a whole or will the 4 standards be versioned separately (current approach)? I see benefits and downsides with both approaches.

I think this is an excellent question and definitely warrants further discussion. How it is handled seems intertwined with the standard's governance structure / processes moving forward... Is this the sort of thing we want to/are planning to cover in the working group?

@nichtich
Copy link
Contributor

I would not be surprised if there will be an edge case of some artificial piece of data being compliant with 1.0 but not with the new version because the existing wording allows things not planned to be allowed. Moreover I think a version 2.0 will more attract than discourage use.

@fjuniorr
Copy link

fjuniorr commented Dec 22, 2023

I don't even think we will need artificial data to hit this problem. #379 and #697 are breaking changes likely to be discussed which at some point were added1 to frictionless-py v5.

Sidenote: will we version Data Package (the collection of standards) as a whole or will the 4 standards be versioned separately (current approach)? I see benefits and downsides with both approaches.

Thinking about "communication simplicity" I think they should be versioned as a whole. This quote from @roll captures the problem quite well:

For example, we would like to make our Python libs 100% compatible/implementing the specs. TBH at the moment, I don't really understand what does it mean. Whether there is a frozen v1 of the specs to be compatible with and where all the current spec changes go v1.1/v2 branch of the specs etc

To give another example, I can see how frictionless-r could support Tabular Data Resource v2 with #379 but not support CSV/Table Dialect v2 with #697. However this creates an explosion on the number of ways a client could be "standard compliant" creating confusion for users.

Footnotes

  1. I think https://github.com/frictionlessdata/specs/issues/379 was removed after https://github.com/frictionlessdata/frictionless-py/issues/868 but frictionless-py 5.16.0 converts "dialect": {"delimiter": ";"} to "dialect": {"csv": {"delimiter": ";"}} unless system.standards = "v1" is specified. I noticed this after having some difficulties in creating data packages that would play nice with both frictionless-py and frictionless-r.

@roll roll added this to the v2 milestone Jan 3, 2024
@roll roll added the general label Jan 3, 2024
@roll roll removed this from the v2 milestone Jan 3, 2024
@roll
Copy link
Member

roll commented Jan 4, 2024

I think it's a valid point, and as a Working Group, we can vote on the version when we have finished the changelog.

Peter outlined the pros of staying on v1.1 so I'll add some arguments in favor of v2:

  • I think we should try pursuing the idea of having no breaking changes for the specs forever. It sounds really doable in my opinion as it's only a matter of not changing data types and not strongly changing the semantics of existent metadata properties. So if we stick to v1.1 we might never get v2 at all (it's not bad just stating)
  • At the same time, currently, we're doing a first update in 5 years, that will include many (mostly minor) but still many changes and some new features. In the future, if e.g. we will need to add just one thing like package.propX by semver we will be still updating to v1.3. So we will get two versions v1.2 and v1.3 (and following) not comparable in size and importance. I think v2 and following small v2.1, v2.2, etc will communicate better the structure of changes
  • As already mentioned naming major updates with major versions (v2, let's say v3 in a few years) even though it's not breaking is just easier for communication, funding, etc

TBH, I'm not sure if the specs need 100% compliance to semver as it's not software. For example, JSON Schema versioning has been like Draft X for years and now it's yyyy-mm based. Honestly speaking, those Draft X looked really weird but actually they kinda worked implementors just thought about being compliant with draft "version X"

@roll
Copy link
Member

roll commented Jan 5, 2024

@peterdesmet
I think we need to work with core standard and domain-specific extensions as projects so it will be core vX, camtrap vY, fiscal vZ etc. So I would just version datapackage repository as whole (I guess you do the same for camtrap).

PS.
Fiscal Data Package as a domain-specific extension moved to its own project - https://github.com/frictionlessdata/datapackage-fiscal

@khusmann
Copy link

I just realized "backwards compatibility" / "no breaking changes" has different levels/types of strict-ness, and I'm not clear where we stand:

  1. An implementation designed for v2 spec should be equally capable of reading v1 data packages

  2. An implementation designed for v1 spec should be capable of reading v2 data packages (albeit with reduced features)

Different types of modifications to the spec break in different ways:

  • adding a new optional prop in v2 does not break either type of compatibility

  • removing a prop in v2 breaks (1) but not (2)

  • changing a prop type from integer in v1 to integer | string in v2 breaks (2) but not (1)

etc.

In general, it's easier to upgrade software than existing data artifacts... so I'd argue we should hold to (1) and relax (2) to give us more freedom for v2 improvements. It also puts me squarely in the v2 semver camp because although a given v2 spec implementation will be "backwards compatible with v1 data", it still is "breaking" in that v2 data will not necessarily work with a v1 implementation.

@peterdesmet
Copy link
Member Author

Thanks @khusmann for the summary, I complete agree that we should hold to (1) and relax (2), i.e. future software application should still be able to read v1 data packages (since those will be around for a long time), but can be slow in adopting new features of v2.

I draw a different conclusion regarding the versioning though, since a v2 spec sounds (to me) that software implementations can at some point give up on v1. A v1.1 indicates that this is still within the same major version of the spec.

@roll
Copy link
Member

roll commented Jan 25, 2024

@peterdesmet
Answering frictionlessdata/datapackage#12 (comment) as I think it will be good to have everything related to the versioning discussion in one place.

Why is it structurally non-breaking for implementations?

By structurally breaking change I mean something that will fail all the implementations on the next nightly-build. It will happen if we do a breaking change to one of JSON Schema profiles e.g. changing schema.fields to be a mapping instead of an array.

Unfortunately, as the specs in some places were written very broadly, we also have a grey zone. Maybe finiteNumber was a bad example of it but something like any format for dates. The specs just say that it's implementation specific so e.g. changing this will be implementation-specific breaking.

So in my head for v2 I have these tiers (and my opinion on change possibility):

  • profiles (JSON Schema) level breaking -> no for v2 (and probably no forever)
  • semantically/grey zone/etc breaking -> discussible
  • not breaking - yes

@roll
Copy link
Member

roll commented Jan 25, 2024

Also, it's the specifics of working on standards that many kinds of new features (a property added) don't have full forward-compat as e.g. a new constraint will kind of break validation completeness of the current implementations. So maybe this kind of changes might differentiate major and minor in our case. E.g.:

  • source.version -> minor as it's a part of JSON Schema validation
  • constraints.inclusiveMaximum -> major as it requires implementations updates and affects validation completeness

@peterdesmet
Copy link
Member Author

@roll, since you wanted everything related to versioning be part of this discussion, I'm also referring to this comment by @khughitt and me regarding implementations retrieving or detecting the version of the Data Package spec:

Tangential but, this makes me wonder whether it would make sense to modify the validation machinery to support validating against earlier versions of the spec?

That would be useful, but rather than implementations (or users) guessing what version of the spec was used for a datapackage.json, it will likely be good if that was indicated. I don't think this is currently possible?

@roll
Copy link
Member

roll commented Jan 25, 2024

I think on the Standard side, we need to decide whether we provide standard version information for an individual descriptor e.g. as proposed here #444

I think every implementation is free to decide how to handle it as it's just about resources. E.g. some implementation can have a feature that it validates against versions X, Y, and Z. And some just against Y

Note, that currently we consider datapackage.json to be versionless

@peterdesmet
Copy link
Member Author

peterdesmet commented Jan 29, 2024

I think the rules for changing the Data Package spec should be declared (on the spec website or elsewhere). I currently find it difficult to assess if PR follow the rules. Here's a first attempt:

General rules

(in line with @khusmann's statement that software is easier to update than data artifacts #858 (comment))

  1. An existing datapackage.json that is valid MUST NOT becoming invalid in the future.
  2. A new datapackage.json MAY be invalid because a software implementation does not support the latest version of the specification (yet).

Because of these rules datapackage.json does not have to indicate what version of Data Package it uses (i.e. it is versionless). Implementations have no direct way of assessing the version (even though this would make it easier #858 (comment) it is not something that we can require from data publishers, imo).

Versioning

  1. The Data Package specification is versioned. This is new over 1.0, where changes were added without increasing the version.
  2. The Data Package specification is versioned as a whole: a number of changes are considered, discussed, added or refused and released as a new minor version.

Property changes

  1. A property MUST NOT change type
  2. A property MAY allow additional type (array) @roll you want to avoid this as a rule, but it does offer flexibility, cf. Make contributor role an array of strings #804 (comment)
  3. A property MUST NOT become required
  4. A property MAY become optional. Example: Make contributors[].title and sources[].title not required datapackage#7
  5. A property MUST NOT add enum
  6. A property MAY remove enum. Example: Allow free text role for the contributors property #809
  7. A property MUST NOT remove enum values
  8. A property MAY add enum values

Table schema changes

  1. A field type MUST NOT change default format. Example: does Updated date/time definitions datapackage#23 align with this?
  2. A field type MUST NOT remove format pattern options
  3. A field type MAY add format pattern options

New properties

  1. A new property MAY make a datapackage.json invalid (because of general rule 2). Example: Added field.missingValues datapackage#24
  2. A new property CANNOT be required

Removed properties

  1. Removing a property CANNOT make a datapackage.json invalid (because of general rule 1)

@khughitt
Copy link

khughitt commented Feb 1, 2024

Thanks for taking the time to put this together, @peterdesmet! This seems like a great idea..

I think it would be useful to use this as a starting point for a description of the revision process in the docs.

I'll create a separate issue so that it can be tracked separately from the issue discussion here.

@fomcl
Copy link

fomcl commented Mar 7, 2024

My 2 cents here:

@roll
Copy link
Member

roll commented Apr 26, 2024

@peterdesmet
Regarding provisional properties, I think we have an even more eloquent solution for example using a special Data Package Draft/Next extension (or a profile per feature) where we can test new features and ideas without actually affecting the core specs itself. Users will just need to use a draft Data Package profile to join testing.

And then if we have a established release cycle we can merge tested features in the core specs based on schedule. Actually using this approach feature development can be even decentrilized

@peterdesmet
Copy link
Member Author

peterdesmet commented Apr 29, 2024

@roll sounds promising, would have to see it in action to fully understand. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants