Amend RFC 1000: Revision ids should be hashes of the revision contents #2

tailhook · 2020-02-03T10:05:59Z

Motivation

Each revision is a file in dbschema/migrations/* per spec. Files are stored in revision control, so technically can be edited after being registered in the database server. Even if we explicitly document that editing files is supported, users can still fail on fixing merge conflicts. The erroneous migration is hard to fix if we don't have hashes because it's unclear which contents is the source of truth if two branches were merged (i.e. id is opaque, and any revision could be the right one depending on which branch was applied first, and order might be different on different staging servers or even canary deployments).

Assumptions

It should be possible to hash revision contents based solely on tokenization of the file without understanding the semantics of the data.

Alternatively, it might be possible to implement full AST or simplified AST (parse tree) of the revision file and hash that, if some forms of statements are ambiguous.

The text was updated successfully, but these errors were encountered:

elprans · 2020-06-15T20:58:52Z

The current spec explicitly avoids talking about the schema contents. The only thing that is currently enforced is the order of migrations. The main reason is that we want the migration scripts to be editable.

That said, your comment made me realize that the RFC does not explicitly talk about the "latest migration" record, or the migration history "HEAD" if you will. This could be a simple file with the id of the topmost migration.

tailhook · 2020-06-16T08:48:17Z

The current spec explicitly avoids talking about the schema contents. The only thing that is currently enforced is the order of migrations. The main reason is that we want the migration scripts to be editable.

Sure. That's why it's important to have hashes. it's fine to edit migrations until you applied them to all persistent servers (i.e. perhaps production, but maybe also staging that you can't restore from earlier backup if you need to). And editing later means revisions are out of sync in different installations of the database (i.e. we will never reapply revision with id already applied, even if that is edited). And this can lead to the revision mess that goes unnoticed quite some time and which is hard to fix later on.

It's not clear from the initial proposal, but my current idea is that hash of the migration is computed by migration tool, at the time of execution. So text is editable, but if you edited migration that already applied, it conflicts. You have two options then: revert that file or restore the database from the backup.

I know that we are used to editing that in alembic (an presumably all other existing tools), but alembic does not store history in the database that can go out of sync. So if you need to fix something in alembic you can run DDLs manually and change the revision. In our case, as long as I understand, migrations in the edgedb and in files will be out of sync forever at this point, with no way to find out what was actually applied at any given point.

That said, all of it is very important for schema migration, an might be much less important for the data migrations. Applying different data migrations on staging and production might even be useful, but not at the cost of editing migration history.

That said, your comment made me realize that the RFC does not explicitly talk about the "latest migration" record, or the migration history "HEAD" if you will. This could be a simple file with the id of the topmost migration.

Not sure it's important. I think alembic treats as "HEAD" anything that that has no child revisions.

elprans · 2020-06-17T16:32:16Z

OK, we can make a note of the migration content hash when the migration is committed. If such integrity is the only concern, then this can be a field in the migration object.

tailhook added the enhancement New feature or request label Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amend RFC 1000: Revision ids should be hashes of the revision contents #2

Amend RFC 1000: Revision ids should be hashes of the revision contents #2

tailhook commented Feb 3, 2020

elprans commented Jun 15, 2020

tailhook commented Jun 16, 2020

elprans commented Jun 17, 2020

Amend RFC 1000: Revision ids should be hashes of the revision contents #2

Amend RFC 1000: Revision ids should be hashes of the revision contents #2

Comments

tailhook commented Feb 3, 2020

Motivation

Assumptions

elprans commented Jun 15, 2020

tailhook commented Jun 16, 2020

elprans commented Jun 17, 2020