-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amend RFC 1000: Revision ids should be hashes of the revision contents #2
Comments
The current spec explicitly avoids talking about the schema contents. The only thing that is currently enforced is the order of migrations. The main reason is that we want the migration scripts to be editable. That said, your comment made me realize that the RFC does not explicitly talk about the "latest migration" record, or the migration history "HEAD" if you will. This could be a simple file with the id of the topmost migration. |
Sure. That's why it's important to have hashes. it's fine to edit migrations until you applied them to all persistent servers (i.e. perhaps production, but maybe also staging that you can't restore from earlier backup if you need to). And editing later means revisions are out of sync in different installations of the database (i.e. we will never reapply revision with id already applied, even if that is edited). And this can lead to the revision mess that goes unnoticed quite some time and which is hard to fix later on. It's not clear from the initial proposal, but my current idea is that hash of the migration is computed by migration tool, at the time of execution. So text is editable, but if you edited migration that already applied, it conflicts. You have two options then: revert that file or restore the database from the backup. I know that we are used to editing that in alembic (an presumably all other existing tools), but alembic does not store history in the database that can go out of sync. So if you need to fix something in alembic you can run DDLs manually and change the revision. In our case, as long as I understand, migrations in the edgedb and in files will be out of sync forever at this point, with no way to find out what was actually applied at any given point. That said, all of it is very important for schema migration, an might be much less important for the data migrations. Applying different data migrations on staging and production might even be useful, but not at the cost of editing migration history.
Not sure it's important. I think alembic treats as "HEAD" anything that that has no child revisions. |
OK, we can make a note of the migration content hash when the migration is committed. If such integrity is the only concern, then this can be a field in the migration object. |
Motivation
Each revision is a file in
dbschema/migrations/*
per spec. Files are stored in revision control, so technically can be edited after being registered in the database server. Even if we explicitly document that editing files is supported, users can still fail on fixing merge conflicts. The erroneous migration is hard to fix if we don't have hashes because it's unclear which contents is the source of truth if two branches were merged (i.e. id is opaque, and any revision could be the right one depending on which branch was applied first, and order might be different on different staging servers or even canary deployments).Assumptions
It should be possible to hash revision contents based solely on tokenization of the file without understanding the semantics of the data.
Alternatively, it might be possible to implement full AST or simplified AST (parse tree) of the revision file and hash that, if some forms of statements are ambiguous.
The text was updated successfully, but these errors were encountered: