Skip to content

Conversation

migmartri
Copy link
Member

@migmartri migmartri commented Aug 31, 2023

Instead of running the migration as a separate job, which had some issues #309, #271, we now will run the migration as an init container.

I also double checked that our backend, Postgres can handle safely concurrent migration requests from different replicasm using advisory locking as stated in Atlas upstream documentation

--- /tmp/before.yaml	2023-08-31 17:28:50.707396512 +0200
+++ /tmp/after.yaml	2023-08-31 17:31:46.650255138 +0200
@@ -763,6 +763,17 @@
       serviceAccountName: foo-chainloop-controlplane
       securityContext:
         null
+      initContainers:
+        - name: migrate
+          image: "ghcr.io/chainloop-dev/chainloop/control-plane-migrations:v0.16.2"
+          imagePullPolicy: 
+          args:
+            - migrate
+            - apply
+            - --url
+            - "postgres://chainloop:chainlooppwd@foo-postgresql:5432/chainloop-cp?sslmode=disable"
+            - --dir
+            - file:///migrations
       containers:
         
         - name: chainloop
@@ -1192,46 +1203,3 @@
       volumeMounts:
   volumes:
   restartPolicy: Never
----
-# Source: chainloop/templates/controlplane/migrate-job.yaml
-apiVersion: batch/v1
-kind: Job
-metadata:
-  name: foo-chainloop-controlplane-migrate
-  labels:
-    app.kubernetes.io/name: chainloop
-    helm.sh/chart: chainloop-1.13.0
-    app.kubernetes.io/instance: foo
-    app.kubernetes.io/managed-by: Helm
-    app.kubernetes.io/part-of: chainloop
-    app.kubernetes.io/component: controlplane-migration
-  annotations:
-    "helm.sh/hook": post-install,pre-upgrade
-    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
-spec:
-  template:
-    metadata:
-      labels:
-        app.kubernetes.io/name: chainloop
-        helm.sh/chart: chainloop-1.13.0
-        app.kubernetes.io/instance: foo
-        app.kubernetes.io/managed-by: Helm
-        app.kubernetes.io/part-of: chainloop
-        app.kubernetes.io/component: controlplane-migration
-    spec:
-      restartPolicy: OnFailure
-      serviceAccountName: foo-chainloop-controlplane
-      securityContext:
-        null
-      containers:
-        
-        - name: migrate
-          image: "ghcr.io/chainloop-dev/chainloop/control-plane-migrations:v0.16.2"
-          imagePullPolicy: 
-          args:
-            - migrate
-            - apply
-            - --url
-            - "postgres://chainloop:chainlooppwd@foo-postgresql:5432/chainloop-cp?sslmode=disable"
-            - --dir
-            - file:///migrations

Closes #309 and #271

@migmartri migmartri requested review from danlishka and zaibon August 31, 2023 15:43
@zaibon
Copy link
Contributor

zaibon commented Aug 31, 2023

I think this will be an issue when using the sql-proxy. Since the migration is run as an init container, the sql-proxy side car is not yet started, so the migration will not be able to reach the database.

@migmartri
Copy link
Member Author

migmartri commented Aug 31, 2023

I think this will be an issue when using the sql-proxy. Since the migration is run as an init container, the sql-proxy side car is not yet started, so the migration will not be able to reach the database.

oh, that's a good point, let me think...

This proxy makes things so tricky :(

@migmartri
Copy link
Member Author

There might be another way of running the proxy. Maybe we can run the proxy as its own deployment, and then talk to it through its internal DNS, what do you think?

what I mean is:

1 - Deploy the proxy as its own deployment in the same namespace
2 - Configure the connection string to instead of using localhost, use the servicename of that newly exposed deployment
3 - See if it works

If we do that, we could keep either the job approach or the init container approach, either should work, I think.

I like the init container approach, it makes the setup more predictable since the controlplane will not work if the DB settings are not properly set.

@zaibon
Copy link
Contributor

zaibon commented Aug 31, 2023

I think the approach of having a single deployment for the SQL poxy is the right one. There is no need for the sql-proxy to be run multiple time in each deployment of the control-plane.

I like the init container approach, it makes the setup more predictable since the control-plane will not work if the DB settings are not properly set.

Would it be possible that the migration code changes somehow, then a new control-plane pods starts, run the migration while other control-plan pods are still alive. Making the pods that did not restart incompatible with the latest version of the database schema?

@migmartri
Copy link
Member Author

Would it be possible that the migration code changes somehow, then a new control-plane pods starts, run the migration while other control-plan pods are still alive. Making the pods that did not restart incompatible with the latest version of the database schema?

Yes, I guess that's indeed a possibility. But I do not see it different than having a job that modifies the DB schema while we are still running an old version of the code as we currently do no?

In any case to me, schema changes need to be non-destructive and take into account both the rollout process and possible rollbacks. We can achieve that now through review process since the migrations are now part of the code.

What I could see though is a old version of the migration code trying to update the DB that it might be in a newer state, the good news are that in that case migrate apply will just skip since there is nothing new to apply.

Does it make sense?

BTW, I'll halt merging this code until we figure out how to proceed re: proxy, thanks for raising the issue.

@zaibon
Copy link
Contributor

zaibon commented Sep 1, 2023

Yes, I guess that's indeed a possibility. But I do not see it different than having a job that modifies the DB schema while we are still running an old version of the code as we currently do no?

Indeed this could also happen with the current setup.

Does it make sense?

Yes it does. I think we can go ahead with your proposal.

@migmartri
Copy link
Member Author

Yes it does. I think we can go ahead with your proposal.

Perfect, let me know if updating the Chart to run the proxy as its own component is something you could give a try since you have a GKE setup. Otherwise it will take me more time. Happy either way.

Thanks

@migmartri migmartri marked this pull request as draft September 5, 2023 08:33
@zaibon
Copy link
Contributor

zaibon commented Sep 5, 2023

Perfect, let me know if updating the Chart to run the proxy as its own component is something you could give a try since you have a GKE setup

Yes I can work on that and test on GKE.

Signed-off-by: Miguel Martinez Trivino <miguel@chainloop.dev>
@migmartri migmartri marked this pull request as ready for review September 11, 2023 12:26
@migmartri migmartri merged commit 4ca2a82 into chainloop-dev:main Sep 14, 2023
@migmartri migmartri deleted the fix-migration branch September 14, 2023 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

migration job never properly finishes when using slq-proxy sidecar
2 participants