Skip to content

Restoring from VM Backups on Google Cloud

Isabelle Guyon edited this page Aug 6, 2019 · 17 revisions

Related posts:

Restoring a VM from Snapshots on Google Cloud Storage

Step 1: Creating a new VM from a backed-up VM

  • Start from your Google Cloud Storage Control Panel, and click VM Instances under Compute Engine in the side menu
img
  • Select create new instance on the top menu
img
  • Configure your machine options and name. We recommend for a production level Codalab server with heavy traffic between 4-8 vCpus/Cpus and between 16-32GB of RAM.
img
  • Under boot disk, select change
img
  • Select the snapshots tab.

  • Select the snapshot you wish to restore from. They will display the date they were created here as well.

img
  • Ensure the disk size and type is correct and submit the form. Note: We recommend at least 50+ gb of disk space.
img
  • Under firewall configuration, enable HTTP/HTTPS traffic
img
  • Expand Management, Security, Disks, Networking, Sole Tenancy

image

  • Add the tag allow-rabbitmq-and-flower under network tags. (This allows RabbitMQ/Celery/Etc to communicate with workers)
  • Submit the form and create your new instance from a snapshot backup.

Step 2: Configuring the restored VM by editing the .env file

  • SSH into the instance via the SSH button from the instances menu
img
  • Change directory to the Ubuntu user's codalab-competitions directory. (Note if you SSH in under the user Ubuntu it's in the same directory as you will be)
img
  • If you are practicing backup recovery and/or will not be using the same URL as the original instance that was backed up OR, if the restored instance has a new different IP address not pointing (yet) to the original URL (e.g. autodl.lri.fr), then you need to edit the Docker environment configuration file (.env) to use the new VM in the mean time. Sudo is not needed if SSH'ed in as Ubuntu. (To do that you need an editor, if you hate vim, run sudo apt install emacs to get emacs).
img
  • Change all references of the domain name of the original instance to the new instance IP (or the new URL if you already have a URL pointing to the right IP address). The most important setting is CADDY_DOMAIN as this is what Caddy will try to serve. If the DNS/IP it's trying to serve doesn't match the IP it is being served from, it will not work. You will see a message like autodl.lri.fr is not served from this instance. Make sure if using an IP to append :80 on the end to specify not to use SSL. Otherwise you will receive an SSL error as it will try to run with SSL enabled, but not be able to retrieve a certificate.
img
  • WARNING: Google cloud creates a NEW DYNAMIC IP every time you restore your VM, so unless you had a static IP assigned, you need to redo this procedure every time you restart your VM. Here is the procedure to request a static IP.

  • If you want to assign a new URL, then instead of the IP address, put the new URL in CADDY_DOMAIN. Do not forget to create an A record to make your URL point to the correct IP address at your ISP. Here is how it is done at Moniker (it may take a while for the DNS to propagate):

Arecord
  • Run docker-compose up -d to update all containers (Note: Your output should show the containers as having been recreated)
img

Step 3: Checking your website is up and running

  • Verify you can connect via web browser to the instance

  • If you get the below error after following these steps, make sure you're not trying to use https.

img
  • If you get a message similar to: "The Codalab site is not currently available" then Codalab is probably still starting up. Check the django container logs with docker-compose logs -f django. The last few steps should be checking static files and running any migrations.

  • !OPTIONAL!: If not restoring a specific domain or doing a test it is possible that the instance will not have the default worker enabled, and no compute workers will be attached. To re-enable the default worker you can simply rename or delete docker-compose.override.yaml. It may also be necessary to ensure the correct worker version is specified.

  • Verify all services are running with docker ps

img
  • Access the logs with the command docker-compose logs -f (Use docker-compose logs -f <container> to view a specific container's logs)
img

Step 4: Check competitions can be created and sample submissions work

  • Upload the test competition and make sample submissions to verify everything is working correctly. Baseline0 should work with the default CPU queue. The other baselines require a GPU queue. If your submissions get stuck, make sure you are submitting to the right queue.

  • Next you have to restore the GPU queue and test the sample submissions that require a GPU (Baselines 1 and 2).

Clone this wiki locally