feat(aws): support multiple nodes #100

FrankYang0529 · 2022-08-31T16:27:29Z

ref: #63

FrankYang0529 · 2022-08-31T16:45:35Z

Hi @vdice, sorry for pushing a draft PR first. I encountered an issue and would like to get your feedback first. Thank you!

Currently, this branch can deploy 3 instances, hasicorp stacks, and fermyon platform. However, when I tried to run spin deploy to deploy http-rust. I always get an error from bindle. I'm not familiar with bindle part. Could you help me take a look for it. Thank you!

> spin --version
spin 0.4.0 (784094e 2022-07-13)

> spin deploy
Error: Failed to push bindle spin-hello-world/1.0.0+q8e6334c to server https://bindle.44.210.10.26.sslip.io/v1

Caused by:
    0: Failed to push bindle from '/var/folders/jp/2ylhtk2x0m1fgp3s7k0yb3kr0000gn/T/.tmpoJwMBS' to server at 'https://bindle.44.210.10.26.sslip.io/v1'
    1: Error contacting server: resource could not be loaded

For design, originally, I would like to use aws_lb. However, ALB DNS record can't support wildcard subdomain, so we need users to provide their domain name and add a A record to ALB. After realizing it, I choose current simple way for it - using sslip.io on eip. I will add ALB support after I resolve spin deploy issue.

Also, I wonder how to sync host volume between nodes? If we can't do it, will Postgres be redeployed to another node? If yes, we may lose data in the last node.

For vault, I am not sure where we use it. I don't deploy it in current version. Will it break something?

For consul/nomad, do we want to split client nodes to other instances?

vdice · 2022-08-31T18:01:39Z

@FrankYang0529 no worries; great idea to put up as draft first anyways on a big effort like this. I hope to deploy from this branch and take a look at the bindle issue soon.

Meanwhile:

I will add ALB support after I resolve spin deploy issue.

👍

Also, I wonder how to sync host volume between nodes? If we can't do it, will Postgres be redeployed to another node? If yes, we may lose data in the last node.

Right, we would lose data with the current postgres job config. Good question, we'd need to research techniques (use AWS EFS? Are there similar options for other cloud installers (GCP, DO, etc.)? Is there a cloud platform-agnostic approach?) Maybe relegate to a follow-up due to its complexity and reach across installer scenarios?

For vault, I am not sure where we use it. I don't deploy it in current version. Will it break something?

It is fine to leave Vault out of the mix for now -- you're correct, it isn't used currently.

For consul/nomad, do we want to split client nodes to other instances?

Perhaps eventually, but would be fine to tackle in a different issue. For now, I like the simplicity of each node looking like a clone of the rest -- though we'll see if that bites us somehow as we test :)

FrankYang0529 · 2022-09-01T14:43:08Z

Is there a cloud platform-agnostic approach?

I took a look for ceph today, but it looks like we need a ceph cluster first. I don't want to introduce too many components in this PR, so probably let's use EBS first? From nomad document, we can also use a similar technique in GCP and Digital Ocean. (GCP persistent disks and Digital Ocean droplet storage volumes)

vdice · 2022-09-02T22:04:48Z

I took a look for ceph today, but it looks like we need a ceph cluster first. I don't want to introduce too many components in this PR

Agreed, let's try using the EBS/container storage interface approach first. Good find.

You know what... if that works... maybe we try containerizing bindle and get persistent data there as well 😄 (Maybe a large effort)

I always get an error from bindle. I'm not familiar with bindle part. Could you help me take a look for it. Thank you!

Ok, I had a chance to test this. The bindle issue is due to bindle not being able to write to the host volume. There are two changes needed; I'll add comments in-line.

Once I applied the changes around the bindle volume, I was able to attempt a spin deploy. I saw that the bindle transaction was successful (bindle shows up in volume) but it looks like our next hurdle is related to hippo's postgresql driver and the postgres container; I see the following: https://gist.github.com/vdice/cc081811e1649c3cb3a5abf55bd71b5f

aws/terraform/scripts/install_nomad.sh.tpl

aws/terraform/vm_assets/job/bindle.nomad

FrankYang0529 · 2022-09-05T12:50:51Z

Once I applied the changes around the bindle volume, I was able to attempt a spin deploy. I saw that the bindle transaction was successful (bindle shows up in volume) but it looks like our next hurdle is related to hippo's postgresql driver and the postgres container; I see the following: https://gist.github.com/vdice/cc081811e1649c3cb3a5abf55bd71b5f

I tested different Hippo versions. It looks like there is an issue to work with PostgreSQL. I create an issue to track it.

deislabs/hippo#1167

FrankYang0529 · 2022-09-11T09:42:18Z

Hi @vdice, I fixed the error config for bindle and updated postgres host volume to aws-ebs CSI. I think we can also dockerize bindle, so we don't need to use host volume for it. I would like to leave aws_elb to the next PR, so the remaining items for this PR are:

Dockerize bindle Dockerize bindle deislabs/bindle#343
Update Hippo to v0.19.1, Bindle to v0.8.3, and Spin to v0.6.0
Update README

vdice · 2022-09-13T17:34:23Z

aws/terraform/scripts/install_consul.sh.tpl

+    "provider=aws tag_key=ConsulRole tag_value=consul-server addr_type=private_v4"
+  ],
+  "client_addr": "0.0.0.0",
+  "bind_addr": "{{ GetInterfaceIP \"eth0\" }}",


This was interesting to me: When I overrode the host type to be t3.small, the interface name is ens0. Anyways, nothing to change now, but note that we may want to harden this in the future to support other instance types.

Change it to {{ GetPrivateInterfaces | include \"flags\" \"forwardable|up\" | attr \"address\" }}. I can use this in t2.medium.

vdice · 2022-09-13T18:01:20Z

aws/terraform/variables.tf

+  default     = "t2.medium"
+}
+
+variable "availability_zone" {


We can keep this variable for users who want full control of the zone, but I'm wondering if it would be better for the default to be based on the current aws region (as configured in the env or aws cli)?

We could do the latter via adding the following to main.tf:

data "aws_region" "current" {}

and then in the locals block:

availability_zone = "${data.aws_region.current.name}b"

Where I used "b", because my current region (us-west-1) doesn't actually have an "a" 😄

Add data "aws_region" and remove the variable. I think we can use this currently and add the variable back in the follow-up PR. Thank you.

vdice · 2022-09-13T19:53:08Z

@FrankYang0529 the ebs setup looks slick! I was encountering errors when testing -- it seems like it would schedule all of the jobs up to and including postgres but hippo never ran... however, I thought maybe some things are still in progress on this branch, so I didn't dig deep.

👍 on list from #100 (comment)

FrankYang0529 · 2022-10-10T17:20:04Z

Hi @vdice, I did some updates, but haven't fully tested it. I just wanted to keep some steps that I used in manual deployment. I will do more tests tomorrow. Thank you.

Use docker driver for bindle.
Generate bindle key in the script.
Only run run_servers.sh in the first node.

FrankYang0529 · 2022-10-12T16:45:28Z

Hi @vdice, I tested the provisioning part. Currently, it can deploy all components successfully. However, we need a new Hippo release to avoid the PostgreSQL DateTime issue deislabs/hippo#1167. Also, we need a CI update to publish an official bindle image. I used frankyang/bindle:dev for testing. I will create a PR in bindle repo for it.

bacongobbler · 2022-10-12T19:40:10Z

Thanks @FrankYang0529. I'll see about cutting a new Hippo release some time this week.

bacongobbler · 2022-10-14T15:49:03Z

Hippo v0.19.1 is in the process of going through the release pipeline. Will update fermyon/installer once that's finished

bacongobbler · 2022-10-14T15:51:57Z

#109

FrankYang0529 · 2022-11-06T14:17:01Z

Besides README, this PR is almost ready. Currently, it waits for spinframework/spin#786 and deislabs/bindle#353, so I would like to add some test steps for anyone who wants to try this.

Apply resources

$ cd aws/terraform
$ terraform apply

Export ssh key

$ terraform output -raw ec2_ssh_private_key > /tmp/ec2_ssh_private_key.pem
$ chmod 600 /tmp/ec2_ssh_private_key.pem

Login to ec2 and wait for cloud-init to finish.

$ ssh -i /tmp/ec2_ssh_private_key.pem ubuntu@<elastic_ip>
$ sudo journalctl -u cloud-final.service -f

After cloud-init is finished, you can check nomad jobs on https://<elastic_ip>:4646.
Copy secret_keys.toml and keyring.toml from ec2 to your laptop.

# on ec2
$ cd ~
$ cat secret_keys.toml
$ cat keyring.toml

Clone and build feat: bump bindle to v0.9.0-rc.1 spinframework/spin#786.

# in spin folder
$ cargo build
$ ./target/debug/spin templates install --git https://github.com/fermyon/spin

Run the following commands in the same terminal.

# in installer/aws/terraform folder
$ $(terraform output -raw environment)
# in spin folder
$ ./target/debug/spin new http-rust myapp
$ export BINDLE_KEYRING_FILE=<path to keyring.toml>
$ export BINDLE_SECRET_FILE=<path to secret_keys.toml>
$ export BINDLE_LABEL="user<me@example.com>"
$ cd myapp
$ ../target/debug/spin build
$ ../target/debug/spin deploy

Open https://spin-deploy.myapp.hippo.<elastic_ip>.sslip.io and you will see Hello, Fermyon.

If you do some change and want to push the app again, you can do following commands with same terminal for step 7. We need environment variables from $(terraform output -raw environment) and BINDLE_KEYRING_FILE, BINDLE_SECRET_FILE, and BINDLE_LABEL.
9. Install bindle v0.9.0-rc1 on your local.
10. Fetch server keys. In stdout, you will see where it write the output to.

$ bindle keys fetch

Add the key to your BINDLE_KEYRING_FILE.
In myapp folder, push the app again.

$ ../target/debug/spin build
$ ../target/debug/spin deploy --deploy-existing-bindle

vdice · 2022-11-09T20:01:02Z

Awesome work, @FrankYang0529!

I have yet to test the flow laid out in #100 (comment) but I do want to mention the following item to get your thoughts:

Can we extract the bindle version update (from v0.8.0 to v0.9.x) into a separate follow-up? I ask for a few reasons:

When we're ready to do so, I'd like the bindle version to be updated in all scenarios (local and cloud providers, either one- or multi-node, etc) at once (eg same PR), if possible.
It would be great to scope this PR down to the changes only needed to support multi-node clusters in AWS
It would be preferable not to be tightly-coupled to the correspoding Spin PR that bump its bindle client version (feat: bump bindle to v0.9.0-rc.1 spinframework/spin#786)

For 2 (the purposes of this PR), do we really only need a publicly-accessible Docker image of bindle, regardless of version? If we had such an image for bindle v0.8.0, can we still achieve running the platform in AWS in a multi-node context -- and use the currently-available v0.6.0 spin to deploy?

What do you think? I know you've spent a lot of time on this and so I may be missing some details...

FrankYang0529 · 2022-11-10T15:06:23Z

Hi @vdice, yeah, I think it's easier to use Bindle v0.8.x. Let's do a follow-up in the next PR. I also update README.

If we had such an image for bindle v0.8.0, can we still achieve running the platform in AWS in a multi-node context -- and use the currently-available v0.6.0 spin to deploy?

Yes, we can. Since we don't use bindle v0.9.x, all test steps are the same as usual. Currently, I temporally used bindle image frankyang/bindle:v0.8.3. If we merge fermyon/bindle#2 and release a new fermyon/bindle image, I will change it to ghcr.io/fermyon/bindle:v0.8.3. Thanks

FrankYang0529 · 2022-11-15T13:40:34Z

Hi @vdice, I updated the bindle image to ghcr.io/fermyon/bindle:v0.8.2 and it works fine. This PR is ready for review. Thanks.

vdice · 2022-11-17T00:35:59Z

@FrankYang0529 thanks so much; sorry for the delay here. Should have cycles to devote to this tomorrow -- excited to test it out!

vdice

Haven't tested this yet but have a first round of comments. Thanks so much all of the work thus far.

I did want to get your thoughts on the following: If we go with these changes, the multi-node setup will be the de facto AWS deployment and it includes a fair amount of additional resources (more, larger instances, EBS volumes, etc.), increasing the barrier of entry. Arguably, this formation will be the more attractive configuration for users wanting to run the Fermyon Platform in a non-trivial/real-world capacity on AWS -- but I can also see the lighter-footprint single-node config being attractive to quickly kick the tires on AWS with lower commitment. However, adding logic to support both modes may be complex/involved and I don't want to derail the progress here. What do you think? Willing to be convinced that we just drop the single-node formation, pending user feedback -- we can always bring back from git history...

aws/README.md

aws/terraform/main.tf

aws/terraform/vm_assets/job/hippo.nomad

aws/terraform/vm_assets/run_servers.sh

FrankYang0529 · 2022-11-18T02:12:05Z

However, adding logic to support both modes may be complex/involved and I don't want to derail the progress here. What do you think? Willing to be convinced that we just drop the single-node formation, pending user feedback -- we can always bring back from git history...

How about let's split folders for single-node and multiple-node deployment? This may have more work for maintenance, but we can share single-node scripts with GCP/Azure/DO, so I think it's acceptable.

vdice · 2022-11-18T14:38:44Z

How about let's split folders for single-node and multiple-node deployment? This may have more work for maintenance, but we can share single-node scripts with GCP/Azure/DO, so I think it's acceptable.

@FrankYang0529 that does sound like the best approach for now. 👍 Let's go that route for this PR; we can always refine/revisit later.

vdice

Just tested the multiple-nodes scenario on AWS and the services came up as expected and an example app deploy worked! 🎉

I think we're at the stage of making sure the docs are in order (and accurate to the scenario, now that we have multi- and single- modes). Then we should be good to merge this in and start using.

The multiple-nodes configuration is a more robust and flexible foundation compared to the single-node configuration, so I'm excited to iterate here!

aws/README.md

aws/terraform/main.tf

aws/terraform/multiple-nodes/main.tf

aws/terraform/single-node/main.tf

vdice

Tested both single- and multi- scenarios and they look great. Last minor docs suggestion and then I believe we're ready to get this in! 🎉

README.md

Signed-off-by: Frank Yang <yangpoan@gmail.com>

vdice

Thank you for the incredible amount of work here @FrankYang0529! Super excited to now have a multi-node scenario for users to try. 🚀

vdice reviewed Sep 2, 2022

View reviewed changes

aws/terraform/scripts/install_nomad.sh.tpl Outdated Show resolved Hide resolved

aws/terraform/scripts/install_nomad.sh.tpl Outdated Show resolved Hide resolved

aws/terraform/vm_assets/job/bindle.nomad Outdated Show resolved Hide resolved

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 0733c6d to 7f2bdb7 Compare September 11, 2022 09:35

vdice reviewed Sep 13, 2022

View reviewed changes

FrankYang0529 mentioned this pull request Sep 14, 2022

Dockerize bindle deislabs/bindle#343

Merged

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 7f2bdb7 to d658929 Compare October 10, 2022 17:12

FrankYang0529 force-pushed the support-aws-multi-nodes branch 2 times, most recently from d4c7487 to 05d80fa Compare October 12, 2022 16:38

FrankYang0529 mentioned this pull request Oct 26, 2022

feat: push docker image deislabs/bindle#352

Merged

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 05d80fa to e44f9fe Compare November 4, 2022 15:24

FrankYang0529 mentioned this pull request Nov 10, 2022

Bump version to v0.8.3 and add dockerfile fermyon/bindle#2

Closed

FrankYang0529 force-pushed the support-aws-multi-nodes branch from e44f9fe to 322fc25 Compare November 10, 2022 14:58

FrankYang0529 marked this pull request as ready for review November 10, 2022 14:59

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 322fc25 to 3d6e04f Compare November 15, 2022 13:39

vdice self-requested a review November 15, 2022 14:44

vdice reviewed Nov 18, 2022

View reviewed changes

FrankYang0529 force-pushed the support-aws-multi-nodes branch 2 times, most recently from 8953ee8 to 10c8115 Compare November 22, 2022 14:30

vdice reviewed Nov 23, 2022

View reviewed changes

aws/README.md Outdated Show resolved Hide resolved

aws/terraform/main.tf Outdated Show resolved Hide resolved

vdice reviewed Nov 23, 2022

View reviewed changes

aws/terraform/multiple-nodes/main.tf Show resolved Hide resolved

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 10c8115 to 4da0371 Compare November 23, 2022 15:27

vdice reviewed Nov 23, 2022

View reviewed changes

aws/terraform/single-node/main.tf Show resolved Hide resolved

FrankYang0529 force-pushed the support-aws-multi-nodes branch 2 times, most recently from 50ec8dd to 3ac427b Compare November 29, 2022 15:36

vdice reviewed Nov 29, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 3ac427b to 0651ef4 Compare November 30, 2022 15:18

feat(aws): support multiple nodes

a25ea98

Signed-off-by: Frank Yang <yangpoan@gmail.com>

FrankYang0529 force-pushed the support-aws-multi-nodes branch from 0651ef4 to a25ea98 Compare November 30, 2022 15:18

vdice approved these changes Nov 30, 2022

View reviewed changes

vdice merged commit 1480d74 into fermyon:main Nov 30, 2022

This was referenced Nov 30, 2022

AWS Multi-Node: Load Balance across instances #115

Open

docs(aws): updates per single-node/multi-node scenarios #116

Merged

FrankYang0529 deleted the support-aws-multi-nodes branch December 1, 2022 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aws): support multiple nodes #100

feat(aws): support multiple nodes #100

FrankYang0529 commented Aug 31, 2022

FrankYang0529 commented Aug 31, 2022

vdice commented Aug 31, 2022 •

edited

Loading

FrankYang0529 commented Sep 1, 2022

vdice commented Sep 2, 2022 •

edited

Loading

FrankYang0529 commented Sep 5, 2022

FrankYang0529 commented Sep 11, 2022 •

edited

Loading

vdice Sep 13, 2022

FrankYang0529 Oct 10, 2022

vdice Sep 13, 2022

FrankYang0529 Oct 10, 2022

vdice commented Sep 13, 2022 •

edited

Loading

FrankYang0529 commented Oct 10, 2022

FrankYang0529 commented Oct 12, 2022

bacongobbler commented Oct 12, 2022

bacongobbler commented Oct 14, 2022

bacongobbler commented Oct 14, 2022

FrankYang0529 commented Nov 6, 2022

vdice commented Nov 9, 2022 •

edited

Loading

FrankYang0529 commented Nov 10, 2022

FrankYang0529 commented Nov 15, 2022

vdice commented Nov 17, 2022

vdice left a comment

FrankYang0529 commented Nov 18, 2022

vdice commented Nov 18, 2022

vdice left a comment

vdice left a comment

vdice left a comment

feat(aws): support multiple nodes #100

feat(aws): support multiple nodes #100

Conversation

FrankYang0529 commented Aug 31, 2022

FrankYang0529 commented Aug 31, 2022

vdice commented Aug 31, 2022 • edited Loading

FrankYang0529 commented Sep 1, 2022

vdice commented Sep 2, 2022 • edited Loading

FrankYang0529 commented Sep 5, 2022

FrankYang0529 commented Sep 11, 2022 • edited Loading

vdice Sep 13, 2022

Choose a reason for hiding this comment

FrankYang0529 Oct 10, 2022

Choose a reason for hiding this comment

vdice Sep 13, 2022

Choose a reason for hiding this comment

FrankYang0529 Oct 10, 2022

Choose a reason for hiding this comment

vdice commented Sep 13, 2022 • edited Loading

FrankYang0529 commented Oct 10, 2022

FrankYang0529 commented Oct 12, 2022

bacongobbler commented Oct 12, 2022

bacongobbler commented Oct 14, 2022

bacongobbler commented Oct 14, 2022

FrankYang0529 commented Nov 6, 2022

vdice commented Nov 9, 2022 • edited Loading

FrankYang0529 commented Nov 10, 2022

FrankYang0529 commented Nov 15, 2022

vdice commented Nov 17, 2022

vdice left a comment

Choose a reason for hiding this comment

FrankYang0529 commented Nov 18, 2022

vdice commented Nov 18, 2022

vdice left a comment

Choose a reason for hiding this comment

vdice left a comment

Choose a reason for hiding this comment

vdice left a comment

Choose a reason for hiding this comment

vdice commented Aug 31, 2022 •

edited

Loading

vdice commented Sep 2, 2022 •

edited

Loading

FrankYang0529 commented Sep 11, 2022 •

edited

Loading

vdice commented Sep 13, 2022 •

edited

Loading

vdice commented Nov 9, 2022 •

edited

Loading