Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aws): support multiple nodes #100

Merged
merged 1 commit into from
Nov 30, 2022

Conversation

FrankYang0529
Copy link
Contributor

ref: #63

@FrankYang0529
Copy link
Contributor Author

Hi @vdice, sorry for pushing a draft PR first. I encountered an issue and would like to get your feedback first. Thank you!

Currently, this branch can deploy 3 instances, hasicorp stacks, and fermyon platform. However, when I tried to run spin deploy to deploy http-rust. I always get an error from bindle. I'm not familiar with bindle part. Could you help me take a look for it. Thank you!

> spin --version
spin 0.4.0 (784094e 2022-07-13)

> spin deploy
Error: Failed to push bindle spin-hello-world/1.0.0+q8e6334c to server https://bindle.44.210.10.26.sslip.io/v1

Caused by:
    0: Failed to push bindle from '/var/folders/jp/2ylhtk2x0m1fgp3s7k0yb3kr0000gn/T/.tmpoJwMBS' to server at 'https://bindle.44.210.10.26.sslip.io/v1'
    1: Error contacting server: resource could not be loaded

For design, originally, I would like to use aws_lb. However, ALB DNS record can't support wildcard subdomain, so we need users to provide their domain name and add a A record to ALB. After realizing it, I choose current simple way for it - using sslip.io on eip. I will add ALB support after I resolve spin deploy issue.

Also, I wonder how to sync host volume between nodes? If we can't do it, will Postgres be redeployed to another node? If yes, we may lose data in the last node.

For vault, I am not sure where we use it. I don't deploy it in current version. Will it break something?

For consul/nomad, do we want to split client nodes to other instances?

@vdice
Copy link
Member

vdice commented Aug 31, 2022

@FrankYang0529 no worries; great idea to put up as draft first anyways on a big effort like this. I hope to deploy from this branch and take a look at the bindle issue soon.

Meanwhile:

I will add ALB support after I resolve spin deploy issue.

👍

Also, I wonder how to sync host volume between nodes? If we can't do it, will Postgres be redeployed to another node? If yes, we may lose data in the last node.

Right, we would lose data with the current postgres job config. Good question, we'd need to research techniques (use AWS EFS? Are there similar options for other cloud installers (GCP, DO, etc.)? Is there a cloud platform-agnostic approach?) Maybe relegate to a follow-up due to its complexity and reach across installer scenarios?

For vault, I am not sure where we use it. I don't deploy it in current version. Will it break something?

It is fine to leave Vault out of the mix for now -- you're correct, it isn't used currently.

For consul/nomad, do we want to split client nodes to other instances?

Perhaps eventually, but would be fine to tackle in a different issue. For now, I like the simplicity of each node looking like a clone of the rest -- though we'll see if that bites us somehow as we test :)

@FrankYang0529
Copy link
Contributor Author

Is there a cloud platform-agnostic approach?

I took a look for ceph today, but it looks like we need a ceph cluster first. I don't want to introduce too many components in this PR, so probably let's use EBS first? From nomad document, we can also use a similar technique in GCP and Digital Ocean. (GCP persistent disks and Digital Ocean droplet storage volumes)

@vdice
Copy link
Member

vdice commented Sep 2, 2022

I took a look for ceph today, but it looks like we need a ceph cluster first. I don't want to introduce too many components in this PR

Agreed, let's try using the EBS/container storage interface approach first. Good find.

You know what... if that works... maybe we try containerizing bindle and get persistent data there as well 😄 (Maybe a large effort)

I always get an error from bindle. I'm not familiar with bindle part. Could you help me take a look for it. Thank you!

Ok, I had a chance to test this. The bindle issue is due to bindle not being able to write to the host volume. There are two changes needed; I'll add comments in-line.

Once I applied the changes around the bindle volume, I was able to attempt a spin deploy. I saw that the bindle transaction was successful (bindle shows up in volume) but it looks like our next hurdle is related to hippo's postgresql driver and the postgres container; I see the following: https://gist.github.com/vdice/cc081811e1649c3cb3a5abf55bd71b5f

aws/terraform/scripts/install_nomad.sh.tpl Outdated Show resolved Hide resolved
aws/terraform/scripts/install_nomad.sh.tpl Outdated Show resolved Hide resolved
aws/terraform/vm_assets/job/bindle.nomad Outdated Show resolved Hide resolved
@FrankYang0529
Copy link
Contributor Author

Once I applied the changes around the bindle volume, I was able to attempt a spin deploy. I saw that the bindle transaction was successful (bindle shows up in volume) but it looks like our next hurdle is related to hippo's postgresql driver and the postgres container; I see the following: https://gist.github.com/vdice/cc081811e1649c3cb3a5abf55bd71b5f

I tested different Hippo versions. It looks like there is an issue to work with PostgreSQL. I create an issue to track it.

deislabs/hippo#1167

@FrankYang0529
Copy link
Contributor Author

FrankYang0529 commented Sep 11, 2022

Hi @vdice, I fixed the error config for bindle and updated postgres host volume to aws-ebs CSI. I think we can also dockerize bindle, so we don't need to use host volume for it. I would like to leave aws_elb to the next PR, so the remaining items for this PR are:

"provider=aws tag_key=ConsulRole tag_value=consul-server addr_type=private_v4"
],
"client_addr": "0.0.0.0",
"bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was interesting to me: When I overrode the host type to be t3.small, the interface name is ens0. Anyways, nothing to change now, but note that we may want to harden this in the future to support other instance types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to {{ GetPrivateInterfaces | include \"flags\" \"forwardable|up\" | attr \"address\" }}. I can use this in t2.medium.

default = "t2.medium"
}

variable "availability_zone" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this variable for users who want full control of the zone, but I'm wondering if it would be better for the default to be based on the current aws region (as configured in the env or aws cli)?

We could do the latter via adding the following to main.tf:

data "aws_region" "current" {}

and then in the locals block:

  availability_zone = "${data.aws_region.current.name}b"

Where I used "b", because my current region (us-west-1) doesn't actually have an "a" 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add data "aws_region" and remove the variable. I think we can use this currently and add the variable back in the follow-up PR. Thank you.

@vdice
Copy link
Member

vdice commented Sep 13, 2022

@FrankYang0529 the ebs setup looks slick! I was encountering errors when testing -- it seems like it would schedule all of the jobs up to and including postgres but hippo never ran... however, I thought maybe some things are still in progress on this branch, so I didn't dig deep.

👍 on list from #100 (comment)

@FrankYang0529
Copy link
Contributor Author

Hi @vdice, I did some updates, but haven't fully tested it. I just wanted to keep some steps that I used in manual deployment. I will do more tests tomorrow. Thank you.

  • Use docker driver for bindle.
  • Generate bindle key in the script.
  • Only run run_servers.sh in the first node.

@FrankYang0529 FrankYang0529 force-pushed the support-aws-multi-nodes branch 2 times, most recently from d4c7487 to 05d80fa Compare October 12, 2022 16:38
@FrankYang0529
Copy link
Contributor Author

Hi @vdice, I tested the provisioning part. Currently, it can deploy all components successfully. However, we need a new Hippo release to avoid the PostgreSQL DateTime issue deislabs/hippo#1167. Also, we need a CI update to publish an official bindle image. I used frankyang/bindle:dev for testing. I will create a PR in bindle repo for it.

@bacongobbler
Copy link
Member

Thanks @FrankYang0529. I'll see about cutting a new Hippo release some time this week.

@bacongobbler
Copy link
Member

Hippo v0.19.1 is in the process of going through the release pipeline. Will update fermyon/installer once that's finished

@bacongobbler
Copy link
Member

#109

@FrankYang0529
Copy link
Contributor Author

Besides README, this PR is almost ready. Currently, it waits for fermyon/spin#786 and deislabs/bindle#353, so I would like to add some test steps for anyone who wants to try this.

  1. Apply resources
$ cd aws/terraform
$ terraform apply
  1. Export ssh key
$ terraform output -raw ec2_ssh_private_key > /tmp/ec2_ssh_private_key.pem
$ chmod 600 /tmp/ec2_ssh_private_key.pem
  1. Login to ec2 and wait for cloud-init to finish.
$ ssh -i /tmp/ec2_ssh_private_key.pem ubuntu@<elastic_ip>
$ sudo journalctl -u cloud-final.service -f
  1. After cloud-init is finished, you can check nomad jobs on https://<elastic_ip>:4646.
  2. Copy secret_keys.toml and keyring.toml from ec2 to your laptop.
# on ec2
$ cd ~
$ cat secret_keys.toml
$ cat keyring.toml
  1. Clone and build feat: bump bindle to v0.9.0-rc.1 spin#786.
# in spin folder
$ cargo build
$ ./target/debug/spin templates install --git https://github.com/fermyon/spin
  1. Run the following commands in the same terminal.
# in installer/aws/terraform folder
$ $(terraform output -raw environment)
# in spin folder
$ ./target/debug/spin new http-rust myapp
$ export BINDLE_KEYRING_FILE=<path to keyring.toml>
$ export BINDLE_SECRET_FILE=<path to secret_keys.toml>
$ export BINDLE_LABEL="user<me@example.com>"
$ cd myapp
$ ../target/debug/spin build
$ ../target/debug/spin deploy
  1. Open https://spin-deploy.myapp.hippo.<elastic_ip>.sslip.io and you will see Hello, Fermyon.

If you do some change and want to push the app again, you can do following commands with same terminal for step 7. We need environment variables from $(terraform output -raw environment) and BINDLE_KEYRING_FILE, BINDLE_SECRET_FILE, and BINDLE_LABEL.
9. Install bindle v0.9.0-rc1 on your local.
10. Fetch server keys. In stdout, you will see where it write the output to.

$ bindle keys fetch
  1. Add the key to your BINDLE_KEYRING_FILE.
  2. In myapp folder, push the app again.
$ ../target/debug/spin build
$ ../target/debug/spin deploy --deploy-existing-bindle

@vdice
Copy link
Member

vdice commented Nov 9, 2022

Awesome work, @FrankYang0529!

I have yet to test the flow laid out in #100 (comment) but I do want to mention the following item to get your thoughts:

Can we extract the bindle version update (from v0.8.0 to v0.9.x) into a separate follow-up? I ask for a few reasons:

  1. When we're ready to do so, I'd like the bindle version to be updated in all scenarios (local and cloud providers, either one- or multi-node, etc) at once (eg same PR), if possible.
  2. It would be great to scope this PR down to the changes only needed to support multi-node clusters in AWS
  3. It would be preferable not to be tightly-coupled to the correspoding Spin PR that bump its bindle client version (feat: bump bindle to v0.9.0-rc.1 spin#786)

For 2 (the purposes of this PR), do we really only need a publicly-accessible Docker image of bindle, regardless of version? If we had such an image for bindle v0.8.0, can we still achieve running the platform in AWS in a multi-node context -- and use the currently-available v0.6.0 spin to deploy?

What do you think? I know you've spent a lot of time on this and so I may be missing some details...

@FrankYang0529
Copy link
Contributor Author

Hi @vdice, yeah, I think it's easier to use Bindle v0.8.x. Let's do a follow-up in the next PR. I also update README.

If we had such an image for bindle v0.8.0, can we still achieve running the platform in AWS in a multi-node context -- and use the currently-available v0.6.0 spin to deploy?

Yes, we can. Since we don't use bindle v0.9.x, all test steps are the same as usual. Currently, I temporally used bindle image frankyang/bindle:v0.8.3. If we merge fermyon/bindle#2 and release a new fermyon/bindle image, I will change it to ghcr.io/fermyon/bindle:v0.8.3. Thanks

@FrankYang0529
Copy link
Contributor Author

Hi @vdice, I updated the bindle image to ghcr.io/fermyon/bindle:v0.8.2 and it works fine. This PR is ready for review. Thanks.

@vdice vdice self-requested a review November 15, 2022 14:44
@vdice
Copy link
Member

vdice commented Nov 17, 2022

@FrankYang0529 thanks so much; sorry for the delay here. Should have cycles to devote to this tomorrow -- excited to test it out!

Copy link
Member

@vdice vdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested this yet but have a first round of comments. Thanks so much all of the work thus far.

I did want to get your thoughts on the following: If we go with these changes, the multi-node setup will be the de facto AWS deployment and it includes a fair amount of additional resources (more, larger instances, EBS volumes, etc.), increasing the barrier of entry. Arguably, this formation will be the more attractive configuration for users wanting to run the Fermyon Platform in a non-trivial/real-world capacity on AWS -- but I can also see the lighter-footprint single-node config being attractive to quickly kick the tires on AWS with lower commitment. However, adding logic to support both modes may be complex/involved and I don't want to derail the progress here. What do you think? Willing to be convinced that we just drop the single-node formation, pending user feedback -- we can always bring back from git history...

aws/README.md Outdated Show resolved Hide resolved
aws/README.md Outdated Show resolved Hide resolved
aws/terraform/main.tf Outdated Show resolved Hide resolved
aws/terraform/main.tf Outdated Show resolved Hide resolved
aws/terraform/main.tf Outdated Show resolved Hide resolved
aws/terraform/vm_assets/job/hippo.nomad Outdated Show resolved Hide resolved
aws/terraform/vm_assets/job/hippo.nomad Outdated Show resolved Hide resolved
aws/terraform/vm_assets/run_servers.sh Outdated Show resolved Hide resolved
aws/terraform/vm_assets/run_servers.sh Outdated Show resolved Hide resolved
aws/terraform/vm_assets/run_servers.sh Outdated Show resolved Hide resolved
@FrankYang0529
Copy link
Contributor Author

However, adding logic to support both modes may be complex/involved and I don't want to derail the progress here. What do you think? Willing to be convinced that we just drop the single-node formation, pending user feedback -- we can always bring back from git history...

How about let's split folders for single-node and multiple-node deployment? This may have more work for maintenance, but we can share single-node scripts with GCP/Azure/DO, so I think it's acceptable.

@vdice
Copy link
Member

vdice commented Nov 18, 2022

How about let's split folders for single-node and multiple-node deployment? This may have more work for maintenance, but we can share single-node scripts with GCP/Azure/DO, so I think it's acceptable.

@FrankYang0529 that does sound like the best approach for now. 👍 Let's go that route for this PR; we can always refine/revisit later.

@FrankYang0529 FrankYang0529 force-pushed the support-aws-multi-nodes branch 2 times, most recently from 8953ee8 to 10c8115 Compare November 22, 2022 14:30
Copy link
Member

@vdice vdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested the multiple-nodes scenario on AWS and the services came up as expected and an example app deploy worked! 🎉

I think we're at the stage of making sure the docs are in order (and accurate to the scenario, now that we have multi- and single- modes). Then we should be good to merge this in and start using.

The multiple-nodes configuration is a more robust and flexible foundation compared to the single-node configuration, so I'm excited to iterate here!

aws/README.md Outdated Show resolved Hide resolved
aws/terraform/main.tf Outdated Show resolved Hide resolved
@FrankYang0529 FrankYang0529 force-pushed the support-aws-multi-nodes branch 2 times, most recently from 50ec8dd to 3ac427b Compare November 29, 2022 15:36
Copy link
Member

@vdice vdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested both single- and multi- scenarios and they look great. Last minor docs suggestion and then I believe we're ready to get this in! 🎉

README.md Outdated Show resolved Hide resolved
Signed-off-by: Frank Yang <yangpoan@gmail.com>
Copy link
Member

@vdice vdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the incredible amount of work here @FrankYang0529! Super excited to now have a multi-node scenario for users to try. 🚀

@vdice vdice merged commit 1480d74 into fermyon:main Nov 30, 2022
@FrankYang0529 FrankYang0529 deleted the support-aws-multi-nodes branch December 1, 2022 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants