-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(aws): support multiple nodes #100
Conversation
Hi @vdice, sorry for pushing a draft PR first. I encountered an issue and would like to get your feedback first. Thank you! Currently, this branch can deploy 3 instances, hasicorp stacks, and fermyon platform. However, when I tried to run
For design, originally, I would like to use Also, I wonder how to sync host volume between nodes? If we can't do it, will Postgres be redeployed to another node? If yes, we may lose data in the last node. For vault, I am not sure where we use it. I don't deploy it in current version. Will it break something? For consul/nomad, do we want to split client nodes to other instances? |
@FrankYang0529 no worries; great idea to put up as draft first anyways on a big effort like this. I hope to deploy from this branch and take a look at the bindle issue soon. Meanwhile:
👍
Right, we would lose data with the current postgres job config. Good question, we'd need to research techniques (use AWS EFS? Are there similar options for other cloud installers (GCP, DO, etc.)? Is there a cloud platform-agnostic approach?) Maybe relegate to a follow-up due to its complexity and reach across installer scenarios?
It is fine to leave Vault out of the mix for now -- you're correct, it isn't used currently.
Perhaps eventually, but would be fine to tackle in a different issue. For now, I like the simplicity of each node looking like a clone of the rest -- though we'll see if that bites us somehow as we test :) |
I took a look for ceph today, but it looks like we need a ceph cluster first. I don't want to introduce too many components in this PR, so probably let's use EBS first? From nomad document, we can also use a similar technique in GCP and Digital Ocean. (GCP persistent disks and Digital Ocean droplet storage volumes) |
Agreed, let's try using the EBS/container storage interface approach first. Good find. You know what... if that works... maybe we try containerizing bindle and get persistent data there as well 😄 (Maybe a large effort)
Ok, I had a chance to test this. The bindle issue is due to bindle not being able to write to the host volume. There are two changes needed; I'll add comments in-line. Once I applied the changes around the bindle volume, I was able to attempt a |
I tested different Hippo versions. It looks like there is an issue to work with PostgreSQL. I create an issue to track it. |
0733c6d
to
7f2bdb7
Compare
Hi @vdice, I fixed the error config for bindle and updated postgres host volume to aws-ebs CSI. I think we can also dockerize
|
"provider=aws tag_key=ConsulRole tag_value=consul-server addr_type=private_v4" | ||
], | ||
"client_addr": "0.0.0.0", | ||
"bind_addr": "{{ GetInterfaceIP \"eth0\" }}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was interesting to me: When I overrode the host type to be t3.small, the interface name is ens0
. Anyways, nothing to change now, but note that we may want to harden this in the future to support other instance types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to {{ GetPrivateInterfaces | include \"flags\" \"forwardable|up\" | attr \"address\" }}
. I can use this in t2.medium
.
aws/terraform/variables.tf
Outdated
default = "t2.medium" | ||
} | ||
|
||
variable "availability_zone" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep this variable for users who want full control of the zone, but I'm wondering if it would be better for the default to be based on the current aws region (as configured in the env or aws cli)?
We could do the latter via adding the following to main.tf
:
data "aws_region" "current" {}
and then in the locals
block:
availability_zone = "${data.aws_region.current.name}b"
Where I used "b", because my current region (us-west-1) doesn't actually have an "a" 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add data "aws_region"
and remove the variable. I think we can use this currently and add the variable back in the follow-up PR. Thank you.
@FrankYang0529 the ebs setup looks slick! I was encountering errors when testing -- it seems like it would schedule all of the jobs up to and including postgres but hippo never ran... however, I thought maybe some things are still in progress on this branch, so I didn't dig deep. 👍 on list from #100 (comment) |
7f2bdb7
to
d658929
Compare
Hi @vdice, I did some updates, but haven't fully tested it. I just wanted to keep some steps that I used in manual deployment. I will do more tests tomorrow. Thank you.
|
d4c7487
to
05d80fa
Compare
Hi @vdice, I tested the provisioning part. Currently, it can deploy all components successfully. However, we need a new Hippo release to avoid the PostgreSQL DateTime issue deislabs/hippo#1167. Also, we need a CI update to publish an official bindle image. I used |
Thanks @FrankYang0529. I'll see about cutting a new Hippo release some time this week. |
Hippo v0.19.1 is in the process of going through the release pipeline. Will update fermyon/installer once that's finished |
05d80fa
to
e44f9fe
Compare
Besides README, this PR is almost ready. Currently, it waits for fermyon/spin#786 and deislabs/bindle#353, so I would like to add some test steps for anyone who wants to try this.
$ cd aws/terraform
$ terraform apply
$ terraform output -raw ec2_ssh_private_key > /tmp/ec2_ssh_private_key.pem
$ chmod 600 /tmp/ec2_ssh_private_key.pem
$ ssh -i /tmp/ec2_ssh_private_key.pem ubuntu@<elastic_ip>
$ sudo journalctl -u cloud-final.service -f
# on ec2
$ cd ~
$ cat secret_keys.toml
$ cat keyring.toml
# in installer/aws/terraform folder
$ $(terraform output -raw environment)
# in spin folder
$ ./target/debug/spin new http-rust myapp
$ export BINDLE_KEYRING_FILE=<path to keyring.toml>
$ export BINDLE_SECRET_FILE=<path to secret_keys.toml>
$ export BINDLE_LABEL="user<me@example.com>"
$ cd myapp
$ ../target/debug/spin build
$ ../target/debug/spin deploy
If you do some change and want to push the app again, you can do following commands with same terminal for step 7. We need environment variables from $ bindle keys fetch
$ ../target/debug/spin build
$ ../target/debug/spin deploy --deploy-existing-bindle |
Awesome work, @FrankYang0529! I have yet to test the flow laid out in #100 (comment) but I do want to mention the following item to get your thoughts: Can we extract the bindle version update (from v0.8.0 to v0.9.x) into a separate follow-up? I ask for a few reasons:
For 2 (the purposes of this PR), do we really only need a publicly-accessible Docker image of bindle, regardless of version? If we had such an image for bindle v0.8.0, can we still achieve running the platform in AWS in a multi-node context -- and use the currently-available v0.6.0 spin to deploy? What do you think? I know you've spent a lot of time on this and so I may be missing some details... |
e44f9fe
to
322fc25
Compare
Hi @vdice, yeah, I think it's easier to use Bindle v0.8.x. Let's do a follow-up in the next PR. I also update README.
Yes, we can. Since we don't use bindle v0.9.x, all test steps are the same as usual. Currently, I temporally used bindle image |
322fc25
to
3d6e04f
Compare
Hi @vdice, I updated the bindle image to |
@FrankYang0529 thanks so much; sorry for the delay here. Should have cycles to devote to this tomorrow -- excited to test it out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested this yet but have a first round of comments. Thanks so much all of the work thus far.
I did want to get your thoughts on the following: If we go with these changes, the multi-node setup will be the de facto AWS deployment and it includes a fair amount of additional resources (more, larger instances, EBS volumes, etc.), increasing the barrier of entry. Arguably, this formation will be the more attractive configuration for users wanting to run the Fermyon Platform in a non-trivial/real-world capacity on AWS -- but I can also see the lighter-footprint single-node config being attractive to quickly kick the tires on AWS with lower commitment. However, adding logic to support both modes may be complex/involved and I don't want to derail the progress here. What do you think? Willing to be convinced that we just drop the single-node formation, pending user feedback -- we can always bring back from git history...
How about let's split folders for single-node and multiple-node deployment? This may have more work for maintenance, but we can share single-node scripts with GCP/Azure/DO, so I think it's acceptable. |
@FrankYang0529 that does sound like the best approach for now. 👍 Let's go that route for this PR; we can always refine/revisit later. |
8953ee8
to
10c8115
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested the multiple-nodes
scenario on AWS and the services came up as expected and an example app deploy worked! 🎉
I think we're at the stage of making sure the docs are in order (and accurate to the scenario, now that we have multi- and single- modes). Then we should be good to merge this in and start using.
The multiple-nodes configuration is a more robust and flexible foundation compared to the single-node configuration, so I'm excited to iterate here!
10c8115
to
4da0371
Compare
50ec8dd
to
3ac427b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested both single- and multi- scenarios and they look great. Last minor docs suggestion and then I believe we're ready to get this in! 🎉
3ac427b
to
0651ef4
Compare
Signed-off-by: Frank Yang <yangpoan@gmail.com>
0651ef4
to
a25ea98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the incredible amount of work here @FrankYang0529! Super excited to now have a multi-node scenario for users to try. 🚀
ref: #63