New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ELB tags for custom ingresses #1391
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is empty string valid tag? that would make more sense to me
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM beyond the value used for the tag
Set the empty string. Going to test that. |
When a cluster is updated, do we know how CF treats the changed subnets? Will it simply add the labels and keep the network functionality up? How could I verify that at best? |
can you try to update from a previous version to WIP with this change? |
Sure, but what behaviour to look for? How to verify the subnets behave properly? |
yeah I have no idea. I remember for security groups when you add a tag it delete and create the SG again which obviously has an impact in the customer traffic. Maybe you can ping an application there and see if there is downtime during the upgrade. I can help if you need |
In case of a tag it should not recreate the entire SG because the changes are not at the level of the single rule but at the level of the SG. But apology me in advance if I say it wrong. Maybe i remember wrongly. Different is the case for changes at the level of the rule. In that case there is a drop in the connection because the SG is destroyed and recreated. 😭😬😱 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
for testing, I would do the upgrade and look at the CF, check if CF state is green, check if the subnets after update have a proper tag and then check cluster functionality, that means access k8s API, and try ssh into one of the nodes, ping each node from master/worker. If all of this works than subnets are fine, testing this on multi-AZ cluster make sense as you will have more subnets and inter-subnet communication. |
@xh3b4sd you could also create an old cluster and then modify the cloudformation and see what it does if you execute this. If that helps. |
The new cluster which has the tags apparently has connection issues. I deleted the tags manually and restarted the kubelets but fooling around did not lead to anything. I would like to pair with somebody to figure out which problems we have there. The cluster
|
I updated |
@xh3b4sd Did you try to create those tags with the value For instance kops creates See also here:
But it seems that you already solved the issue and it works so ignore as said 🤓 |
I've run a test cluster and reproduced perfectly the error. I made also some further progress to figure out the root cause. It is crucial here to check out the timeline. 23.20: the tenant cluster is created. 23.22: it is not possible to reach the cluster at the first hit because it returns the following error: ➜ kgno
Unable to connect to the server: EOF and by logging the I0206 22:24:58.788349 1 log.go:172] http: TLS handshake error from 10.1.0.215:63042: EOF
I0206 22:25:03.388177 1 log.go:172] http: TLS handshake error from 10.1.0.214:42568: EOF
I0206 22:25:03.788940 1 log.go:172] http: TLS handshake error from 10.1.0.215:63056: EOF
I0206 22:25:08.176294 1 log.go:172] http: TLS handshake error from 10.1.0.104:51114: EOF
I0206 22:25:08.387920 1 log.go:172] http: TLS handshake error from 10.1.0.214:42574: EOF
I0206 22:25:08.788566 1 log.go:172] http: TLS handshake error from 10.1.0.215:63062: EOF
I0206 22:25:13.388652 1 log.go:172] http: TLS handshake error from 10.1.0.214:42576: EOF
I0206 22:25:13.788126 1 log.go:172] http: TLS handshake error from 10.1.0.215:63064: EOF The EC2 instance (master) linked to the api ELB is marked The EC2 instances (worker) linked to the ingress ELB are marked 23.38: the cluster gets healthy and then consequently the ingress ELB instances are marked Using the following tags, introduce a delay somewhere in the creation of the cluster. In general the cluster takes more time as it is visible from two things:
Where to go next? I'd recommend to debug the creation of the cluster in particular each component because for some reason these tags introduce some overhead computation somewhere perhaps right in this part https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L3150 |
The cluster takes time to create, that is not something new and we are fine, the problem AFAIK was that it did not resolve, the connection was broken for day after creation. if You can access the server 20min after creation as you said then you did not hit the bug. |
20m after the creation I can run
Could you please point out how to reproduce this? Maybe I haven't tested something. |
The issue works as follows. You create a cluster. You wait until it is created. The API works. You wait some more time, maybe an hour or two. Then the API is not working anymore and only returns |
I see. Thank you @xh3b4sd for the clarification. So we need to observe this in the long run. When you say the API is not working anymore how does it look the api ELB? Is it |
ELB appears to be healthy and the master node is also in service. |
Your cluster
|
I see then I think we should improve one thing. If the API returns |
API does not return EOF, API is working, the ELB returns EOF, the health check for API worked fine so far. Maybe just try debug yourself :p |
Ok @calvix but do you agree on the fact that if the ELB returns EOF is a symptom that something is now working properly and we should be advertised about that? So we need to introduce/scrape any kind of metrics at the level of ELB to know what is going on. Do we agree on VPC flow logs could be the way to dig more on this? |
Vaclav created an issue in SIG Monitoring about this yesterday: https://github.com/giantswarm/giantswarm/issues/5294. |
I'm still on this and try to understand more. Enabled the VPC flow logs but it seems that it didn't help that much. I haven't seen any suspicious logs. I'll continue to check that. |
Yes, but we should first find what is wrong before we add scraping random metrics. We already scrape some for them in What I thought is that you could debug the problem found what is wrong and then figure out a way how we can detect such a situation. |
@calvix is there anything left to test? Is it good enough for us that you fixed and verified the issue and CI is green? I am moving the tags to the new WIP bundle as well. Just checking when we could merge this PR here and get the PM sorted. |
@xh3b4sd should be fine, i tested this few times in ginger and gauss and was not able to reproduce again |
Should the changes in |
it will be great to have available for current versions |
Fixes https://github.com/giantswarm/giantswarm/issues/5220. I checked
b8z5x
on seal. See https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:search=b8z5x;sort=SubnetId. Only the private subnet is tagged there. I simply added the tags to the CF template. Not sure what the value should be but I guess due to the weird Kubernetes label format that might not even matter at all.1
is set already in AWS for a couple of clusters.