-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed Authentication: Too many connects #28
Comments
This error is the result of too many connection requests at the same time given your broker's instance size. You probably want to set reconnect.backoff.ms to something higher than default. See this link for some information: https://docs.aws.amazon.com/msk/latest/developerguide/limits.html |
"Too many connects" is a sign that one or more IAM clients are trying to connect to a particular broker too many times per second and the broker is protecting itself. Please note this error is not about the total number of connections per broker but the rate of new IAM connections per broker. |
@sayantacC @liko9 Thanks a lot for the detailed answer! Making this change worked for me. Maybe I missed it, but was this limits page referenced in the IAM documentation? I think it would be helpful to have that. Thanks again! |
I have increased this value, but still getting the error. Im running kafka connect in distributed mode. And this is the error from the kafka connect service.
|
@BhuviTheDataGuy The number of kafka connect workers will increase the rate of new connections to the kafka brokers. Depending on the number of kafka connect workers you have, you might need to set the |
Instance type - t3.small(2 broker) |
@BhuviTheDataGuy Sorry for the delayed response. If you update the broker type to a larger instance, does the problem go away ? |
We are encountering the same problems. Setting When would the reconnect take place? As I went through the implementation, what I see is:
Even if the retry would work, several AdminClients are created, which all connect to the MSK cluster. Since this is not a reconnect, So either AWS removes the limit on t3.small for IAM connections or the Kafka clients throw a different exception :-/ See parts of our logs using AWS MSK Connect:
Since this forces us to use the m5.large instance or not use IAM, I wrote this question in the AWS re:Post site to hopefully obtain help from the AWS community on this: https://repost.aws/questions/QU3qd8DVjTR4qHH_4Zf4KtuA/msk-connect-on-t-3-small-fails-due-to-not-retryable-sasl-authentication-exception-reconnect-backoff-ms-worker-configuration-will-not-help-can-aws-remove-the-connection-limit |
@mfbieber What do your Kafka Connect configurations look like? Obviously, please mask anything sensitive, but I'm keenly interested in how you are setting the backoff parameter. One element of Kafka Connect which isn't immediately obvious is the number of different contexts it runs within. In my testing, I had to set the backoff for two additional contexts - Producer and Consumer. From what you shared, it appears that your AdminClientConfig took the parameter as expected, but I'd be curious to see if the others did as well (my guess is that they did not). Also, you are referencing trunk in your code investigation - did you compile the latest version from trunk yourself, or are you using a specific version of Apache Kafka?
I believe this is because the default of the backoff timing is 50ms, which is way too quick for MSK using IAM. I also set the backoff.max.ms higher than my backoff.ms so that it could happen more than once instead of failing immediately. (reference: https://kafka.apache.org/documentation/#producerconfigs_reconnect.backoff.max.ms) |
Hi @liko9 , thanks for your interest and your answer! I tried your configuration, but I am seeing the same errors as before with this worker configuration:
Also, because I am unsure of how to configure the connector, I added those values to the connector configuration (where should they actually go? I would have thought the worker, but I am just starting out with Kafka):
In the connector log, I also see:
I believe I am seeing the AdminClientConfig only. That one is crashing (as described) and I believe that prevents any producers and consumers from being started maybe?
The Kafka Connect version is 2.7.1 and the cluster's Kafka version is 2.8.1. I believe it is easier if you see the full Kafka Connect Worker log, I've attached it here. That contains no production-relevant information: I really hope that I am missing something in the configuration. |
@mfbieber First off, thank you for your detailed descriptions and full logs. If more people were willing to provide that, it would make troubleshooting far, far easier. I previously overlooked that you are using MSK Connect. I do see in the logs the same as you, that the AdminClientConfig has accepted the parameters you provided, and these do match my working configuration. However, my working configuration not from MSK Connect but rather Kafka Connect running on an EC2 instance (created prior to the availability of MSK Connect). You are correct that because the AdminClient is unable to check its topics (the config, offsets, and storage topics) in MSK due to the "too many connects" error, it is not continuing to where the producer or consumer are initialized. There is an automatic retry / reconnection attempt that the reconnect.backoff parameters are used by. I'm at a loss as to why this isn't working within MSK Connect - I'd suggest opening a support case to see if the service team can see anything that we're not seeing in the logs. The parameters are worker level parameters (not connector parameters), so you did that in the proper place. The two things you could try while you're waiting on support are to try to set reconnect.backoff.ms to 10000 (make it 10s instead of 1s - this is a totally silly idea but it might be interesting if the behavior changes) or you could try running Kafka Connect yourself in EC2. I'm sorry that I couldn't be of further help. |
Hello All, I am having this issue as well. (Not working) Test1: t3.small + no custom config for reconnect.backoff I reached out to AWS Support. And, I was told that the issue is related to the TCP connection limit. (Not working) Test2: t3.small + overriding these reconnect.backoff
I wasn't convinced by what AWS support said, so I tried with the updated config.
Honestly, it doesn't make sense that I have to increase the machine type because t3.small can't handle IAM auth request from one MSK connector. |
Although, sadly I do not yet have a solution to offer, I want to reopen the issue to raise visibility. |
@sayantacC |
To give little more details, I only use one MSK Connector that use "tasks.max=1" and connect to MSK Cluster. |
@liko9: thanks for verifying! @Gatsby-Lee: thank you. We will also invest some $s for the support and add a ticket for the t3.small limitation. The more people request it, the higher it might get in their priority list - that's what I heard. The next (but not preferred) option would have been to use SASL SCRAM, configure ACLs and use the public endpoint to connect to the MSK cluster. I tested it, but unfortunately, it is not possible to provide I am going to create one more ticket for that too. Also, I will create another ticket for an option to delete worker configurations and custom plugins, which is not possible at the moment: https://stackoverflow.com/questions/70025964/how-to-delete-a-worker-and-a-plugin-on-aws-msk-connect |
@mfbieber I escalated these two limitations, you mentioned. I will share if I have more updates about the TCP connection throttle issue with t3.small. For fun, I wrote about the issue here. |
Thanks, @Gatsby-Lee. No, that I will also share any updates I get here with you guys. |
This seems to be exclusive to MSK Connect, as I do not experience the same issue when running Kafka Connect with IAM authentication against MSK (it seems to respect the reconnect.backoff parameters as expected). Have either of you tried Kafka Connect without MSK Connect? |
@liko9 no I haven't tried it. |
Hi all, I have this same problem on our dev cluster setting up MSK Kafka Connect. I have IAM working running on EC2 or ECS. But with MSK Kafka Connect this remains a problem. Upgrading the instance types to m5.large did work around the problem - but this means I have a really expensive dev environment for really low producers and consumers. Here is the custom worker config:
Errors from logs:
|
@mfbieber were you able to get MSK Kafka Connect working with |
@pmalon: no, but I have gotten feedback from the AWS Support team:
We will probably either wait until it works with t3 or implement a hacky solution involving a public cluster with self-built mechanisms to secure it. It simply depends on the timing and we would rather wait for the IAM + t3 solution. |
@liko9 I got the same error with pure Kafka Connect without MSK Connect |
@mfbieber Since the current MSK Connect doesn't allow to override the |
We just went live running ~80 Debezium connectors in a Kafka Connect ECS Cluster, not MSK Connect as it was going to be much more expensive for our use case. Our MSK cluster is running 3 m5.large, and I've currently got ~7 connectors experiencing this issue. |
@sayantacC or anyone else that might have insight, is there any discussion or progress internally related to this issue? I am feeling like something is severely wrong with the library here or the number of connects per second is too low for our broker types. This is causing us severe headaches and a lack of confidence in the solution in production. I am going to open a support ticket with AWS, but I'd like to keep the conversation going here as well. I want to stress, I do not think this is an issue with only MSK Connect or using MSK with small broker sizes. We are spending some good money on a large MSK cluster and Kafka Connect in ECS, and constantly seeing this problem. Below is our setup. We currently are only producing data, and don't even have our consumers wired up yet. We are able to see sustained 30k messages per second with spikes up to 70k-90k per second with the brokers keeping at ~50-75% CPU usage in the Kafka cluster. However, our connectors keep failing with the error below. I have tried to set the retry and backoff as suggested in this thread. The longest we've gone without connectors failing for this is maybe 12 hours. Also after restarting connectors they sometimes seem to fail faster than the max timeout we have set. It sort of feels like these settings aren't being honored like others have said above. It is frustrating for me to have to go ask for more brokers or larger brokers when the clusters seem to be performing more than sufficient for our load otherwise. Explaining we need more brokers to handle IAM auth is a very tough sell. I did load testing on 60-100k messages per second and was not running into his problem, so again this is hard to explain why I need to spend more on brokers now and I'm not even sure that would solve our issue. What can we do to troubleshoot further or get you more information? Should we just stop trying to use IAM auth? I see how we can monitor total TCP connections in our cluster in Cloud Watch metrics and we are staying below quota there. How do we monitor new connections per second which seem to be where the problem is? And can we aggregate this by the source of the connection to help pinpoint where the issue is coming from? Kafka: Broker config:
Kafka Connect: connect-distributed.properties:
Error:
|
@dude0001 Thanks for reporting your problem with such detail and I am sorry about the pain this issue has been causing you. We are looking into the pain points with using IAM and Kafka Connect. In the meanwhile, I have captured my best understanding of this issue based on some digging and have some suggestions based on it. BackgroundAs you are painfully aware, each MSK broker imposes a limit on the connection creation rate by IAM clients. When this limit is breached, the broker rejects the next connection with a "Too many connects" error message that gets encapsulated in a SaslAuthenticationException on the client. By default, Kafka Connect fails the connector task immediately on any error. As a result any SaslAuthenticationException caued the connector task to fail immediately. However, this behavior can be changed by setting the value of the field My reading of the Kafka Connect code seems to indicate that setting SuggestionWould you be willing to try modifying your Connect's error handling by setting:
I will point out for completeness that it is possible to include the messages while logging the errors as well: MonitoringThe |
Well. IMHO, your suggestion works for only the case that it's ok to lose some data. In my case, your suggestion creates data integrity question. |
Is there any other logging or anything needed to reproduce and troubleshoot the issue? It seems easy enough yo reproduce already but if I can help I am glad to do it. Having to buy more compute in an already large cluster just for IAM auth feels bad and there has to be another solution. Thank you for pointing me to the metric. I missed this and now I can at least monitor. There still doesn't seem to be a way to aggregate a source from this so we can pinpoint where we are opening a lot of connections quickly and try to optimize. The "'all' changes the behavior to skip over problematic records" part of the suggested error tolerance work around does seem problematic for us as well. I appreciate the idea but I am still researching that. Right now the only solution I have found is to automate restarting the failed connectors with a Lambda running on a CRON schedule but that jas it's own consequences for is as well. |
@Gatsby-Lee Although, it is unlikely to be useful, I will still point out for completeness that it is possible to include the messages while logging the errors as well: |
@dude0001 We are internally looking into the problem of rejecting IAM connections with In this case, there isn't a Kafka metric from the broker side that can aggregate by something like client-id since the broker is rejecting the connection before it can really learn about the client-id. I will continue to look for any other mechanism of finding this information (such as logging). I wanted to point out that there are some JMX based metrics on all Kafka producer/consumer/connect/streams instances: I have updated my comment suggesting the |
I'm going to dog pile on this same issue with another "this isn't just the small instance types" or "this isn't just MSK connect" example. We are running with It makes IAM based auth nearly unusable for any larger deployment of Kafka-Connect. |
AWS MSK made an improvement for connection burst rates in IAMAuthAgent. AWS need to deploy the new version to your Amazon MSK clusters to mitigate the issue on your behalf. You can request this from AWS support. The deployment requires a rolling restart of all the brokers in the cluster. During deployment, one broker at a time will be unavailable for read/write operations. Clients will still be able to communicate with the cluster if you follow the Amazon MSK best practices [1] to avoid any availability loss during the planned upgrade. [1] https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html |
We have recently addressed this issue in MSK. We have also updated documentation at https://docs.aws.amazon.com/msk/latest/developerguide/limits.html#msk-provisioned-quota. If you still need help, please create a support case. |
What does this error indicate? The logic of my application is that I have 2 producer and one consumer running in parallel with eachother, is that what may be causing this issue? This is the first time I have seen this error:
22:03:32.012 [kafka-producer-network-thread | producer-1] INFO org.apache.kafka.common.network.Selector - [Producer clientId=producer-1] Failed authentication with example.kafka.us-east-1.amazonaws.com/10.1.1.132 ([446c81dc-9ab3-4d4b-b174-4ecd9baa406c]: Too many connects)
22:03:32.046 [kafka-producer-network-thread | producer-1] ERROR org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Connection to node -1 (example.kafka.us-east-1.amazonaws.com/10.1.1.132:9098) failed authentication due to: [446c81dc-9ab3-4d4b-b174-4ecd9baa406c]: Too many connects
The text was updated successfully, but these errors were encountered: