Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a specific error code for Slack API rate limit errors #2695

Open
chillleader opened this issue Jun 4, 2024 · 12 comments
Open

Introduce a specific error code for Slack API rate limit errors #2695

chillleader opened this issue Jun 4, 2024 · 12 comments
Assignees
Labels
kind:enhancement New feature or request scope:slack support Support ticket

Comments

@chillleader
Copy link
Member

Is your feature request related to a problem? Please describe.

When using the Slack API via the Slack OOTB connector, it is possible to hit the rate limit. The connector currently doesn't distinguish between rate limit errors and other types of errors. It doesn't show a custom error message to the user, showing a generic message instead, while the logs may indicate further details. Also it is impossible to work around the rate limit and implement custom retry behavior for such errors.

Describe the solution you'd like

Slack Connector should differentiate between rate limiting errors and other error types. It should return a specific error code for the rate limiting issue, making it possible to handle this error in the error expression, for example, by modeling an explicit retry cycle in BPMN.

Describe alternatives you've considered

Slack connector could support some customized retry mechanism tailored specifically for rate limiting errors, but it is somewhat difficult given the stateless nature of outbound connectors. We may consider further investigations in this direction.

@chillleader chillleader added kind:enhancement New feature or request scope:slack support Support ticket labels Jun 4, 2024
@chillleader chillleader self-assigned this Jun 4, 2024
@furudean
Copy link

furudean commented Jun 7, 2024

Why should the connector not just handle rate limiting as documented in the Slack API? Given that we keep state in the connector, wouldn't it be more useful to resume iteration instead of starting over to performing all the same lookups for user ids and channel ids again (usually resulting in the same rate limit error)

@chillleader
Copy link
Member Author

@furudean that would be a great option but the issue with it is that this will result in a potentially indefinite execution time of the connector task, which is conflicting with Zeebe's concept of job timeout. Job workers must return a result synchronously, or else Zeebe will consider the job stuck and make it available for workers again, which can lead to request duplication. Depending on the job timeout configuration, the behavior of a connector can even be different in similar situations.

As I understand, it is not guaranteed that the request to Slack API will succeed after waiting the time period specified in Retry-After, so in theory the total waiting time is indefinite.

@chillleader
Copy link
Member Author

chillleader commented Jun 10, 2024

Alternatively we could think of supporting exponential retry backoff strategies in the Retry backoff property of the connector to allow more flexibility in overcoming rate limiting issues in a platform-native way - wdyt?

@chillleader
Copy link
Member Author

chillleader commented Jun 10, 2024

Okay, as I was thinking about it, I realized we have a way to support this. I think this would be a great use-case for the ConnectorRetryException, which is already supported by the connector runtime.

On rate-limiting errors, the connector will throw this exception, setting the backoff to the value of Retry-After from the Slack response. The job will be returned to Zeebe immediately and it will reassign it to the worker after the requested backoff. You will need to specify the maximum retry count in the element template after which the job will fail. What do you think of this?

@furudean
Copy link

@chillleader Could you explain how using ConnectorRetryException is different from throwing a normal exception in this case? Will this enable us to resume iteration inside the connector without sending redundant requests?

@chillleader
Copy link
Member Author

chillleader commented Jun 11, 2024

@furudean when a connector throws a ConnectorRetryException, the job will be returned to the broker (as it would also do with any other exception), but with ConnectorRetryException it is possible to customize the backoffDuration (by default it's 0, i.e. jobs are retried immediately). The connector can throw this special exception and set the backoff duration to exactly what was asked for by Slack in the Retry-After header. The broker will wait for the specified time before attempting the retry. Afterwards, the connector runtime will pick up the job again and try to send the request again.

This will not require any manual error handling from your side (process execution will automatically recover), the only thing to keep in mind is that the default number of retries is 3 and you might want to increase this value in the connector properties, should you still continue to see the rate limiting issues.

@furudean
Copy link

furudean commented Jun 11, 2024

@chillleader That sounds helpful in some cases, but this won't help with the issue when there are too many channels/users in a workspace and we're using the "Invite" method, right? The connector would make the queries to hit the user and channel endpoints until it hits a rate limit, and then the context will be thrown away with the connector job. If it retries at this point, it basically starts over from zero.

@chillleader
Copy link
Member Author

Ah I see what you mean, thanks for explaining! For some reason I assumed this lookup was performed in bulks but I see now that it's not the case 🤦 I agree that it would not be a complete solution for your problem.

Let me just gather more opinions on this from the team and get back to you asap.

@chillleader
Copy link
Member Author

chillleader commented Jun 12, 2024

Our suggestion is to use the Slack connector in multi-instance mode. This way Slack connector invocations will contain only one user per job, and the context of batch processing will be managed by Zeebe. Please check the examples below. To prevent rate-limiting issues with this setup, it's best to use sequential multi-instance (not parallel) with a reasonable retry backoff.

This can already be configured now with the current implementation of the Slack connector. As an improvement, I think it will still be valuable to throw the ConnectorRetryException and configure the correct backoff value automatically. Please let me know if this works for you.

Connector properties - multi-instance Screenshot 2024-06-12 at 13 20 50

Slack connector multi-instance.bpmn.zip

@furudean
Copy link

@chillleader I tried out the linked process in our instance, switching out the variables to values relevant in our Slack workspace. I seem to just get the same error. I'm not sure what this multi process instance gave us here.

Screenshot 2024-06-14 at 2 49 30 PM

@chillleader
Copy link
Member Author

In this example, multi-instance allows us to avoid the issue of losing the context of already resolved user IDs and wasting retries due to re-resolving all user IDs every time the job is retried. This is achieved by breaking down the bulk job into a number of instances, each responsible for resolving a single user ID, resolving a channel ID, and then inviting this user into the channel. With multi-instance, when Slack returns a rate-limiting error, Zeebe will handle retries for each user separately (vs. attempting to resolve N users together and failing with a rate-limiting issue again).

But the multi-instance itself is only part of the solution, because retries also need to be configured so that the job is not retried immediately (use the retry backoff property for this). In this process, sequential multi-instance is used, so users are invited one by one, and if the connector is rate-limited while inviting the first user, it will not attempt to invite the rest of them until the first issue is cleared out.

Regarding the error, I'm not sure what was the cause in this case - unfortunately we don't expose any details in the error message (but that will be solved when we resolve this GitHub issue!), so ideally we'd have to take a look at the connector runtime logs. If it's the rate-limiting error again, you would need to configure the retry parameters of the connector to handle it. I'm also happy to organize a call and support you synchronously with this issue, if you like - the support team would be able to help you with scheduling 🙂

But I wonder if the suggested solution makes sense and if it could work for you?

@furudean
Copy link

furudean commented Jun 20, 2024

Sorry about the delay in getting back to you. Here's the error trace:

2024-06-20T22:28:25.625Z ERROR 1 --- [pool-2-thread-4] i.c.c.r.c.outbound.ConnectorJobHandler   : Exception while processing job: 6755399451199228 for tenant: <default>
java.lang.RuntimeException: Unable to find conversation with name: meri-test-7
	at io.camunda.connector.slack.outbound.utils.DataLookupService.getChannelIdByName(DataLookupService.java:193)
	at io.camunda.connector.slack.outbound.model.ConversationsInviteData.invoke(ConversationsInviteData.java:70)
	at io.camunda.connector.slack.outbound.SlackRequest.invoke(SlackRequest.java:54)
	at io.camunda.connector.slack.outbound.SlackFunction.execute(SlackFunction.java:50)
	at io.camunda.connector.runtime.core.outbound.ConnectorJobHandler.handle(ConnectorJobHandler.java:178)
	at io.camunda.connector.runtime.outbound.jobhandling.SpringConnectorJobHandler.lambda$handle$0(SpringConnectorJobHandler.java:75)
	at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:247)
	at io.camunda.zeebe.spring.client.actuator.MicrometerMetricsRecorder.executeWithTimer(MicrometerMetricsRecorder.java:50)
	at io.camunda.connector.runtime.outbound.jobhandling.SpringConnectorJobHandler.handle(SpringConnectorJobHandler.java:66)
	at io.camunda.zeebe.client.impl.worker.JobRunnableFactoryImpl.executeJob(JobRunnableFactoryImpl.java:45)
	at io.camunda.zeebe.client.impl.worker.JobRunnableFactoryImpl.lambda$create$0(JobRunnableFactoryImpl.java:40)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.slack.api.methods.SlackApiException: status: 429, error: ratelimited, needed: null, provided: null, warning: null
	at com.slack.api.methods.impl.MethodsClientImpl.parseJsonResponseAndRunListeners(MethodsClientImpl.java:3641)
	at com.slack.api.methods.impl.MethodsClientImpl.parseJsonResponseAndRunListeners(MethodsClientImpl.java:3614)
	at com.slack.api.methods.impl.MethodsClientImpl.postFormWithTokenAndParseResponse(MethodsClientImpl.java:3521)
	at com.slack.api.methods.impl.MethodsClientImpl.conversationsList(MethodsClientImpl.java:1923)
	at io.camunda.connector.slack.outbound.utils.DataLookupService.getChannelIdByName(DataLookupService.java:176)
	... 16 common frames omitted

I'm getting a rate limit response trying to invite just one user and one channel, so unfortunately the multi-instance does not help for our large slack workspace. It will always rate limit with how the connector hammers the API and fails quickly.

Let's continue our conversation in the support ticket as it's a little off topic. The general sentiment of this issue is good and will help with debugging but will unfortunately not solve our problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:enhancement New feature or request scope:slack support Support ticket
Projects
None yet
Development

No branches or pull requests

2 participants