Must have a custom scale controller #27

ryancrawcour · 2019-03-21T21:37:46Z

Until this extension is supported by the Functions scale controller we will need logic to handle scale.

Something (a web app or a web job, running in the same App Svc?) will need to check the "queue length" in Kafka topic(s) configured and determine if the current number of Function instances are keeping up adequately.

If we're falling behind, we need to scale out.
If we're ahead, or have drained the queue then we need to scale back in.

@jeffhollan & @anirudhgarg can you please add some details to this requirement that will get us to a MVP scale controller.

jeffhollan · 2019-03-21T21:39:44Z

This is being worked out in a related and private repo. Will link to this issue from there.

--

Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.

ryancrawcour · 2019-03-21T21:42:54Z

Does this mean we won't need a custom scaler if we plan on hosting this in Azure in an App Service (until Consumption is supported)?

jeffhollan · 2019-03-21T21:49:48Z

Correct

ryancrawcour · 2019-03-21T22:01:39Z

Awesome. Might just close this one then :)
Or, leave it open until we verify we can deploy this custom scaler in to an App Svc environment.

fbeltrao · 2019-03-21T22:02:48Z

Do we need to expose statistics or additional information to the scaler or is it going directly to the source?

jeffhollan · 2019-03-21T22:25:16Z

The App Svc environment won't have a custom scaler - it will rely on auto-scaling which scales based on CPU and memory. So the short term solution (sans that private repo thing I linked above) is rely on CPU metrics to drive scaling.

Allowing you to use a customer scaler in something like App Service or even consumption is on the backlog of Functions, but in a way that it's hard to know what an ETA would be. I don't expect this calendar year. If there's enough customer demand probably the quickest route is we provide support on the Azure Functions consumption scale controller - that's a much more straight forward path. Not easy, but easier than customer scalers.

ryancrawcour · 2019-03-21T22:29:19Z

Not sure I understand how CPU metrics will help figure out that our Function is falling behind.

jeffhollan · 2019-03-21T22:31:26Z

Also I'm sorry I thought this comment was in this repo but I added to the other. And the short answer is it won't.

Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.

ryancrawcour · 2019-03-21T22:41:37Z

Hmmmm, This is why I was thinking that we'd need something that would check the queue length and scale our App Svc accordingly.

I know @anirudhgarg and I discussed this briefly when we met.

Even if I scale the App Svc (that hosts my Functions) manually how will this know to go spin up additional instances of the Function? Something needs to tell it to go spin up new instances, or tear down instances that exist (or does the App Svc plan not do this, and just keeps instances sitting idle waiting for work?)

jeffhollan · 2019-03-21T22:45:56Z

You aren't wrong - but there's no feature or existing way to do this today. So there's nothing we can build that could check the queue length and then use that detail to drive scaling decisions of an app service plan.

jcooper1982 · 2019-04-21T10:09:14Z

Hi guys,

Just found this thread and I’m not sure if the approach you guys are talking about is the most natural way to scale Kafka consumers. Not that your ideas aren’t great, they just seem more aligned to something like Azure Service Bus than Kafka.

Queue length shouldn’t ideally be used to decide how many instances to spin up as that will compromise ordered processing of messages if you have multiple instances processing against a given partition. Rather the best way to handle this is to see how many partitions for the given topic the consumer is lagging against and to spin up an instance for each. The difficulty here is that in order to achieve the consumption at scale you really want for the function that is assigned to a given partition to hang around for awhile and process lots of messages rather than just one, thus taking advantage of librdkafka’s batching logic.

Have been running Kafka consumers very successfully in WebJobs but without any dynamic scaling. If you guys crack this problem in functions then this would be amazing. It might be worth paying attention to KNative’s approach but I suspect the way it works there is via an observer container that then decides how many worked containers to spin up. Without the observer I think you guys are going to have to be mega-creative to come up with a good solution that is a natural fit for Kafka.

Again apologies if all of this is obvious to you guys. Not trying to poke holes but am instead keen to see a great solution eventuate out of this. I’ll definitely use it!

Cheers
Johann

ryancrawcour · 2019-04-21T19:39:12Z

Thanks @jcooper1982. Very good points. We currently only have one consumer per partition to ensure we keep ordered delivery intact. In terms of how many instances we need etc. this is where we're a bit stuck right now. Current thinking is that we need some kind of observer, but what should it observe on?

jcooper1982 · 2019-04-21T20:52:17Z

Hi @ryancrawcour, consumer lag for the consumer group per partition is definitely the right thing to observe because then you can spin up one instance per partition which the consumer is lagging against, ensuring you only have just enough active functions running at a given time. Is it actually possible to build a custom observer for Kafka in the functions engine?

I believe there is an admin client in the latest Confluent SDK, though I haven’t played with it yet, which you might be able to get consumer lag out of. Alternatively you could get it from consumer statistics.

A trick I’ve done before (have written a Kafka .Net SDK wrapping confluent) is to have multiple threads within a consumer with logic to stage messages in memory if they share a key with a message already being processed. This obviously requires some throttling controls to ensure you don’t overrun memory but does offer you the best of all worlds. I’ll be happy to contribute (probably not for a few weeks) if you guys are keen to enable this as an option. This was an optional feature in my SDK as well.

ryancrawcour · 2019-04-21T23:39:48Z

Thanks @jcooper1982 we'd value your contributions whenever you are able.

fbeltrao · 2019-04-22T07:32:21Z

Hi @jcooper1982 thanks for your feedback!

Current design:

Let librdkafka manage partition assignment
Adding new hosts will rearrange partitions. Scaling up to Topic.PartitionCount
Order is guaranteed at partition level
Single consumer per partition. Consumer has 1+ partitions

I believe this is a good general implementation, assuming you have well distributed partition load and processing.

However, if partitions are unbalanced we would have to manage partition assignments manually and spawn new hosts to handle lagging partitions in isolation. I am not sure it makes sense, as the problem source is most likely a sub-optimal partition strategy.

Does it make sense?

jcooper1982 · 2019-04-22T08:20:25Z

Hi @fbeltrao, yup that sounds like a perfectly good and standard design. I will go through the code over the next few weeks and provide more feedback then too.

The main question is how will you decide whether to add new hosts or not? An extreme option is to always have one host per partition which will mean rapid processing but inefficient costing and you might as well go for continuous WebJobs. Alternatively you need some intelligence that decides that scale up is necessary and I suppose you have a few options below.

-Scale out based on the number of partitions for the consumer group which have consumer lag. As previously mentioned this could be done using the admin client if this function is supported, or via consumer statistics. This would be the ultimate option in terms of cost effectiveness yet supporting high throughput.
-Scale out based on CPU usage (doesn't really make much sense because CPU usage doesn't indicate that there is work to do on other partitions) or some other arbitrary measure.
-Scale out if the current set of hosts is busy since that is an indicator that there are active messages on the topic but is no guarantee that there will be any work for an additional host to do.

The other issue is the short lifespan of functions. I suppose your scaling engine will need to account for the fact that a function can only run for so long if it's running within a consumption based plan. The extension is going to have to have smarts to auto shutdown in a clean fashion and the engine is going to need to decide whether to instantly spin up another host because there's still more work to do.

fbeltrao · 2019-04-22T08:47:13Z

I wouldn't worry about the function execution time, because the limit your are talking about is for the custom code that will be triggered by this extension. AFAIK the listening part does not have a limit, and will be shutdown only if the scale controller decides to scale it down.

jcooper1982 · 2019-04-22T10:04:07Z

Fantastic. Will dive into this a bit deeper soon and will hopefully have a better appreciation for the underlying architecture of Azure functions by then since I only understand this at a conceptual level right now.

Will also speak with my (soon to be new) employer about what level of collaboration they are willing to commit, since the SDK I wrote for them will likely have quite a few reusable parts here that could add a lot of value.

jeffhollan · 2019-11-21T02:38:25Z

Just updating this - we have a way now to create a custom trigger to run in VNet enabled plans (premium). We should be able to make this trigger run well in the Premium plan (Linux and Windows) now

ryancrawcour · 2020-05-28T06:48:46Z

Azure Functions Premium now supports scaling based on Kafka topic length.
Support in KEDA exists

ryancrawcour added this to the P0 milestone Mar 21, 2019

jeffhollan mentioned this issue Mar 21, 2019

[Scaler] Kafka kedacore/keda#13

Closed

ryancrawcour added the blocked Waiting for someone else to resolve label Mar 25, 2019

ryancrawcour modified the milestones: P0, P1 Apr 28, 2019

jeffhollan removed the blocked Waiting for someone else to resolve label Nov 21, 2019

ryancrawcour closed this as completed May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Must have a custom scale controller #27

Must have a custom scale controller #27

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019 •

edited

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019

ryancrawcour commented Mar 21, 2019

fbeltrao commented Mar 21, 2019

jeffhollan commented Mar 21, 2019

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019 •

edited

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019

jcooper1982 commented Apr 21, 2019

ryancrawcour commented Apr 21, 2019

jcooper1982 commented Apr 21, 2019

ryancrawcour commented Apr 21, 2019 •

edited

fbeltrao commented Apr 22, 2019 •

edited

jcooper1982 commented Apr 22, 2019

fbeltrao commented Apr 22, 2019

jcooper1982 commented Apr 22, 2019

jeffhollan commented Nov 21, 2019

ryancrawcour commented May 28, 2020

Must have a custom scale controller #27

Must have a custom scale controller #27

Comments

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019 • edited

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019

ryancrawcour commented Mar 21, 2019

fbeltrao commented Mar 21, 2019

jeffhollan commented Mar 21, 2019

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019 • edited

ryancrawcour commented Mar 21, 2019

jeffhollan commented Mar 21, 2019

jcooper1982 commented Apr 21, 2019

ryancrawcour commented Apr 21, 2019

jcooper1982 commented Apr 21, 2019

ryancrawcour commented Apr 21, 2019 • edited

fbeltrao commented Apr 22, 2019 • edited

jcooper1982 commented Apr 22, 2019

fbeltrao commented Apr 22, 2019

jcooper1982 commented Apr 22, 2019

jeffhollan commented Nov 21, 2019

ryancrawcour commented May 28, 2020

jeffhollan commented Mar 21, 2019 •

edited

jeffhollan commented Mar 21, 2019 •

edited

ryancrawcour commented Apr 21, 2019 •

edited

fbeltrao commented Apr 22, 2019 •

edited