Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Must have a custom scale controller #27

Closed
ryancrawcour opened this issue Mar 21, 2019 · 20 comments
Closed

Must have a custom scale controller #27

ryancrawcour opened this issue Mar 21, 2019 · 20 comments
Milestone

Comments

@ryancrawcour
Copy link
Contributor

Until this extension is supported by the Functions scale controller we will need logic to handle scale.

Something (a web app or a web job, running in the same App Svc?) will need to check the "queue length" in Kafka topic(s) configured and determine if the current number of Function instances are keeping up adequately.

If we're falling behind, we need to scale out.
If we're ahead, or have drained the queue then we need to scale back in.

@jeffhollan & @anirudhgarg can you please add some details to this requirement that will get us to a MVP scale controller.

@ryancrawcour ryancrawcour added this to the P0 milestone Mar 21, 2019
@jeffhollan
Copy link

jeffhollan commented Mar 21, 2019

This is being worked out in a related and private repo. Will link to this issue from there.

--

Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.

@ryancrawcour
Copy link
Contributor Author

Does this mean we won't need a custom scaler if we plan on hosting this in Azure in an App Service (until Consumption is supported)?

@jeffhollan
Copy link

Correct

@ryancrawcour
Copy link
Contributor Author

Awesome. Might just close this one then :)
Or, leave it open until we verify we can deploy this custom scaler in to an App Svc environment.

@fbeltrao
Copy link
Contributor

Do we need to expose statistics or additional information to the scaler or is it going directly to the source?

@jeffhollan
Copy link

The App Svc environment won't have a custom scaler - it will rely on auto-scaling which scales based on CPU and memory. So the short term solution (sans that private repo thing I linked above) is rely on CPU metrics to drive scaling.

Allowing you to use a customer scaler in something like App Service or even consumption is on the backlog of Functions, but in a way that it's hard to know what an ETA would be. I don't expect this calendar year. If there's enough customer demand probably the quickest route is we provide support on the Azure Functions consumption scale controller - that's a much more straight forward path. Not easy, but easier than customer scalers.

@ryancrawcour
Copy link
Contributor Author

Not sure I understand how CPU metrics will help figure out that our Function is falling behind.

@jeffhollan
Copy link

jeffhollan commented Mar 21, 2019

Also I'm sorry I thought this comment was in this repo but I added to the other. And the short answer is it won't.

Also it's expected for the short term in the Azure service this trigger will only work on the App Service Plan with autoscaling. Getting kafka scaling logic in the Azure Functions multi-tenant scale controller is still being evaluated based on customer need. Since Kafka is generally not exposed publicly it's still to be seen if there is interest in a multi-tenant non-VNet connected Kafka trigger. Once we have VNet enabled triggers or potentially custom trigger support or enough customer interest we can consider supporting and providing in the Azure Functions service scale controller.

@ryancrawcour
Copy link
Contributor Author

Hmmmm, This is why I was thinking that we'd need something that would check the queue length and scale our App Svc accordingly.

I know @anirudhgarg and I discussed this briefly when we met.

Even if I scale the App Svc (that hosts my Functions) manually how will this know to go spin up additional instances of the Function? Something needs to tell it to go spin up new instances, or tear down instances that exist (or does the App Svc plan not do this, and just keeps instances sitting idle waiting for work?)

@jeffhollan
Copy link

You aren't wrong - but there's no feature or existing way to do this today. So there's nothing we can build that could check the queue length and then use that detail to drive scaling decisions of an app service plan.

@ryancrawcour ryancrawcour added the blocked Waiting for someone else to resolve label Mar 25, 2019
@jcooper1982
Copy link

Hi guys,

Just found this thread and I’m not sure if the approach you guys are talking about is the most natural way to scale Kafka consumers. Not that your ideas aren’t great, they just seem more aligned to something like Azure Service Bus than Kafka.

Queue length shouldn’t ideally be used to decide how many instances to spin up as that will compromise ordered processing of messages if you have multiple instances processing against a given partition. Rather the best way to handle this is to see how many partitions for the given topic the consumer is lagging against and to spin up an instance for each. The difficulty here is that in order to achieve the consumption at scale you really want for the function that is assigned to a given partition to hang around for awhile and process lots of messages rather than just one, thus taking advantage of librdkafka’s batching logic.

Have been running Kafka consumers very successfully in WebJobs but without any dynamic scaling. If you guys crack this problem in functions then this would be amazing. It might be worth paying attention to KNative’s approach but I suspect the way it works there is via an observer container that then decides how many worked containers to spin up. Without the observer I think you guys are going to have to be mega-creative to come up with a good solution that is a natural fit for Kafka.

Again apologies if all of this is obvious to you guys. Not trying to poke holes but am instead keen to see a great solution eventuate out of this. I’ll definitely use it!

Cheers
Johann

@ryancrawcour
Copy link
Contributor Author

Thanks @jcooper1982. Very good points. We currently only have one consumer per partition to ensure we keep ordered delivery intact. In terms of how many instances we need etc. this is where we're a bit stuck right now. Current thinking is that we need some kind of observer, but what should it observe on?

@jcooper1982
Copy link

Hi @ryancrawcour, consumer lag for the consumer group per partition is definitely the right thing to observe because then you can spin up one instance per partition which the consumer is lagging against, ensuring you only have just enough active functions running at a given time. Is it actually possible to build a custom observer for Kafka in the functions engine?

I believe there is an admin client in the latest Confluent SDK, though I haven’t played with it yet, which you might be able to get consumer lag out of. Alternatively you could get it from consumer statistics.

A trick I’ve done before (have written a Kafka .Net SDK wrapping confluent) is to have multiple threads within a consumer with logic to stage messages in memory if they share a key with a message already being processed. This obviously requires some throttling controls to ensure you don’t overrun memory but does offer you the best of all worlds. I’ll be happy to contribute (probably not for a few weeks) if you guys are keen to enable this as an option. This was an optional feature in my SDK as well.

@ryancrawcour
Copy link
Contributor Author

ryancrawcour commented Apr 21, 2019

Thanks @jcooper1982 we'd value your contributions whenever you are able.

@fbeltrao
Copy link
Contributor

fbeltrao commented Apr 22, 2019

Hi @jcooper1982 thanks for your feedback!

Current design:

  • Let librdkafka manage partition assignment
  • Adding new hosts will rearrange partitions. Scaling up to Topic.PartitionCount
  • Order is guaranteed at partition level
  • Single consumer per partition. Consumer has 1+ partitions

I believe this is a good general implementation, assuming you have well distributed partition load and processing.

However, if partitions are unbalanced we would have to manage partition assignments manually and spawn new hosts to handle lagging partitions in isolation. I am not sure it makes sense, as the problem source is most likely a sub-optimal partition strategy.

Does it make sense?

@jcooper1982
Copy link

Hi @fbeltrao, yup that sounds like a perfectly good and standard design. I will go through the code over the next few weeks and provide more feedback then too.

The main question is how will you decide whether to add new hosts or not? An extreme option is to always have one host per partition which will mean rapid processing but inefficient costing and you might as well go for continuous WebJobs. Alternatively you need some intelligence that decides that scale up is necessary and I suppose you have a few options below.

-Scale out based on the number of partitions for the consumer group which have consumer lag. As previously mentioned this could be done using the admin client if this function is supported, or via consumer statistics. This would be the ultimate option in terms of cost effectiveness yet supporting high throughput.
-Scale out based on CPU usage (doesn't really make much sense because CPU usage doesn't indicate that there is work to do on other partitions) or some other arbitrary measure.
-Scale out if the current set of hosts is busy since that is an indicator that there are active messages on the topic but is no guarantee that there will be any work for an additional host to do.

The other issue is the short lifespan of functions. I suppose your scaling engine will need to account for the fact that a function can only run for so long if it's running within a consumption based plan. The extension is going to have to have smarts to auto shutdown in a clean fashion and the engine is going to need to decide whether to instantly spin up another host because there's still more work to do.

@fbeltrao
Copy link
Contributor

I wouldn't worry about the function execution time, because the limit your are talking about is for the custom code that will be triggered by this extension. AFAIK the listening part does not have a limit, and will be shutdown only if the scale controller decides to scale it down.

@jcooper1982
Copy link

Fantastic. Will dive into this a bit deeper soon and will hopefully have a better appreciation for the underlying architecture of Azure functions by then since I only understand this at a conceptual level right now.

Will also speak with my (soon to be new) employer about what level of collaboration they are willing to commit, since the SDK I wrote for them will likely have quite a few reusable parts here that could add a lot of value.

@ryancrawcour ryancrawcour modified the milestones: P0, P1 Apr 28, 2019
@jeffhollan
Copy link

Just updating this - we have a way now to create a custom trigger to run in VNet enabled plans (premium). We should be able to make this trigger run well in the Premium plan (Linux and Windows) now

@jeffhollan jeffhollan removed the blocked Waiting for someone else to resolve label Nov 21, 2019
@ryancrawcour
Copy link
Contributor Author

Azure Functions Premium now supports scaling based on Kafka topic length.
Support in KEDA exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants