Add a matcher for partitioning services #3224

lawrencegripper · 2018-04-23T15:54:41Z

Do you want to request a feature or report a bug?

Feature (I'll write it)

What did you do?

Service Fabric uses partitioning of services to improve scalability. I would like to add a matching rule which allows requests to be partitioned. A frontend would be created per partition and the matching rule would ensure requests are matched to the correct frontend based on the value of a hash function – allowing you to evenly distribute across n number of partitions. This would be useful to other providers, for example allowing requests to be partitioned across multiple container instances or services in Kubernetes.

Additional discussion: jjcollinge/traefik-on-service-fabric#45

Proposal

Add an additional matching rule to Traefik which enables a hashed range match for example HashedRange: type:header value:x-partitionheader match:0-100 range:0-300 . It would take an input and use a hashing algorithm to convert this to an int with even distribution in a range. In this case the full range would be 0-300 and this rule would match if the hashed result of the header x-partitionheader fell in the range 0-100.

This could be used to create 3 partitions with a KeyMin=0 and KeyMax=300 for example and distribute load between them:

Frontend for Parition 1 with matcher HashedRange: type:header value:x-partitionheader match:0-100 range 0-300
Frontend for Parition 2 with matcher HashedRange: type:header value:x-partitionheader match:100-200 range 0-300
Frontend for Parition 2 with matcher HashedRange: type:header value:x-partitionheader match:200-300 range 0-300

In addition to the type:header option I would also look to add url-regex which would match a section of the url to hash,

I can think of more types but I think these two cover most use cases.

Example of url-regex type

URL: http://example.com/bob/?customerid=jamesnesbit
HashedRange: type:url-regex value:[=].* match:0-100 range:0-300

This would hash jamesnesbit and match if it the result was in range 0-100

What did you expect to see?

The service fabric provider would query stateful services and create a frontend for each partition with the appropriate hashedrange matcher. Requests would then match the correct partition based on the value of their header or url-regex.

What did you see instead?

I don't believe it's currently possible to achieve this behavior in Traefik

CC: @jjcollinge

The text was updated successfully, but these errors were encountered:

geraldcroes · 2018-04-27T13:36:42Z

Hi, I'm quite new to the Service Fabric world so excuse my candor here.

Why can't you use the existing matchers in your use case? Like, a simple regex-based matcher?
Why do you need to hash the value?

To me, it looks like you're trying to implement a load balancing rule in the matcher.

Before taking any action, we need to fully understand your use case.

More specifically, what is a partitioned service? Can you give us some pointers, a diagram along with a use case?

Thanks for your help

petertiedemann · 2018-04-27T13:52:06Z

@geraldcroes A partitioned service is typically a stateful service, that has been broken into partitions( like shards for databases ). Imagine you have customers A,B,C,D, and you decide to have 3 partitions.

The client only knows the customer, it does not know how A,B,C,D are allocated to partitions, but this information is required to route the request correctly. Without @lawrencegripper 's feature, either the client must call somewhere else to get the partition, or a separate proxy service has to be setup.

Also see https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-concepts-partitioning .

FYI, the reason i am replying here, is that we are starting to use Traefik in Service Fabric to replace the Azure APIM in our setup, and this is a bit of an annoyance (so there is a real world "customer" here :)

geraldcroes · 2018-04-27T14:08:14Z

Thank you for your pointers, I'll read them right away.

I still don't get why the partitioning logic has to exist both in the stateful service and in Traefik (and why Traefik can't contact some kind of master that would handle the routing).

Also, one of my question was, "why do you need to hash / compute the value and not use the value "as is" using a regexp?"

jjcollinge · 2018-04-27T14:37:28Z

The SF partitions have no knowledge of their fellow partitions and need to be individually load balanced across - hence the appropriate partition endpoint must be resolved before the request is then load balanced over the partition instances. There is a SF API that can resolve the request but this would require a lookup from a piece of custom middleware for each request destined for a stateful partition (pick your poison?).

The need for a hash is to support range based matching rather than direct string matching. Hashing ensures even distribution across the partitions to avoid getting hot spots and effective use of the underlying resources.

lawrencegripper · 2018-04-27T16:20:50Z

A similar approach is used by the Metaparticle project to handle this in Kubernetes - good doc with diagram. The doc explains how this approach used, hopefully demonstrating that the label is useful in both SF and other orchestrators.

geraldcroes · 2018-05-07T09:49:00Z

A quick update -- I wasn't able to work on it last week but am now setting up an environment so I can test it and move forward.

geraldcroes · 2018-05-07T14:44:14Z

Another update -- I've dived into Service Fabric and now have a better grasp of the problem at hand.

To be completely honest, I was not familiar with the stateful services approach. Until now, I've always preferred the stateless one (a computing unit with an external persistance store).

That being said, I understand its value and the fact that it is an important feature (even if I have questions that I cannot yet find the answers to).

Allow us a bit more time to discuss it. We'll soon come back to you.

lawrencegripper · 2018-05-11T08:48:33Z

Thanks for taking a look into the Service Fabric use case. I think the matcher also has broader usability for other systems too - as partitioning can be used for both scale and A/B testing.

For scale in large deployments

The Metaparticle link I shared provides a good example:

Sharding is useful because it ensures that only a small number of a containers handle any particular request. This in turn ensures that caches stay hot and if failures occur, they are limited to a small subset of users.

https://metaparticle.io/tutorials/dotnet-sharding/

For A/B testing

For example you want to A/B test a new UI change. You want to expose the new version to a low number of users initially to understand how it affects engagement or errors rates. To do this you treat deployments as immutable, keeping the old version deployed alongside the new version and sending a % of requests to the new version. The problem is that, once a user has the new UI, you don't want them to jump randomly between new and old versions between each request or device.

With the HashedRange matcher you can run your 2 deployments with the following labels:

New Version: HashedRange: type:header value:x-userid match:0-5 range 0-100
Old Version: HashedRange: type:header value:x-userid match:5-100 range 0-100

This would ensure that 5% of users are directed to the new version always, even if they disconnect/reconect, logged in on a different device, browsers etc. The same 5% of users (through the x-userid header) will always see the new deployment. This gives users a consistent experience during a A/B test and you a consistent test group.

This method may be preferrable to stick sessions (cookies) as, even if the user disconnects/reconnects, flushes cookies, uses incognito or different browers they will always see the new deployment.

geraldcroes · 2018-05-24T14:44:43Z

Another update --

We've heavily discussed the proposal, and there are still some debates whether Traefik should or should not embed this feature. For my part, after having investigated on the issue and its use cases, I'm convinced that it should be (at some point) included.

There are some cons though.

The matcher would stand out as being more complicated than the others,
The matcher could have a serious impact on the performance,
There should be more options to define the routes, maybe a chain system,
There should be support for custom algorithms to select the shards
The feature is currently not being asked by many (nor supported by a large community) ... even if I think that it would be welcome.

So for now, even if the team seems interested in the feature, it doesn't fully agree (yet) on the proposal.

Still, in the foreseeable future, Traefik will provide a feature that should enable users to customise and introduce the behaviour you're asking for.

In the meantime, I'll let maintainers take over and move forward.

lawrencegripper · 2018-05-25T12:10:26Z

Hi @geraldcroes thanks for taking a look and the wider team for discussions - appreciate the time and effort taken and agree with a number of the con's listed. In the interest of exploring all options, do any of the following give us a way forward?

Move code under SF Provider

In the proposal I tried to make the matcher generic, making it work outside the ServiceFabric provider. If it was specific to the SF provider code and located within it, would that change the teams view? I believe it would mitigate the impact on the wider Traefik codebase while still allowing Traefik to support the SF stateful service use case and leaving support and maintenance to the SF community.

We would need a way to add a matcher to the list from the provider code, like we do at the moment with the application insights hook for logrus here.

Explore a plugin model

We ruled it out as go-plugin still doesn't support windows. Would you be open to using something like hashicorp/go-plugin? I'd be happy to POC creating an extension point with it allowing plugins to register matcher's. This approach would have an added benefit as many of the users of SF are .net developers so they could write their own partitioning matcher in .net. I would want to benchmark to ensure the RPC calls didn't introduce too much latency but it would be specific to SF users.

Let me know your thoughts.

petertiedemann · 2018-05-25T13:54:06Z

@geraldcroes I definitely agree that it makes sense to support multiple sharding algorithms, but i am not sure why it would be considered so much more complicated than the other matchers or have that significant performance impact?

You mention that is it not a very requested feature, but it is certainly a feature we would like where i work. Would it make any difference if we were a paying customer (i noticed you introduced commercial support)?

geraldcroes · 2018-05-28T15:11:28Z

@lawrencegripper We're discussing options, I'll keep you updated as soon as I can.

ldez · 2018-06-01T15:00:49Z

Moving the code under the SF Provider doesn't look appealing because it would make it stand apart (even more than it currently does).

One of our goals is to offer a cohesive and straightforward API, whatever provider the users have chosen and we don't welcome the idea of proposing features here but not there.

Once again we understand that the feature would be welcome by the Service Fabric community, but unfortunately we're not yet ready to include it as is.

This is not the first time that plugin systems (or others) have come up into the discussion (see below for references), but even if we're working toward solutions that would make it possible, we're not ready yet, and by "yet", I mean that we're actively working on it.

It's never an easy thing to answer with "sorry, not yet," but this is all I can do for now.

@petertiedemann Wether you're a paying customer or not has not come up once in the debate. The only reason why we're postponing the proposal is because we truely are not ready yet.

We thank you once again for the proposal that we'll keep open, and regret to close the current pull request.

Rule #1 of open-source: no is temporary, yes is forever.

https://twitter.com/solomonstre/status/715277134978113536?lang=en

grpc plugin: #2362
go plugin: #1865
plugin: #1336

lawrencegripper · 2018-06-01T15:47:40Z

@ldez Really appreciate the response, thanks for taking a look at the alternatives I proposed and all the help with SF provider 👍

Let me know when/how things go with the plugin model and look forward to taking another crack at this in the future.

petertiedemann · 2018-06-01T16:09:53Z

@ldez I only brought the support thing up, because @geraldcroes said this functionality was not a much requested feature and not supported by a large community, thinking that having paying customers using the feature might help justify having to support it.

Without this feature we will either have to use a fork of Traefik, or write stateless proxies for our stateful services (luckily we only have a few). I haven't explored how paid support would work if using a fork, but i doubt it would work out well.

You guys really need plugin support :)

lawrencegripper · 2018-07-20T16:10:15Z

So I recently came across goja an ECMAScript implementation in Go.: https://github.com/dop251/goja as it's used by the k6 load testing project here: https://github.com/loadimpact/k6/blob/master/js/compiler/compiler_test.go

It would, in theory, allow us to have simple JS functions defining matching rules/middleware. These could be base64 encoded and set as labels on the services then loaded and run dynamically or provided to Traefik in the TOML.

We would need to run some tests to understand that impact on performance, my hope is that basic rules would be faster than out-of-process RPC style plugin models.

@ldez If this sounds of interest I'd be happy to look at running some benchmarks.

marshalYuan · 2018-08-01T11:16:30Z

@lawrencegripper I also want a dynamic matcher for A/B testing or service-chain, and go-lua is my origin plan. But our engineers debated it's performance. What about goja?

lawrencegripper · 2018-08-01T20:57:33Z

I’m unclear on the perf as haven’t run any benchmarks but I’d be happy to do some testing if this is something that the traefik team would consider merging, assuming it can meet performance goals.

clazarr · 2019-03-03T05:53:51Z

As a potentially interested party seeking additional options to Azure APIM and custom coded API gateways, I was wondering if any progress has been made, since the original PR almost a year ago and more than 9 months since the "no, not yet" response, in supporting stateful services in Service Fabric with Traefik? Is there another path under development by the Traefik maintainers such as the JS functions as matching rules/middleware approach @lawrencegripper mentioned?

This is important to Traefik's integration with the SF platform since stateful service support is a major differentiator of the SF platform. In other words, without it, folks in my situation will likely look elsewhere.

lawrencegripper · 2019-03-04T10:44:11Z

There hasn't been any progress on this that I'm aware of as it's blocked by the availability of a plugin model to move this out of the tree.

Ldez's comment is the a good summary of the situation. I understand that managing an OSS project which has lots of different users and supported platforms means some will not get everything they want.

On an related note building OSS is hard and people can expect a lot and unintentionally sometimes not appear grateful for the hard-work of others. Please keep in mind that @ldez and the Traefik team have taken a lot of time to review, improve and maintain the Service Fabric provider.

clazarr · 2019-03-04T14:33:40Z

On an related note building OSS is hard and people can expect a lot...

I appreciate the response and status update. I agree that successful OSS projects are built upon lots of hard work. Contributions of ideas and code from the community help further that success. I appreciate the Traefik team's specific vision for the right way to evolve the project. It's understandable that there has been much interest in an extensibility model to allow additional functionality or leverage platform capabilities (e.g. Service Fabric and others) for a long time. I think we're all just trying to move things forward and address significant functional requirements / use cases.

aantono · 2019-03-15T12:41:56Z

For what it’s worth, I’ve been prototyping various embeddable interpreters like go-lua and gojo, etc. So far their performance hasn’t been great (worse than the previous attempts with GRPC or Hashicorp go-plugin). I have got good results with https://github.com/d5/tengo, so will try to make a PR for Traefik folks to consider.

nmengin · 2024-02-08T16:30:14Z

Hello,

This proposal targets Traefik v1 which is not supported anymore.
I close the issue accordingly.

We'll re-open it later if necessary.

traefiker added the status/0-needs-triage label Apr 23, 2018

lawrencegripper changed the title ~~Add a Matcher for Partitioned services~~ Add a matcher for partitioning services Apr 23, 2018

ldez added the area/provider/servicefabric label Apr 24, 2018

lawrencegripper mentioned this issue Apr 26, 2018

Add a matcher for partitioning services #3239

Closed

2 tasks

ldez added the kind/proposal a proposal that needs to be discussed. label Apr 27, 2018

juliens assigned ldez and geraldcroes Apr 27, 2018

geraldcroes added contributor/waiting-for-feedback and removed status/0-needs-triage labels Apr 27, 2018

ldez added the status/0-needs-triage label Apr 27, 2018

geraldcroes removed the contributor/waiting-for-feedback label Apr 27, 2018

nmengin removed the status/0-needs-triage label May 23, 2018

geraldcroes added the status/0-needs-triage label May 25, 2018

jjcollinge mentioned this issue May 31, 2018

Not able to generate frontend for Stateful Service jjcollinge/traefik-on-service-fabric#66

Closed

ldez removed the status/0-needs-triage label Jun 1, 2018

lawrencegripper mentioned this issue Jan 16, 2019

Handle stateful partition keys in Traefik middleware jjcollinge/traefik-on-service-fabric#45

Closed

clazarr mentioned this issue Apr 10, 2019

Traefik does not support routing to partitioned (stateful) services MicrosoftDocs/architecture-center#1408

Closed

MaxMood96 mentioned this issue May 13, 2022

[Snyk] Fix for 1 vulnerabilities MaxMood96/traefik#82

Open

snyk-bot mentioned this issue Sep 8, 2022

[Snyk] Fix for 1 vulnerabilities MaxMood96/traefik#89

Open

MaxMood96 mentioned this issue Nov 27, 2023

[Snyk] Fix for 24 vulnerabilities MaxMood96/traefik#132

Open

MaxMood96 mentioned this issue Dec 19, 2023

[Snyk] Fix for 10 vulnerabilities MaxMood96/traefik#136

Open

nmengin closed this as completed Feb 8, 2024

nmengin unassigned geraldcroes and ldez Feb 8, 2024

traefik locked and limited conversation to collaborators Mar 10, 2024

traefiker added the status/5-frozen-due-to-age label Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a matcher for partitioning services #3224

Add a matcher for partitioning services #3224

lawrencegripper commented Apr 23, 2018

geraldcroes commented Apr 27, 2018

petertiedemann commented Apr 27, 2018 •

edited

geraldcroes commented Apr 27, 2018

jjcollinge commented Apr 27, 2018 •

edited

lawrencegripper commented Apr 27, 2018 •

edited

geraldcroes commented May 7, 2018

geraldcroes commented May 7, 2018

lawrencegripper commented May 11, 2018 •

edited

geraldcroes commented May 24, 2018

lawrencegripper commented May 25, 2018

petertiedemann commented May 25, 2018

geraldcroes commented May 28, 2018

ldez commented Jun 1, 2018

lawrencegripper commented Jun 1, 2018

petertiedemann commented Jun 1, 2018

lawrencegripper commented Jul 20, 2018

marshalYuan commented Aug 1, 2018

lawrencegripper commented Aug 1, 2018 •

edited

clazarr commented Mar 3, 2019

lawrencegripper commented Mar 4, 2019

clazarr commented Mar 4, 2019

aantono commented Mar 15, 2019

nmengin commented Feb 8, 2024

Add a matcher for partitioning services #3224

Add a matcher for partitioning services #3224

Comments

lawrencegripper commented Apr 23, 2018

Do you want to request a feature or report a bug?

What did you do?

Proposal

Example of url-regex type

What did you expect to see?

What did you see instead?

geraldcroes commented Apr 27, 2018

petertiedemann commented Apr 27, 2018 • edited

geraldcroes commented Apr 27, 2018

jjcollinge commented Apr 27, 2018 • edited

lawrencegripper commented Apr 27, 2018 • edited

geraldcroes commented May 7, 2018

geraldcroes commented May 7, 2018

lawrencegripper commented May 11, 2018 • edited

For scale in large deployments

For A/B testing

geraldcroes commented May 24, 2018

lawrencegripper commented May 25, 2018

petertiedemann commented May 25, 2018

geraldcroes commented May 28, 2018

ldez commented Jun 1, 2018

lawrencegripper commented Jun 1, 2018

petertiedemann commented Jun 1, 2018

lawrencegripper commented Jul 20, 2018

marshalYuan commented Aug 1, 2018

lawrencegripper commented Aug 1, 2018 • edited

clazarr commented Mar 3, 2019

lawrencegripper commented Mar 4, 2019

clazarr commented Mar 4, 2019

aantono commented Mar 15, 2019

nmengin commented Feb 8, 2024

petertiedemann commented Apr 27, 2018 •

edited

jjcollinge commented Apr 27, 2018 •

edited

lawrencegripper commented Apr 27, 2018 •

edited

lawrencegripper commented May 11, 2018 •

edited

lawrencegripper commented Aug 1, 2018 •

edited