Skip to content

Add controlled rollout proposal#6221

Merged
chiiph merged 2 commits intomainfrom
controlled-rollout-proposal
Jul 7, 2022
Merged

Add controlled rollout proposal#6221
chiiph merged 2 commits intomainfrom
controlled-rollout-proposal

Conversation

@chiiph
Copy link
Copy Markdown
Contributor

@chiiph chiiph commented Jun 14, 2022

Proposal for allow features to be rolled out in a more controlled way.

@chiiph chiiph requested review from a team and zwass June 14, 2022 14:50
Comment on lines +61 to +62
enable_software_inventory:
default: false
Copy link
Copy Markdown
Contributor

@michalnicp michalnicp Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason using an object with default: false wouldn't work for the regular case as well? I usually find that having multiple types are harder to implement and more confusing. In this case it seems like a RolloutBoolean can be an object or a boolean. Are we doing this for backwards compatibility reasons?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the goal is backwards compatibility.


## How

We would create a new type of boolean value in our `AppConfig` called `RolloutBoolean`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have typically seen this sort of thing called a feature flag. Our users may also be more familiar with that term than RolloutBoolean.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be, feature flags in my book is a binary thing, though. This is a way of rolling out a feature so that eventually you enable for all if you want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature flags are not always just a binary off/on thing. For example, in LaunchDarkly, feature flags can target a specific user, a subset of users, use conditional logic. They can also take a range of values (not just boolean, but also strings, numbers, etc)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I've heard customers express concerns about turning on features for a bunch of hosts at once. They want to be able to roll out things like the software inventory queries slowly and monitor the performance impacts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature flags are not binary in what they target, but I think they do tend to be associated with temporary things though (i.e. roll out slowly but eventually it should be enabled everywhere) and typically to be more for the development team than a config option for the users.

If I understand correctly this proposal, this could be for permanent things - maybe team1 will never want feature X, but team2 wants it, is that correct? (and of course it seems to be strictly for users, not for us as a way to "hide" an experimental feature - they manage the flags however they want for how long they want)

So in my understanding, it might be confusing to call this "feature flags" though I certainly see the similarities.

concerns about turning on features for a bunch of hosts at once

I'm not sure the config approach of setting specific host IDs scales very well though. E.g. if I manage 20K hosts, it would be nicer to be able to say "enable for 10%" or "enable for hosts in this label", something like that, although that means having to manage some state for the percentage, and the label membership is dynamic. Thinking out loud, but to assist users with many hosts, we could have tooling in fleetctl (and/or the UI) to support rolling out in this way (x%, or hosts in "label", etc.) that would take a snapshot of matching host ids and set the config using host ids, so that there's no state to manage elsewhere and the label membership is just that - a snapshot at that moment in time. When you grow the rollout, it ensures the current host IDs stay in the set and only adds new ones.

Overall, I'm +1 too for the proposal - there's some complexity around allowing different types for the same config value, but nothing crazy.

I have some concerns about being sure to properly handle the now per-host application of the features in all cases, it might get very tricky depending on the feature (e.g. you can't just check if the feature is enabled and if so, do a batch SELECT of things related to that feature - you have to check if each affected host has the feature enabled).

My concerns would get clearer once we design this for a specific feature, but to give an hypothetical example, if the failing policies webhook integration was controlled like this, we'd have to add checks for each failed policy if it contains any host that has the feature enabled, ignore it if it only has hosts with the feature disabled, and only send the list of hosts with the feature enabled if it has a mix of both.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should give people running Fleet tools to ... use a feature within the scope that is useful for them.

The wording from the doc above suggests that this is intended as a config option for users, to control whether certain features are enabled

it would be nicer to be able to say "enable for 10%" or "enable for hosts in this label"

I think a percentage rollout would be a great idea, especially for things that affect performance.

I have some concerns about being sure to properly handle the now per-host application of the features in all cases, it might get very tricky depending on the feature (e.g. you can't just check if the feature is enabled and if so, do a batch SELECT of things related to that feature - you have to check if each affected host has the feature enabled).

The best way I can think of implementing this right now is having a separate table you can join hosts with to check whether a feature is enabled eg host_features

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally left percentages out because of its added complexity. We can add it later on. If you want progressive rollout, either assign teams or specific hosts. If you want a range, we can make fleetctl auto generate the config based on a range, like @mna mentioned.

As for the wording, yup, it was left phrased like that so that this would serve both purposes: full AND partia rollouts. We can disagree on wether we want this, but the idea is: what's the simplest we could do that could serve the most use cases we've heard of?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see percentages being useful, but I can also see customers requesting other ways to control this, I think we should ask around before settling. One crazy idea could be to control the target hosts via an osquery query

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One crazy idea could be to control the target hosts via an osquery query

Not so crazy! This is basically (iiuc) being able to target labels :)

@lucasmrod
Copy link
Copy Markdown
Member

+1. I think this is a good low-risk improvement. We can start with the shown overrides as they are the most requested by customers (that is: enable on a certain team, on certain hosts and on certain platforms)

@lucasmrod
Copy link
Copy Markdown
Member

Am using docs/Contributing/Blueprints/ for pushing proposals/new-specs, should we use proposals instead? Am fine with either.

@lucasmrod lucasmrod self-requested a review June 14, 2022 18:28
lucasmrod
lucasmrod previously approved these changes Jun 14, 2022
@chiiph
Copy link
Copy Markdown
Contributor Author

chiiph commented Jun 14, 2022

Am using docs/Contributing/Blueprints/ for pushing proposals/new-specs, should we use proposals instead? Am fine with either.

Hm, well, I didn't think of throwing this in docs because it didn't feel like a doc... but if it is approved and merged, it would technically serve as the doc for the feature, so it's not a bad idea at all!

enabled by default, everybody would love them all, and they would work flawlessly for all possible use cases.

We are in the real world, though, which is not ideal. So we should give people running Fleet tools to rollout features
slowly, so that they can update infrastructure if needed, or only use a feature within the scope that is useful for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea, should we put guidelines or a DRY individual to make the final decision regarding which features can be configured?

I can imagine this feature left unused because of the extra work, or being overused and suddenly everything is behind a config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be work to strike a balance. The default is probably NOT to use this unless there's a need.

Comment thread proposals/001-controlled-rollout.md Outdated

We would create a new type of boolean value in our `AppConfig` called `RolloutBoolean`.

`RolloutBoolean` will have a function `Enabled(h *fleet.Host) bool`. So instead of doing this:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! as a general comment, I think it would be useful to include which RolloutBoolean values are true in the error instance we store in Redis.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on your comment? I'm not sure I follow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry that was pretty abstract. I meant it would be nice to ultimately include which flags are enabled in fleetctl debug errors and/or fleetctl debug archive

Comment thread proposals/001-controlled-rollout.md Outdated
We would do:

```go
if ac.HostSettings.EnableHostUsers.Enabled(host) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very much a nit, but it would be nice to think about a naming convention for these booleans that reads better than EnableHostUsers.Enabled

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, any ideas?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for changing the name of this method, Enabled is a little too specific as well. I would propose something generic like Value or Get to retrieve the actual value of the boolean based on the host.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Get!

roperzh
roperzh previously approved these changes Jun 15, 2022
@noahtalerman
Copy link
Copy Markdown
Member

noahtalerman commented Jun 15, 2022

@chiiph this is looking great overall!

What is the reasoning behind starting with three options (team, platform, and host) ?

what's the simplest we could do that could serve the most use cases we've heard of?

I'm thinking we could start with just teams. And the following becomes the documented way for test performance of new features:

  1. Create a "Sandbox" team
  2. Enable features for this team
  3. Transfer test hosts to this team

This also enables us tell the story of Fleet Premium being the paid, "enterprise edition" of Fleet. If you want to be able to test performance of new features, because you have performance concerns for your organization, sign up for Fleet Premium.

@chiiph
Copy link
Copy Markdown
Contributor Author

chiiph commented Jun 15, 2022

@noahtalerman

What is the reasoning behind starting with three options (team, platform, and host) ?

Basically the complexity of doing all three is not much greater than doing just one, and we would cover more use cases that would lower the barrier of adoption/rollout of features we've seen struggles with being adopted in orgs.

@zwass
Copy link
Copy Markdown
Member

zwass commented Jun 16, 2022

Thanks folks! The way we're going to proceed with this is to take this proposal as "here's what engineering thinks is possible/practical". Then Noah and I will work together to craft the "interface" portions of this and come back to Platform with prioritization for what gets implemented.

@chiiph chiiph dismissed stale reviews from roperzh and lucasmrod via 923d75c June 17, 2022 17:16
@chiiph chiiph temporarily deployed to Docker Hub June 17, 2022 17:16 Inactive
default: false
overrides:
platforms:
- linux: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDEA: To simplify first iteration, we should not allow users to add more than one overrides field.

You need to choose one of teams, hosts_ids or platforms, otherwise we would have to define a preference order between the three fields:

---
apiVersion: v1
kind: config
spec:
  host_settings:
    enable_software_inventory:
      default: false
      overrides:
        teams:
        - team1: true
        host_ids:
        - 3214: true
        platforms:
        - linux: false

(E.g. What do we do if host 3214 is a linux host in team1?)

@chiiph
Copy link
Copy Markdown
Contributor Author

chiiph commented Jul 7, 2022

Merging this since we're going to implement it asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants