Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple workload, configuration management from charm.py #9

Closed
wants to merge 7 commits into from

Conversation

sed-i
Copy link
Contributor

@sed-i sed-i commented May 25, 2023

In this PR a few parts of operating grafana agent have been decoupled from charm.py.
Something similar is expected to follow in a future PR for operating nginx.

  • Split out "WorkloadManager" and "Config".
  • Set up stage for the config to be lazily evaluated, in an attempt to avoid code ordering issues.
  • New "compound status" handling approach:
    • Charm components report back via a callback
    • Each component, and then the charm, are responsible for calculating the "total" status for themselves

Apart from that, most of the code was copy-pasted from grafana agent (k8s operator).

Comment on lines +18 to +23
class Status:
"""Helping with centralized status setting."""

def __init__(self, callback: Callable[[StatusBase], None] = lambda _: None):
self._config: StatusBase = UnknownStatus()
self._callback = callback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seems to me like this status class should be defined globally in charm.py because the status will need to be combined with any status from Mimir itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It for sure would need to be combined, but the idea I'm proposing here is:

  • Each component has its own status, and calculate it's own total status
  • Each component reports back its total status
  • The "total of totals" is calculated in charm code

This seems more flexible than introducing additional assumptions on the structure/hierarchy of one global status object. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should wait on compound status to land in ops, inventing our own here seems like coinventing the wheel

Comment on lines +63 to +80
super().__init__(charm, f"{self.__class__.__name__}-{container_name}")

# Property to facilitate centralized status update
self.status = Status(callback=status_changed_callback) # pyright: ignore

self._unit = charm.unit

self._service_name = self._container_name = container_name
self._container = charm.unit.get_container(container_name)

self._render_config = config_getter

# turn the container name to a valid Python identifier
snake_case_container_name = self._container_name.replace("-", "_")
charm.framework.observe(
getattr(charm.on, "{}_pebble_ready".format(snake_case_container_name)),
self._on_pebble_ready,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we call super, I think we should be able to use self.framework and self.unit rather than charm.framework and charm.unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but this seems more explicit in a good way?

Comment on lines +108 to +109
if version := self.version:
self._unit.set_workload_version(version)
Copy link

@dstathis dstathis May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): I'm not sure the version we display should be the grafana-agent version. This is a Mimir charm after all. Not sure what version we should display but I don't think it should be this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
I was focusing on the "workload manager" aspect and overlooked this.

  • If we aim to have the same "workload manager" for gagent here and elsewhere, then we would probably need a means to selectively set/not set the workload version.
  • The coordinator runs gagent and nginx. How awful would agent: x.y.z; nginx: a.b.c be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • log.info the workload version
  • do not set the workload here; let charm code access .version prop directly
  • in this case, charm code should set n/a (or multi/various?)

Comment on lines +173 to +183
if config == old_config:
# Nothing changed, possibly new installation. Move on.
self.status.config = ActiveStatus()
return

try:
self.write_file(self.CONFIG_PATH, yaml.dump(config))
except APIError as e:
logger.warning(str(e))
self.status.config = WaitingStatus(str(e))
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should the write_file be in an else block? Seems like it shouldn't be needed if the config file did not change.

question: Should we restart or reload grafana-agent after writing the new config file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If write_file goes in the else block, what would we have in the try block? Wouldn't it be equivalent to just having the write_file in the try block in the first place?
  2. At first I had a restart in this same function (like we do elsewhere), but that made the function not reusable on pebble ready, because there is no service yet. I think it makes sense to separate the two concerns: update config and restart. We could come up with a third name that does both.

Comment on lines +180 to +183
except APIError as e:
logger.warning(str(e))
self.status.config = WaitingStatus(str(e))
return
Copy link

@dstathis dstathis May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): I do not think waiting status makes sense. We are not waiting on a related app. In fact, this error is not resolvable at all, so we should surface the exception in order to go in to error state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from gagent.

Seems like:

  • It should be WaitingStatus if we're before pebble-ready ("auto resolvable")
  • BlockedStatus otherwise

So I agree, all in all this should be BlockedStatus.

Comment on lines +162 to +163
config = self._render_config() # TODO: Must not be None
assert config is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we capturing possible AssertionError exception? 🤔

Comment on lines +76 to +84
config_getter=lambda: Config(
topology=JujuTopology.from_charm(self),
scrape_configs=None, # FIXME generate from memberlist
remote_write=self.remote_write_consumer.endpoints,
loki_endpoints=self.loki_consumer.loki_endpoints,
insecure_skip_verify=True,
http_listen_port=3500,
grpc_listen_port=3600,
).build(), # TODO figure out what to do about potential code ordering problem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would build Config() object before, so this section is easy to understand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot, unfortunately, as it must be evaluated as a lambda to not miss out on the changes happening as part of the event hook execution.

@sed-i: What we could do, however, is to move the Config object instantiation to a separate function/method, and pass that invocation as the lambda expression.

Comment on lines +28 to +29
for endpoint in self.remote_write + self.loki_endpoints:
endpoint["tls_config"] = {"insecure_skip_verify": insecure_skip_verify}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like it necessarily belongs in the init function.

Comment on lines +76 to +84
config_getter=lambda: Config(
topology=JujuTopology.from_charm(self),
scrape_configs=None, # FIXME generate from memberlist
remote_write=self.remote_write_consumer.endpoints,
loki_endpoints=self.loki_consumer.loki_endpoints,
insecure_skip_verify=True,
http_listen_port=3500,
grpc_listen_port=3600,
).build(), # TODO figure out what to do about potential code ordering problem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot, unfortunately, as it must be evaluated as a lambda to not miss out on the changes happening as part of the event hook execution.

@sed-i: What we could do, however, is to move the Config object instantiation to a separate function/method, and pass that invocation as the lambda expression.

Copy link
Contributor

@PietroPasotti PietroPasotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the different Config approach, I don't think it's such a different approach. The implementation is different, sure, but the idea is the same. Also AM's config has somewhat different design goals:

  • storedstate-backed hash instead of recalculation from filesystem (we could factor that out, of course).
  • the config building is somewhat more complicated, not just one big dict.

We should see if we can generalize the two?

Comment on lines +18 to +23
class Status:
"""Helping with centralized status setting."""

def __init__(self, callback: Callable[[StatusBase], None] = lambda _: None):
self._config: StatusBase = UnknownStatus()
self._callback = callback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should wait on compound status to land in ops, inventing our own here seems like coinventing the wheel

def config(self, value: StatusBase):
self._config = value
# When status is updated, it is likely desirable to have some kind of side effect.
self._side_effect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: _side_effect --> _update_status

logger.debug("Status updated to: %s", self._combined())
self._callback(self._combined())

def _combined(self) -> StatusBase:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]: either _combine() or @property\ndef _combined(self)...

Returns:
A string equal to the agent version
"""
if not self.is_ready:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_ready is not a property (but imho it should be) so this will always be trivially true.

try:
self.write_file(self.CONFIG_PATH, yaml.dump(config))
except APIError as e:
logger.warning(str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this fails, how does the charm know? and how does it retry?

@sed-i
Copy link
Contributor Author

sed-i commented Nov 28, 2023

Outdated. Closing.

@sed-i sed-i closed this Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants