Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

gravel: add etcd #403

Merged
merged 22 commits into from Apr 13, 2021
Merged

gravel: add etcd #403

merged 22 commits into from Apr 13, 2021

Conversation

jecluis
Copy link
Member

@jecluis jecluis commented Apr 10, 2021

We need something to share state across multiple nodes, consistently, and, ideally, without a single point of failure. This patchset proposes etcd as that thing.

Why etcd?

Because it exists, is simple enough to setup, and does not require Ceph to work.

Why not Ceph's monitors kvstore?

We could use the monitors' config-key/kv store, but a few things strike me as good enough reasons not to:

  1. That was never what it was supposed to exist for, and everyone keeps abusing it. Ceph needs something like etcd to keep its stuff, or to implement a proper kvstore may it be in the monitors or elsewhere (the manager's would be preferable tbh).
  2. We lose the store if the cluster dies, or doesn't start, or a monitor fills up to the brim and causes a quorum loss. If that happens, Aquarium needs to still be available to provide info to the user, and to tell the user which services have gone away, stuff like that.
  3. The ceph-mon kvstore lacks a few things that are pretty cool, like watching key updates, and transactions.

Few things of note

  • we add a new dependency, on aetcd3. This has been the only asyncio etcd3 library I found that seems recent enough, with decent testing, and actually building on pypi, even though it doesn't seem to be particularly active.
  • aetcd3 is far from perfect, and has this one thing that bites us on quit, where resources are not properly freed and an exception is raised. I've opened a PR (client: ensure channels are closed on __del__() martyanov/aetcd#4) against it but I have very little hope to get it merged. This is why we're relying, for the purposes of this patchset, on a custom-compiled library sitting in one of my servers, somewhere.
  • Unless there's suddenly more activity on that lib, we may have to fork it and carry it ourselves.

Resolves #32

Signed-off-by: Joao Eduardo Luis <joao@suse.com>

@jecluis jecluis added feature New feature gravel Related to the Aquarium Backend milestone: required Required for the assigned Milestone labels Apr 10, 2021
@jecluis jecluis added this to the Milestone 3 milestone Apr 10, 2021
@jecluis jecluis added this to In progress in Project Aquarium via automation Apr 10, 2021
@jecluis jecluis force-pushed the wip-etcd branch 2 times, most recently from 44c4ba4 to 19b1758 Compare April 10, 2021 21:38
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@@ -41,6 +41,7 @@ class WelcomeMessageModel(BaseModel):
pubkey: str
cephconf: str
keyring: str
peer_url: str
Copy link
Member

@tserong tserong Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just nitpicking, but can we give this a slightly different name? Because it's peer_url, it makes me think it's a URL, but it's actually a string in the form "name=URL". Maybe call it, ummm... etcd_peer? Just something to make it slightly more obvious it's weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops, that comment should have been part of the review below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sounds good.

Copy link
Member

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should work :-) Missing a cache, for the case where etcd is dead, but you already noted that we'll need a cache in one of the commit messages, so that's fine. I have no particular reason not to approve this, but a few questions inline.

@@ -174,7 +174,8 @@ EOF
fi

if [[ "$kiwi_profiles" == *"Ceph"* ]]; then
pip install fastapi uvicorn aetcd3
pip install fastapi uvicorn \
https://celeuma.wipwd.dev/aqr/aetcd3/aetcd3-0.1.0a5.dev2+g0aa852e.d20210410-py3-none-any.whl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this referring to the fix for martyanov/aetcd#4, which was since closed? In that case, do we still need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the initial comment, but no. Even though that PR was closed, there's a couple of additional patches on top of the original repo, one of them to fix an aiofiles's dependency version -- original repo requires <0.6, while uvicorn requires >= 0.6, and it's annoying to have to deal with those dependencies when pip installs. Another one is a missing await when shutting down the lib IIRC.

@@ -7,4 +7,4 @@ starlette==0.13.6
uvicorn==0.13.3
pip
websockets==8.1
aetcd3
https://celeuma.wipwd.dev/aqr/aetcd3/aetcd3-0.1.0a5.dev2+g0aa852e.d20210410-py3-none-any.whl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in pip install line

logger.info(f"started etcd process pid = {process.pid}")
t = threading.Thread(target=_bootstrap_process)
t.start()
logger.info("started etcd thread")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Starting it as a process, for some reason, allows etcd to squash the event loop's signal handlers, which breaks our "on shutdown" cleanup." Huh? As a separate process it shouldn't be able to do anything to the parent, should it? And if it can mess up the parent process, why wouldn't it also mess up the process when run as a thread? (I kinda feel like etcd should really be a separate process, just because it's its own daemon, rather than being some class or library that we import...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs, that seems to be the daemon's instance name. initial-cluster seems much like mon_initial_cluster from Ceph. Even if that might set the cluster's name to the name of the first node, maybe there's a way to do that without having to name the first node something that does not match the hostname? (I'll look into that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf - this last comment should have been a reply to something else entirely. github seems to be on fire today.

Anyhoo, to @tserong actual comment:

"Starting it as a process, for some reason, allows etcd to squash the event loop's signal handlers, which breaks our "on shutdown" cleanup." Huh? As a separate process it shouldn't be able to do anything to the parent, should it? And if it can mess up the parent process, why wouldn't it also mess up the process when run as a thread? (I kinda feel like etcd should really be a separate process, just because it's its own daemon, rather than being some class or library that we import...)

Yeah, I know. This is weird, really. I assumed the same thing, but for some reason, when starting an asyncio subprocess from a multithreading.Process, the signal handler is squashed; when starting the asyncio subprocess from a threading.Thread, it does not. I'll try getting a script to reproduce the behavior just to ensure I'm not imagining things though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm struggling to reproduce this with a script because it works fine with a Process. I truly can't tell wtf was going on. I'll try reverting it and see what happens, I guess.

async def ensure_connection(self) -> None:
""" Open k/v store connection """
# try getting the status, loop until we make it.
opened = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we never make it? The KV store is unavailable forever?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. This might warrant a timeout or something. But yeah, that's it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine for now. It'll also presumably benefit once we have a cache later (at least we'll be able to read cached K/Vs while we're waiting for etcd)

Copy link
Member Author

@jecluis jecluis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason github did not allow me to reply to @tserong without reviewing my own PR? 🤷

@@ -174,7 +174,8 @@ EOF
fi

if [[ "$kiwi_profiles" == *"Ceph"* ]]; then
pip install fastapi uvicorn aetcd3
pip install fastapi uvicorn \
https://celeuma.wipwd.dev/aqr/aetcd3/aetcd3-0.1.0a5.dev2+g0aa852e.d20210410-py3-none-any.whl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the initial comment, but no. Even though that PR was closed, there's a couple of additional patches on top of the original repo, one of them to fix an aiofiles's dependency version -- original repo requires <0.6, while uvicorn requires >= 0.6, and it's annoying to have to deal with those dependencies when pip installs. Another one is a missing await when shutting down the lib IIRC.

logger.info(f"started etcd process pid = {process.pid}")
t = threading.Thread(target=_bootstrap_process)
t.start()
logger.info("started etcd thread")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the docs, that seems to be the daemon's instance name. initial-cluster seems much like mon_initial_cluster from Ceph. Even if that might set the cluster's name to the name of the first node, maybe there's a way to do that without having to name the first node something that does not match the hostname? (I'll look into that)

async def ensure_connection(self) -> None:
""" Open k/v store connection """
# try getting the status, loop until we make it.
opened = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. This might warrant a timeout or something. But yeah, that's it.

@@ -41,6 +41,7 @@ class WelcomeMessageModel(BaseModel):
pubkey: str
cephconf: str
keyring: str
peer_url: str
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sounds good.

def _load(self) -> None:
self._token = self._load_token()
self._token = await self._load_token()
await self._kvstore.watch("/nodes/token", _watcher)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a token obtained, and then the key watched for updates? Do we expect the token to change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope so. In this case was meant mostly as a PoC that watching keys works, but thinking ahead, given I hope we'll let the user refresh their token should they want to.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Add new members on join. Relies on python's aetcd3 library
for etcd shenanigans.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
So we can have a consistent and up to date ceph.conf across all nodes,
let's rely on cephadm to manage it.

We don't drop the ceph.conf initially shared with a joining node because
we want to ensure that node is able to perform operations on the cluster
as soon as join finishes, and we don't want to have to wait for cephadm
to write the ceph.conf.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Also adds needed typings for aetcd3.locks module.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
We needed to patch aetcd3 and the fix hasn't been merged upstream yet.
So, we created our own package and uploaded it somewhere it can be
reached. This is what we are installing now.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Obtain state and watch changes, update as needed.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
We're going to cache things in the near-future, and obtain state from
the kvstore, watches and whatnot. We need it to be a full fledged
service.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
For those services needing to cleanup state, add a shutdown method to be
called when we are shutting down.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.com>
@jecluis
Copy link
Member Author

jecluis commented Apr 12, 2021

@tserong addressed all comments, I think.

Project Aquarium automation moved this from In progress to Reviewer approved Apr 13, 2021
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this manually, but I'm in favour of the general approach etc.

@jecluis jecluis merged commit 649a96b into aquarist-labs:main Apr 13, 2021
Project Aquarium automation moved this from Reviewer approved to Done Apr 13, 2021
@jecluis jecluis deleted the wip-etcd branch April 13, 2021 07:00
mgfritch added a commit to mgfritch/aquarium that referenced this pull request Apr 15, 2021
introduced by PR aquarist-labs#403

Signed-off-by: Michael Fritch <mfritch@suse.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature New feature gravel Related to the Aquarium Backend milestone: required Required for the assigned Milestone
Projects
Development

Successfully merging this pull request may close these issues.

gravel: persistent state
3 participants