Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a basic Dotmesh Operator that replicates our current DaemonSet setup #344

Closed
alaric-dotmesh opened this issue Mar 22, 2018 · 8 comments
Assignees
Labels

Comments

@alaric-dotmesh
Copy link
Contributor

alaric-dotmesh commented Mar 22, 2018

This is part of epic #385 .

With #343 done, we can write a simple Dotmesh Operator that runs Dotmesh on every node in the cluster.

This can become the canonical way of running DM in k8s once documented!

@lukemarsden
Copy link
Collaborator

lukemarsden commented Mar 29, 2018

Per #343 (comment), we don't have a StatefulSet template per se, but we do have a design for an operator which will set up PVCs, node labels etc appropriately!

@lukemarsden lukemarsden self-assigned this Mar 29, 2018
lukemarsden added a commit that referenced this issue Mar 31, 2018
lukemarsden added a commit that referenced this issue Mar 31, 2018
lukemarsden added a commit that referenced this issue Mar 31, 2018
@lukemarsden
Copy link
Collaborator

lukemarsden commented Apr 5, 2018

As part of this issue, we need a testing strategy.

We are going to develop simple dind-flexvolume and dind-dynamic-provisioner modules which can be used to simulate cloud provider volumes (insofar as they provide writeable filesystems which dotmesh can initialize zfs pools-in-files on) in the dind tests.

Then we can write dind tests for the dotmesh operator creating pods which consuming PVCs from a dind storageclass and providing dotmesh PVs.

Later, we'll be able to simulate killing a dotmesh node and having the PV failover and "reattach" to a different dind.

@lukemarsden
Copy link
Collaborator

lukemarsden commented Apr 6, 2018

We now have dind flexvolume and dynamic provisioners, which work according to this test.

Next steps, as I see them:

  • write a test that cordoning a node results in the fake "block device" for a pod being reattached by Kubernetes to the pod on the new node

LOCAL

  • write a test that providing the dotmesh operator (which is partially implemented here), works just as well the current daemonset. make it pass with the following config by implementing enough of the algorithm.
storageMode: local
localMode:
  poolSizePerNode: 10G
  poolLocation: /var/lib/dotmesh

PV PER NODE

  • write a test that providing the dotmesh operator with the following config provisions PVs from the dind provisioner. make it pass by implementing more of the algorithm:
storageMode: pvPerNode
pvPerNodeMode:
  pvSizePerNode: 100G
  storageClass: fast
  • write a test which deletes a dind node and demonstrates that the dotmesh operator fails over the PV to another node. make it pass by implementing more of the algorithm. you'll need to support multiple zpools on a single node (by running two dotmesh storage instances on one node).

POOL OF DOTMESHES

  • implement NFS support... and then write a test for the following config where storage nodes are only a subset of the nodes in the cluster, demonstrating accessing a dot from a different node to where the storage is hosted:
storageMode: storageNodes
storageNodesMode:
  storageNodes: 3
  sizePerNode: 100G

@lukemarsden
Copy link
Collaborator

Note that some of the above plan spans different github issues!

@lukemarsden
Copy link
Collaborator

lukemarsden commented Apr 6, 2018

In particular "pool of dotmeshes" depends on #341, #345 and #346 and the unissued final bullet in #100

@lukemarsden
Copy link
Collaborator

see https://kubernetes.io/blog/2018/01/introducing-client-go-version-6 Updating dependencies – golang/dep

@prisamuel prisamuel self-assigned this Apr 9, 2018
prisamuel pushed a commit that referenced this issue Apr 10, 2018
alaric-dotmesh added a commit that referenced this issue Apr 11, 2018
alaric-dotmesh added a commit that referenced this issue Apr 11, 2018
We now have a thing we can compile and run in a test cluster locally,
which prints out messages when nodes come/go/change, and the start of a code
structure that will run The Algorithm whenever something interesting
happens.
prisamuel pushed a commit that referenced this issue Apr 12, 2018
alaric-dotmesh added a commit that referenced this issue Apr 13, 2018
…ve no pod bound to them!

Downside: the pods crash and burn on startup. But I think that should be
just a matter of tweaking the pod template.
alaric-dotmesh added a commit that referenced this issue Apr 13, 2018
alaric-dotmesh added a commit that referenced this issue Apr 16, 2018
…rt new dotmesh pods while old ones are dying.
alaric-dotmesh added a commit that referenced this issue Apr 17, 2018
@alaric-dotmesh alaric-dotmesh mentioned this issue Apr 17, 2018
8 tasks
@alaric-dotmesh
Copy link
Contributor Author

I'm moving the "extra work" out into a new epic which this is just the first part of. This is now part of epic #385, which is a sub-epic of #100!

@alaric-dotmesh alaric-dotmesh changed the title Write a basic Dotmesh Operator to instantiate our StatefulSet template Write a basic Dotmesh Operator that replicates our current DaemonSet setup Apr 17, 2018
alaric-dotmesh added a commit that referenced this issue Apr 23, 2018
…DM namespace already), make node labelling two-stage.
alaric-dotmesh added a commit that referenced this issue Apr 23, 2018
…od deployment, so GC works correctly (and kubectl drain?)
alaric-dotmesh added a commit that referenced this issue May 2, 2018
…works.

`kubectl drain` fails if the node has pods controlled by operators on it.
This makes it intermittent already because sometimes the etcd pod is on that
node, and downright failsome with the dotmesh operator in play.
alaric-dotmesh added a commit that referenced this issue May 3, 2018
…p), and they're not referenced from the docs any more.
@alaric-dotmesh
Copy link
Contributor Author

This is in production, so I'm calling it done.

binocarlos added a commit that referenced this issue May 9, 2018
* master:
  NFC: More logging
  dotscience#3 make subdot roots writeable by all, for containers which run as non-root
  FIX: Missed space :-( Testing stuff in CI is tedious.
  FIX: Missed the `-c` option to the `dm dot delete...`
  FIX: Typo...
  #17: Pull the right image, use a dedicated config, and test `dm dot delete` on the remote
  NFC: Test adding sleep to ensure replication.
  #17: Avoid echoing the API key, and run the smoke tests on Linux (it's easier for me to debug them there)
  #17: Made the smoke test push to a remote cluster (if credentials are passed into SMOKE_TEST_REMOTE and SMOKE_TEST_APIKEY).
  NFC: Fix logging on error messages
  #352: Attempt to reduce flakiness by checking replication status on both nodes in a cluster
  NFC: Comments concerning pod health checking
  NFC: Re-enable flaky test for debugging
  #344: We no longer need the GKE yamls (that's handled in the ConfigMap), and they're not referenced from the docs any more.
  NFC: fix typo sneaked into yaml
  NFC: Comment out test until we can work out how to fix it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants