Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unit state lost following upgrade-charm #316

Closed
faebd7 opened this issue Jun 1, 2020 · 6 comments
Closed

unit state lost following upgrade-charm #316

faebd7 opened this issue Jun 1, 2020 · 6 comments

Comments

@faebd7
Copy link

faebd7 commented Jun 1, 2020

I just upgraded a charm I'm developing, after which the unit seems lost its unit state. Unfortinately, I don't have a consistent way to reproduce this. I've run into it only once or twice previously, but I've done many more successful charm upgrades.

Before:

mattermost/10*  active    idle   10.1.1.17  8065/TCP  

After:

mattermost/10*  waiting   idle   10.1.1.17  8065/TCP  Waiting for database relation

The relation is still there:

[agnew(charm-k8s-mattermost)] juju status --format yaml mattermost
model:
  name: mm
  type: caas
  controller: mattermost
  cloud: k8s
  region: localhost
  version: 2.8-rc2
  model-status:
    current: available
    since: 28 May 2020 14:24:31+12:00
  sla: unsupported
machines: {}
applications:
  mattermost:
    charm: local:kubernetes/mattermost-17
    series: kubernetes
    os: kubernetes
    charm-origin: local
    charm-name: mattermost
    charm-rev: 17
    charm-version: 250d9df-dirty
    scale: 1
    provider-id: 335e8836-3907-48bc-8b65-98ddbe6d5263
    address: 10.152.183.32
    exposed: false
    application-status:
      current: active
      since: 02 Jun 2020 09:43:01+12:00
    relations:
      db:
      - postgresql
    units:
      mattermost/10:
        workload-status:
          current: waiting
          message: Waiting for database relation
          since: 02 Jun 2020 11:18:01+12:00
        juju-status:
          current: idle
          since: 02 Jun 2020 11:18:04+12:00
        leader: true
        open-ports:
        - 8065/TCP
        address: 10.1.1.17
        provider-id: d9681e49-0fb4-4fe4-a236-935caf767650
    endpoint-bindings:
      "": alpha
      db: alpha
application-endpoints:
  postgresql:
    url: mattermost:admin/daturbase.postgresql
    endpoints:
      db:
        interface: pgsql
        role: provider
    application-status:
      current: active
      message: Live master (10.12)
      since: 28 May 2020 14:25:48+12:00
    relations:
      db:
      - mattermost
storage: {}
controller:
  timestamp: 11:31:43+12:00
[agnew(charm-k8s-mattermost)] _

but the unit state DB seems to have been reset:

root@mattermost-operator-0:/var/lib/juju# path_to_unit_state_db=/var/lib/juju/agents/unit-mattermost-10/charm/.unit-state.db
root@mattermost-operator-0:/var/lib/juju# python3 -c 'import pickle, pprint, sqlite3, sys ; show_none = False ; print("\n".join(["=== {} ===\n{}\n".format(t[0], pprint.pformat(t[1])) for t in [(row[0], pickle.loads(row[1])) for row in sqlite3.connect(sys.argv[1]).execute("SELECT handle, data from snapshot")] if show_none or t[1] is not None]))' ${path_to_unit_state_db?}
=== MattermostK8sCharm/StoredStateData[state] ===
{'db_conn_str': None, 'db_ro_uris': [], 'db_uri': None}

=== MattermostK8sCharm/PostgreSQLClient[db]/StoredStateData[_state] ===
{'rels': {}}

=== StoredStateData[_stored] ===
{'event_count': 6}

root@mattermost-operator-0:/var/lib/juju# _

The charm source is here: https://git.launchpad.net/~pjdc/charm-k8s-mattermost/+git/charm-k8s-mattermost/tree/src/charm.py?h=control-source#n61

And here's the debug-log from the upgrade: https://paste.ubuntu.com/p/bp4YdzR5XW/

@faebd7
Copy link
Author

faebd7 commented Jun 2, 2020

The reproduction above was with Juju 2.8-rc2, and I've also just reproduced it with Juju 2.8-rc3 (which will become 2.8.0).

I did a brief test with IAAS and CAAS charms on 2.8-rc3 with a flag file in /var/lib/juju/agents/unit-foo-N/charm, and based on that it seems that this isn't a case of the charm directory itself being reset somehow.

@faebd7
Copy link
Author

faebd7 commented Jun 2, 2020

Note that my charm is targeting k8s clusters that do not have persistent storage. Therefore I've set min-juju-version: 2.8.0 in metadata.yaml which means that the Juju pods have no volumes attached. However, this is happening on a single-node microk8s, so I don't think this is being caused by pods being rescheduled, since there's nowhere the reschedule them to.

@faebd7
Copy link
Author

faebd7 commented Jun 2, 2020

From looking over scrollback (unfortunately I didn't preserve everything) I can't 100% rule out the application's operator pod being restarted and therefore losing its state due to lack of persistent storage.

I understand Juju 2.8 controller-backed charm state support is being worked on, but since I couldn't find an issue specifically dedicated to it, although there's a mention in #240 (comment), I filed #317.

@jameinel
Copy link
Member

jameinel commented Jun 2, 2020

I'm pretty sure charm pods are restarted on upgrade. If they aren't on 'upgrade-charm' they definitely are on 'upgrade-juju' because the base image that holds the Juju agent is being updated. (I believe they also are on upgrade charm, but I'm not positive of that.
Without persistent storage today (setting min-juju-version), when the charm comes up again, it will very much lose all of its state. (we currently store the unit state in an sqlite database on the disk of the unit agent.)

I've started working on state-get today (before reading this, which is why it caught my eye).
I'm happy to use issue #317 to track this, though I'm pretty sure this is just a direct result of that issue.

Note that you don't have to set 'min-juju-version' and Juju will continue to provision charm operator pods as stateful sets. Though its my understanding that you're looking specifically to run these charms in locations where storage is not available.

@faebd7
Copy link
Author

faebd7 commented Jun 2, 2020

I've started working on state-get today (before reading this, which is why it caught my eye).
I'm happy to use issue #317 to track this, though I'm pretty sure this is just a direct result of that issue.

Excellent news, thanks!

Note that you don't have to set 'min-juju-version' and Juju will continue to provision charm operator pods as stateful sets. Though its my understanding that you're looking specifically to run these charms in locations where storage is not available.

That's correct -- none of our k8s clusters have persistent storage.

@chipaca
Copy link
Contributor

chipaca commented Jul 1, 2020

Dupe of #317.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants