unit state lost following upgrade-charm #316

faebd7 · 2020-06-01T23:39:06Z

I just upgraded a charm I'm developing, after which the unit seems lost its unit state. Unfortinately, I don't have a consistent way to reproduce this. I've run into it only once or twice previously, but I've done many more successful charm upgrades.

Before:

mattermost/10*  active    idle   10.1.1.17  8065/TCP

After:

mattermost/10*  waiting   idle   10.1.1.17  8065/TCP  Waiting for database relation

The relation is still there:

[agnew(charm-k8s-mattermost)] juju status --format yaml mattermost
model:
  name: mm
  type: caas
  controller: mattermost
  cloud: k8s
  region: localhost
  version: 2.8-rc2
  model-status:
    current: available
    since: 28 May 2020 14:24:31+12:00
  sla: unsupported
machines: {}
applications:
  mattermost:
    charm: local:kubernetes/mattermost-17
    series: kubernetes
    os: kubernetes
    charm-origin: local
    charm-name: mattermost
    charm-rev: 17
    charm-version: 250d9df-dirty
    scale: 1
    provider-id: 335e8836-3907-48bc-8b65-98ddbe6d5263
    address: 10.152.183.32
    exposed: false
    application-status:
      current: active
      since: 02 Jun 2020 09:43:01+12:00
    relations:
      db:
      - postgresql
    units:
      mattermost/10:
        workload-status:
          current: waiting
          message: Waiting for database relation
          since: 02 Jun 2020 11:18:01+12:00
        juju-status:
          current: idle
          since: 02 Jun 2020 11:18:04+12:00
        leader: true
        open-ports:
        - 8065/TCP
        address: 10.1.1.17
        provider-id: d9681e49-0fb4-4fe4-a236-935caf767650
    endpoint-bindings:
      "": alpha
      db: alpha
application-endpoints:
  postgresql:
    url: mattermost:admin/daturbase.postgresql
    endpoints:
      db:
        interface: pgsql
        role: provider
    application-status:
      current: active
      message: Live master (10.12)
      since: 28 May 2020 14:25:48+12:00
    relations:
      db:
      - mattermost
storage: {}
controller:
  timestamp: 11:31:43+12:00
[agnew(charm-k8s-mattermost)] _

but the unit state DB seems to have been reset:

root@mattermost-operator-0:/var/lib/juju# path_to_unit_state_db=/var/lib/juju/agents/unit-mattermost-10/charm/.unit-state.db
root@mattermost-operator-0:/var/lib/juju# python3 -c 'import pickle, pprint, sqlite3, sys ; show_none = False ; print("\n".join(["=== {} ===\n{}\n".format(t[0], pprint.pformat(t[1])) for t in [(row[0], pickle.loads(row[1])) for row in sqlite3.connect(sys.argv[1]).execute("SELECT handle, data from snapshot")] if show_none or t[1] is not None]))' ${path_to_unit_state_db?}
=== MattermostK8sCharm/StoredStateData[state] ===
{'db_conn_str': None, 'db_ro_uris': [], 'db_uri': None}

=== MattermostK8sCharm/PostgreSQLClient[db]/StoredStateData[_state] ===
{'rels': {}}

=== StoredStateData[_stored] ===
{'event_count': 6}

root@mattermost-operator-0:/var/lib/juju# _

The charm source is here: https://git.launchpad.net/~pjdc/charm-k8s-mattermost/+git/charm-k8s-mattermost/tree/src/charm.py?h=control-source#n61

And here's the debug-log from the upgrade: https://paste.ubuntu.com/p/bp4YdzR5XW/

The text was updated successfully, but these errors were encountered:

faebd7 · 2020-06-02T03:57:16Z

The reproduction above was with Juju 2.8-rc2, and I've also just reproduced it with Juju 2.8-rc3 (which will become 2.8.0).

I did a brief test with IAAS and CAAS charms on 2.8-rc3 with a flag file in /var/lib/juju/agents/unit-foo-N/charm, and based on that it seems that this isn't a case of the charm directory itself being reset somehow.

faebd7 · 2020-06-02T04:05:50Z

Note that my charm is targeting k8s clusters that do not have persistent storage. Therefore I've set min-juju-version: 2.8.0 in metadata.yaml which means that the Juju pods have no volumes attached. However, this is happening on a single-node microk8s, so I don't think this is being caused by pods being rescheduled, since there's nowhere the reschedule them to.

faebd7 · 2020-06-02T04:58:09Z

From looking over scrollback (unfortunately I didn't preserve everything) I can't 100% rule out the application's operator pod being restarted and therefore losing its state due to lack of persistent storage.

I understand Juju 2.8 controller-backed charm state support is being worked on, but since I couldn't find an issue specifically dedicated to it, although there's a mention in #240 (comment), I filed #317.

jameinel · 2020-06-02T06:31:44Z

I'm pretty sure charm pods are restarted on upgrade. If they aren't on 'upgrade-charm' they definitely are on 'upgrade-juju' because the base image that holds the Juju agent is being updated. (I believe they also are on upgrade charm, but I'm not positive of that.
Without persistent storage today (setting min-juju-version), when the charm comes up again, it will very much lose all of its state. (we currently store the unit state in an sqlite database on the disk of the unit agent.)

I've started working on state-get today (before reading this, which is why it caught my eye).
I'm happy to use issue #317 to track this, though I'm pretty sure this is just a direct result of that issue.

Note that you don't have to set 'min-juju-version' and Juju will continue to provision charm operator pods as stateful sets. Though its my understanding that you're looking specifically to run these charms in locations where storage is not available.

faebd7 · 2020-06-02T20:46:04Z

I've started working on state-get today (before reading this, which is why it caught my eye).
I'm happy to use issue #317 to track this, though I'm pretty sure this is just a direct result of that issue.

Excellent news, thanks!

Note that you don't have to set 'min-juju-version' and Juju will continue to provision charm operator pods as stateful sets. Though its my understanding that you're looking specifically to run these charms in locations where storage is not available.

That's correct -- none of our k8s clusters have persistent storage.

chipaca · 2020-07-01T13:00:44Z

Dupe of #317.

chipaca closed this as completed Jul 1, 2020

jameinel mentioned this issue Oct 10, 2023

ci: Pass LIBJUJU_VERSION_SPECIFIER to the mysql-k8s-operator unit tests #1036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unit state lost following upgrade-charm #316

unit state lost following upgrade-charm #316

faebd7 commented Jun 1, 2020 •

edited

faebd7 commented Jun 2, 2020

faebd7 commented Jun 2, 2020 •

edited

faebd7 commented Jun 2, 2020

jameinel commented Jun 2, 2020

faebd7 commented Jun 2, 2020

chipaca commented Jul 1, 2020

unit state lost following upgrade-charm #316

unit state lost following upgrade-charm #316

Comments

faebd7 commented Jun 1, 2020 • edited

faebd7 commented Jun 2, 2020

faebd7 commented Jun 2, 2020 • edited

faebd7 commented Jun 2, 2020

jameinel commented Jun 2, 2020

faebd7 commented Jun 2, 2020

chipaca commented Jul 1, 2020

faebd7 commented Jun 1, 2020 •

edited

faebd7 commented Jun 2, 2020 •

edited