Skip to content

gluon-scheduled-domain-switch: add package #1555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 12, 2019

Conversation

blocktrron
Copy link
Member

This package allows to automatically switch to another domain, either at a given point in time or after the node was offline long enough.

It's primary goal is to allow for communities which still use IBSS to migrate over to 802.11s without having to run both protocols at the same time, which might lead to overloaded routers.

Depending how #1377 is scoped, this might close it.

@rotanid rotanid added 0. type: enhancement The changeset is an enhancement 3. topic: package Topic: Gluon Packages labels Oct 22, 2018
@blocktrron
Copy link
Member Author

blocktrron commented Oct 23, 2018

I've just noticed there are some things to do in terms of code-style (Tabs vs. Spaces), will fix that soon.

fixed

@blocktrron
Copy link
Member Author

Addressed all issues pointed out by @mweinelt

@rotanid rotanid added this to the 2018.2 milestone Nov 4, 2018
@genofire
Copy link

genofire commented Nov 5, 2018

Maybe a map of domains is good:
e.g. old_domain1 to new_domain1 and old_domain2 to new_domain2

@blocktrron
Copy link
Member Author

@genofire This is indirectly archieved by the abitlity to put the settings into the respective domain files. This way, it is also possible to e.g. configure switch_after_offline_mins on a per-domain base.

@T-X
Copy link
Contributor

T-X commented Nov 15, 2018

If I remember correctly then @neoraider was thinking about some rollback mechanism, too. So, if after switching to the new domain settings and if gateways are unreachable then, then revert to the previous domain. Could probably be implemented and added later, too, but @blocktrron, have you spend some thoughts on something like this?

Would it be like switching back and forth between old and new domain settings after some timeout? (Could that run into some undesired side-effects?)

@blocktrron
Copy link
Member Author

A little bit of background here: This is pretty much a stripped down, tidied up version of the package we used to migrate the network of Freifunk Darmstadt.

We thought about this also but dropped this idea due to the fact we might run into issues in larger meshes, switching back an forth between domains. We never wanted nodes to go back. If you run into the fallback switch, you will see a node at the point all other nodes have switched, even if this means you will have no mesh connection for ~5 days (IMHO. a reasonable timespan between rolling out the scheduled domain switch and the actual switch-date is 1 Week or maybe 10 days).

So a back-and-forth switching is not necessary, all nodes will be back online after a week.

As long as undesired side-effects go: I've never thought about potential problems, except for the time to stabilize a larger mesh with different fallback-intervals and domain states. This will sort out itself after time, but i think the one-time switch is superior.

@T-X
Copy link
Contributor

T-X commented Nov 17, 2018

So a back-and-forth switching is not necessary, all nodes will be back online after a week.

Unless you've made some mistake in the new domain settings. Or have other bugs that only show up with the new domain settings.

While the domain switching might seem safer than the autoupdater because it does not need to flash the whole system, that might make it less safer without a revert option: With the autoupdater you can validate your images/changes via beta branches and an updating timespan. With the domain-switch it's an "all-in / all-jump-now" approach.

With the presumption that humans will make mistakes and will create broken domain settings (or will run into new protocol or driver bugs that will only surface in the combination of the new domain settings and this particular network), do we have enough safety measures for the scheduled-domain-switch approach to accommodate for that?

@mweinelt
Copy link
Contributor

Unless you've made some mistake in the new domain settings. Or have other bugs that only show up with the new domain settings.

Certainly, but I'd argue that as usual any firmware doing such a major migration should be properly tested ahead of time. Additionally signing a firmware manifest should imply that it has received testing, more so with a domain migration, anything else feels unreasonable.

Apart from misconfiguration I don't believe the general concept of this PR is particularly error-prone. It makes sure that nodes definitely migrate at some point, it's reliability basically comes down to how reliable you can roll out new firmware before the switch time is up. I don't necessarily believe wiggling back and forth between old and new configuration is strictly necessary, although it would be a nice addition.

The primary culprit for me is that you have to configure the switch time at build time, unable to further control and monitor the switch at run time, like in a remote-controlled approach.

  • Set it too long, you're fine, it just takes forever.
  • Set it too short, you risk splitting your network.

Fortunately with a non-parallel approach we don't run into load issues along the way. I'd say most communities can reliably roll out a firmware to most devices in about a week, with most nodes receiving the update very quickly, and only a few select routers lagging behind, because most famously

  • their uplink is bad,
  • they were offline between firmware rollout and migration

So I'd say a recommendation of at least 10-14 days between firmware rollout and the actual switch would be suitable. Obviously more days are needed if you have configured a slower rollout.

Copy link
Member

@neocturne neocturne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a complete review, just a few comments.

@blocktrron
Copy link
Member Author

PR updated, changes still need testing on a real device though.

@blocktrron blocktrron force-pushed the pr-switch-domain branch 2 times, most recently from 23e2666 to edd5889 Compare December 21, 2018 21:22
@blocktrron
Copy link
Member Author

Just tested my changes and they behave as expected. Also switched to using system uptime instead of date and time to determine the offline-duration of the node.

blocktrron added a commit to blocktrron/ffda-packages that referenced this pull request Dec 22, 2018
This uses the uptime instead of the date to determine whether or not to
switch due to the node being offline for a period of time.

This way, we mitigate race conditions when a node is powered on and sets
it's clock after X minutes via NTP.

Thanks to Linus Lüssing for the suggestion
freifunk-gluon/gluon#1555 (comment)
@blocktrron
Copy link
Member Author

Updated this PR. Tested on one node without issues.

I hope i did catch everything. 😄

wusel42 pushed a commit to ffgtso/ffgt_packages-v2018.1 that referenced this pull request Jan 29, 2019
This uses the uptime instead of the date to determine whether or not to
switch due to the node being offline for a period of time.

This way, we mitigate race conditions when a node is powered on and sets
it's clock after X minutes via NTP.

Thanks to Linus Lüssing for the suggestion
freifunk-gluon/gluon#1555 (comment)
@blocktrron
Copy link
Member Author

Fixed two minor inconsistencys (see fixup commit).

Copy link
Member

@neocturne neocturne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ony minor issues left, I think we can get this merged after this round 👍

@blocktrron
Copy link
Member Author

Updated PR.

  • Addressed everything @NeoRaider and @skorpy2009 pointed out.
  • Rebased onto current master

This package allows to automatically switch to another domain, either
at a given point in time or after the node was offline long enough.
@rotanid rotanid merged commit c1b9ea2 into freifunk-gluon:master Feb 12, 2019
mweinelt pushed a commit that referenced this pull request Feb 26, 2019
This package allows to automatically switch to another domain, either
at a given point in time or after the node was offline long enough.
@blocktrron blocktrron deleted the pr-switch-domain branch March 5, 2019 15:49
christf pushed a commit to christf/gluon that referenced this pull request May 23, 2019
This package allows to automatically switch to another domain, either
at a given point in time or after the node was offline long enough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: enhancement The changeset is an enhancement 3. topic: package Topic: Gluon Packages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants