Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance-failover-post isn't always called #124

Closed
Ganeti-Issues-Migrator opened this issue Jun 21, 2017 · 24 comments
Closed

instance-failover-post isn't always called #124

Ganeti-Issues-Migrator opened this issue Jun 21, 2017 · 24 comments

Comments

@Ganeti-Issues-Migrator
Copy link

Ganeti-Issues-Migrator commented Jun 21, 2017

Originally reported of Google Code with ID 68.

<b>What steps will reproduce the problem?</b>
1. I create an isntance with drbd layout on node1 and node2, node1 is also
cluster master
2. I add a simple script (date > /tmp/date, for ex) into
/etc/ganeti/hooks/instance-failover-post.d/01-test
The instance is initially started on node1.
first gnt-instance failover (node1->node2) : everything goes well, post
hook is called on node1 and node2
second gnt-instance failover(node2 -> node1) : the hook on node2 isn't called.

Am I doing something bad ? 

Cheers,

Maxence

Originally added on 2009-08-26 09:26:15 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

just tested with a third node, so that my cluster master is node1, and the instance
is on node2/node3. On both of these nodes, the hook isn't called on the node which is
"leaved" by the instance (the 'old' primary node).


Originally added on 2009-08-26 11:33:48 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

I investigated a bit more, in rpc.py, call_hooks_runner, the node_list doesn't
contains nodes other than the master one 

Originally added on 2009-08-26 14:00:18 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

ok, found the bug :)
in cmdlib.py, class LUFailoverInstance, in BuildHooksEnv.
The line for definning node list is :

nl = [self.cfg.GetMasterNode()] + list(self.instance.secondary_nodes)

should be 

    nl = [self.cfg.GetMasterNode(), self.instance.primary_node] +
list(self.instance.secondary_nodes)

I didn't test, but regarding the code, I suspect the same issue will appear with live
migration.

Another thing:
Let's say I have an instance inst1 on node1 (primary) and node2 (secondary).
When I use failover command, it will switch primary and secondary nodes, but, because
the node list is fetched before the failover switch, the post-hooks are called with
the "old config".

cheers

Originally added on 2009-08-26 14:33:59 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

I made some cleaner correction, I attach the patch. It corrects 4 node list.

Maxence

Originally added on 2009-08-26 20:09:16 +0000 UTC.

Attachments:
node_name.patch

@Ganeti-Issues-Migrator
Copy link
Author

Hi,
it's not a bug, it's a *documented* feature. :)

Hooks documentation reports that OP_INSTANCE_FAILOVER pre and post phases run
on master and secondary node, not on primary.

With your patch, if you failover instances away from a dying node, you're going
to ask the dying node to run pre phase hooks. This request is likely to fail,
aborting the whole failover process.

Another option might be to just run post phase in that case..

Luca

Originally added on 2009-09-24 18:13:39 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

hmm, indeed, I agree for the pre-exec hook. But I don't see any reason not to add
primary node on post-hooks ? 

And, about my other question :
Let's say I have an instance inst1 on node1 (primary) and node2 (secondary).
When I use failover command, it will switch primary and secondary nodes, but, because
the node list is fetched before the failover switch, the post-hooks are called with
the "old config".

any tips ? 

cheers,

maxence

Originally added on 2009-09-28 16:42:39 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi,
yes, I think you should send a patch for post hooks only to ganeti-devel for
review.

About your second question: it's not clear for me what "old config" means for
you because IGNORE_CONSISTENCY is the only failover-related environment
variable passed to the hooks and it can't be really considered "old" at some
point. I'll try to add a few considerations anyway..

I think the basic idea for failover hooks on the nodes which host the instance
was:
- run the "pre" phase on the node which is about to receive the instance, so it
  can perform some checks and possibly fail
- run the "post" phase on the node which has received the instance to make
  small changes on the node itself for the new machine
So in the code the secondary is looked up to run both paths.

In general if you need many cluster informations in your hook which might
change during the execution of the operation you can look them up in ssconf
files: /var/lib/ganeti/ssconf_* (eventually expanding the files or their
content).

Cheers,

Luca

Originally added on 2009-09-29 14:54:09 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi,

I'll send the patch asap.

By "old config", I mean : 
Let's say we have a vm "vm1.tld" and 2 nodes "host1.tld", and "host2.tld".
host1 is the primary server of vm1, host2 the secondary.

so in normal state, I have : prim = host1, sec = host2. In that case, gnt-instance
info vm1.tld will be give me smthg like :
[...]
 - primary : host1.tld
 - secondaries : host2.tld
[...]

Now I do : gnt-intance failover vm1.tld, after what, I re-run a gnt-instance info
vm1.gnt. I get :
[...]
 - primary : host2.tld
 - secondaries : host1.tld
[...]

Which seems normal, host2 is the new primary of the instance.

What is strange is that, in post-hooks, the primary node is always host1 :)
Since the action was made, shouldn't it be updated with new configuration ? 

Cheers,

Maxence

Originally added on 2009-09-29 15:04:32 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi,
as far as I know there's no environment variable made available to hooks which says 
something about primary and secondary.

Cheers,

Luca

Originally added on 2009-09-29 15:14:48 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

In my scripts, there are at least 2 env variables :
$GANETI_INSTANCE_SECONDARIES and $GANETI_INSTANCE_PRIMARY.

The documentation says : "All instance operations take at least the following
variables: [...]".

Maxence

Originally added on 2009-09-29 15:22:47 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi,
sorry I skipped that part.. yes in the post phase that's no longer consistent.
I think it should be reported as a separate bug.

Cheers,
Luca

Originally added on 2009-09-29 15:33:18 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

ok, so to sum up, I send you a patch for the post-hook calls and open a new bug for
data inconsistency :)

Cheers,

Maxence

Originally added on 2009-09-29 15:42:13 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Oh you can also attach a patch to solve the data inconsistency problem :P

Luca

Originally added on 2009-09-29 15:50:26 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi,
I've added a few lines to hooks documentation in branch-2.1 describing the non-
intuitive environment behaviour for those variables in failover/migration post phase.

Luca

Originally added on 2009-09-30 08:40:35 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

I'm trying to review my patch for hooks, and not sure about a few things :

====
in case of Remove : initial version only notify the master :
nl = [self.cfg.GetMasterNode()]

Whereas function documentaion says :
"This runs on master, primary and secondary nodes of the instance.".

so I updated it to notify the Master, but also all the instance's node in pre and
post hooks :
nl = [self.cfg.GetMasterNode()] + list(self.instance.all_nodes)

====
For Failover : 
IIn pre-phase, I only notify master and secondaries. In post-phase, I notify master +
all instance's nodes

====
For Migration :
ame than failover

==== 
GrowDisk :
The initial code didn't notify secondaries nodes. I guess growing a lv partition
requires an action on both primary and secondary node, so if it's going to work, we
can be sure that all nodes are available.

So I patched it to notify mater + primary + secondaries in both pre and post phases.

===


Please, let me know if it seems good, so that I can send you the patch.

Cheers,

Maxence

Originally added on 2009-10-01 07:15:07 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi!

If the function documentation says one thing and the code does another one it's
more likely the documentation is not following the code and not vice-versa; so
if you're running hooks on more nodes be sure you can explain why it's an
improvement (e.g.: provide a real case in which you want to run a hook on
primary and secondary for LUGrowDisk).

That said the idea seems fine to me.

Thank you for working on this!


Luca


Originally added on 2009-10-01 13:31:14 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

for the LUGrowDisk, I don't have any example to say "it's usefull to call hooks on
primary and secondaries", but I don't see anything to say "it's bad to call them" :)

I join you the patch, feel free to remove the unappropriate sections :)

Originally added on 2009-10-01 13:51:24 +0000 UTC.

Attachments:
PATCH-correct-hooks.patch

@Ganeti-Issues-Migrator
Copy link
Author

Any news about this patch ? (I wait for it to be applied before removing patch on the
howto with ganeti+ospf)

Originally added on 2009-10-13 12:35:09 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

-- Empty comment --

Originally added on 2009-10-23 12:47:50 +0000 UTC.

Added to Milestone: Release2.1

@Ganeti-Issues-Migrator
Copy link
Author

any news ? 
I just took a look at git, and it seems there is no changes related to this bug (I 
especially looked at LUFailoverInstance env inconsistency).

Regards,

Maxence

Originally added on 2010-01-17 11:55:00 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

Hi,

I still can't find modifications in git, and we are at rc5 stage, is it still planed 
for 2.1 ? 

Regards

Originally added on 2010-02-10 16:07:49 +0000 UTC.

@Ganeti-Issues-Migrator
Copy link
Author

It won't make 2.1.0 but will be in 2.1.1, sorry!

I'm working on a combined patch that will fix both this issue and issue 71.

iustin

Originally added on 2010-02-11 13:09:42 +0000 UTC.

Mentioned Issue with GoogleCode ID 71 has been migrated to GitHub with ID #127.
Added Labels: Type-Enhancement
Changed State: Started

@Ganeti-Issues-Migrator
Copy link
Author

Committed to devel-2.1, will be in 2.1.1.

iustin

Originally added on 2010-02-11 16:20:41 +0000 UTC.

Changed State: Fixed

@Ganeti-Issues-Migrator
Copy link
Author

-- Empty comment --

Originally added on 2012-10-22 08:57:07 +0000 UTC.

Changed State: Released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant