Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: add a tool to restore crush map after a faulty one is injected #5052

Merged
merged 10 commits into from Jul 17, 2015

Conversation

tchaikov
Copy link
Contributor

  • add "--dump " command to osdmaptool, so we are able to extract osdmap epoch using xmlstarlet
  • add a command to ceph-monstore-tool to rewrite monstore
  • add a script to help fix the crush map

see http://tracker.ceph.com/issues/11815

@ghost
Copy link

ghost commented Jun 23, 2015

@tchaikov
Copy link
Contributor Author

thanks @dachary, i just fixed the script's path, and also added it to the dist tarball.

@tchaikov tchaikov changed the title tools: add a tool to restore crush map after a faulty one is injected [DNM] tools: add a tool to restore crush map after a faulty one is injected Jun 23, 2015
@tchaikov tchaikov force-pushed the wip-11815-restore-crushmap branch 2 times, most recently from 79d1874 to 3c6fc77 Compare June 23, 2015 13:04
@tchaikov
Copy link
Contributor Author

@jecluis could you take a quick look at this change?

  1. packed all the changes in a proposal, and put it into a new paxos commit. but i didn't update the infected osdmap epoches as you suggested, since the paxos commit will be applied on all monitors, so the bad epoches on leader will be fixed once the proposal is accepted.
  2. added a new osdmap commit, so the OSDMonitor will update its current osdmap at seeing it.

thanks =)

@tchaikov tchaikov force-pushed the wip-11815-restore-crushmap branch 3 times, most recently from 9fc5f93 to eda9844 Compare June 24, 2015 04:05
@tchaikov tchaikov force-pushed the wip-11815-restore-crushmap branch 2 times, most recently from 0e78e54 to b7cddf8 Compare June 24, 2015 14:22
@tchaikov tchaikov changed the title [DNM] tools: add a tool to restore crush map after a faulty one is injected tools: add a tool to restore crush map after a faulty one is injected Jun 24, 2015
@tchaikov
Copy link
Contributor Author

the failure was due to:

2015-06-24 16:42:15.427958 2b22a8bf2bc0 -1 mon.a@-1(probing).osd e3 update_from_paxos full map CRC mismatch, resetting to canonical

@tchaikov tchaikov changed the title tools: add a tool to restore crush map after a faulty one is injected [DNM] tools: add a tool to restore crush map after a faulty one is injected Jun 24, 2015
@tchaikov tchaikov force-pushed the wip-11815-restore-crushmap branch 3 times, most recently from a961462 to 2e362d3 Compare June 25, 2015 08:01
@tchaikov tchaikov changed the title [DNM] tools: add a tool to restore crush map after a faulty one is injected tools: add a tool to restore crush map after a faulty one is injected Jun 25, 2015
@tchaikov
Copy link
Contributor Author

tchaikov commented Jul 7, 2015

per the discussion tonight, we should also offer a command, with which user can extract both the full map and the incrementals, which can be injected to the OSD's object store. we can do it as part of #5127.

@tchaikov
Copy link
Contributor Author

@jecluis i managed to workaround the "local" command.

$ help local
...
Exit Status:
Returns success unless an invalid option is supplied, a variable
assignment error occurs, or the shell is not executing a function.

so "local" is practically always happy at whatever the following $() subshell returns, and what we need is to put the assignment at a different line.

following is what i have if the monitor is running:

$ ./tools/ceph-monstore-update-crush.sh dev/mon.a/
IO error: lock dev/mon.a/store.db/LOCK: Resource temporarily unavailable

error accessing mon store at dev/mon.a/

@tchaikov
Copy link
Contributor Author

FAIL: test/libradosstriper/rados-striper.sh

the failed test is not relevant .

@ghost
Copy link

ghost commented Jul 15, 2015

@tchaikov could you please rebase and repush ? It's the first time this rados-striper.sh error happens and I'd like to make sure it's a transient error.

@tchaikov
Copy link
Contributor Author

@dachary rebased and repushed.

tchaikov and others added 10 commits July 17, 2015 19:14
"rewrite" command will
 - add a new osdmap version to update current osdmap held by OSDMonitor
 - add a new paxos version, as a proposal it will
   * rewrite all osdmap epochs from specified epoch to  the last_committed
     one with the specified crush map.
   * add the new osdmap which is added just now
so the leader monitor can trigger a recovery process to apply the transaction
to all monitors in quorum, and hence bring them back to normal after being
injected with a faulty crushmap.

Fixes: #11815
Signed-off-by: Kefu Chai <kchai@redhat.com>
* --dump will accept a formatter argument.

Signed-off-by: Kefu Chai <kchai@redhat.com>
* its '--dump-json' option is replaced by '--dump json'

Signed-off-by: Kefu Chai <kchai@redhat.com>
Fixes: #11815
Signed-off-by: Kefu Chai <kchai@redhat.com>
Fixes: #11815
Signed-off-by: Kefu Chai <kchai@redhat.com>
Fixes: #11815
Signed-off-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
Signed-off-by: Joao Eduardo Luis <joao@suse.de>
@tchaikov
Copy link
Contributor Author

rebased against master to resolve the merge conflicts of PendingReleaseNotes .

@jecluis ping?

@jecluis
Copy link
Member

jecluis commented Jul 17, 2015

Let's wait for the bot to come back with a clean build and I'll merge it then ;)

@tchaikov
Copy link
Contributor Author

cool, thanks in advance! =)

jecluis added a commit that referenced this pull request Jul 17, 2015
tools: add a tool to restore crush map after a faulty one is injected

Reviewed-by: Joao Eduardo Luis <joao@suse.de>
@jecluis jecluis merged commit a89cae4 into master Jul 17, 2015
@jecluis jecluis deleted the wip-11815-restore-crushmap branch July 17, 2015 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants