Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchestrator api graceful-master-takeover does not always return a code #949

Closed
laurent-indermuehle opened this issue Aug 4, 2019 · 9 comments · Fixed by #1166
Closed

Comments

@laurent-indermuehle
Copy link
Contributor

I'm trying to catch the return code of graceful-master-takeover command. Either a success, a failure or a refusal (If the destination is already the master per ex.).

Here's what I got so far (orchestrator 3.1.0):

  1. Using orchestrator command. It works, but the output is not easy to grab from a script as it output SQL syntax errors :
orchestrator -c graceful-master-takeover -i mysql-customer1-t1:33005 -d mysql-customer1-t2:33005


2019-07-18 ERROR Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'and cluster_name='mysql-customer1-t2:33005'' at line 6
2019-07-18 ERROR Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'and cluster_name='mysql-customer1-t2:33005'' at line 6
mysql-test-t1:33005
mysqld5-bin.000010:1563
  1. Using the API works in case of success or refusal. But in case of error, the timeout is super long and the output is again not easy to grab :
orchestrator-client -c api -path graceful-master-takeover/mysql-customer1-t2/33006/mysql-customer1-t1/33006 | jq -r '.[].Code, .Message'

#Already master
GracefulMasterTakeover: indicated designated instance mysql-customer1-t1:33006 must be directly replicating from the master mysql-customer1-t1:33006

# Success
OK
graceful-master-takeover: successor promoted
(syslog also report the SQL syntax error here)

# Failure (network down)
[looooong pause]
dial tcp 192.168.100.101:33006: connect: no route to host

# Failure (slave stopped too long for master to have binlog anymore)
Start SLAVE UNTIL is past coordinates: mysql-test-t2:33006

I was hopping orchestrator-client will always return a code. It's not the case.

@shlomi-noach
Copy link
Collaborator

Interesting timing, seeing that just today I proposed #947 and #948 which address some of these problems.

The SQL error is an unfortunate bug fixed in #931 but also in #948

Let me look into formalizing the output for that ; though I'd suggest using the API for most automated tasks.

Also, what's a "looooong pause"? A number would be great. Please see if #948 reduces that number.

@laurent-indermuehle
Copy link
Contributor Author

You're working faster than we can open issues :P

I'll run again the tests next week (starting 12 of August). I'll add some numbers for the looong pause too.

@shlomi-noach
Copy link
Collaborator

shlomi-noach commented Aug 4, 2019

You're working faster than we can open issues

Sheer coincidence, I assure you 😄

@laurent-indermuehle
Copy link
Contributor Author

Sorry for the delay, had to learn how to compile orchestrator.
I've redone all the tests with version a5822fd :

  1. Using orchestrator command :
# Good conditions:
[root@mysql-orc]# orchestrator -c graceful-master-takeover -i mysql-c1-t1:33005 -d mysql-c1-t2:33005
mysql-c1-t2:33005
mysqld5-bin.000002:1501

# Same command once again, so the destination is already master:
[root@mysql-orc]# orchestrator -c graceful-master-takeover -i mysql-c1-t1:33005 -d mysql-c1-t2:33005
2019-08-13 14:03:03 FATAL GracefulMasterTakeover: indicated designated instance mysql-c1-t2:33005 must be directly replicating from the master mysql-c1-t2:33005

Perfect! The SQL syntax errors are gone!

  1. Using the API:
orchestrator-client -c api -path graceful-master-takeover/mysql-c1-t2/33006/mysql-c1-t1/33006 | jq -r '.[].Code, .Message'

# Already master - no changes
GracefulMasterTakeover: indicated designated instance mysql-c1-t1:33005 must be directly replicating from the master mysql-c1-t1:33005

# Success - no more SQL syntax error in syslog !
OK
graceful-master-takeover: successor promoted

# Failure (network down)
A loooong break of about 10 minutes
dial tcp 192.168.100.101:33005: connect: no route to host

# Failure (slave stopped for too long for master to have binlog anymore)
Start SLAVE UNTIL is past coordinates: mysql-c1-t2:33005

I should have mention the problem with the api not always returning a code. It generate this error : jq: error (at <stdin>:1): Cannot index string with string "Code"

@laurent-indermuehle laurent-indermuehle changed the title orchestrator api graceful-master-takeover orchestrator api graceful-master-takeover does not always return a code Aug 13, 2019
@shlomi-noach
Copy link
Collaborator

@Honiix thank you for looking into; in this case, I'm interested in what the JSON does provide. Since your output only permits '.[].Code, .Message' I'm unsure what the output was.

If you could possibly repeat the two short tests and paste the JSON output that would be great. Meanwhile, I acknowledge that the API can return different format upon error and upon success, to the best of my memory. But do let's look at that JSON.

@laurent-indermuehle
Copy link
Contributor Author

That's the thing. The message "GracefulMasterTakeover: indicated designated instance mysql-c1-t1:33005 must be directly replicating from the master mysql-c1-t1:33005" is a string, not a json.

In case of master no longer having the binlog, the message is also a string:
WaitForExecBinlogCoordinatesToReach: reached maxWait 20s on mysql-c1-t1:33005

@shlomi-noach
Copy link
Collaborator

@Honiix ohhhh! OK cool, thanks for this info; I'll look into it.

@shlomi-noach
Copy link
Collaborator

Fixed in #1166

@shlomi-noach
Copy link
Collaborator

As concrete #1166 example to this issue:

$ orchestrator-client -c api -path graceful-master-takeover-auto/ci/127.0.0.1/10114 | jq .
{
  "Code": "ERROR",
  "Message": "GracefulMasterTakeover: Recovery attempted yet no replica promoted; err=RecoverDeadMaster: failed 127.0.0.1:10114 promotion; PreventCrossRegionMasterFailover: will not promote server in rgn-west when failed server in rgn-east",
  "Details": null
}
$ orchestrator-client -c api -path graceful-master-takeover-auto/ci/127.0.0.1/10112 | jq .
{
  "Code": "ERROR",
  "Message": "GracefulMasterTakeover: indicated designated instance 127.0.0.1:10112 must be directly replicating from the master 127.0.0.1:10111",
  "Details": null
}
$ orchestrator-client -c api -path graceful-master-takeover-auto/ci/127.0.0.1/10113 | jq .
{
  "Code": "OK",
  "Message": "graceful-master-takeover: successor promoted",
  "Details": {
    "Id": 192,
    "UID": "1589784607961527000:b4a441fbf7f276fc841c0b01d9f1985c0bfe0c754605f4d4bfdf33bde5b5ada7",
    "AnalysisEntry": {
      "AnalyzedInstanceKey": {
        "Hostname": "127.0.0.1",
        "Port": 10111
      },
      "AnalyzedInstanceMasterKey": {
        "Hostname": "",
        "Port": 0
      },
      "ClusterDetails": {
        "ClusterName": "127.0.0.1:10111",
        "ClusterAlias": "ci",
        "ClusterDomain": "",
        "CountInstances": 4,
        "HeuristicLag": 0,
        "HasAutomatedMasterRecovery": true,
        "HasAutomatedIntermediateMasterRecovery": true
      },
      "AnalyzedInstanceDataCenter": "dc-east-1",
      "AnalyzedInstanceRegion": "rgn-east",
      "AnalyzedInstancePhysicalEnvironment": "prod",
      "IsMaster": true,
      "IsCoMaster": false,
      "LastCheckValid": true,
      "LastCheckPartialSuccess": true,
      "CountReplicas": 1,
      "CountValidReplicas": 1,
      "CountValidReplicatingReplicas": 1,
      "CountReplicasFailingToConnectToMaster": 0,
      "CountDowntimedReplicas": 0,
      "ReplicationDepth": 0,
      "SlaveHosts": [
        {
          "Hostname": "127.0.0.1",
          "Port": 10113
        }
      ],
      "IsFailingToConnectToMaster": false,
      "Analysis": "DeadMaster",
      "Description": "",
      "StructureAnalysis": null,
      "IsDowntimed": false,
      "IsReplicasDowntimed": false,
      "DowntimeEndTimestamp": "",
      "DowntimeRemainingSeconds": 0,
      "IsBinlogServer": false,
      "PseudoGTIDImmediateTopology": false,
      "OracleGTIDImmediateTopology": true,
      "MariaDBGTIDImmediateTopology": false,
      "BinlogServerImmediateTopology": false,
      "CountLoggingReplicas": 1,
      "CountStatementBasedLoggingReplicas": 0,
      "CountMixedBasedLoggingReplicas": 0,
      "CountRowBasedLoggingReplicas": 1,
      "CountDistinctMajorVersionsLoggingReplicas": 1,
      "CountDelayedReplicas": 0,
      "CountLaggingReplicas": 0,
      "IsActionableRecovery": true,
      "ProcessingNodeHostname": "shlomi-mbp",
      "ProcessingNodeToken": "302337ab8c792a5e8fc71c3919fa35c7da04e275c9b915b2457a6ce0b00b354c",
      "CountAdditionalAgreeingNodes": 0,
      "StartActivePeriod": "",
      "SkippableDueToDowntime": false,
      "GTIDMode": "ON",
      "MinReplicaGTIDMode": "ON",
      "MaxReplicaGTIDMode": "ON",
      "MaxReplicaGTIDErrant": "",
      "CommandHint": "graceful-master-takeover",
      "IsReadOnly": false
    },
    "SuccessorKey": {
      "Hostname": "127.0.0.1",
      "Port": 10113
    },
    "SuccessorAlias": "",
    "IsActive": false,
    "IsSuccessful": true,
    "LostReplicas": [],
    "ParticipatingInstanceKeys": [],
    "AllErrors": [],
    "RecoveryStartTimestamp": "",
    "RecoveryEndTimestamp": "",
    "ProcessingNodeHostname": "",
    "ProcessingNodeToken": "",
    "Acknowledged": false,
    "AcknowledgedAt": "",
    "AcknowledgedBy": "",
    "AcknowledgedComment": "",
    "LastDetectionId": 0,
    "RelatedRecoveryId": 0,
    "Type": "MasterRecovery",
    "RecoveryType": "MasterRecoveryGTID"
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants