-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scsi reservations issue on failover #25
Comments
Hello, This doesn't follow the spirit of what I documented. Why doesn't your design use multipath? |
Don't have the space for a second HBA, servers have only a few low-profile slots. One has the HBA, the other a dual 10G card for NFS connectivity. This is a low-budget op with second hand hardware, paid by donations, to provide storage for the build/compile servers of an open source development team I help out with some infra and admin work. I understand that two HBA's and multipathing would provide additional availability, but unfortunately it is what it is. Until a big sponsor comes along... ;-) |
We had the same problems with SCSI reservations as it never worked as expected. Sometimes when a failover happened, the new active server could not import the disks. |
Good to read I'm not alone. Not good that you needed to work around it like that. I'd hoped to avoid that. |
I haven't had such issues with any deployment. @milleroff Please make sure |
That is indeed how I've hooked it up now, one port going to each of the enclosures. All my disks are Hitachi HUS156060VLS600, which are dual-port SAS drives. I don't have any SCSI fencing active, removing that was my first step in trying to find the problem. |
SCSI fencing is crucial to what you're doing. That's how the failover and pool import work. |
So what is setting the reservations, as it's not the fence_scsi agent? I get that in production you need it to avoid imports on both nodes (which would be utter horror), but if it doesn't work in a controlled failover (where the pools are cleanly exported and a node is cleanly shutdown to tigger a failover), I don't see how adding another layer of complexity will fix the issue. |
Controlled and uncontrolled failovers work in the setup I've described and documented. I do not know what's unique about your environment, but removing critical components of the design isn't going to help the situation. What is the output of This high-availability design assumes: |
I get that. I'd rather work with a documented (and if possible supported) environment as well. But as said, it is what it is. ;-)
The SSD disk is a 32GB SATA DOM the server boots from. I don't have a multipath setup, so no multipath service installed. And therefore no multipath output, and no dm-multipath devices. I know my setup isn't as documented, but I was still hoping someone knew where these reservations came from, so I could work with/around them. I have no issues writing a resource agent to deal with those if needed, if that is what it takes. Thanks sofar. I have reinstated the fence_scsi agent now, will try another failover tomorrow, I'm in GMT+2, getting late here... ;-) |
I'd enable the multipath daemon, re-create the pool with the resulting
Your NetApp shelves should allow you to do this. |
Ok, will do that tomorrow evening. Thanks for the help sofar. |
Decided to restart the project completely. Formatted and cleared everything, reinstalled CentOS and ZFS, following the wiki, virtually to the letter. Difference is I decided not to use multipath, I've had a chat with some redhat DC guys today, and they all advised me not to use it when there is no multipath in use, to avoid another layer of complexity. So I followed their advice, and used " Just tested a few failovers by switching nodes to standby and back and faking network issues, and that seems to work fine now, including the SCSI fencing. Happy days. Only one issue left: when ZFS fails over, the shares aren't activated after the failover, and I need to do a No idea where to look next. I didn't have this problem before, so I seem to roll from one issue to the next... |
I don't understand what you are trying to do by avoiding multipath, as it is a key element of this design. I understand you're seeking assistance, but you have not clearly articulated the reasoning behind not using multipath devices. If there's an architectural issue preventing multipath cabling, please explain. |
I don't have multipaths to my storage, so it totally pointless to install and use multipath, every device only has one path. I get it is a key element of your design, but I don't have the hardware to match, I have already explained that to you (I only have one HBA per server, and only place for one). |
I'm sorry, but the guidelines are very clear. Single HBAs aren't a problem if they have two external ports. Your equipment choices and crafting a workaround is not a valid support issue. |
What an attitude. Disappointing. I have servers with a single HBA. They have two ports each. One port is connected to shelf A, one port is connected to shelf B. The shelfs themselfs also only have two ports, so I CANT create a multipath, even if I wanted to. As I wrote yesterday, it is what it is, and then you didn't have a problem with it. Besides that, the fact that nfs shares don't become available after a zpool import has absolutely zero to do with whether multipath is in use or not. I doesn't work either if I boot up one node while the other is switched off... |
This is outside the scope of support because your solution is not built properly. Regarding ZFS shares, filesystem exports are shared automatically on zpool import. I suspect that your pools aren’t actually exporting/importing since the servers and disks have no knowledge of each other; because there's no use of multipath devices/device names in your zpool. |
@ewwhite Thanks so much for your excellent and hard work on this project! If you're interested was hoping to share some work I've done to integrate your design with ZapZend, which (as I'm sure you're aware) stores all of the snapshot & replication configuration within properties of the ZFS filesystem itself. In my testing (with minor modification) ZnapZend meshes well with your design :-)
@milleroff @WanWizard If you have each SAS HBA port connected to a separate shelf on both server nodes than the physical SAS connectivity IS already multipathed. However, for this ZFS-HA design to function, and as @ewwhite has described in the wiki, it is essential to install and configure device-mapper-multipath, and use the /dev/mapper/ device IDs for the vdevs when you create the ZFS pool. Hope this helps. |
Here are the notes from the Red Hat support article linked above: Resolution Root Cause |
Thanks for the prompt response and helpful info @ewwhite :-) It seems the fence_mpath agent is a little more complex to setup, and requires "that /etc/multipath.conf be configured with a unique reservation_key hexadecimal value on each node, either in the defaults or in a multipath block for each cluster-shared device." Have you tested using the fence_mpath agent with your design BTW? |
For those finding this issue because of a similar issue: Multipath is not a requirement in my setup, creating (and failing over) a zpool with vdev's using wwn's works fine, wwn's are fixed and unique. I had this confirmed by my company's redhat system engineer. And I've checked the ZoL code, Addressed this issue by modifying the nfsserver heartbeat script, and add a |
@rcproam No, I have not had a need to use the fence_mpath. I don't and have not encountered SCSI reservation problems in my builds. Definitely try to make sure The other thing that I do these days is ensure there's a discrete heartbeat network path between nodes. I've been using a simple USB transfer cable between hosts to provide this additional link as the alternate Corosync ring. I found this to be necessary in environments where I have MLAG/MC-LAC switches and multi-chassis LACP from the server to switches. A switch failure with collapsed VLANs for data, heartbeat, etc. would kill all of the network links, including the Corosync rings. That's the only other modification I've needed. I don't suspect that SCSI reservation issues are commonplace. @WanWizard - I advise leaving the NFS service running full time on both nodes. ZFS takes care of the rest. There's no need to start/stop that service for this purpose. Note that there's no NFS server resource. Just ZFS zpool, STONITH and IP address.
|
As I wrote, I have This is on suggestion of that same Redhat engineer, and documented here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/s1-resourcegroupcreatenfs-haaa Doing so requires the zpool and its datasets to be available before NFS starts, and that can only be archieved using the nfsserver resource in combination with order constraints. It does create a chicken-and-egg problem, I understand that now. Redhat's examples are based on DRBD, which doesn't have this problem. I worked around it, I don't have a problem with that. Just wanted to report that back, for future reference. |
Hello, I think I'm in the same situation as @WanWizard. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports each. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only. So I'm forced to connect only one sas cable per EMM (EMM1 on host1 HBA, EMM2 on host2 HBA). Disks are SAS dual port. I have multipath enabled but obviously multipath -ll show a single path for each disk on each host. However it seems that failover is working with no issues. My question is: could I stay safe with this setup or have I to migrate to a full dual-port SAS solution? |
The only addtional risk you run is that a cable or connection issue between the active node and one of the enclosures will trigger a failover, whereas with multipath the active node would remain active and would use the second path. It depends on your situation, but in my case everything runs in a locked rack that nobody ever opens, the chance of connection issues are very slim, and a failover because of it is not a problem (that is why I have two nodes, right?). In my case the risks by far outweigh the replacement costs. |
I just read through the technical guidebook for the Dell MD1220. The manual says that clustering is not supported on the enclosure.
What happens if you create a SAS Multipath ring and use the Out ports on the EMM? The manual says the ports may be disabled depending on the enclosure mode (split/unified). If this doesn’t work, I guess that means this Dell is not an ideal enclosure for ZFS clustering purposes.
Edmund White
On Apr 29, 2019, at 11:25 AM, Riccardo <notifications@github.com<mailto:notifications@github.com>> wrote:
Hello, I think I'm in the same situation as @WanWizard<https://github.com/WanWizard>. I have a Dell MD1220 which basically has 2 EMM controllers with 2 SAS ports. Unfortunately the 2nd port is unusable for multipath because is reserved for the daisy chain between enclosures only.
So I'm looking for a way to preserve my MD1220 and have a reliable system even in case of hard failure.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#25 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABJSFNQYDXSTO2GSLRIQZSTPS4OOXANCNFSM4HDFLAPQ>.
|
By connecting a SAS HBA to Out port on EMM nothing happens. multipath -ll says there's only one possible path. However I tried to do some failover test and everything seemed fine:
I think also that dell MD1200 should be removed from the wiki, since it is equipped with same EMM controllers of MD1220. Riccardo |
@rbicelli The point of the dual cabling is to provide HBA, port, cable and controller resilience. The limitation of the MD1220 controller is disappointing to see. |
I am having this issue as well. My setup is red hat 8 muiltpath-ll shows two paths for each disk multipathd was initially trying to use TUR, this was giving reservation errors on passive node every time it tried to check if the path was avalable. I changed it to use directio and this stopped the errors on passive node on path check. However on failover I still get reservastion issues causing a faliure to failover sd 1:0:62:0: reservation conflict sg_persist shows the disk has reservations also, like original poster. |
Hey, followed your great instructions to the letter, but I'm left with a situation that leaves me stumped.
I have a setup with two supermicro's, each connected to 2 12-disk JBOD's with SAS disks, but without a loop, so no multipath (and multipath is not installed). Both JBOD's are used in mirrored vdevs, so I can lose an entire JBOD without much issues.
OS: CentOS Linux release 7.6.1810 (Core)
ZFS: 0.7.13, from the zfs-kmod repo
This setup works fine, until pacemaker decides there is a need to failover. It doesn't matter if that is because the active node is put into standby, because the hardware is switched off, etc.
When pacemaker fails over, the second node tries to import the pool, which fails because something on the first node has placed SCSI reservations on the disk:
as soon as the failover happens, the second node starts to log:
which either causes the entire import to fail, or, if the import succeeds, with disks offline due to excessive errors.
I've been pulling my hair out for about two weeks now, but no clue what sets these reservations, or how I can have them released on a cluster start or a cluster failover. There seem to be lots of people building Linux HA clusters with ZFS judging the discussions I found, but no one mentions this issue...
The text was updated successfully, but these errors were encountered: