Skip to content
This repository has been archived by the owner on Apr 17, 2020. It is now read-only.

Clarification needed on AWS CFE examples #17

Closed
C0missar opened this issue Mar 3, 2020 · 7 comments
Closed

Clarification needed on AWS CFE examples #17

C0missar opened this issue Mar 3, 2020 · 7 comments
Labels
documentation Improvements or additions to documentation

Comments

@C0missar
Copy link

C0missar commented Mar 3, 2020

The AWS section of the CFE user guide leaves a lot of questions open, or if the answers are there, I didn't understand them.

https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/aws.html

The drawing doesn't match either the text or the example declarations. It would help to have a single drawing, with the declarations matching it exactly, and discussion using that scenario and IPs - preferably two examples, a same-AZ and an across-AZ case. The routing considerations are substantially different.

• What route(s) are to be updated? The Big-IPs can be in different subnets.
• The examples talk about both the default route and RFC 1918 routes being updated.
• Must the web servers' default route be pointed at the Big-IPs internal interface?
• Is iLX installation required? It appears so.
• Can CFE share the same S3 bucket as the one created by the CFT? It appears so.
• The failover drawing shows that VIPs must be in traffic group 'none' – why?
• Using addresses like '10.0.1.10' and '10.0.11.10' is confusing and hard to read. Why not '10.0.20.x' and '10.0.30.x' so the differences stand out?

When it comes to operations, I haven't been able to make CFE do anything. Although it accepted my declaration and responds appropriately to status and failover triggers, nothing is actually happening. Not terribly surprising, as I still don't understand it, but I should get some indication back.

• How do you troubleshoot CFE?
• Why does a call to Trigger Failover return "SUCCEEDED" when nothing happened?
• The Across-AZ CFT creates an EIP and a private VIP on bigip1, but no private VIP on bigip2, so there is nothing to associate the EIP with on failover.

Thanks,
Stan

@shyawnkarim
Copy link

Thanks for reaching out to us about our documentation. We released a new version of the Cloud Failover Extension last night and have made many additions and improvements to our documentation. Please take a look at it and let me know if there are items that still need clarification.

@chen23
Copy link

chen23 commented Mar 17, 2020

I'll add to @C0missar comments that the "quick start" can omit details related how CFE behaves when deployed via the CFT. Specifically:

  • the "stack name" will map to "mydeployment"
  • all fields are "required" (for those that have used the HA iApp previously it was a "surprise" to have an S3 dependency)

In my own testing I could only get it to work using routeTag and not static. It would be helpful to add additional screen shots for updating EIP / Routes to illustrate the desired outcome. Here's an example of what my route table looks like after applying tags.

image

the term "failover.scopingTags" in the docs got lost to me whether you should specify a custom key name or just key value. After looking at the outcome of the CFT it became apparent that the key defaults to the value that is used in the documentation, it would be nice to have a note to example/doc to reference the default values when using the CFT.

To the debug steps it would be helpful to mention where to look for logs (restnoded.log) and how to trigger the script manually (either by force to stand-by or trigger the scripts directly via bash).

AFAIK traffic group NONE helps in cases where you have an active/active and/or you want the BIG-IP to still accept traffic while traffic is being sent to the stand-by device during failover (otherwise the traffic would immediately get dropped).

It would help to mention the destination of the JSON body. I had to hunt around back to the quickstart page to get the URL. Maybe quickstart is per-environment (AWS, Azure, GCP) and makes assumption that you are starting with a CFT, ARM, GDM template? It gets a bit repetitive, but also makes it easier to start/end on the same page.

Here's an example of my JSON output that I used in my environment using the CFT deployment.

{
    "class": "Cloud_Failover",
    "environment": "aws",
        "externalStorage": {
        "scopingTags": {
            "f5_cloud_failover_label": "erchen-cross-az"
        }
    },
    "failoverAddresses": {
        "scopingTags": {
            "f5_cloud_failover_label": "erchen-cross-az"
        }
    },
    "failoverRoutes": {
        "scopingTags": {
            "f5_cloud_failover_label": "erchen-cross-az"
        },
        "scopingAddressRanges": [
            {
                "range": "192.168.1.0/24"
            }
        ],
        "defaultNextHopAddresses": {
            "discoveryType": "routeTag"
        }
    }
}

@C0missar
Copy link
Author

C0missar commented Mar 17, 2020

I honestly didn't see much in the way of documentation changes. The drawing in AWS now shows the pair of VSs in TG1, but that was all I noticed (in the AWS section).

I would like to have a clearer statement about how CFE relates to traffic groups and traditional HA. I get that the mechanism is dual VIPs and API based rather that GARP, but will normal box failover events trigger the standby box to make the API calls and reassociate the EIPs? If you force a TG to standby, will that trigger CFE?

I'd also like to understand why Active-Active is not recommended. Other articles on failover (non-CFE) have recommended Active-Active, and it seems to me that for CFE, it wouldn't really matter, since you are failing over to a different VIP anyway.

On routing, the only way I can think of to make HA across AZs work is with a SNATpool or SNAT automap. I don't see the use case for route updates, and I'd like to know what assumptions are made about Web server routing/default gateways.

I have yet to see this work. My declaration looks like that of Chen23 above, except that the discoveryType is "static", and the scopingAddressRanges is "[]". I don't understand the intent for this field, and my PS consultant had this exact declaration working in his lab using that empty value.

{
    "class": "Cloud_Failover",
    "environment": "aws",
    "externalStorage": {
        "scopingTags": {
            "f5_cloud_failover_label": "AWSUSE2PROD-cfe"
        }
    },
    "failoverAddresses": {
        "scopingTags": {
            "f5_cloud_failover_label": "AWSUSE2PROD-cfe"
        }
    },
    "failoverRoutes": {
        "scopingTags": {
            "f5_cloud_failover_label": "AWSUSE2PROD-cfe"
        },
        "scopingAddressRanges": [],
        "defaultNextHopAddresses": {
            "discoveryType": "static",
            "items": []
        }
    },
    "controls": {
        "class": "Controls",
        "logLevel": "silly"
    }
}

The results I'm getting are a) forced failover of TG1 via the Config UI has absolutely no effect on CFE or the contents of the CFE state file in the S3 bucket; b) A trigger via declaration returns "SUCCEEDED" and the state file is updated, but the EIP is not re-associated and the state of TG1 remains unchanged.

The S3 bucket has the f5_cloud_failover_label tag matching the declaration, and the EIP contains both the f5_cloud_failover_label tag and matching value, and the VIPS tag containing the two VIPs separated by a comma as shown in the example.

@shyawnkarim
Copy link

Thanks for all these details on how to improve our documentation. I've created internal issue AUTOSDK-230 to improve our documentation. I've included detailed notes for our documentation team on all the points you have raised here.

@C0missar
Copy link
Author

I also don't understand these terms or how to configure them. None of them are questions that come up in traditional failover discussions.

defaultNextHopAddresses - Next hop for what? On which interface?
DiscoveryType - What is discovered, given that each failover object is declared individually?
scopingAddressRanges - Isn't that what is specified in the VIPS tag on the EIP?
What assumptions are made for the web server's routing?

And what does tgactive.sh have to do with anything - the AWS CFE page had it being removed for same net deployments, but didn't even mention anything for across-AZ. There was discussion of tgactive.sh on Issue #19 as if it was something I needed to be concerned with, but I don't know what or why.

Is any of this stuff written down, or am I supposed to be reading the source code? I've been messing with this for several weeks and still can't get it to work, and support won't open a ticket on it. I'd post more info on my config, but don't know what would be useful.

@rafkruczkowski
Copy link

rafkruczkowski commented Mar 26, 2020

I'm getting lost on what element fails over. In the gif diagram, the private IP address of the NIC is show, but would that be a manually configured floating IP on the F5, or a AWS object?

Also, the doc hints at that the declaration can be used for failover address and/or failover routes. My deployment only has one NIC, so I posted the declaration below and got a success back but still not sure how to test the ingress traffic to the active unit.

{ "class": "Cloud_Failover", "environment": "aws", "externalStorage": { "scopingTags": { "f5_cfe": "alpha" } }, "failoverAddresses": { "scopingTags": { "f5_cfe": "alpha" } } }

Lastly, the keys indicate they can be anything, and looks like it, but the only one that should be set is the NIC Mapping one and that one needs to be f5_cloud_failover_nic_map Would request clarity in the documentation and the declaration on this point.

@alaari-f5
Copy link
Collaborator

Closing this issue

As of release CFE 1.2 we moved this CFE repo under F5Networks. Your issue was recreated there. To follow-up on this issue visit:

F5Networks/f5-cloud-failover-extension#9

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants