# Introduction to Forwarding Change Validation


Hello and welcome to our Getting to know Batfish series where we showcase key capabilities of Batfish and how they can be used as part of your network automation workflow. 

My name is Matt. 

In this video, I will show you how to validate network forwarding changes using Batfish. 

Network forwarding changes are:
- Common
- Often hard to get right
- Often hard to validate
- Often high-risk

Network engineers frequently make changes that impact forwarding behavior: adding new routes, opening or closing flows, routing traffic through different devices, etc. These changes are often hard to get right and hard to validate. The risks of deploying a bad change can be huge, potentially causing outages or security breaches. 

For these reasons, having a powerful change validation process is crucial for modern networks.

With Batfish, validating changes to your network can be:

This video will show you how Batfish can help get your changes right.
With Batfish, change validation can be:

- Proactive: catch bugs before they hit the network

Proactive, meaning you can validate changes offline, before you deploy them to the network. 
- Any problems will be discovered before they can affect the network, and
- you'll only deploy your changes after you have complete confidence that they will have the intended effects.

- Network-scale: full network behavior of all devices

Change validation with Batfish is also network-scale: Batfish analyzes full end-to-end network behavior, allowing it to find bugs caused by interactions between devices far away from each other.

- Comprehensive: validate all possible traffic, not just a few probes

It's also comprehensive: Batfish can validate the forwarding behavior for all possible flows. This allows it to find bugs you may never have thought to test. 

All of this makes batfish the ideal tool for your change validation workflow.

In [2]:
# Import packages and load questions
%run startup.py

# Initialize base snapshot
NETWORK_NAME = "forwarding-change-validation"
BASE_NAME = "base"
BASE_PATH = "networks/forwarding-change-validation/base"

bf_set_network(NETWORK_NAME)
bf_init_snapshot(BASE_PATH, name=BASE_NAME, overwrite=True)

'base'

Let's get started.

(RUN)

As usual, we'll begin by initializing the current snapshot of our network in Batfish.

Batfish will parse the configuration files, build a model of the network, and automatically computes the RIBs and FIBs.

# Change Scenario 1: Costing out `core1`
![example-network](https://raw.githubusercontent.com/batfish/pybatfish/master/jupyter_notebooks/networks/forwarding-change-validation/differential%20forwarding%20network.png)

We're going to validate two different changes to our network.

The network has two host devices connected to `leaf1` -- a webserver and a database server.

On the other end, we have two border routers that connect to the external network.

The network is provisioned with failover redundancy for the core routers. All traffic is normally routed through `core1` but will automatically switch to  `core2` in case of a failure or during maintenance.

In this scenario, we want to service `core1` so we're going to shift traffic to `core2`. 

We'll implement a change to cost out `core1`, and verify that it does not affect reachability to the hosts from the external network.

In [None]:
# Run traceroutes from border interfaces to hosts
answer = bfq.traceroute(
    startLocation="enter(border.*[GigabitEthernet0/0])",
    headers=HeaderConstraints(dstIps="ofLocation(host.*)")
).answer(snapshot=BASE_NAME)
show(answer.frame())

First, let's test the current forwarding behavior using the `traceroute` question.

We want to see how external-to-host traffic is routed, so:
- we'll use the dstIps parameter to specify traffic destined to the host devices, and
- the startLocation parameter to specify traffic entering the network through the border routers' external interfaces

(RUN)

The results include a flow from each border router, and all possible paths of each flow. As we can see in the `Traces` column, both flows are routed through `core1`, which is all we're concerned with here.  For more information about the `traceroute` question, see our previous video in this series, `Introduction to Forwarding Analysis`.

For now, having confirmed that traffic is being routed through `core1` as expected, let's go ahead with our change to cost it out so that traffic is routed through `core2` instead.

### Initialize the change snapshot


```
$ diff -r base/ change1/
diff -r base/configs/core1.cfg change1/configs/core1.cfg
68c68
<  ip ospf cost 1
---
>  ip ospf cost 500
73c73
<  ip ospf cost 1
---
>  ip ospf cost 500
78c78
<  ip ospf cost 1
---
>  ip ospf cost 500
83c83
<  ip ospf cost 1
---
>  ip ospf cost 500
```

In [11]:
CHANGE1_NAME = "change"
CHANGE1_PATH = "networks/forwarding-change-validation/change1"

bf_init_snapshot(CHANGE1_PATH, name=CHANGE1_NAME, overwrite=True)

'change'

To cost out core1, we'll raise the OSPF cost of each interface from 1 to 500.

(FRAGMENT)

Now we'll take the change and initialize a new candidate snapshot in Batfish. 

(RUN)

We now have two shapshots in Batfish we can compare. Remember, this change exists only in Batfish -- we won't push anything until we have full confidence that it's correct.

### Change validation process
1. No traffic is routed through `core1`
2. External-to-host traffic is unaffected

Now we're ready to validate our change. 

We'll do this using a two-step process
- Step 1 validates that the change has the intended effect.
  - No outside-to-host traffic is routed through `core1`
- Step 2 validates that it has no unintended effects. 
  - In this case, the change not affect reachability of external-to-host traffic.


### Step 1: No outside-to-host traffic is routed through core1

In [None]:
answer = bfq.reachability(
    pathConstraints=PathConstraints(
        startLocation="enter(border.*[GigabitEthernet0/0])",
        transitLocations="core1"),
    headers=HeaderConstraints(dstIps="ofLocation(host.*)", srcIps="0.0.0.0/0"),
    actions="SUCCESS,FAILURE"
).answer(snapshot=CHANGE1_NAME)
show(answer.frame())

To verify that **no** outside-to-host traffic is routed through `core1`, we search for counterexamples: outside-to-host traffic that *is* routed through `core1`. If no counterexamples are found, we are *guaranteed* that `core1` is never used. 

We see three new parameters here:
- The `srcIps` parameter specifies to search all possible source IP addresses. By default, Batfish will search source IPs appropriate to each start location.
- The `transitLocations` parameter specifies to search for flows that can transit `core1`. 
- The `actions` parameter specifies to include flows that are dropped as well as those that are successfully delivered (which is the default).

Let's run this query.

(RUN)

Good!  Reachability was unable to find any edge-to-host flow that transited core1, showing that we have successfully costed it out.


### Step 2: Outside-to-host traffic is unaffected.

In [None]:
answer = bfq.differentialReachability(
    pathConstraints=PathConstraints(startLocation="enter(border.*[GigabitEthernet0/0])"),
    headers=HeaderConstraints(dstIps="ofLocation(host.*)", srcIps="0.0.0.0/0"),
    actions="SUCCESS"
).answer(
    snapshot=CHANGE1_NAME,
    reference_snapshot=BASE_NAME)
show(answer.frame())

Now let's validate that costing out core1 has no unintended effects. 

To do that, we'll compare the forwarding behavior of the change snapshot against the original using the differentialReachability question.

Batfish will search for flows that match the specified criteria in one of the snapshots but not the other. 

There should be no differences, since external-to-host reachability should be the same going through core2 as it is going through core1. 

(RUN)

Unfortunately, these results show that there is a difference of reachability: some traffic that was being delivered in the reference snapshot (before the change) is being *null routed* in the change snapshot. 

If we had deployed this change, there would have been a loss of connectivity. Thankfully batfish was able to identify the bug *before* we deployed the change.

The results include an example flow from each border router. Each flow comes with detailed traces of all the paths it can take through the network, in each snapshot. These traces help us to quickly diagnose the problem: `core2` has an old static route for `2.128.1.1/32` that should have been removed. A similar problem could occur with out-of-date ACLs along the path through `core2`. Batfish would find those problems as well.

Having identified and diagnosed the problem, we can remove the bad static route in an updated change snapshot.

In [3]:
CHANGE1_FIXED_NAME = "change-fixed"
CHANGE1_FIXED_PATH = "networks/forwarding-change-validation/change1-fixed"
bf_init_snapshot(CHANGE1_FIXED_PATH, name=CHANGE1_FIXED_NAME, overwrite=True)

'change-fixed'

In [4]:
answer = bfq.reachability(
    pathConstraints=PathConstraints(
        startLocation="enter(border.*[GigabitEthernet0/0])",
        transitLocations="core1"),
    headers=HeaderConstraints(dstIps="ofLocation(host.*)", srcIps="0.0.0.0/0"),
    actions="SUCCESS,FAILURE"
).answer(snapshot=CHANGE1_FIXED_NAME)
show(answer.frame())

Unnamed: 0,Flow,Traces,TraceCount


In [5]:
answer = bfq.differentialReachability(
    pathConstraints=PathConstraints(startLocation="enter(border.*[GigabitEthernet0/0])"),
    headers=HeaderConstraints(dstIps="ofLocation(host.*)", srcIps="0.0.0.0/0")
).answer(
    snapshot=CHANGE1_FIXED_NAME,
    reference_snapshot=BASE_NAME)
show(answer.frame())

Unnamed: 0,Flow,Snapshot_Traces,Snapshot_TraceCount,Reference_Traces,Reference_TraceCount


(RUN)

First, we'll initialize the fixed snapshot.

(FRAGMENT)

Then we'll repeat step 1 to confirm that core1 is still costed out 

(RUN)

great! 

(FRAGMENT)

Now let's repeat step 2 to make sure there are no other problems we need to fix.

(RUN)

Great! There are no other bugs, so this change can be safely pushed to the network.

## Change 1 Summary


Let's recap the steps we took to validate this change:

1. The change has the intended effect.
 - No traffic is routed through `core1`.

Step 1 validated that the change does what it's supposed to: move traffic from core1 to core2.

2. The change has no unintended effects.
 - All outside-to-host traffic is unaffected.

Step2 validated that our change doesn't break anything: the end-to-end behavior of the network is unaffected.

Together, these steps give us complete confidence that we can safely cost out and service core1.

# Change Scenario 2: Validating the end-to-end impact of an ACL change
![example-network](https://raw.githubusercontent.com/batfish/pybatfish/master/jupyter_notebooks/networks/forwarding-change-validation/differential%20forwarding%20network.png)

Now let's validate another change to the same network. Unlike the previous scenario, this time we **do** want to affect end-to-end reachability, and we will use the same validation process to ensure that our change has the intended effect, *and* that it has no *unintended* effects.

For this scenario, imagine that we have developed and tested a new web service on the webserver, and are now ready to open it to the outside world. 

### Trace HTTP traffic to host-www from outside the network. 

In [None]:
answer = bfq.traceroute(
    startLocation="enter(border.*[GigabitEthernet0/0])",
    headers=HeaderConstraints(dstIps="ofLocation(host-www)", applications="HTTP")
).answer(snapshot=BASE_NAME)
show(answer.frame())

Let's start by using the `traceroute` question to confirm that the web service is not currenlty accessible from outside the network. 

Our parameters specify:
- HTTP traffic
- to the webserver
- from the external network.

(RUN)

As you can see, the flow is dropped by the ingress ACL `OUTSIDE_TO_INSIDE` on each border router. This is where we'll make our change. 


```
ip access-list extended OUTSIDE_TO_INSIDE
 permit tcp any 2.128.0.0 0.0.1.255 eq ssh
 permit udp any 2.0.0.0 0.255.255.255
 deny ip any any
```

This snippet shows the original ACL definition.

The first line filters traffic to the host subnet.

Currently, SSH is the only TCP traffic permitted.

```
ip access-list extended OUTSIDE_TO_INSIDE
 permit tcp any 2.128.0.0 0.0.1.255 eq ssh
 permit tcp any 2.128.0.0 0.0.1.255 eq www
 permit udp any 2.0.0.0 0.255.255.255
 deny ip any any
```

Here's the updated version of the ACL.

We've added this second line to permit HTTP traffic,

and we're opening it to the entire subnet, because we already filter per-host on the leaf router.

### Trace HTTP traffic to host-www from outside the network.

In [None]:
# Initialize the change snapshot
CHANGE2_NAME = "change2"
CHANGE2_PATH = "networks/forwarding-change-validation/change2"
bf_init_snapshot(CHANGE2_PATH, name=CHANGE2_NAME, overwrite=True)

answer = bfq.traceroute(
    startLocation="enter(border.*[GigabitEthernet0/0])",
    headers=HeaderConstraints(dstIps="ofLocation(host-www)", applications="HTTP")
).answer(snapshot=CHANGE2_NAME)
show(answer.frame())

Let's test this change with the same `traceroute` command we used before.

(RUN)

Good. We can see that HTTP traffic can now reach the webserver from the external network through either border router. 

This gives us some confidence that the change is correct, but we don't yet have complete confidence.

Normally, we'd independently validate the change to each border router ACL following the steps outlined in the `Provably Safe ACL and Firewall Changes` video.

However, in the interest of time, I'll skip those steps, and proceed to validating the end-to-end network behavior. 

### Change validation process
1. HTTP traffic from outside the network can reach `host-www`.
2. No other traffic is affected.

As before, I'll validate our change in two steps:
1. Step 1 validates that the change has the intended effect.
  - Our web service is now available from the external network.
1. Step 2 validates that the change has no unintended effects

### Step 1: HTTP traffic from outside the network can reach host-www

In [None]:
answer = bfq.reachability(
    pathConstraints=PathConstraints(startLocation="enter(border.*[GigabitEthernet0/0])"),
    headers=HeaderConstraints(dstIps="ofLocation(host-www)", srcIps="0.0.0.0/0", applications="HTTP"),
    actions="FAILURE"
).answer(snapshot=CHANGE2_NAME)
show(answer.frame())

`traceroute` showed that after our change, *some* external traffic can reach the webservice.

Step 1 will validate that *all* such traffic can reach it. 

We'll do that by searching for flows that *fail* to reach the webserver, by setting `actions` parameter to `FAILURE`.

(RUN) 

Good! Batfish comprehensively searched all external flows destined to the webservice, and verified that all will be delivered successfully.

### Step 2: No other traffic is affected

In [None]:
answer = bfq.differentialReachability(
    pathConstraints=PathConstraints(startLocation="enter(border.*[GigabitEthernet0/0])"),
    headers=HeaderConstraints(dstIps="ofLocation(host-www)", srcIps="0.0.0.0/0", applications="HTTP"),
    invertSearch=True
).answer(snapshot=CHANGE2_NAME, reference_snapshot=BASE_NAME)
show(answer.frame())

Step 2 validates that no other network behavior has changed.

As before, we'll use the `differentialReachability` question to compare our change snapshot against the original. 

This time, we know *some* traffic has changed -- namely external traffic to the webservice.

We'll ask Batfish to search for differences that affect *any other traffic*, using the `invertSearch` parameter.

This directs batfish to search outside the specified header space instead of within it.

(RUN)

Unfortunately, our change had a broader impact than we intended, and batfish gives us some counterexample flows.

In this case, we see a flow that was previously dropped by the border router ACLs, but is now being delivered to the database server.

We can also see that this is HTTP traffic, which is supposed to be blocked at the leaf router. 

So we've quickly diagnosed the problem: The leaf router is not properly filtering traffic to the hosts: it permits HTTP to both hosts, rather than just the webserver.

### Step 1 (again): HTTP traffic from outside the network can reach host-www

In [8]:
CHANGE2_FIXED_NAME = "change2-fixed"
CHANGE2_FIXED_PATH = "networks/forwarding-change-validation/change2-fixed"
bf_init_snapshot(CHANGE2_FIXED_PATH, name=CHANGE2_FIXED_NAME, overwrite=True)

'change2-fixed'

In [9]:
answer = bfq.reachability(
    pathConstraints=PathConstraints(startLocation="enter(border.*[GigabitEthernet0/0])"),
    headers=HeaderConstraints(dstIps="ofLocation(host-www)", srcIps="0.0.0.0/0", applications="HTTP"),
    actions="FAILURE"
).answer(snapshot=CHANGE2_FIXED_NAME)
show(answer.frame())

Unnamed: 0,Flow,Traces,TraceCount


In [10]:
answer = bfq.differentialReachability(
    pathConstraints=PathConstraints(startLocation="enter(border.*[GigabitEthernet0/0])"),
    headers=HeaderConstraints(dstIps="ofLocation(host-www)", srcIps="0.0.0.0/0", applications="HTTP"),
    invertSearch=True
).answer(snapshot=CHANGE2_FIXED_NAME, reference_snapshot=BASE_NAME)
show(answer.frame())

Unnamed: 0,Flow,Snapshot_Traces,Snapshot_TraceCount,Reference_Traces,Reference_TraceCount


Having fixed the bad ACL on the leaf router, we'll initialize an updated change snapshot. 

(RUN)

Let's validate *this* snapshot.

(FRAGMENT)

(RUN) 

As before, step 1 validates that all HTTP flows to the webserver will be delivered.

(FRAGMENT)

(RUN)

Great! This time, step 2 validates that no other flows are affected by the change.

## Change 2 Summary

Let's recap the steps we took to verify the change in this scenario:

1. The change had the intended effect.
 - HTTP traffic from outside the network can reach host-www

First, we verified that the intent of the change is achieved: our webservice is now reachable from the outside world.

2. The change had no unintended effects.
 - No other external-to-host traffic is affected.

Second, we verified that there were no unintended effects to external-to-host traffic. As in the previous scenario, we used the `differentialReachability` question to compare the forwarding behavior of our change snapshot against the original.

# Thanks!

Want to learn more? Come find us on Slack and Github!

Thanks for watching this demo of network forwarding change validation using Batfish. You've seen how batfish can help provide full-scale, comprehensive change validation proactively, so you can deploy changes to your network with complete confidence.

If you want to learn more about how to use batfish for validating changes to your network, or other ways batfish can help you build and maintain your network, please leave a comment, or come find us on slack or github. You'll find links in the description.

And don't forget to check out the other videos in the getting to know batfish series.

Thanks!