# Network Layer Services

Layer|Message
---|---
application|message
transport|segment
network|datagram
link|frame

- transport segment from sending to receiving host
- on sending side: encapsulates segments into datagrams
- on receiving side: delivers segments to transport layer
- network layer protocols resides in **every host, router**
- router examines header fields in all IP datagrams passing through it

## Two key network-layer functions

- **forwarding**: move packets from router's input to appropriate router output
- **routing**: determine route taken by packets from source to destination
    - routing algorithms

# Router Architecture

- two key router functions
    - run routing algorithms/protocol (RIP, OSPF, BGP)
    - forwarding datagrams from incoming to outgoing link

<img src="img/Snip20191028_40.png" width=80%/>

# Input Port Functions

- line termination
- link layer protocol (receiver)
- decentralized switching (lookup, forwarding, queueing)

<img src="img/Snip20191028_41.png" width=80%/>

## Input Port Queuing 

- fabric slower than input ports combined -> queueing may occur at input queues
    - queueing delay and loss due to input buffer overflow
- **Head-of-the-Line (HOL) blocking**: queued datagram at front of queue prevents others in queue from moving forward

<img src="img/Snip20191028_46.png" width=80%/>

# Switching Fabrics

- transfer packet from input buffer to appropriate output buffer
- switching rate: rate at which packets can be transfer from inputs to outputs
    - often measured as multiple of input/output line rate
    - N inputs: switching rate $N$ times line rate desirable

<img src="img/Snip20191028_42.png" width=80%/>

## Switching via Memory

- first generation routers
    - traditional computers with switching under direct control of CPU
    - packet copied to system's memory
    - speed limited by memory bandwidth (2 bus crossings per datagram)

<img src="img/Snip20191028_43.png" width=80%/>

## Switching via a Bus

- datagram from input port memory to output port memory via a shared bus
- **bus contention**: switching speed limited by bus bandwidth


## Switching via Interconnection Network
- overcome bus bandwidth limitations
- banyan networks, crossbar, other interconnection nets initially developed to connect processors in multiprocessor
- advanced design: fragmenting datagram into fixed length cells, switch cells through the fabric


# Output Ports

<img src="img/Snip20191028_44.png" width=80%/>

- datagram (packets) can be lost due to congestion, lack of buffers
- priority scheduling - who gets best performance, network neutrality
- **buffering** required when datagrams arrive from fabric faster than the transmission rate
- **scheduling discipline** chooses among queued datagrams for transmission

## Output Port Queueing

<img src="img/Snip20191028_45.png" width=80%/>

- buffering when arrival rate via switch exceeds output line speed
- *queueing (delay)* and *loss* due to output port buffer overflow

### Find Required Buffering for a Router's Output Interface

- we want to keep the output link utilization near 100%

- **RFC 3439**: average buffering equal to "typcial" RTT (say 250 ms) times link capacity C
    - e.g., C = 10 Gbps link, => 2.5 Gbit buffer

- recent recommendation: with N flows, buffering equal to **${RTT \times C\over \sqrt{N}}$**

# The Internet Network Layer

<img src="img/Snip20191028_47.png" width=80%/>

# IP Datagram/Packet Format

<img src="img/Snip20191028_48.png" width=40%/>

- **Version(4 bits)**: 4 (= 0100)
- **Header Length (4 bits)**: unit is 4-bytes (e.g., a 20-byte header has this field equals to 5 (0101))
- **DSCP (Differentiated Services Code Point / 8 bits)**: Type of data carried 
    - 6-bit DSCP + 2-bit
    - DSCP = 46 -> high priority; DSCP = 0 -> low priority
- **length (16 bits)**: packet length in bytes
- **16-bit identifier**: a long IP packet is **fragmented** into smaller packets, all those small packets carry the *same* 16-bit-ID
- **flags (3-bit)**: <Not used, Don't frag., More frags. to follow>
- **fragment offset (13 bits)**: gives the position of the fragment in the original packet
- **TTL (8 bits)**: Time to live
    - max number of the remaining hops
    - TTL is decremented by 1 at each router
    - TTL = 0 -> Router discards the packet
- **Upper Layer (8 bits)**: upper layer protocol to deliver payload to, e.g., TCP = 6, UDP = 17
- **Header Checksum (16 bits)**: to detect bit errors in **packet header** (errors in data are ignored)
- **Source IP address (32 bits)**: 32-bit IP address of the node that originally created the packet
- **Destination IP address (32 bits)**: 32-bit IP address of the destination node of the packet


## Header Checksum Calculation (Sender)

- example: IP header in Hex with checksum set to 0000
    - 4500 0073 0000 4000 4011 **0000** c0a8 0001 c0a8 00c7
    - step 1: add all 16-bit blocks of the header
        - result = 2 479c
        - add the carry (2) to the reset to get
            - temp = 479e
    - step 2: take 1's complement of temp to get the checksum
        - checksum = 1's complement (Temp) = b861
    - header with checksum 
        - 4500 0073 0000 4000 4011 **b861** c0a8 0001 c0a8 00c7
        - the IP packet will be sent with this header

## Header Checksum Recalculation (Receiver)
- assume that the header is received without bit error
    - 4500 0073 0000 4000 4011 **b861** c0a8 0001 c0a8 00c7
- step 1: add all 16-bit blocks of the header
    - result = 2 fffd
    - add the carry(2) to fffd to get ffff
- step 2: take 1's complement of the result from step 1
    - 1's complement of ffff = 0000
- step 3: if the result from step 2 is **0000**, then there is no error; otherwise there is a bit-error, drop the packet

# IP Fragmentation & Reassembly

- network links have MTU (max transfer size): largest possible link-level frame
    - different link types have different MTUs
- large IP datagrams are divided (fragmented) within the network layer
    - one datagram becomes several datagrams
    - reassembled only at final destination
    - IP header bits used to identify, order related fragments

<img src="img/Snip20191101_105.png" width=60%/>

<img src="img/Snip20191101_106.png" width=80%/>

<img src="img/Snip20191101_107.png" width=80%/>


# IP Addresses

- in IPv4, an IP address is 32-bit long
- L2 switches do not have IP addresses
- end-devices generally have one IP address each
- routers have multiple IP addresses, 1+ for each physical interface

<img src="img/Snip20191101_108.png" width=30%/>

# Network Prefix

<img src="img/Snip20191101_109.png" width=80%/>

- all the IP addresses in one network have a common portion in their most significant bits
    - the common portion is known as the **Network Prefix**

<img src="img/Snip20191101_110.png" width=40%/>

# Classful Addressing 

- classful addressing is characterized by *fixed length prefixes*
    - Class A: addr begins with 0 and it is of the form: <Prefix.Host.Host.Host>, prefix length is 8 bits
    - Class B: addr begins with 10 and it is of the form: <Prefix.Prefix.Host.Host>, prefix length is 16 bits
    - Class C: addr begins with 110, <Prefix.Prefix.Prefix.Host>, prefix length is 24 bits
- drawback: fixed number of networks and fixed number of addresses per network

# Classless Addressing

- CIDR: Classless InterDomain Routing
    - subnet portion (prefix portion) of addresses is of arbitrary length
- VLSM: Variable Length Subnet Mask

<img src="img/Snip20191101_111.png" width=60%/>

# Network Prefix Notation

- given 129.97.0.0/16
    - 16 is the number of bits used for the network prefix starting from the MSB
- routers use the following notation
    - destination network address: 129.97.0.0
    - network mask: 11111111.11111111.00000000.00000000: 255.255.0.0
    - destination network adderss: 129.97.0.0/255.255.0.0

- **Network ID**: appears in routing tables
    - individual host IP addresses do NOT
- **Broadcast Address**: used to perform IP-level broadcast
    - send an IP packet with a Broadcast Address as the destination, the IP packet is delivered to all the nodes on the network

- **example**: IP 10.32.8.84/17
    - network prefix: 10.32.0.0
    - host ID: 111 1111.1111 1111
    - broadcast IP: 10.32.127.255

- IP address: network prefix + host ID
    - network ID: network prefix + host ID all 0's
    - broadcast IP: network prefix + host ID all 1's

<img src="img/Snip20191104_118.png" width=80%/>

- consider the network 10.2.5.16/28
    - /28 = 11111111.11111111.11111111.11110000 = 255.255.255.240
    - the **first** address in the net 10.2.5.16/28 is 10.2.5.00010000 = 10.2.5.16
        - this is the **network ID**
    - the next address is 10.2.5.00010001 = 10.2.5.17
        - this address is **assigned to router**
    - the next is 10.2.5.00010010 = 10.2.5.18, ..., all the way up to the second last address in the net 10.2.5.00011110 = 10.2.5.30
        - these addresses are **assigned to hosts**
    - the **last** address is 10.2.5.00011111 = 10.2.5.31
        - this is the **broadcast address**

# Public vs Private IP Address

- public
    - globally unique
    - routers everywhere can recognize these addresses
    - any host can open a TCP connection with a machine with a public IP address
- private
    - not globally unique, but unique within an organization
    - routers outside the organization cannot recognize these
    - if a host has a private IP address, a host outside the organization cannot open a TCP connection within

- the advantages of having public/private IP addresses are
    - reuse of IP addresses
    - added security

## Private IP Addresses (RFC 1918)

- 10.0.0.0/8: valid IP addresses are 10.0.0.1 ~ 10.255.255.254
- 172.16.0.0/12: valid IP addresses are 172.16.0.1 ~ 172.31.255.254
- 192.168.0.0/16: valid IP addresses are 192.168.0.1 ~ 192.168.255.254

# Hop-by-hop Routing

## Packet Forwarding

- the *relaying of packets* from one physical interface to another by routers on the Internet
- for a router R to determine the next hop for a given IP packet P, the destination address in P and the routing table in R is included in the decision
    - this is called **table-driven** routing
- alternative - source routing: the source host finds the path and puts it in the header, and sends the packet to the next router identified in the path; the next router looks at the path in the packet to further forward it, and etc.

## Structure of Routing Table

Destination Address|Mask|Next Hop|Interface|Metric
---|---|---|---|---
IP Address|IP Address|Router or hosts|interface number|

- contains entries from the routing algorithm
- contains the networks directly connectedc to the router
- contains the default entry

<img src="img/Snip20191104_120.png" width=60%/>

- assume interface 3 directly connects to the network 129.97.8.0/24
- then the routing table at B is given as

Destination Address|Mask|Next Hop|Interface|Metric
---|---|---|---|---
72.2.0.0|255.255.0.0|IP1|1|
129.97.8.0|255.255.255.0|"connected"|3|
0.0.0.0|0.0.0.0|IP2|2|

- the network learned the first entry by running routing protocols
- the second entry indicate the network connected to the router (configurable)
- the third entry is the default entry, 0.0.0.0 indicates all other networks

- the router choose the next hop for a packet by matching the destination address with the one in the routing table, and forward the packet to the corresponding interface

<img src="img/Snip20191104_121.png" width=80%/>

- if $(x.y.z.w \textrm{ AND } a.b.c.d) == (\textrm{IP1 AND }a.b.c.d)$ matching occurs
- if many matchings occur, there is a **longest prefix matching** to break the ties
    - the longer the prefix, the closer it is to the destination

### Matching and Forwarding Algorithm
- inputs
    - IP1: IP address from the packet header
    - RT: routing table
- procedure
    - if matching occurs between IP1 and RT
        - find the matching entry with the longest prefix
        - forward the packet via the appropriate interface
    - else if default entry exists in RT (default 0.0.0.0/0)
        - forward the packet via the appropriate interface
    - else
        - send error message (ICMP message) to the source of the IP packet
        - ICMP: Internet Control Message Protocol

### Longest Prefix Matching
- a longer prefix means a smaller pool of IP address in the network; a smaller prefix means a larger pool of IP address in the network
- packet forwarding based on the longest prefix matching in each router will eventually deliver the IP packet to the destination host


# IP Address Configuration

- hardcode
- DHCP: Dynamic Host Configuration Protocol
    - dynamically get address from a server, plug-and-play

# DHCP: Dynamic Host Configuration Protocol

- goal: allow host to dynamically obtain its IP address from network server when it joins network
    - can renew its lease on address in use
    - allows reuse of addresses (only hold address while connected)
    - support for mobile users who want to join network

- procedure
    - host broadcasts **DHCP discover** message (optional)
    - DHCP server responds with **DHCP offer** message (optional)
    - host requests IP address **DHCP request** message
    - DHCP server sends address **DHCP ack** message

# DHCP Client-Server

<img src="img/Snip20191105_123.png" width=80%/>

<img src="img/Snip20191105_124.png" width=80%/>

- DHCP returns
    - the IP address of the first-hop router for client (aka default gateway)
    - name and IP address of DNS server
    - network mask
        - IP address block
        - Network mask

<img src="img/Snip20191105_125.png" width=80%/>

<img src="img/Snip20191105_126.png" width=80%/>


# Example on Subnetting

- the IP block address 192.168.16.0/24 is assigned to a company
- the company have 4 departments and we would like each department to have its own LAN as follows
    - LAN A with 120 hosts
    - LAN B with 50 hosts
    - LAN C with 12 hosts
    - LAN D with 28 hosts
- perform subnetting to give the network ID and prefix of each LAN

LAN | Number of Hosts | Number of Host ID bits | Network ID | Prefix | Broadcast IP
---|---|---|---|---|---
A|120|7|192.168.16.128|25|192.168.16.255
B|50|6|192.168.16.64|26|192.168.16.127
C|12|4|192.168.16.16|28|192.168.16.31
D|28|5|192.168.16.32|27|192.168.16.63


# Hierarchical Addressing: Address Aggregation (Supernetting)

- Hierarchical addressing allows efficient advertisement of routing information

<img src="img/Snip20191108_202.png" width=80%/>

<img src="img/Snip20191108_203.png" width=80%/>

<img src="img/Snip20191108_204.png" width=80%/>


# NAT: Network Address Translation

- local network uses just one IP address as far as outside world is concerned
- range of addresses not needed from ISP: just one IP address for all devices
- can change addresses of devices in local network without notifying outside world
- can change ISP without changing addresses of devices in local network
- devices inside local network not explicitly addressable, visible by outside world (better for security)

<img src="img/Snip20191108_205.png" width=80%/>

- 3 computers and one internal router interface
    - all 4 interfaces do not have public IP addresses, rather, those are given private IP addresses
    - problem: a computer from outside the home network cannot send an IP packet to the home computers, because the private net is not advertised by routers

<img src="img/Snip20191108_206.png" width=80%/>

- 16-bit port-number field
    - 60,000 simultaneous connections with a single LAN-side address
- NAT is controversial
    - router should only process up to layer 3
    - violates end-to-end argument
        - NAT possibility must be taken into account by app designers, e.g., P2P applications
    - address shortage should instead be solved by IPv6



# Routing Algorithm Classification

## Global vs Decentralized

- **global**: all routers (within an autonomous system: AS) have complete topology, link cost info
    - **"link state" algorithms**
- **decentralized**: router knows physically-connected neighbors, and the algorthms involve iterative process of computation, as well as exchange of info with neighbors
    - **"distance vector" algorithms**

## Static vs Dynamic
- **static**: routes change slowly over time
- **dynamic**: routes change more quickly
    - periodic update
    - in response to link cost changes


# Autonomous Systems (AS)

<img src="img/Snip20191108_207.png" width=80%/>

- an Autonomous System (AS) is a **set of routers** under a single technical admin, using an *intra-AS routing protocol* to route packets **within the AS**, and using an *inter-AS routing protocol* to route packets to other AS's
    - $AS := \{\textrm{routers}\}$
        - $1+$ protocol for routing within the AS
        - 1 protocol for routing among the AS
        - everything owned by a single admin
- AS's are identified by a 16-bit ID called **AS number**

<img src="img/Snip20191108_208.png" width=80%/>

<img src="img/Snip20191108_209.png" width=60%/>


# Distance Vector Algorithm (for RIP)

- $D_x(y)$: estimate of least cost from $x$ to $y$
    - $x$ maintains distance vector $D_x = [D_x(t): y\in N]$
    - $N$ is the set of all the nodes in the network
- node $x$
    - knows cost to each neighbor $v$: $c(x, v)$
    - maintains its neighbors' distance vectors, for each neighbor $v$, $x$ maintains $D_v = [D_v(y): y\in N]$

- from time-to-time, each node sends its own distance vector estimate to neighbors
- when $x$ receives new DV estimate from neighbor, it updates its own DV using Bellman-Ford:
    - $D_x(y) = \min_v\{c(x,v) + D_v(y)\}$ for each node $y\in N$
- under minor, natural conditions, the estimate $D_x(y)$ converge to the actual least cost $d_x(y)$

- **iterative, asynchronous**: each local iteration caused by
    - local link cost change
    - DV update message from neighbor
- **distributed**: each node notifies neighbors *only* when its DV changes
    - neighbors then notify their neighbors if necessary

<img src="img/Snip20191108_210.png" width=40%/>

<img src="img/Snip20191111_228.png"/>

<img src="img/Snip20191111_229.png"/>

# Distance Vector: Link Cost Changes

- "good news travels fast"
    - if a node discovers a shorter path to a node, it informs its neighbors about it right away, and this can ripple through the entire network until stability is reached
- "bad news travels slow"
    - if a node detects a link failure, the algorithm does not tell it to its neighbor
    - routing loops happen in the network in the form of **two-node instability problem**

## Two-node Instability Problem

- given Routing Table at each node in the form of 
    - <Destination, Next Hop, Cost>
- for a 3-node net fragment

<img src="img/Snip20191111_1.png" width=60%/>

<img src="img/Snip20191111_2.png" width=60%/>

## Solutions to Cut the Loop

- **Split Horizon**
    - if node $B$ learns about a **path** to $X$ from $A$, subsequently, $B$ does not tell $A$ about this

- **Poisoned Reverse**
    - if $A$ tries to route to $X$ via $B$, $A$ tells $B$ that it has an infinity-cost path to $X$ when the ($X$, $A$) link is broken
    - enable s $B$ to not try to forward the packet to $X$ via $A$

- does not solve the loop problem with more nodes
    - three-node instability
    - four-node instability
    - ...


# Routing Information Protocol (RIP)

- DV algorithm
    - distance metric: # hops (maximum 15 hops), each link has cost 1
    - DVs exchanegd with neighbors every 30 second in response message (aka **advertisement**)
    - each advertisement: list of up to 25 destination *subnets* (in IP addressing sense)

<img src="img/Snip20191111_3.png" width=80%/>

## RIP: link failure & recovery

- if no advertisement heard after 180 seconds, the neighbor/link is declared dead
    - routes vis neighbor is invalidated
    - new advertisements sent to neighbors
    - neighbors in turn send out new advertisements (if tables changed)
    - link failure info quickly propagates to entire net
    - **poison reverse** used to prevent ping-pong loops (infinite distance = 16 hops)
   

## RIP Table Processing

- RIP routing tables managed by *application-level* process called *route-d* (daemon)
- advertisements sent in UDP packets periodically

<img src="img/Snip20191115_15.png" width=80%/>

# OSPF (Open Shortest Path First)

- each router learns the complete topology of the network within AS via flooding of OSPF-advertisement messages
- uses Dijkstra's algorithm to compute shortest paths between all pairs of nodes
- for a given source & destination pair, finds the "next hop" from the shortest path


- uses link-state algorithm
    - link-state packet dissemination
    - topology map at each node
    - route computation using Dijkstra's algorithm
- OSPF advertisment carries one entry per neighbor
    - broadcast link state periodically even if there is no change (for the purpose of robustness)
- advertisements flooded to *entire* AS
    - carried in OSPF messages directly over IP (rather than TCP/UDP)

## Link-State Routing Algorithm

- cost computation: default cost computation: $\textrm{cost} = {\alpha \over \textrm{Link Speed}}$

- **example**: $\textrm{cost} = {100 Mbps \over \textrm{Link Speed}}$
    - 1 Mbps link: cost = 100
    - 10 Mbps link: cost = 10

## Link-State Routing for OSPF

- **Dijkstra's Algorithm**
    - net topology, link costs known to all nodes
        - accomplished via "link state broadcast"
        - all nodes have same info
    - computes least cost paths from one node ("source") to all other nodes 
        - gives *forwarding table* for that node
    - iterative: after $k$ iterations, know least cost path to $k$ destinations

- $c(x,y)$: link cost from node x to y, equals to infinity if not direct neighbors
- $D(v)$: current value of cost of path from source to destination $v$
- $p(v)$: predecessor node along path from source to $v$
- $N'$: set of nodes whose least cost path definitively known

<img src="img/Snip20191115_16.png" width=80%/>


<img src="img/Snip20191115_17.png" width=80%/>


<img src="img/Snip20191115_18.png" width=80%/>
    


<img src="img/Snip20191115_19.png" width=80%/>

## OSPF: Link Types

1. **Point-to-Point Link**

<img src="img/Snip20191115_20.png" width=40%/>

- there are 3 subnets associated with the Point-to-Point link, one at each router and one in between the two routers


2. **Transient Link** (Broadcast Link)

<img src="img/Snip20191115_21.png" width=40%/>

3. **Stub Link**

<img src="img/Snip20191115_22.png" width=40%/>

- by means of LS advertisement, other routers know what destinations are connected in the AS.

## OSPF: Link-State

<img src="img/Snip20191115_23.png" width=80%/>

## OSPF Advanced Features (not in RIP)

- **security**: all OSPF messages are authenticated (to prevent malicious intrusion)
- **multiple** same-cost paths are allowed (whereas in RIP, only one path is allowed)
- for each link, there can be multiple cost metrics for different Type Of Service (TOS) 
- integrated unicast and multicast support
    - Multicast OSPF (MOSPF) uses the same topology data base as OSPF
- **hierachical** OSPF in large domains

### Hierachical OSPF

<img src="img/Snip20191115_24.png" width=80%/>

- each area runs a separate instance of OSPF
- **two-level hierarchy**: local area and backbone
    - link-state advertisements only inside the area
    - each nodes has detailed area topology; only knows the direction to nets in other areas
- **area border routers**: summarize distances to nets in own area, advertise to other area border routers
- **backbone routers**: run OSPF routing limited to backbone
- **boundary routers**: connect to other AS's



# Link-State vs Distance-Vector Algorithms

## LS vs DV: Message Complexity

- Link-State: with $n$ nodes, and $E$ links, since nodes running the LS algorithm perform flooding of their link-states, $O(nE)$ messages are sent


- Distance-Vector: distance vectors are exchanged between neighbors only (but therefore having varying convergence time)

## LS vs DV: Speed of Convergence

- Link-State: initially, it will be slow due to the flooding algorithm ($O(n^2)$ algorithm for $O(nE)$ messages); subsequent updates convergence will be much faster


- Distance-Vector: initially, it will be faster; subsequent updates will trigger recalculation through many iterations in the network and will therefore be slower

## LS vs DV: Robustness

- Link-State: since each node receives advertisements about the same link from more than one source, and each node computes its own topology, LS is more robust


- Distance-Vector: an error by a node will propagate through the entire network (due to each node's table used by others) and therefore DV is less robust



# Interconnected AS's


<img src="img/Snip20191115_25.png" width=80%/>

# BGP: Internet Inter-AS Routing

- **Border Gateway Protocol**: the de facto inter-domain routing protocol
- BGP provides each AS a means to
    - **eBGP**: obtain subnet reachability information from neighboring AS's
    - **iBGP**: propagate reachability information to all AS-internal routers
    - determine "good" routes to other networks based on reachability information and policy
- allows subnet to advertise its existence to the rest of the Internet

## BGP Messages

- BGP messages exchanged between peers over TCP connection
- BGP messages
    - **OPEN**: opens TCP connection to peer and authenticates sender
    - **UPDATE**: advertises new path (or withdraws old)
    - **KEEPALIVE**: keeps a connection alive in absence of UPDATES; also ACKs OPEN request
    - **NOTIFICATION**: reports errors in previous message; also used to close connection

<img src="img/Snip20191115_26.png" width=80%/>

## BGP Basics

- BGP session: two BGP routers ("peers") exchange BGP messages
    - advertising *paths* to different destination network prefixes ("path vector" protocol)
    - exchanged over semi-permanent TCP connections
- when AS3 advertises a prefix to AS1
    - AS3 *promises* it will forward datagrams towards that prefix
- AS3 can aggregate prefixes in its advertisement
    - e.g., there are 4 subnets attached to AS3: 138.16.64/24, 138.16.65/24, 138.16.66/24, and 138.16.67/24
    - then AS3 can aggregate the prefixes for these 4 subnets and use BGP to advertise the single prefix to 138.16.64/22 to AS1

<img src="img/Snip20191118_44.png" width=80%/>


- using eBGP session between 3a and 1c, AS3 sends prefix reachability info to AS1
    - 1c can then use iBGP to distribute new prefix info to all routers in AS1
    - 1b can then re-advertise new reachability info to AS2 over 1b-to-2a eBGP session
- when router learns new prefix, it creates entry for prefix in its forwarding table



## Path Attributes and BGP Routes

- advertised prefix includes BGP attributes
    - `prefix + attributes = "route"`
- two important attributes
    - **AS-PATH**: contains AS's through which prefix advertisement has passed (e.g., AS 67, AS 17)
    - **NEXT-HOP**: indicates specific internal-AS router to next-hop AS (may be multiple links from current AS to next-hop-AS)


## BGP Import Policy

- Gateway router receiving router advertisement uses **import policy** to accept/decline (**policy-based routing**)
    - e.g., never route through AS x
    - when the router receives an AS-PATH to destination X: $\{AS 12, AS 215, AS 64496, AS 99\}; X$ but the router might not trust $AS 64496$ so it would decline such path


## BGP Route Selection

- router may learn about more than 1 route to destination AS, selects route based on
    1. local preference value attribute: policy decision
    2. shortest AS-PATH
    3. closest NEXT-HOP router (hot potato routing)
    4. additional criteria


## Forwarding Table Construction

1. Router becomes aware of prefix
    - via BGP route advertisements from other routers
2. Router determines output port for prefix
    - use BGP route selection to find best inter-AS route
    - use OSPF to find best intra-AS route leading to best inter-AS route
    - router identifies router port for that best route
3. Router enters prefix-port in forwarding table

### Router Becomes Aware of Prefix

<img src="img/Snip20191118_46.png" width=80%/>

- BGP message contains "routes"
- "route" is a prefix and attributes: AS-PATH, NEXT_HOP, ...
    - e.g., `Prefix: 138.16.64/22; AS-PATH: AS3 AS131; NEXT-HOP: 201.44.13.125`

<img src="img/Snip20191118_47.png" width=80%/>

- router may receive multiple routes for the same prefix and it has to select one route
- the router selects route based on **shortest AS-PATH**
    - e.g., `AS2 AS17 to 138.16.64/22` beats `AS3 AS131 AS201 to 138.16.64/22`

- the router (that received the advertisements) then find the best intra-route to BGP route using **OSPF** with selected route's NEXT-HOP attribute, which is the IP address of the router interface that begins the AS PATH

<img src="img/Snip20191118_48.png" width=80%/>

- e.g., `AS-PATH: AS2 AS17; NEXT-HOP: 111.99.86.55`

**Hot Potato Routing**

<img src="img/Snip20191118_51.png" width=80%/>

- suppose there are two or more best inter-routes
- Hot potato routing chooses route with the closest NEXT-HOP
    - use OSPF to determine which gateway is closest
    - e.g., for 1c, choose AS3 AS201 ASX over AS2 AS17 ASX since it is closer

### Router Determines Output Port for Prefix

- identifies port along the OSPF shortest path
- adds prefix-port enrty to its forwarding table
    - e.g., `(138.16.64/22, port 4)`

<img src="img/Snip20191118_50.png" width=80%/>


# BGP Route Selection

- BGP router may learn about more than 1 route to destination AS; it selects a route based on
    1. policies
        - import policy
        - export policy
    2. shortest AS-PATH

# BGP Routing Policy

<img src="img/Snip20191122_2.png" width=80%/>

- A, B, C are **provider networks**
- X, W, Y are customers of provider networks
- X is *dual-homed*: attached to two networks
    - X doesnt want to route from B via X to C, so it will not advertise to B a route to C


- An AS's **export policy** (which routes it will advertise)
    - to a customer: all routes
    - to a peer or service provider:
        - routes to all its own subnets and to its customers' subnets
        - but not to subnets learned from other providers or peers


- $A$ advertises path $AW$ to $B$
- $B$ advertises path $BAW$ to $X$
- $B$ should NOT advertise path $BAW$ to $C$, since $B$ gets no benefit for routing $CBAW$ since neither $W$ nor $C$ are $B$'s customers, and $B$ wants to force $C$ to route to $w$ via $A$, and $B$ wants to route only to/from its customers




# Motivation for Intra/Inter AS routing

- policy
    - inter-AS: admin wants control over how its traffic routed, who routes through its net
    - intra-AS: single admin, so no policy decisions needed
- scale
    - hierarchical routing saves table size, reduced update traffic
- performance
    - intra-AS: can focus on performance
    - inter-AS: policy may dominate over performance