# SEN163A - Fundamentals of Data Analytics
## Assignment 2 - Large Scale Internet Data Analysis
### Dr. Ir. Jacopo De Stefani - [J.deStefani@tudelft.nl](mailto:J.deStefani@tudelft.nl)
### Joao Pizani Flor, M.Sc. - [J.p.pizaniflor@tudelft.nl](mailto:J.p.pizaniflor@tudelft.nl)

# Review

- AS
- BGP


## AS

### Definition

An autonomous system number is a unique identifier that is globally available and allows its autonomous system to exchange routing information with other systems.

### Overview

An autonomous system (AS) is a group of IP prefixes with a clearly defined external routing policy. In order for multiple autonomous systems to interact, each needs to have a unique identifier. Autonomous system numbers can be public or private. Public ASNs are required for systems to exchange information over the Internet. A private ASN can be used instead if a system is communicating solely with a single provider via Border Gateway Protocol (BGP).
Regional Internet registries

The Internet Assigned Numbers Authority (IANA) is responsible for globally coordinating DNS Root, IP addressing, and other Internet protocol resources, including ASNs. IANA assigns ASNs to regional Internet registries (RIRs), which are organizations that manage Internet number resources in a particular region of the world.

The five regional Internet registries are:

- **African Network Information Center (AFRINIC)**
- **American Registry for Internet Numbers (ARIN)**
- **Asia-Pacific Network Information Centre (APNIC)**
- **Latin American and Caribbean Network Information Centre (LACNIC)**
- **Réseaux IP Européens Network Coordination Centre (RIPE NCC)**

The five RIRs are united by an unincorporated organization called the Number Resource Organization. The NRO’s mission is to contribute to an open, stable, and secure Internet by coordinating joint RIR activities and projects, such as Resource Certification (RPKI) and Internet governance activities.

### Types

There are four types of autonomous systems that generally need an ASN. These include:

- **Multihomed** – Connected to more than one autonomous system.
- **Stub** – Only connected to one other autonomous system.
- **Transit** – Provides connections through itself. For example, network A can connect to network C directly or by crossing over network B.
- **Internet Exchange Point** – Autonomous system created by the physical infrastructure located at Internet exchange points.

### Autonomous system number formats

Until 2007, all autonomous system numbers were 2-byte, or 16-bit, numbers. This gave IANA 65,536 possible ASNs to distribute. This amount was always destined to run out, much like IPv4 addresses. Just like the creation of IPv6, 4-byte (32-bit) ASNs were created to remedy the issue. The new system provides 4,294,967,296 autonomous system numbers.

With the switch to 4-byte, people grew concerned that number representation would become too difficult. To mitigate those concerns, two alternative ways to represent the number were created.

The standard method for displaying the number is called `asplain`, which is a simple decimal representation.

The `asdot+` method breaks the number into low and high-order 16-bit values and separates them by a dot. For example, 65525 would be displayed as 0.65525, 65537 would be displayed as 1.0, 65680 would be displayed as 1.144, and so on.

The `asdot` method is a mixture of asplain and `asdot+`. Any number that is in the 2-byte range is displayed in `asplain` format, 65525 would be 65525; any number that is outside of that range is displayed in `asdot+` format, 65680 would be 1.144.

**Quick reference:** https://blog.stackpath.com/autonomous-system-number/


## BGP

![BGP](BGP.png)

**Quick reference:** https://www.cloudflare.com/learning/security/glossary/what-is-bgp/

# AS Data

In [4]:
import pickle

# Open the file in binary mode
with open('data/AS_dataset.pkl', 'rb') as file:
	
	# Call load method to deserialze
	AS_df = pickle.load(file)


In [5]:
AS_df.shape

(60122, 5)

In [6]:
AS_df.head()

Unnamed: 0,ASN,Country,Name,NumIPs,type
0,AS55330,AF,AFGHANTELECOM GOVERNMENT COMMUNICATION NETWORK,50432,hosting
1,AS17411,AF,Io Global Services Pvt. Limited,13568,business
2,AS55424,AF,Instatelecom Limited,13312,business
3,AS38742,AF,AWCC,11520,isp
4,AS131284,AF,Etisalat Afghan,10240,isp


# Probe Data

In [7]:
# Open the file in binary mode
with open('data/probe_dataset.pkl', 'rb') as file:
	
	# Call load method to deserialze
	probe_df = pickle.load(file)

In [8]:
probe_df.shape

(11008, 2)

In [9]:
probe_df.head()

Unnamed: 0,prb_id,ASN
0,1,AS3265
1,2,AS1136
2,3,AS3265
3,6,AS6830
4,8,AS3265


# RIPE Data

In [16]:
import bz2
import time
import json

In [17]:
bz2Filename = './data/ping-2022-02-13T2300.bz2'
bz2File     = bz2.open(bz2Filename, 'rt') 

#read first 10 lines to estimate total loading time
count = 0;
st    = time.time()
for line in bz2File:
    print(line)
    decoded_line = json.loads(line)
    print(decoded_line)
    count = count + 1
    if count>10: break

#print time and estimate total time            
#dur         = round(time.time() - st,2)
#estTotTime  = round( (dur/100000)*nrOfLines )
#print("\nbz2 file:" )
#print("Loading 100k lines took: "  + str(dur) + " seconds")
#print("Estimated loading time of entire bz2 file: "  + str(estTotTime) + \
#      " seconds" )

#finally close bz2File
bz2File.close()

{"fw":5040,"mver":"2.4.1","lts":221,"dst_name":"185.184.236.30","af":4,"dst_addr":"185.184.236.30","src_addr":"45.77.211.82","proto":"ICMP","ttl":54,"size":64,"result":[{"rtt":136.430751},{"rtt":136.515399},{"rtt":136.571458}],"dup":0,"rcvd":3,"sent":3,"min":136.430751,"max":136.571458,"avg":136.5058693333,"msm_id":22782578,"prb_id":6434,"timestamp":1644795355,"msm_name":"Ping","from":"45.77.211.82","type":"ping","group_id":22782577,"step":240}

{'fw': 5040, 'mver': '2.4.1', 'lts': 221, 'dst_name': '185.184.236.30', 'af': 4, 'dst_addr': '185.184.236.30', 'src_addr': '45.77.211.82', 'proto': 'ICMP', 'ttl': 54, 'size': 64, 'result': [{'rtt': 136.430751}, {'rtt': 136.515399}, {'rtt': 136.571458}], 'dup': 0, 'rcvd': 3, 'sent': 3, 'min': 136.430751, 'max': 136.571458, 'avg': 136.5058693333, 'msm_id': 22782578, 'prb_id': 6434, 'timestamp': 1644795355, 'msm_name': 'Ping', 'from': '45.77.211.82', 'type': 'ping', 'group_id': 22782577, 'step': 240}
{"fw":5040,"mver":"2.4.1","lts":222,"dst_name":

# IPv4 data

This site or product includes IP2Location LITE data available from <a href="https://lite.ip2location.com">https://lite.ip2location.com</a>.

In [18]:
import pandas

In [19]:
ipv4_df = pandas.read_csv("data/IP2LOCATION-LITE-DB1.CSV")

In [20]:
ipv4_df.shape

(188932, 4)

In [21]:
ipv4_df.head()

Unnamed: 0,0,16777215,-,-.1
0,16777216,16777471,AU,Australia
1,16777472,16778239,CN,China
2,16778240,16779263,AU,Australia
3,16779264,16781311,CN,China
4,16781312,16785407,JP,Japan


In [24]:
ipv4_df.rename(columns = {'0':'ip_from', '16777215':'ip_to',
                              '-':'country_code','-.1':'country_name'}, inplace = True)

In [25]:
ipv4_df.head()

Unnamed: 0,ip_from,ip_to,country_code,country_name
0,16777216,16777471,AU,Australia
1,16777472,16778239,CN,China
2,16778240,16779263,AU,Australia
3,16779264,16781311,CN,China
4,16781312,16785407,JP,Japan
