# SEN163A - Fundamentals of Data Analytics
# Assignment 2 - Large-scale Internet Data Analysis
### Ir. Jacopo De Stefani - [J.deStefani@tudelft.nl](mailto:J.deStefani@tudelft.nl)
### Joao Pizani Flor, M.Sc. - [J.p.pizaniflor@tudelft.nl](mailto:J.p.pizaniflor@tudelft.nl)

### 05-03-2022
## Group 2
- Emmanuel M Boateng - '5617642'
- Joost Oortwijn - '4593472'
- Philip Busscher - ''4611993''
- Floris Kool - ''4975243''


# Introduction

This report contains a data-analysis for The Groote Nationale Investeer Bank to investigate the four most convenient locations for a datacenter in the EU. Four datasets are provided to carry out the research. Using Jupyter Notebook, the datasets are structured and prepared for analysis. 
The provided data sets are a RIPE data set, an IP location data set, an AS (Automated System) data set and a probe data set. Next paragraph gives a brieve explanation on the data sets. 


# 1. Dataset description

The AS data set contains ASN (Autonomous System Numbers) that represent different a groups of IP adresses run by a network operator. The country in which the network is situated, number of IP's and types are also displayed in this dataframe. This data set is important to locate where the probes are situated. 

The probe data set contains two attributes: the id of the probe and an ASN number. Using this data set one can assign a probe to a specific ASN in a specific country. 

The RIPE data set contains ping measurements of approximately 30 days. A day consists of twentyfour files which contains ping meauserments for one hour. A ping measurement is the time that a small package of data travels between your device, a server and back to your device. In this research only the ping measurements of the 1th of March is used. 

The last data set is the IP location of the senders and the receivers of the data packages. 


## 1.1 Opening the data

In [36]:
import pickle
import time
import bz2
import os
import sys
import json
import pandas as pd

In [37]:
#AS Dataset

AS_Filename = 'data\AS_dataset.pkl'

with open(AS_Filename, 'rb') as file:
    
    AS_df = pickle.load(file)
    
AS_df.head(10)

Unnamed: 0,ASN,Country,Name,NumIPs,type
0,AS55330,AF,AFGHANTELECOM GOVERNMENT COMMUNICATION NETWORK,50432,hosting
1,AS17411,AF,Io Global Services Pvt. Limited,13568,business
2,AS55424,AF,Instatelecom Limited,13312,business
3,AS38742,AF,AWCC,11520,isp
4,AS131284,AF,Etisalat Afghan,10240,isp
5,AS45178,AF,ROSHAN-AF,5376,business
6,AS132471,AF,MTNAFGHANISTAN,4864,business
7,AS7494,AF,CeReTechs Co ltd,4096,business
8,AS138322,AF,Afghan Wireless,3584,business
9,AS55745,AF,Neda Telecommunications,2560,business


In [38]:
#Probe dataset

Probe_Filename = 'data/probe_dataset.pkl'

with open(Probe_Filename, 'rb') as file:
    
    P_df = pickle.load(file)
    
P_df.head(10)

Unnamed: 0,prb_id,ASN
0,1,AS3265
1,2,AS1136
2,3,AS3265
3,6,AS6830
4,8,AS3265
5,11,AS12333
6,14,AS3269
7,20,AS3265
8,24,AS7922
9,26,AS3265


In [39]:
#Ripe dataset (Single file)

#Option 1 decompressed file
#decomFilename = 'data/ping-2022-03-01T2300_decom'
#decomFile     = open(decomFilename, 'rt')

#Option 2 BZ2 file
bz2Filename = 'data/ping-2022-03-01T2300.bz2'
bz2File     = bz2.open(bz2Filename, 'rt')


# List of tuples
# https://stackoverflow.com/questions/28056171/how-to-build-and-fill-pandas-dataframe-from-for-loop
tuple_list = []

start  = time.time()

#for line in bz2File:
for line in decomFile:
    
    decoded_line = json.loads(line)
    if "af" in decoded_line and "dst_addr" in decoded_line and "prb_id" in decoded_line and "avg" in decoded_line: 
        if decoded_line["af"] == 4:
            tuple_list.append((decoded_line["dst_addr"],decoded_line["prb_id"],decoded_line["avg"]))

            
dur         = round(time.time() - start,2)
print("Loading took: "  + str(dur) + " seconds")
print("Lines added to tuple: " + str(len(tuple_list)))

#finally close bz2File
decomFile.close()

NameError: name 'decomFile' is not defined

In [None]:
#Load tuples data into dataframe
start  = time.time()

RIPE_df = pd.DataFrame(tuple_list)

dur         = round(time.time() - start,2)
print("Loading took: "  + str(dur) + " seconds")

In [None]:
#IP 2 Location dataset

IP_Filename = "data/IP2LOCATION-LITE-DB1.CSV"

ipv4_df = pd.read_csv(IP_Filename)

ipv4_df.rename(columns = {'0':'ip_from', '16777215':'ip_to',
                              '-':'country_code','-.1':'country_name'}, inplace = True)

ipv4_df.head(10)

More detailed description of data if needed (Can also be after opening each dataset)

## 1.2 Limitations in data

Evaluate if there are limitations in the provided datasets (AS and probe data set). If you find limitations, describe these and conjecture possible reasons, supported with data.

...

In [None]:
#Code needed to prove limitations

Some list of limitations in text

# 2 Analysis

Short description of what is going to be analyzed

## 2.1 AS (Question B)

With the AS and probe data set, find the number m of AS’s that can be used for hosting in the EU
and have probes in the RIPE data set. Sort the ASN’s in ascending order and include the first and last
three in your report (number, name and country).


In [None]:
#Merge the AS and Probe datasets
as_probe_joined_df = pd.merge(P_df,AS_df, on='ASN')

In [None]:
# EU country codes retrieved from: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Country_codes
EU_list = ['BE','BG','CZ','DK','DE','EE','IE','EL','ES','FR','HR','IT','CY','LV','LT','LU','HU','MT','NL','AT','PL','PT','RO','SI','SK','FI','SE']

# Filter data set for AS's that can be used for hosting in the EU
as_probe_joined_df = as_probe_joined_df.loc[(as_probe_joined_df['type'] == 'hosting') & (as_probe_joined_df['Country'].isin(EU_list))]

In [None]:
#Get the unique number of probe IDs that are in the RIPE Data
unique_prbID = RIPE_df[1].unique()

print("Unique probe IDs: " + str(len(unique_prbID)))

In [None]:
#Filter the data set by only selecting the ASN's that have probes in the Ripe dataset
AS_Probe_RIPE_df = as_probe_joined_df.loc[as_probe_joined_df['prb_id'].isin(unique_prbID)]

#Sort by ASN
AS_Probe_RIPE_df.sort_values(by=['ASN']).sort_values(by=['ASN'])

print("Number of probes connected to AS that can be used for hosting in the EU and are in the RIPE dataset: " + str(len(AS_Probe_RIPE_df["ASN"])))


In [None]:
#Remove duplicate ASNs (Probes connected to same AS)
display_df = AS_Probe_RIPE_df.drop_duplicates(subset=['ASN'])

#Remove unused columns
display_df = display_df.drop(columns=['prb_id', 'NumIPs', 'type'])

#Print anwser to question B
print("Number of AS that can be used for hosting in the EU and are in the RIPE dataset: " + str(len(display_df["ASN"])))


In [None]:
#First 3 probes
display_df.head(3)

In [None]:
#Last 3 probes
display_df.tail(3)

Description of results

## 2.2 Hosting location (Question C)
For a single hour in the RIPE data set: find all valid entries where the probe has hosting type AS and
the target IPv4 is from an EU country. Implement this in an efficient way.

In [None]:
#Selects all entries in RIPE data with probe connected to EU as of type hosting
RIPE_HostAS_df = RIPE_df.loc[RIPE_df[1].isin(AS_Probe_RIPE_df['prb_id'])]

print("Entries with probe connected to an EU as with type hosting: " + str(len(RIPE_HostAS_df[1])))

In [None]:
#Convert IP strings to IP integers
for i in RIPE_HostAS_df.index:
    
    if i == 0:
        print(RIPE_HostAS_df[0][i])
    IP_Splitstring = RIPE_HostAS_df[0][i].split(".") 
    RIPE_HostAS_df.at[i, 0] = int(IP_Splitstring[0]) * 16581375 + int(IP_Splitstring[1]) * 65025 + int(IP_Splitstring[2]) * 255 + int(IP_Splitstring[3])



#Add Integer values of IP to dataframe
#RIPE_HostAS_df["IP_Integer"] = IPs_Integer

In [None]:
#Sorting the IP lists so we can check from low to high IPs
RIPE_HostAS_df = RIPE_HostAS_df.sort_values(by=[0])
ipv4_df = ipv4_df.sort_values(by=["ip_from"])

Dest_Addr_Countries = []
ripeindex = 0
ipindex = 0

#Check if there are IP addresses lower than included in the IP2Location dataset
while RIPE_HostAS_df.iat[ripeindex, 0] < ipv4_df.at[ipindex, "ip_from"]:
    ripeindex = ripeindex + 1
    Dest_Addr_Countries.append("-")

print("IP addresses not included in IP2location dataset: " + str(ripeindex))

#Check for each range of IP addresses in the IP2Location dataset which dst_addr IPs are present
#Break loop early if the length of the RIPE dataset is reached
for ipindex in ipv4_df.index:
    while RIPE_HostAS_df.iat[ripeindex, 0] >= ipv4_df.at[ipindex, "ip_from"] and RIPE_HostAS_df.iat[ripeindex, 0] <= ipv4_df.at[ipindex, "ip_to"]:
        Dest_Addr_Countries.append(ipv4_df.at[ipindex, "country_code"])
        ripeindex = ripeindex + 1
        if ripeindex >= len(RIPE_HostAS_df[0]):
            break
    
    if ripeindex >= len(RIPE_HostAS_df[0]):
        break

print("IP addresses linked to country: " + str(len(Dest_Addr_Countries)))
#Add list for destination address location to dataframe
RIPE_HostAS_df["Country"] = Dest_Addr_Countries



In [None]:
#Remove entries not in EU
RIPE_HostAS_df = RIPE_HostAS_df.loc[RIPE_HostAS_df['Country'].isin(EU_list)]

print("Entries with probe connected to an EU AS with type hosting and destination address within EU: " + str(len(RIPE_HostAS_df[1])))

## Part C Alternative Approach reading all the dataset

In [None]:
import pickle
import time
import bz2
import os
import sys
import json
import pandas
import io
import datetime
import socket
import struct

def ip2int(addr):
    return struct.unpack("!I", socket.inet_aton(addr))[0]

with open('data/AS_dataset.pkl', 'rb') as file:
    AS_df = pickle.load(file)
    
with open('data/probe_dataset.pkl', 'rb') as file:    
    P_df = pickle.load(file)
    
decomFilename = 'data/ping-2022-03-01T2300.bz2'
#decomFile     = bz2.open(decomFilename, 'rt')   
merged_df = P_df.merge(AS_df)

ipv4_df = pandas.read_csv("data/IP2LOCATION-LITE-DB1.CSV")
ipv4_df.rename(columns = {'0':'ip_from', '16777215':'ip_to',
                              '-':'country_code','-.1':'country_name'}, inplace = True)


EU_Countries = ["AT","BE","HR","CY","CZ","DK","EE","FI","FR","GR","DE","HU",
                "IE","IT","LV","LT","LU","MT","NL","PL","PT","RO","SK","SI",
                "ES","SE"]

EU_data = merged_df[merged_df['Country'].isin(EU_Countries)]
EU_Hosting = EU_data[EU_data['type'] == 'hosting']



merged_df.insert(2, 'AS', merged_df['ASN'].str.replace('AS',''))
merged_df['AS'] = pandas.to_numeric(merged_df['AS'])
merged_df['prb_id'] = pandas.to_numeric(merged_df['prb_id'])


merged_df_sorted = merged_df.sort_values('AS')
df_HostingAS = merged_df[merged_df['type'] == 'hosting']

ipv4_df.head()
tpl = ipv4_df.loc[:, 'ip_from':'ip_to'].apply(tuple, 1).tolist()
idx = pandas.IntervalIndex.from_tuples(tpl, 'both')

t0 = time.time()
time.sleep(0.000001)
with open(decomFilename, 'rb') as file:
    decomp = bz2.BZ2Decompressor()
    residue = b''
    total_lines = 0
    m = 0
    checked = []
    #102400 Bytes = 102.4 KB (in decimal)
    #102400 Bytes = 100 KB (in binary)
    #Iterate over RIPE data in  100 KB chunks 
    for data in iter(lambda: file.read(100 * 1024), b''):
        # process the raw data and  concatenate residual of the previous block 
        #to the beginning of the current raw data block
        raw = residue + decomp.decompress(data) 
        residue = b''
        ## process_data(current_block) => do the processing of the 
        ##current data block
        current_block = raw.split(b'\n')
        if raw[-1] != b'\n':
            residue = current_block.pop() # last line could be incomplete
        ##Process all data in the current block to check    
        for items in current_block:
            df_dict = json.loads(items.decode('utf-8'))
            if ('dst_addr' in df_dict) and (df_dict['af'] == 4):# and (ip2int(df_dict['dst_addr'])>0:
                ##convert to interger
                df_ip = ip2int(df_dict['dst_addr'])
                #print(df_ip)
                if df_ip > 0: # certain lines have 0.0.0.0 IP
                    loc = idx.get_loc(df_ip)
                    if ((ipv4_df.loc[loc,'country_code'] in EU_Countries) and (df_dict['prb_id'] not in checked)):
                        #if len(EU_Hosting[EU_Hosting['prb_id'] == df_dict['prb_id']])!=0:
                            #print(df_HostingAS[df_HostingAS['prb_id'] == df_dict['prb_id']])
                        m +=1 ## increment count
                       ##create a list of probes that could be used later                     
                        checked.append(df_dict['prb_id']) 
        total_lines += len(current_block)
    total_lines += 1

print("Total processing time: ",(time.time() - t0))
print("Total number of probe entries with hosting type AS and EU target in RIPE is %i" %(m))
fi.close()



Description of results

## 2.3 Latency (Question D)
Move from using only an hour to the full day. It is advisable to store the raw results of each file. Then,
using all processed files, calculate the average latency’s for each country-AS combination and store
the results into one ncountries ×m matrix. If we could place one server in each country, what would the
minimum average latency be for each country? (include in your report)


In [None]:
#We want a matrix of 26 countries * 113 ASNs (For a single file, should be more for 24 files)

#tuple list
test1= []

for country in EU_list:
    
    #Filter each country's ping values seperately into a dataframe
    country_df = RIPE_HostAS_df.loc[RIPE_HostAS_df['Country'] == country]
    
    for ASN in display_df["ASN"]:
        
        #Filter probe IDs for each seperate ASN
        #There are more probes than ASs to calculate the average ping more accurately we use all probes
        prb_df = AS_Probe_RIPE_df.loc[AS_Probe_RIPE_df['ASN'] == ASN]                            
        
        #Filter the ping data so it includes all probes from selected ASN and selected country
        temp_df = country_df.loc[country_df[1].isin(prb_df['prb_id'])]
        
        #Create sum of all ASN - Country ping measurements
        sumvalue = 0
        i = 0
        for pingvalue in temp_df[2]:
            sumvalue = sumvalue + pingvalue
            i = i+1
        
        #Check if there are ping measurements between AS - Country
        #Calculate average when needed, enter '-' when no data available
        if not i == 0:
            average = sumvalue/i
            test1.append((country, ASN, average))
        else:
            test1.append((country, ASN, '-'))

#Load tuple list into dataframe
test1_df = pd.DataFrame(test1)        
test1_df.head(5)



Description of results

## 2.4 Optimal server locations (Question E)
Since we are only allowed to place four servers, determine the best four datacenters based on the total
latency for all countries. Report your findings and your procedure to obtain them. Also include the
average latency for each country.


In [None]:
#Code...

0Description of results

# Conclusions

... 
add code if needed

In [None]:
HostProbes = []

for ProbeASN in HostProbeASNs:
    
    index = 0
    for ASN in P_df["ASN"]:
        if ASN == ProbeASN:
            HostProbes.append(P_df["prb_id"][index])
            break
        index = index + 1
        
print(len(HostProbes))

In [None]:
decomFilename = 'C:/Users/Kooltje/Downloads/FoDa Data/ping-2022-03-01T2300'
decomFile     = open(decomFilename, 'rt')   


HostIPs = []
index = 0

for line in decomFile:
    jsonline = json.loads(line)
    
    
    if jsonline["prb_id"] in HostProbes:
        try:
            #Check for duplicates
            if jsonline["dst_addr"] not in HostIPs:
                #Check if IP is of type 4
                if jsonline["af"] == 4:
                    HostIPs.append(jsonline["dst_addr"])
        except KeyError as err:
            pass
    
    #Read only first 1m lines
    index = index + 1
    if index > 1000000:
        break
                              
print("Amount of IPs in the RIPE data connected to an AS of type Hosting: " + str(len(HostIPs)))

decomFile.close()

In [None]:
#Converting the xxx.xxx.xxx.xxx format of the host IPs to integer
#Needed for when comparing to the IPv4 dataset

HostIPs_Integer = []


for IPString in HostIPs:
    IP_Splitstring = IPString.split(".") 
 
    HostIPs_Integer.append(int(IP_Splitstring[0]) * 16581375 + int(IP_Splitstring[1]) * 65025 + int(IP_Splitstring[2]) * 255 + int(IP_Splitstring[3]))  

print(len(HostIPs_Integer))



In [None]:
#Upper part should be removed because run in part 1
import pickle

with open('data/AS_dataset.pkl', 'rb') as file:
    AS_df = pickle.load(file)
    
with open('data/probe_dataset.pkl', 'rb') as file:    
    P_df = pickle.load(file)
    
decomFilename = 'C:/Users/Kooltje/Downloads/FoDa Data/ping-2022-03-01T2300'
decomFile     = open(decomFilename, 'rt')   


RIPEProbes = []
index = 0

#Create list of all probes that are in the RIPE dataset
for line in decomFile:
    jsonline = json.loads(line)
    
    if jsonline["prb_id"] not in RIPEProbes:
        RIPEProbes.append(jsonline["prb_id"])
                          
    #Read only first 1m lines
    index = index + 1
    if index > 1000000:
        break
        
                  
print("Probes in first 1m lines of RIPE Dataset: " +str(len(RIPEProbes)))            

decomFile.close()


In [None]:
index = 0
ProbeASNs = []

#Create list of all probes in both RIPE and probe datasets
#Saves only the ASNs as these are used later
#Probe IDs no longer used after this point
for probe in P_df["prb_id"]:
    if probe in RIPEProbes:
        ProbeASNs.append(P_df["ASN"][index])
        
    index = index + 1
    
print("Probes in both RIPE and probe dataset: " + str(len(ProbeASNs)))    

In [None]:
#List of country codes that are an EU member

EU_Countries = ["AT",
    "BE",
    "HR",
    "CY",
    "CZ",
    "DK",
    "EE",
    "FI",
    "FR",
    "GR",
    "DE",
    "HU",
    "IE",
    "IT",
    "LV",
    "LT",
    "LU",
    "MT",
    "NL",
    "PL",
    "PT",
    "RO",
    "SK",
    "SI",
    "ES",
    "SE"]

In [None]:
HostProbeASNs = []

index = 0

for ASN in AS_df["ASN"]:
    
    if ASN in ProbeASNs:
        if AS_df["type"][index] == "hosting":
            HostProbeASNs.append(ASN)            
    index = index + 1    
    
print("Amount of probes with an ASN with type hosting: " + str(len(HostProbeASNs))) 

HostProbeASNs.sort()


In [None]:
#Compare IPv4 with HostIPs_Integer

ipv4_df = pandas.read_csv("data/IP2LOCATION-LITE-DB1.CSV")
ipv4_df.rename(columns = {'0':'ip_from', '16777215':'ip_to',
                              '-':'country_code','-.1':'country_name'}, inplace = True)

ipv4_df.head()

HostIPs_EU = []
index = 0
ipv4index = 0

#Sorting the IP list so we can check from low to high IPs
HostIPs_Integer.sort

for IP_to in ipv4_df["ip_to"]:
    
    while HostIPs_Integer[index] < IP_to:
        if ipv4_df["country_code"][ipv4index] in EU_Countries:
            HostIPs_EU.append(HostIPs_Integer[index])
            print(HostIPs_Integer[index])
            print(ipv4_df["country_code"][ipv4index])
        index = index + 1
        if index >= len(HostIPs_Integer):
            break;
    ipv4index = ipv4index + 1  
    if index >= len(HostIPs_Integer):
        break;
                  

print(len(HostIPs_EU))


In [None]:
#create list of unique probe ID's in RIPE dataset
unique_prbID = []
for i in tuple_list:
    if i[1] not in unique_prbID:
        unique_prbID.append(i[1])
        
print("Unique probe IDs in RIPE dataset: " + str(len(unique_prbID.l)))

In [None]:
index = 0
ProbeASNs = []

#Create list of all probes in both RIPE and probe datasets
#Saves only the ASNs as these are used later
#Probe IDs no longer used after this point
for probe in P_df["prb_id"]:
    if probe in RIPEProbes:
        ProbeASNs.append(P_df["ASN"][index])
        
    index = index + 1
    
print("Probes in both RIPE and probe dataset: " + str(len(ProbeASNs)))    

In [None]:
HostProbes = []

index = 0

print(len(HostProbeASNs))

for ASN in P_df["ASN"]:
    if ASN in HostProbeASNs:
        if P_df["prb_id"][index] not in HostProbes:
            HostProbes.append(P_df["prb_id"][index])
    index = index + 1

print(len(HostProbes))    

In [None]:
0#Random stuff I didn't want to throw away yet
#0Code for finding all host probes from EU in the dataset of one hour
import time
import bz2
import os
import sys
import json

# open decompressed file
decomFilename = 'C:/Users/Kooltje/Downloads/FoDa Data/ping-2022-03-01T2300'
decomFile     = open(decomFilename, 'rt') 

#read first line and print
#firstLine = decomFile.readline();
#print(firstLine)

#the line appears to be json-formatted: pretty print json
#firstLineJson = json.loads(firstLine)

#read all lines of first file
count = 0
st    = time.time()
for line in decomFile:
    jsonline = json.loads(line)
    #print(json.dumps(jsonline, sort_keys=True, indent=4))
    count = count + 1
    if count > 10000: 
          break

#print the last line
print(json.dumps(jsonline, sort_keys=True, indent=4))

#print the read duration
dur         = round(time.time() - st,2)
print("Loading took: " + str(dur) + " seconds")
print("The file had " + str(count) + "lines")

#finally close decomFile
decomFile.close()