# Changelog
Reporting:
- Rewritten introduction
- Rewritten dataset description
- Created seperate headings for Variable descriptions and EU list
- Added text explaining AS and Probe merge
- Added chapter 2 introduction
- Added chapter 2.1 conclusion
- Added chapter 2.2 conclusion
- Added chapter 2.3 conclusion
- Added chapter 2.4 conclusion
	
Analysis
- Changed EL to GR in EU_List
- Loaded all 24 files of RIPE data (We previously loaded only the first 6 due to time constraints)
- Changed the AS/Probe dataset limitations. Now shows probes with ASNs not in ASN dataset and shows amount of ASNs being used for analysis
- Included negative ping checks in RIPE dataset limitation description
- Added number of probes in RIPE dataset bigger than number of probes in probe dataset to limitation description
- Added table of minimal latency from selected set to each country
- Added ASN locations of selected to support conclusion

# SEN163A - Fundamentals of Data Analytics
# Assignment 2 - Large-scale Internet Data Analysis
### Ir. Jacopo De Stefani - [J.deStefani@tudelft.nl](mailto:J.deStefani@tudelft.nl)
### Joao Pizani Flor, M.Sc. - [J.p.pizaniflor@tudelft.nl](mailto:J.p.pizaniflor@tudelft.nl)

### 08-04-2022
## Group 2
- Emmanuel M Boateng - '5617642'
- Joost Oortwijn - '4593472'
- Philip Busscher - ''4611993''
- Floris Kool - ''4975243''


# Introduction & problem description
The Groote Nationale Investeer (GNI) Bank has given us the assignment to find the four best datacenter locations for their expansion into the mobile banking sector. To do this we will use data analytics to find the 4 locations that overall give the lowest latency to all EU countries. To analyze this we will use Atlas RIPE ping dataset. This dataset uses probes that measure different internet metrics at a lot of different locations, chapter one will explain this in more detail. To accomplish our analysis this we will go through five steps:

1. Load all the data nessesary and describe potential limitations in the available data.
2. Transform and combine the datasets so we have the right probes selected and only contain the information needed for analysis.
3. Find all measurement data from the selected probes to EU countries, from a subset of the complete dataset.
4. Calculate the average latency from each measurement probe to each country.
5. Find the 4 locations that give the best latency to all eu countries and provide a conclusion.

# 1. Dataset description

The Atlas RIPE dataset uses a large number of probes to measure different internet metrics. We use their ping dataset, which contains latency data from all of their probes to various IP adresses. Their ping dataset is stored in a seperate file for each hour, containing about 28 milion lines per file. We will use the measurement data of all 24 hours of March 1st. 

The probe dataset contains a list of 11008 measurement probes used in the RIPE dataset. For each probe they have the ASN (Autonomous System Number) it's connected to.

The AS dataset contains a list of 60122 ASNs and the country that they're located in. Combining this with the probe dataset we can find out in which country the probes of the RIPE dataset are located.

IP2Location dataset contains IP ranges and assigned countries to them. Using this dataset you can find in which country an IP address is located.

## 1.1 Description of variables used across the entire notebook

In [1]:
#Description of variables used across the entire notebook

#AS_df - Complete AS dataset as provided
#P_df - Complete probe dataset as provided
#EU_list - list of countries in EU
#ipv4_df - Complete ip2location dataset
    
#as_probe_joined_df - Merge of AS and Probe dataset, from 2.1 on filtered to contain only type hosting and location from EU
#AS_Probe_RIPE_df - Merge of AS and Probe dataset with probe ids in RIPE dataset and ASNs of type hosting and location from EU
#display_df - Same as AS_Probe_RIPE_df with removed duplicate ASNs

#RIPE_df - Complete useful contents of a single hour of ripe data (used for 2.1 & 2.2)
#RIPE_HostAS_df - Entries of a single hour of ripe data with probe connected to an EU ASN with type host,
    #From 2.2 on also filtered to only contain entries with destination address in EU
    
#Complete_ASN_Set - Set of ASNs of hosting type from EU and in complete RIPE dataset (Ended up being the same for each hour of data)
#Complete_RIPE_Entries_df - Complete set of RIPE entries with probe ASN in eu and type host and destination in EU
    #Can be loaded from all ripe files
    #Also saved in RIPE_00-23.pkl 
    
#ASN_Country_Avg_df - Combination of each country, ASN and average ping
#ASN_Country_Matrix_df - Combination of each country, ASN and average ping, with Country as index and ASN as column labels

## 1.2 Opening the data

In [2]:
import pickle
import time
import bz2
import os
import sys
import json
import pandas as pd
import numpy as np
import ipaddress
import io
from itertools import combinations

### AS and Probe datasets

In [3]:
#AS Dataset

AS_Filename = 'data/AS_dataset.pkl'

with open(AS_Filename, 'rb') as file:
    
    AS_df = pickle.load(file)

In [4]:
#Probe dataset

Probe_Filename = 'data/probe_dataset.pkl'

with open(Probe_Filename, 'rb') as file:
    
    P_df = pickle.load(file)

The AS and Probe dataset can already be merged as we only need the ASNs of the RIPE probes in the probe dataset

In [5]:
#Merge the AS and Probe datasets
as_probe_joined_df = pd.merge(P_df,AS_df, on='ASN')

### EU countries

In [6]:
# EU country codes retrieved from: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Country_codes
# Changed EL -> GR
EU_list = ['BE','BG','CZ','DK','DE','EE','IE','GR','ES','FR','HR','IT','CY','LV','LT','LU','HU','MT','NL','AT','PL','PT','RO','SI','SK','FI','SE']

### IP2Location dataset

In [7]:
#IP 2 Location dataset

IP_Filename = "data/IP2LOCATION-LITE-DB1.CSV"

ipv4_df = pd.read_csv(IP_Filename)

ipv4_df.rename(columns = {'0':'ip_from', '16777215':'ip_to',
                              '-':'country_code','-.1':'country_name'}, inplace = True)

### Single ripe file (Used for C)

In [53]:
#Ripe dataset (Single file)

#Option 1 decompressed file
decomFilename = 'data/ping-2022-03-01T2300_decom'
#decomFile     = open(decomFilename, 'rt')

#Option 2 BZ2 file
bz2Filename = 'data/ping-2022-03-01T2300.bz2'
bz2File     = bz2.open(bz2Filename, 'rt')


# List of tuples
# https://stackoverflow.com/questions/28056171/how-to-build-and-fill-pandas-dataframe-from-for-loop
tuple_list = []

start  = time.time()

#for line in decomFile:
for line in bz2File:
    
    decoded_line = json.loads(line)
    if "af" in decoded_line and "dst_addr" in decoded_line and "prb_id" in decoded_line and "avg" in decoded_line: 
        if (decoded_line["af"] == 4)and (df_dict["avg"] > 0):
            tuple_list.append((decoded_line["dst_addr"],decoded_line["prb_id"],decoded_line["avg"]))

            
dur         = round(time.time() - start,2)
print("Loading took: "  + str(dur) + " seconds")
print("Lines added to tuple: " + str(len(tuple_list)))


#finally close bz2File
#decomFile.close()
bz2File.close()

Loading took: 557.97 seconds
Lines added to tuple: 17118054


In [55]:
#Load tuples data into dataframe
start  = time.time()

RIPE_df = pd.DataFrame(tuple_list)

dur         = round(time.time() - start,2)
print("Loading took: "  + str(dur) + " seconds")

Loading took: 6.11 seconds


### Complete ripe dataset from BZ2 files
Changed reading method to raw characters to save about 20% in loading time

In [9]:
RIPE_Filenames = pd.date_range('2022-03-01', periods=24, freq='60min').strftime('D:/FoDa Data/ping-%Y-%m-%dT%H%M.bz2').tolist()

Complete_ASN_List = []
Complete_RIPE_Entries_df = pd.DataFrame({0:[], 1:[], 2:[], 'Country':[]})

# Filter data set for AS's that can be used for hosting in the EU
as_probe_joined_df = as_probe_joined_df.loc[(as_probe_joined_df['type'] == 'hosting') & (as_probe_joined_df['Country'].isin(EU_list))]

for filename in RIPE_Filenames:
    #Read RIPE data
    print(filename)
    
    start  = time.time()
    with open(filename, 'rb') as fi:
        decomp = bz2.BZ2Decompressor()
        residue = b''
        total_lines = 0
        m = 0
        tuple_list = []
    
        for data in iter(lambda: fi.read(100 * 1024), b''):
            raw = residue + decomp.decompress(data) # process the raw data and  concatenate residual of the previous block to the beginning of the current raw data block
            residue = b''
            # process_data(current_block) => do the processing of the current data block
            current_block = raw.split(b'\n')
            if raw[-1] != b'\n':
                residue = current_block.pop() # last line could be incomplete

            for items in current_block:
                df_dict = json.loads(items.decode('utf-8'))
                if ('dst_addr' in df_dict) and (df_dict['af'] == 4) and (df_dict["avg"] > 0):
                    tuple_list.append((df_dict["dst_addr"],df_dict["prb_id"],df_dict["avg"]))
    
    fi.close()
    
    Temp_RIPE_df = pd.DataFrame(tuple_list)
    
    #Get list of ASNs
    unique_prbID = Temp_RIPE_df[1].unique()
    
    Temp_AS_Probe_RIPE_df = as_probe_joined_df.loc[as_probe_joined_df['prb_id'].isin(unique_prbID)]
    
    unique_ASNs = Temp_AS_Probe_RIPE_df['ASN'].unique()
    Complete_ASN_List.extend(unique_ASNs)
    
    
    #Get RIPE entries with dst addr in eu
    Temp_RIPE_HostAS_df = Temp_RIPE_df.loc[Temp_RIPE_df[1].isin(Temp_AS_Probe_RIPE_df['prb_id'])]
    
    for i in Temp_RIPE_HostAS_df.index:
        Temp_RIPE_HostAS_df.at[i, 0] = int(ipaddress.IPv4Address(Temp_RIPE_HostAS_df[0][i]))

    Temp_RIPE_HostAS_df = Temp_RIPE_HostAS_df.sort_values(by=[0])
    ipv4_df = ipv4_df.sort_values(by=["ip_from"])

    Dest_Addr_Countries = []
    ripeindex = 0
    ipindex = 0

    while Temp_RIPE_HostAS_df.iat[ripeindex, 0] < ipv4_df.at[ipindex, "ip_from"]:
        ripeindex = ripeindex + 1
        Dest_Addr_Countries.append("-")

    for ipindex in ipv4_df.index:
        while Temp_RIPE_HostAS_df.iat[ripeindex, 0] >= ipv4_df.at[ipindex, "ip_from"] and Temp_RIPE_HostAS_df.iat[ripeindex, 0] <= ipv4_df.at[ipindex, "ip_to"]:
            Dest_Addr_Countries.append(ipv4_df.at[ipindex, "country_code"])
            ripeindex = ripeindex + 1
            if ripeindex >= len(Temp_RIPE_HostAS_df[0]):
                break

        if ripeindex >= len(Temp_RIPE_HostAS_df[0]):
            break
            
    Temp_RIPE_HostAS_df["Country"] = Dest_Addr_Countries
    Temp_RIPE_HostAS_df = Temp_RIPE_HostAS_df.loc[Temp_RIPE_HostAS_df['Country'].isin(EU_list)]    
    
    
    #Add entries to complete dataframe
    frames = [Complete_RIPE_Entries_df, Temp_RIPE_HostAS_df]
    Complete_RIPE_Entries_df = pd.concat(frames)
    
    dur         = round(time.time() - start,2)
    print("Added " + str(len(Temp_RIPE_HostAS_df[0])) + " entries and " + str(len(unique_ASNs)) + " ASNs in " + str(dur) + " seconds")
    print()
    

#Remove duplicates
Complete_ASN_Set = set(Complete_ASN_List)


D:/FoDa Data/ping-2022-03-01T0000.bz2
Added 259980 entries and 112 ASNs in 568.8 seconds

D:/FoDa Data/ping-2022-03-01T0100.bz2
Added 259968 entries and 112 ASNs in 625.48 seconds

D:/FoDa Data/ping-2022-03-01T0200.bz2
Added 259650 entries and 112 ASNs in 1111.9 seconds

D:/FoDa Data/ping-2022-03-01T0300.bz2
Added 259786 entries and 113 ASNs in 1177.38 seconds

D:/FoDa Data/ping-2022-03-01T0400.bz2
Added 259978 entries and 112 ASNs in 1252.62 seconds

D:/FoDa Data/ping-2022-03-01T0500.bz2
Added 259673 entries and 112 ASNs in 1276.23 seconds

D:/FoDa Data/ping-2022-03-01T0600.bz2
Added 259513 entries and 112 ASNs in 1280.33 seconds

D:/FoDa Data/ping-2022-03-01T0700.bz2
Added 259540 entries and 112 ASNs in 1180.1 seconds

D:/FoDa Data/ping-2022-03-01T0800.bz2
Added 259938 entries and 112 ASNs in 559.47 seconds

D:/FoDa Data/ping-2022-03-01T0900.bz2
Added 260024 entries and 112 ASNs in 571.24 seconds

D:/FoDa Data/ping-2022-03-01T1000.bz2
Added 259996 entries and 112 ASNs in 544.0 second

In [11]:
#Save filtered RIPE data to file
print("Unique ASNs: " + str(len(Complete_ASN_Set)))
print("Ripe entries: " + str(len(Complete_RIPE_Entries_df[0])))

Complete_ASN_Set_df = pd.DataFrame(Complete_ASN_Set)

Complete_ASN_Set_df.to_pickle("data/ASN_00-23.pkl")

Complete_RIPE_Entries_df.to_pickle("data/RIPE_00-23.pkl")

Unique ASNs: 113
Ripe entries: 6236514


### Complete ripe dataset from pkl file

In [78]:
#Probe dataset

Ripe_Filename = 'data/RIPE_00-23.pkl'
ASN_Filename = 'data/ASN_00-23.pkl'

with open(Ripe_Filename, 'rb') as file:
    
    Complete_RIPE_Entries_df = pickle.load(file)
    

with open(ASN_Filename, 'rb') as file:
    
    Complete_ASN_Set_df = pickle.load(file)

Complete_ASN_Set = Complete_ASN_Set_df[0]



## 1.2 Limitations in data (Question A)

Evaluate if there are limitations in the provided datasets (AS and probe data set). If you find limitations, describe these and conjecture possible reasons, supported with data.

### 1.2.1 Limitations in the AS and Probe dataset
1. The first limitation we could find is that not all probes have an ASN included in the ASN dataset, as shown in the code below. For this reason 303 out of 11008 probes cannot be used for our analysis, because we can't find out where they are located and if they are of type: hosting.

In [43]:
ASN_ProbeDataSet = list(P_df['ASN'])
ASN_ASNDataSet = list(AS_df['ASN'])

temp_tuple = []

for index, row in P_df.iterrows():
    if row['ASN'] not in ASN_ASNDataSet:
        temp_tuple.append((row['prb_id'], row['ASN']))
        
ExcludedProbes_df = pd.DataFrame(temp_tuple, columns = ['prb_id', 'ASN'])

print('The probe dataset has: ' + str(len(P_df)) + ' probes')
print('However ' + str(len(ExcludedProbes_df)) + ' probes have an ASN that is not included in the ASN dataset')
print('Excluded probes:')
ExcludedProbes_df.head(5)

The probe dataset has: 11008 probes
However 303 probes have an ASN that is not included in the ASN dataset
Excluded probes:


Unnamed: 0,prb_id,ASN
0,2,AS1136
1,239,AS8346
2,319,AS1909
3,345,AS1734
4,646,AS6067


2. Another possible limitation we have in our datasets is that only 145 out of the 542 possible server locations are analyzed. As shown in the code below there are 542 ASNs inside the EU with type hosting (in the ASN dataset). We can find 339 probes that use these ASNs, however some are connected to the same ASN. This means that we can use data from the 339 probes in the analysis, but we're only analysing 145 different locations.

In [48]:
temp_merged_df = pd.merge(P_df,AS_df, on='ASN')
temp_merged_df = temp_merged_df.loc[(temp_merged_df['type'] == 'hosting') & (temp_merged_df['Country'].isin(EU_list))]

print("Probes in probe dataset: " + str(len(P_df)))
print("Probes left after merge with ASN: " + str(len(pd.merge(P_df,AS_df, on='ASN'))))
print("Useful ASNs in ASN dataset: " + str(len(AS_df.loc[(AS_df['type'] == 'hosting') & (AS_df['Country'].isin(EU_list))])))
print("Useful ASNs: " + str(len(temp_merged_df['ASN'].unique())))
print("Useful Probes: " + str(len(temp_merged_df)))

Probes in probe dataset: 11008
Probes left after merge with ASN: 10705
Useful ASNs in ASN dataset: 542
Useful ASNs: 145
Useful Probes: 339


3. After analyzing an hour of the RIPE dataset we found another limitation to the probe dataset. The ripe dataset contains 11608 seperate probe IDs, which is more than the 11008 IDs in the probe dataset. This means we cannot use some entries of the ripe dataset, because we can't check if the probe is in the EU and is of correct type.

### 1.2.2 Limitation in the IP location dataset
1. When reading through the RIPE dataset in 2.2 and comparing this to the IP2Location dataset we noticed that the first 79 entries did not fit in the lowest range of the IP2Location dataset. Meaning they have an IP address that is lower than 1.0.0.0 (the lowest IP in the IP2loaction dataset). This is shown in the code at 2.2. The implication is that we cannot check the destination location for these ripe entries

### 1.2.3 Limitations in the RIPE dataset

When looking at the RIPE dataset, it was found that for some lines the IP destination addresses were missing. This was not the case for the Probe ID's and average round trip times as these were always included within the lines of the RIPE dataset. Another issue that was found is that some average ping values are -1. A -1 ping time would mean traveling back in time 1 milisecond to deliver a package. We don't think the owners of a ripe probe have invented time travel, so we excluded negative values from the analysis.

Below we checked for missing entries and incorrect ping times for the first 5m lines of a RIPE file. Out of the 5 milion line about 11 thousand were missing data and about 870 thousand had incorrect/no ping data. This limits our analysis because the amount of useful data shrinks to about 4/5.

In [58]:
bz2Filename = 'data/ping-2022-03-01T2300.bz2'
bz2File_limitation = bz2.open(bz2Filename, 'rt') 
missing_adres = 0
missing_probeID = 0
missing_avg = 0
incorrect_avg = 0
line_number = 0

for line in bz2File_limitation:
    decoded_line = json.loads(line)
    line_number += 1
    if "dst_addr" not in decoded_line: 
        missing_adres += 1
      
    if "prb_id" not in decoded_line: 
        missing_probeID += 1
        
    if "avg" not in decoded_line: 
        missing_avg += 1
    elif decoded_line["avg"] <= 0:
        incorrect_avg += 1
           
    if line_number > 5000000:
        print('There are', missing_adres, 'missing IP destination addresses in the first 5m lines of the RIPE dataset (for one hour)')
        print('There are', missing_probeID, 'missing probe ID\'s in the first 5m lines of the RIPE dataset (for one hour)')
        print('There are', missing_avg, 'missing average round-trip time values in the first 5m lines of the RIPE dataset (for one hour)')
        print('There are', incorrect_avg, 'incorrect average round-trip time values in the first 5m lines of the RIPE dataset (for one hour)')
        
        break
        
bz2File_limitation.close()

There are 11428 missing IP destination addresses in the first 5m lines of the RIPE dataset (for one hour)
There are 0 missing probe ID's in the first 5m lines of the RIPE dataset (for one hour)
There are 0 missing average round-trip time values in the first 5m lines of the RIPE dataset (for one hour)
There are 872019 incorrect average round-trip time values in the first 5m lines of the RIPE dataset (for one hour)


# 2 Analysis

To find out the best 4 locations for GNI's servers we will first combine the AS, Probe and RIPE dataset to find all the possible hosting locations in the EU. After that we will find all entries in an hour of measurement data from these locations that also have a destination within the EU. Then using the entire dataset we will find all average times from every location to every EU country. To conclude we will select the 4 locations that give the lowest latency to all countries.

## 2.1 AS (Question B)

With the AS and probe data set, find the number m of AS’s that can be used for hosting in the EU
and have probes in the RIPE data set. Sort the ASN’s in ascending order and include the first and last
three in your report (number, name and country).


In [64]:
#Merge the AS and Probe datasets
as_probe_joined_df = pd.merge(P_df,AS_df, on='ASN')

# Filter data set for AS's that can be used for hosting in the EU
as_probe_joined_df = as_probe_joined_df.loc[(as_probe_joined_df['type'] == 'hosting') & (as_probe_joined_df['Country'].isin(EU_list))]

#Get the unique number of probe IDs that are in the RIPE Data
unique_prbID = RIPE_df[1].unique()

print("Unique probe IDs: " + str(len(unique_prbID)))

#Filter the data set by only selecting the ASN's that have probes in the Ripe dataset
AS_Probe_RIPE_df = as_probe_joined_df.loc[as_probe_joined_df['prb_id'].isin(unique_prbID)]

#Sort by ASN
AS_Probe_RIPE_df.sort_values(by=['ASN']).sort_values(by=['ASN'])

print("Number of probes connected to AS that can be used for hosting in the EU and are in the RIPE dataset: " + str(len(AS_Probe_RIPE_df["ASN"])))

#Remove duplicate ASNs (Probes connected to same AS)
display_df = AS_Probe_RIPE_df.drop_duplicates(subset=['ASN'])

#Remove unused columns
display_df = display_df.drop(columns=['prb_id', 'NumIPs', 'type'])

#Sort by ASN
display_df.insert(2, 'AS', display_df['ASN'].str.replace('AS', ''))
display_df['AS'] = pd.to_numeric(display_df['AS'])
display_df = display_df.sort_values('AS')

#Print anwser to question B
print("Number of AS that can be used for hosting in the EU and are in the RIPE dataset: " + str(len(display_df["ASN"])))


Unique probe IDs: 11608
Number of probes connected to AS that can be used for hosting in the EU and are in the RIPE dataset: 234
Number of AS that can be used for hosting in the EU and are in the RIPE dataset: 113


In [65]:
#First 3 probes
display_df.head(3)

Unnamed: 0,ASN,Country,AS,Name
6422,AS6724,DE,6724,Strato AG
10262,AS8304,FR,8304,Ecritel SARL
8489,AS8315,NL,8315,Sentia Netherlands BV


In [66]:
#Last 3 probes
display_df.tail(3)

Unnamed: 0,ASN,Country,AS,Name
8377,AS201978,CY,201978,Osbil Technology Ltd.
9379,AS203944,LU,203944,NTT Luxembourg PSF S.A.
2910,AS203953,DK,203953,Hiper A/S


### Conclusion
As described in the limitations of probe and AS dataset: Out of the 11008 probes in the probe dataset, 10705 can be combined with the ASN dataset. 339 Probes in this dataset are of type hosting and in the EU, which use 145 different ASNs.

The RIPE dataset contains 11608 probes, 234 of these are in the probe dataset and have an ASN in the EU that can be used for hosting. Some of these probes are connected to the same ASN, leaving 113 ASNs located in the EU that GNI can use for hosting their server.

## 2.2 Hosting location (Question C)
For a single hour in the RIPE data set: find all valid entries where the probe has hosting type AS and
the target IPv4 is from an EU country. Implement this in an efficient way.

In [68]:
#Selects all entries in RIPE data with probe connected to EU as of type hosting
RIPE_HostAS_df = RIPE_df.loc[RIPE_df[1].isin(AS_Probe_RIPE_df['prb_id'])]

print("Entries with probe connected to a potential hosting location: " + str(len(RIPE_HostAS_df[1])))

Entries with probe connected to a potential hosting location: 704517


In [69]:
#Convert IP strings to IP integers
for i in RIPE_HostAS_df.index:  
    IP_Splitstring = RIPE_HostAS_df[0][i].split(".") 
    RIPE_HostAS_df.at[i, 0] = int(IP_Splitstring[0]) * 16581375 + int(IP_Splitstring[1]) * 65025 + int(IP_Splitstring[2]) * 255 + int(IP_Splitstring[3])

In [70]:
#Add country of dst_addr to RIPE_HostAS_df

#Sorting the IP lists so we can check from low to high IPs
RIPE_HostAS_df = RIPE_HostAS_df.sort_values(by=[0])
ipv4_df = ipv4_df.sort_values(by=["ip_from"])

Dest_Addr_Countries = []
ripeindex = 0
ipindex = 0

#Check if there are IP addresses lower than included in the IP2Location dataset
while RIPE_HostAS_df.iat[ripeindex, 0] < ipv4_df.at[ipindex, "ip_from"]:
    ripeindex = ripeindex + 1
    Dest_Addr_Countries.append("-")

print("IP addresses not included in IP2location dataset: " + str(ripeindex))

#Check for each range of IP addresses in the IP2Location dataset which dst_addr IPs are present
#Break loop early if the length of the RIPE dataset is reached
for ipindex in ipv4_df.index:
    while RIPE_HostAS_df.iat[ripeindex, 0] >= ipv4_df.at[ipindex, "ip_from"] and RIPE_HostAS_df.iat[ripeindex, 0] <= ipv4_df.at[ipindex, "ip_to"]:
        Dest_Addr_Countries.append(ipv4_df.at[ipindex, "country_code"])
        ripeindex = ripeindex + 1
        if ripeindex >= len(RIPE_HostAS_df[0]):
            break
    
    if ripeindex >= len(RIPE_HostAS_df[0]):
        break

print("IP addresses linked to country: " + str(len(Dest_Addr_Countries)))

#Add list for destination address location to dataframe
RIPE_HostAS_df["Country"] = Dest_Addr_Countries

IP addresses not included in IP2location dataset: 79
IP addresses linked to country: 704517


In [71]:
#Remove entries not in EU
RIPE_HostAS_df = RIPE_HostAS_df.loc[RIPE_HostAS_df['Country'].isin(EU_list)]

print("Entries with probe connected to an EU AS with type hosting and destination address within EU: " + str(len(RIPE_HostAS_df[1])))

Entries with probe connected to an EU AS with type hosting and destination address within EU: 134528


In [73]:
RIPE_HostAS_df.head(5)

Unnamed: 0,0,1,2,Country
2202050,34230351,6413,0.642325,FR
7105316,34230366,6332,3.692374,FR
10643681,34322379,18820,3.350143,FR
6852126,34322379,19338,4.226577,FR
6003356,34322379,52596,1.965267,FR


### Conclusion
Filtering through an hour of the ripe dataset, we found 704517 out of 17118054 entries that have a probe connected to a potential hosting location. 79 of these had an unusable IP adress and were excluded as described in chapter 1.2.2. 134528 entries also had a destination address in the EU. Meaning for an hour of data we have 134528 measurements of potential hosting locations to potential client locations.

## 2.3 Latency (Question D)
Move from using only an hour to the full day. It is advisable to store the raw results of each file. Then,
using all processed files, calculate the average latency’s for each country-AS combination and store
the results into one ncountries ×m matrix. If we could place one server in each country, what would the
minimum average latency be for each country? (include in your report)


In [79]:
#Load the avg ping for each country-AS combination into a DF
ASN_Country_Avg =[]
start  = time.time()

for country in EU_list:
    
    #Filter each country's ping values seperately into a dataframe
    country_df = Complete_RIPE_Entries_df.loc[Complete_RIPE_Entries_df['Country'] == country]
    
    for ASN in Complete_ASN_Set:
        
        #Filter probe IDs for each seperate ASN
        #There are more probes than ASs to calculate the average ping more accurately we use all probes
        prb_df = as_probe_joined_df.loc[as_probe_joined_df['ASN'] == ASN]                            
        
        #Filter the ping data so it includes all probes from selected ASN and selected country
        temp_df = country_df.loc[country_df[1].isin(prb_df['prb_id'])]
        
        #Create sum of all ASN - Country ping measurements
        sumvalue = 0
        i = 0
        for pingvalue in temp_df[2]:
            sumvalue = sumvalue + pingvalue
            i = i+1
        
        #Check if there are ping measurements between AS - Country
        #Calculate average when needed, enter nan when no data available
        if not i == 0:
            average = sumvalue/i
            ASN_Country_Avg.append((country, ASN, average))
        else:
            ASN_Country_Avg.append((country, ASN, np.nan))
            
    

#Load tuple list into dataframe
ASN_Country_Avg_df = pd.DataFrame(ASN_Country_Avg)  
ASN_Country_Avg_df.columns = ['Country','ASN','Average latency']

dur         = round(time.time() - start,2)
print("Loading took: " + str(dur) + " seconds")

#ASN_Country_Avg_df.head(5)

Loading took: 61.24 seconds


In [80]:
#Display Country-AS-AVerage dataframe as a matrix

ASN_Country_Avg_df = ASN_Country_Avg_df.iloc[:, 1:] # asn and latency
df_groupby = ASN_Country_Avg_df.groupby('ASN')['Average latency'].apply(list)

new_dftesttest = np.zeros((len(df_groupby), len(df_groupby[0])))
for i in range(len(df_groupby)):
    for j in range(len(df_groupby[0])):
        new_dftesttest[i,j] = df_groupby[i][j]

df_groupby.index
ASN_Country_Matrix_df = pd.DataFrame(new_dftesttest.transpose())   

column_list = list(df_groupby.index)
ASN_Country_Matrix_df.columns=column_list
ASN_Country_Matrix_df.insert(0,'Country', EU_list)
ASN_Country_Matrix_df.set_index('Country')

Unnamed: 0_level_0,AS12676,AS12824,AS12859,AS12876,AS12993,AS13287,AS15401,AS15598,AS15685,AS15817,...,AS61211,AS62000,AS62282,AS62416,AS6724,AS8304,AS8315,AS8560,AS8893,AS9211
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BE,11.166851,40.074876,10.507757,15.188454,47.158572,50.331953,32.000726,14.27247,7.090369,17.91884,...,13.677713,9.139959,30.264256,37.625758,21.895923,36.613629,14.180203,15.608819,14.24865,23.18245
BG,34.065354,56.2402,37.460264,,,,,,24.493638,37.480807,...,,37.181894,,65.739097,45.362127,,,31.467327,39.602015,34.999751
CZ,12.093469,37.812455,25.809037,,,,,,2.435987,16.390667,...,26.435202,19.131593,,55.626967,19.849781,37.179526,,11.805042,17.589719,
DK,20.984928,30.216756,16.241454,,34.615563,,,,,22.766526,...,,31.615565,27.155439,54.632706,21.40588,40.281921,,22.094678,17.782914,17.562175
DE,8.990465,26.565322,16.135886,15.900784,33.302987,39.10749,34.1256,6.982521,12.363413,13.329492,...,30.20947,17.422402,28.505335,46.415247,17.95417,32.618335,12.573617,9.873529,12.003693,11.463877
EE,33.675421,,37.19887,,9.687826,,,,,37.387413,...,,41.549688,,70.265312,30.381828,,,35.153123,30.205509,
IE,23.375734,,25.699692,,,,,,,29.080462,...,,23.829812,,46.472121,31.682828,37.142185,,25.672747,25.116084,
GR,45.387253,67.395819,57.466218,,,,,,36.755894,48.657652,...,,51.527981,,77.508062,52.276134,,,44.504839,52.042627,
ES,36.942678,54.136859,40.268548,19.919217,,16.318842,,,58.671675,38.207938,...,,29.958473,,19.476205,43.597386,33.196142,,33.745641,38.570608,
FR,17.599229,38.852481,21.875865,7.830189,,27.534623,10.814385,22.820869,16.24821,22.054226,...,43.471482,7.703143,35.566478,36.405002,27.983363,19.763384,28.061133,16.793594,21.933111,21.031365


In [81]:
#Calculate the minimum latency for each country
min_latency_s = ASN_Country_Matrix_df.min(axis = 1)
min_latency_df = pd.Series.to_frame(min_latency_s, 'min Latency')
min_latency_df.index = EU_list
min_latency_df
    

  min_latency_s = ASN_Country_Matrix_df.min(axis = 1)


Unnamed: 0,min Latency
BE,2.871515
BG,8.684864
CZ,1.802406
DK,2.938525
DE,6.982521
EE,1.791287
IE,3.860732
GR,31.803821
ES,5.093963
FR,2.770778


### Conclusion
After checking calculating the average latency of each potential hosting location to each eu country we visualized this in a matrix. Some country-as combination do not have data. This is likely because not every measurement probe will send a ping request to a location within every EU country. 

We calculated the minimum average latency from a measurement probe to every EU country. If we could place a 26 servers, one for each country, this is the average latency each country could expect. We did notice some flaws in our results: 
- There are no measurements available to both Cyprus and Malta. 
- Some countries have a relatively high ping (Greece, and Hungary). There likely isn't a measurement probe in the RIPE dataset in or near these countries, so a hosting location near these places is probably not considered in our analysis. 

## 2.4 Optimal server locations (Question E)
Since we are only allowed to place four servers, determine the best four datacenters based on the total
latency for all countries. Report your findings and your procedure to obtain them. Also include the
average latency for each country.


In [84]:
#Find optimal location of 4 servers

i = 0
start  = time.time()
Set_Average = 1000

#Find each possible combination of 4 ASNs in ASN set
ASNCombinations = [ASNCombination for ASNCombination in combinations(Complete_ASN_Set,4)]  

for ASNCombination in ASNCombinations:  
    #Calculate the minimum latency for each country in this ASN set
    combination_df  = ASN_Country_Matrix_df[[ASNCombination[0], ASNCombination[1], ASNCombination[2], ASNCombination[3]]]
    #Calculate average latency of minimum latency of each country in this set
    min_latency_s = combination_df.min(axis = 1)
    
    #Save the lowest average latency of all sets
    if Set_Average > min_latency_s.mean():    
        Set_Average = min_latency_s.mean()
        CombinationIndex = i
    
    i = i+1
    #if i > 10000:
    #    break
        
        

dur         = round(time.time() - start,2)
expecteddur = round(dur * len(ASNCombinations)/i)
print("Loading took: " + str(dur) + " seconds")
#print("Expected time: " + str(expecteddur) + " seconds")
print("Set of ASNs with the lowest average latency to all countries: " + str(ASNCombinations[CombinationIndex]))
print("Average latency from this set: " + str(Set_Average))
    

Loading took: 3475.63 seconds
Set of ASNs with the lowest average latency to all countries: ('AS34971', 'AS16245', 'AS25151', 'AS15598')
Average latency from this set: 6.7286224930603


In [85]:
#Get name, country and number of available IPs of selected ASNs
AS_Optimal_df = AS_df[AS_df['ASN'].isin(ASNCombinations[CombinationIndex])]

AS_Optimal_df

Unnamed: 0,ASN,Country,Name,NumIPs,type
15365,AS16245,DK,Netgroup A/S,68608,hosting
18721,AS15598,DE,QSC AG,165376,hosting
28851,AS34971,IT,Prometeus di Daniela Agro,11008,hosting
34778,AS25151,NL,Cyso Management B.V.,16640,hosting


In [89]:
#Select 4 ASNs from matrix
ASN_Country_Matrix_Optimal_df = ASN_Country_Matrix_df[list(ASNCombinations[CombinationIndex])]

#Show minimal average latency to each country with 4 optimal locations
min_latency_s = ASN_Country_Matrix_df.min(axis = 1)
min_latency_df = pd.Series.to_frame(min_latency_s, 'min Latency')
min_latency_df.index = EU_list
min_latency_df

  min_latency_s = ASN_Country_Matrix_df.min(axis = 1)


Unnamed: 0,min Latency
BE,2.871515
BG,8.684864
CZ,1.802406
DK,2.938525
DE,6.982521
EE,1.791287
IE,3.860732
GR,31.803821
ES,5.093963
FR,2.770778


### Conclusion
After finding the average latency of every set of 4 possible hosting locations we found that the following set has the lowest average latency to all countries:

- Netgroup A/S from Danmark with AS number: AS16245
- QSC AG from Germany with AS number: AS15598
- Prometeus di Daniela Agro from Italy with AS number: AS34971
- Cyso Management B.V. from the Netherlands with AS number AS25151

This set of AS has an average lowest latency to all countries of The table above shows the average latency of 6.73 ms. Therefor we recommend GNI to place servers in under those mentioned AS numbers. With this GNI can expect an average latency to each country as mentioned in the table above.
