This notebook describes part of the PeeringDB dataset.  
It consists in IXP metadata (table `ix`), ASes metadata (table `net`), a directed Graph (`DiGraph`) and a table containing the graph's nodes metadata (table `nodes`).

In [1]:
import numpy as np
import pandas as pd
import networkx as nx
import pickle
import matplotlib.pyplot as plt

# Loading preprocessed data
Notes on the preprocessing: 
* All entries are uniquely defined with an index.
 * ASes: the index is the AS number (asn)
 * IXPs: a negative number that I attributed
* The graph is first built from infos present in `netixlan_set` of the API. This makes a bipartite graph (AS-IXP) with links weighted by the router port size (`speed` in the API)
* We want to derive a directed graph: we rely on ASes `info_ratio` attribute, that can take the values `Not Disclosed`, `Heavy In(out)bound`, `Mostly In(out)bound`, `Balanced`.
 * Inbound: a link is created with a weight=`speed` from IXP to AS. Another link of weight $(1-\beta)$*`speed` is created in the other direction
 * Outbound: a link is created with a weight=`speed` from AS to IXP. Another link of weight $(1-\beta)$*`speed` is created in the other direction
 * `Balanced` or `Not Disclosed`: A link in both direction with a weight=`speed`
 * Heavy categories: $\beta=\beta_H=0.95$, Mostly categories: $\beta=\beta_M=0.75$

In [2]:
path = "./"
prefix = "peeringdb_2_dump_"
date = "2021_03_01"

pickle_in = open(path+"nodes/"+prefix+date+".pickle", "rb")
nodes = pickle.load(pickle_in)
pickle_in.close()
nodes = nodes.loc[nodes["port_capacity"]>0] ##port capacity = sum of all ports
print("nodes table summary")
display(nodes.info())

pickle_in = open(path+"ix/"+prefix+date+".pickle", "rb")
ix = pickle.load(pickle_in)
pickle_in.close()
ix = ix.loc[ix["port_capacity"]>0]
print("ix table summary")
display(ix.info())

pickle_in = open(path+"net/"+prefix+date+".pickle", "rb")
net = pickle.load(pickle_in)
pickle_in.close()
net = net.loc[net["port_capacity"]>0]
print("net table summary")
display(net.info())

BETA_H = 0.95
BETA_M = 0.75

edgelist = open(path+"graph/"+format(BETA_H, '.4f')+"_"+format(BETA_M, '.4f')+"_"+prefix+date+".txt", "r")
DiGraph = nx.parse_edgelist(edgelist, nodetype = int, data=(('weight',float),), create_using = nx.DiGraph, delimiter=",")


assert(len(nodes) == len(ix) + len(net))
assert(len(nodes) == len(DiGraph))

print("Total number of nodes:", len(nodes))
print("Total number of IXP:", len(ix))
print("Total number of ASes: ", len(net))
print("Total number of edges: ", len(DiGraph.edges()))

nodes table summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12282 entries, 20940 to -893
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           12282 non-null  object 
 1   type           12282 non-null  object 
 2   prev_id        12282 non-null  int64  
 3   AStype         11472 non-null  object 
 4   region         12282 non-null  object 
 5   asn            12282 non-null  int64  
 6   port_capacity  12282 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 767.6+ KB


None

ix table summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 810 entries, -1 to -893
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   proto_ipv6        810 non-null    bool  
 1   status            810 non-null    object
 2   url_stats         810 non-null    object
 3   id                810 non-null    int64 
 4   tech_email        810 non-null    object
 5   city              810 non-null    object
 6   policy_email      810 non-null    object
 7   tech_phone        810 non-null    object
 8   media             810 non-null    object
 9   proto_multicast   810 non-null    bool  
 10  ixf_last_import   127 non-null    object
 11  website           810 non-null    object
 12  updated           810 non-null    object
 13  net_count         810 non-null    int64 
 14  policy_phone      810 non-null    object
 15  proto_unicast     810 non-null    bool  
 16  region_continent  810 non-null    object
 1

None

net table summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11472 entries, 20940 to 61437
Data columns (total 35 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   status                        11472 non-null  object 
 1   looking_glass                 11472 non-null  object 
 2   route_server                  11472 non-null  object 
 3   netixlan_updated              11472 non-null  object 
 4   info_ratio                    11472 non-null  object 
 5   id                            11472 non-null  int64  
 6   policy_ratio                  11472 non-null  bool   
 7   info_unicast                  11472 non-null  bool   
 8   policy_general                11472 non-null  object 
 9   website                       11472 non-null  object 
 10  allow_ixp_update              11472 non-null  bool   
 11  updated                       11472 non-null  object 
 12  netfac_updated                7121 non

None

Total number of nodes: 12282
Total number of IXP: 810
Total number of ASes:  11472
Total number of edges:  63914


# Selecting the main connected component
Most graph algorithms behave best when the graph has a single connected component 

In [3]:
##I work only with the main connected component. Some entries of nodes, ix and net must be removed
##Main connected component.
#watch out casting DiGraph to Graph is not correct (delete doubled edges). For our use here it will be fine.
components = sorted(nx.connected_components(nx.Graph(DiGraph)), key=len, reverse=True) 
print("Number of connected components", len(components))
print("Percentage of nodes in the graph main connected component", 100.0*len(components[0])/DiGraph.number_of_nodes())
DiGraph = DiGraph.subgraph(components[0])

##Removing entries.
for i in range(1,len(components)):
    component = components[i]
    for node in component:
        #if node is an AS
        if node >= 0:
            net.drop(index=node, inplace=True)
            nodes.drop(index=node, inplace=True)
        #if node is an IXP
        if node < 0:
            ix.drop(index=node, inplace=True)
            nodes.drop(index=node, inplace=True)
            
assert(len(nodes) == len(ix) + len(net))
assert(len(nodes) == len(DiGraph))

Number of connected components 28
Percentage of nodes in the graph main connected component 99.22651034033545


- #### Définition des colonnes pertinentes pour l'étude

In [4]:
list_columns_ix = ['net_count', 'name', 'country', 
                   'notes', 'port_capacity', 'asn', 'ixf_net_count', 'id']

list_columns_net = ['info_ratio', 'id', 'policy_general', 'policy_locations',
                    'info_traffic', 'asn', 'info_type', 'ix_count', 'port_capacity']

In [7]:
# Afficher quelques AS nodes et IXP nodes
nodes[nodes.asn<50].head()

Unnamed: 0_level_0,name,type,prev_id,AStype,region,asn,port_capacity
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
42,Packet Clearing House AS42,AS,3924,Educational/Research,Global,42,559507.0
46,Rutgers University,AS,9329,Educational/Research,Regional,46,100000.0
-1,Equinix Ashburn,IXP,1,,North America,-1,14120580.0
-2,Equinix Chicago,IXP,2,,North America,-2,11511604.0
-3,Equinix Dallas,IXP,3,,North America,-3,8022300.0


In [9]:
nodes['AStype'].value_counts()

Cable/DSL/ISP           4993
NSP                     2135
Content                 1226
Not Disclosed           1179
Enterprise               625
Educational/Research     500
Non-Profit               320
Route Server             258
                         122
Network Services          27
Route Collector           12
Government                10
Name: AStype, dtype: int64

In [150]:
net['port_capacity'].sort_values(ascending=False).head(100)

asn
32934     26860000.0
20940     23992000.0
16509     21640000.0
15169     19590000.0
8075      18730000.0
             ...    
16735       921000.0
9049        921000.0
13238       893000.0
2635        890000.0
263237      890000.0
Name: port_capacity, Length: 100, dtype: float64

In [63]:
nodes['port_capacity'].sort_values(ascending=False).head(500)

asn
-102      50251588.0
-24       45180600.0
-21       39243675.0
 32934    26860000.0
-15       24737596.0
             ...    
 46618       20000.0
 59842       20000.0
 19116       20000.0
 53085       20000.0
 46811       20000.0
Name: port_capacity, Length: 5000, dtype: float64

In [184]:
net['policy_general'].value_counts()

Open           9217
Selective      1921
Restrictive     167
                 71
No               31
Name: policy_general, dtype: int64

In [58]:
DiGraph[-212]

AtlasView(FilterAtlas({20144: {'weight': 500.0}, 1930: {'weight': 1000.0}}, <function FilterAdjacency.__getitem__.<locals>.new_node_ok at 0x000001E154B42430>))

- ### Ajout de la classe non ordinale `info_scope`

In [25]:
net['info_scope'].value_counts()

Regional         3614
Europe           1903
Not Disclosed    1527
Global           1264
Asia Pacific     1228
South America     690
North America     646
Africa            226
Australia         172
                   96
Middle East        41
Name: info_scope, dtype: int64

- ### Ajout de la classe ordinale `info_traffic`

In [15]:
net['info_traffic'].value_counts()

                3112
1-5Gbps         2093
5-10Gbps        1263
100-1000Mbps    1149
10-20Gbps       1055
20-50Gbps        900
50-100Gbps       555
100-200Gbps      313
20-100Mbps       231
1-5Tbps          178
0-20Mbps         149
500-1000Gbps     130
300-500Gbps      123
200-300Gbps      117
10-20Tbps         17
5-10Tbps          13
20-50Tbps          5
100+Tbps           3
50-100Tbps         1
Name: info_traffic, dtype: int64

In [23]:
pd.get_dummies(net['info_traffic'])

Unnamed: 0_level_0,Unnamed: 1_level_0,0-20Mbps,1-5Gbps,1-5Tbps,10-20Gbps,10-20Tbps,100+Tbps,100-1000Mbps,100-200Gbps,20-100Mbps,20-50Gbps,20-50Tbps,200-300Gbps,300-500Gbps,5-10Gbps,5-10Tbps,50-100Gbps,50-100Tbps,500-1000Gbps
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
20940,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
31800,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
22822,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3303,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6079,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39928,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
204923,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
133279,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
34959,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


- ### Ajout de la colonne `net_count` [IXP]

In [30]:
max_net_count = ix['net_count'].max()

dataset_ixp = pd.DataFrame()
dataset_ixp['net_count'] = ix['net_count']/max_net_count

In [31]:
dataset_ixp.head()

Unnamed: 0_level_0,net_count
asn,Unnamed: 1_level_1
-1,0.251337
-2,0.184874
-3,0.132162
-4,0.072574
-5,0.161192


- ### Ajout de la colonne `ix_count` [AS]

In [32]:
max_ix_count = net['ix_count'].max()

dataset_as = pd.DataFrame()
dataset_as['net_count'] = net['ix_count']/max_ix_count

In [33]:
dataset_as.head()

Unnamed: 0_level_0,net_count
asn,Unnamed: 1_level_1
20940,0.643293
31800,0.042683
22822,0.317073
3303,0.152439
6079,0.018293


- ### Ajout de la colonne `info_traffic`

In [41]:
scale_mapper = {"0-20Mbps":1, "20-100Mbps":2, "100-1000Mbps":3, "1-5Gbps":4, "5-10Gbps":5, "10-20Gbps":6, "20-50Gbps":7, "50-100Gbps":8, "100-200Gbps":9, "200-300Gbps":10, "300-500Gbps":11, "500-1000Gbps":12, "1-5Tbps":13, "5-10Tbps":14, "10-20Tbps":15, "20-50Tbps":16, "50-100Tbps":17, "100+Tbps":18}
my_dictionary = {k: v/len(scale_mapper) for k, v in scale_mapper.items()}
dataset_as["info_traffic"] = net["info_traffic"].replace(my_dictionary)

dataset_as.head()

Unnamed: 0_level_0,net_count,port_capacity,info_traffic
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20940,0.643293,0.477438,1.0
31800,0.042683,0.000135,0.166667
22822,0.317073,0.145886,0.722222
3303,0.152439,0.008113,0.722222
6079,0.018293,0.007363,0.722222


- ### Ajout de la colonne `port_capacity_scaled`

In [34]:
max_portAS = max(nodes['port_capacity']) # maximum value of port_capacity among ixp & nodes
port_capacity = nodes['port_capacity']


dataset_as['port_capacity'] = port_capacity/max_portAS
nodes['port_capacity_scaled'] = port_capacity/max_portAS

In [35]:
nodes[['port_capacity', 'port_capacity_scaled']].sort_values(by='port_capacity_scaled', ascending=False).head(10)

Unnamed: 0_level_0,port_capacity,port_capacity_scaled
asn,Unnamed: 1_level_1,Unnamed: 2_level_1
-102,50251588.0,1.0
-24,45180600.0,0.899088
-21,39243675.0,0.780944
32934,26860000.0,0.53451
-15,24737596.0,0.492275
20940,23992000.0,0.477438
16509,21640000.0,0.430633
15169,19590000.0,0.389838
-98,18934200.0,0.376788
8075,18730000.0,0.372725


In [36]:
dataset_as.head()

Unnamed: 0_level_0,net_count,port_capacity
asn,Unnamed: 1_level_1,Unnamed: 2_level_1
20940,0.643293,0.477438
31800,0.042683,0.000135
22822,0.317073,0.145886
3303,0.152439,0.008113
6079,0.018293,0.007363


 Calcul des **page_rank**

In [4]:
pagerank_pondere = nx.pagerank(DiGraph)
pagerank = nx.pagerank(DiGraph, weight=None)
pagerank_pondere_inverse = nx.pagerank(DiGraph.reverse(copy=True))
pagerank_inverse = nx.pagerank(DiGraph.reverse(copy=True), weight=None)