This notebook describes part of the PeeringDB dataset.  
It consists in IXP metadata (table `ix`), ASes metadata (table `net`), a directed Graph (`DiGraph`) and a table containing the graph's nodes metadata (table `nodes`).

In [392]:
import numpy as np
import pandas as pd
import networkx as nx
import pickle
import matplotlib.pyplot as plt
from sklearn import preprocessing

# Loading preprocessed data
Notes on the preprocessing: 
* All entries are uniquely defined with an index.
 * ASes: the index is the AS number (asn)
 * IXPs: a negative number that I attributed
* The graph is first built from infos present in `netixlan_set` of the API. This makes a bipartite graph (AS-IXP) with links weighted by the router port size (`speed` in the API)
* We want to derive a directed graph: we rely on ASes `info_ratio` attribute, that can take the values `Not Disclosed`, `Heavy In(out)bound`, `Mostly In(out)bound`, `Balanced`.
 * Inbound: a link is created with a weight=`speed` from IXP to AS. Another link of weight $(1-\beta)$*`speed` is created in the other direction
 * Outbound: a link is created with a weight=`speed` from AS to IXP. Another link of weight $(1-\beta)$*`speed` is created in the other direction
 * `Balanced` or `Not Disclosed`: A link in both direction with a weight=`speed`
 * Heavy categories: $\beta=\beta_H=0.95$, Mostly categories: $\beta=\beta_M=0.75$

In [393]:
path = "./"
prefix = "peeringdb_2_dump_"
date = "2021_03_01"

pickle_in = open(path+"nodes/"+prefix+date+".pickle", "rb")
nodes = pickle.load(pickle_in)
pickle_in.close()
nodes = nodes.loc[nodes["port_capacity"]>0] ##port capacity = sum of all ports
print("nodes table summary")
display(nodes.info())

pickle_in = open(path+"ix/"+prefix+date+".pickle", "rb")
ix = pickle.load(pickle_in)
pickle_in.close()
ix = ix.loc[ix["port_capacity"]>0]
print("ix table summary")
display(ix.info())

pickle_in = open(path+"net/"+prefix+date+".pickle", "rb")
net = pickle.load(pickle_in)
pickle_in.close()
net = net.loc[net["port_capacity"]>0]
print("net table summary")
display(net.info())

BETA_H = 0.95
BETA_M = 0.75

edgelist = open(path+"graph/"+format(BETA_H, '.4f')+"_"+format(BETA_M, '.4f')+"_"+prefix+date+".txt", "r")
DiGraph = nx.parse_edgelist(edgelist, nodetype = int, data=(('weight',float),), create_using = nx.DiGraph, delimiter=",")


assert(len(nodes) == len(ix) + len(net))
assert(len(nodes) == len(DiGraph))

print("Total number of nodes:", len(nodes))
print("Total number of IXP:", len(ix))
print("Total number of ASes: ", len(net))
print("Total number of edges: ", len(DiGraph.edges()))

nodes table summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12282 entries, 20940 to -893
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   name           12282 non-null  object 
 1   type           12282 non-null  object 
 2   prev_id        12282 non-null  int64  
 3   AStype         11472 non-null  object 
 4   region         12282 non-null  object 
 5   asn            12282 non-null  int64  
 6   port_capacity  12282 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 767.6+ KB


None

ix table summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 810 entries, -1 to -893
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   proto_ipv6        810 non-null    bool  
 1   status            810 non-null    object
 2   url_stats         810 non-null    object
 3   id                810 non-null    int64 
 4   tech_email        810 non-null    object
 5   city              810 non-null    object
 6   policy_email      810 non-null    object
 7   tech_phone        810 non-null    object
 8   media             810 non-null    object
 9   proto_multicast   810 non-null    bool  
 10  ixf_last_import   127 non-null    object
 11  website           810 non-null    object
 12  updated           810 non-null    object
 13  net_count         810 non-null    int64 
 14  policy_phone      810 non-null    object
 15  proto_unicast     810 non-null    bool  
 16  region_continent  810 non-null    object
 1

None

net table summary
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11472 entries, 20940 to 61437
Data columns (total 35 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   status                        11472 non-null  object 
 1   looking_glass                 11472 non-null  object 
 2   route_server                  11472 non-null  object 
 3   netixlan_updated              11472 non-null  object 
 4   info_ratio                    11472 non-null  object 
 5   id                            11472 non-null  int64  
 6   policy_ratio                  11472 non-null  bool   
 7   info_unicast                  11472 non-null  bool   
 8   policy_general                11472 non-null  object 
 9   website                       11472 non-null  object 
 10  allow_ixp_update              11472 non-null  bool   
 11  updated                       11472 non-null  object 
 12  netfac_updated                7121 non

None

Total number of nodes: 12282
Total number of IXP: 810
Total number of ASes:  11472
Total number of edges:  63914


# Selecting the main connected component
Most graph algorithms behave best when the graph has a single connected component 

In [394]:
##I work only with the main connected component. Some entries of nodes, ix and net must be removed
##Main connected component.
#watch out casting DiGraph to Graph is not correct (delete doubled edges). For our use here it will be fine.
components = sorted(nx.connected_components(nx.Graph(DiGraph)), key=len, reverse=True) 
print("Number of connected components", len(components))
print("Percentage of nodes in the graph main connected component", 100.0*len(components[0])/DiGraph.number_of_nodes())
DiGraph = DiGraph.subgraph(components[0])

##Removing entries.
for i in range(1,len(components)):
    component = components[i]
    for node in component:
        #if node is an AS
        if node >= 0:
            net.drop(index=node, inplace=True)
            nodes.drop(index=node, inplace=True)
        #if node is an IXP
        if node < 0:
            ix.drop(index=node, inplace=True)
            nodes.drop(index=node, inplace=True)
            
assert(len(nodes) == len(ix) + len(net))
assert(len(nodes) == len(DiGraph))

Number of connected components 28
Percentage of nodes in the graph main connected component 99.22651034033545


- #### Définition des colonnes pertinentes pour l'étude

In [395]:
list_columns_ix = ['net_count', 'name', 'country', 
                   'notes', 'port_capacity', 'asn', 'ixf_net_count', 'id']

list_columns_net = ['info_ratio', 'id', 'policy_general', 'policy_locations',
                    'info_traffic', 'asn', 'info_type', 'ix_count', 'port_capacity']

In [396]:
net[list_columns_net].head(10)

Unnamed: 0_level_0,info_ratio,id,policy_general,policy_locations,info_traffic,asn,info_type,ix_count,port_capacity
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
20940,Heavy Outbound,2,Open,Not Required,100+Tbps,20940,Content,211,23992000.0
31800,Heavy Inbound,3,Open,Preferred,100-1000Mbps,31800,Non-Profit,14,6800.0
22822,Mostly Outbound,4,Selective,Required - US,1-5Tbps,22822,Content,104,7331000.0
3303,Mostly Inbound,5,Selective,Preferred,1-5Tbps,3303,Cable/DSL/ISP,50,407700.0
6079,Mostly Inbound,7,Selective,Preferred,1-5Tbps,6079,Cable/DSL/ISP,6,370000.0
23148,,8,Open,Not Required,,23148,,1,20000.0
7843,Heavy Inbound,9,Selective,Preferred,1-5Tbps,7843,Cable/DSL/ISP,8,580000.0
2828,Balanced,13,Selective,Required - US,1-5Tbps,2828,NSP,7,110000.0
3257,Balanced,14,Restrictive,Required - International,,3257,NSP,3,210000.0
3265,Balanced,16,Open,Not Required,50-100Gbps,3265,Cable/DSL/ISP,5,230000.0


In [397]:
# Afficher quelques AS nodes et IXP nodes
nodes[nodes.asn<50].head()

Unnamed: 0_level_0,name,type,prev_id,AStype,region,asn,port_capacity
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
42,Packet Clearing House AS42,AS,3924,Educational/Research,Global,42,559507.0
46,Rutgers University,AS,9329,Educational/Research,Regional,46,100000.0
-1,Equinix Ashburn,IXP,1,,North America,-1,14120580.0
-2,Equinix Chicago,IXP,2,,North America,-2,11511604.0
-3,Equinix Dallas,IXP,3,,North America,-3,8022300.0


In [398]:
nodes['AStype'].value_counts()

Cable/DSL/ISP           4993
NSP                     2135
Content                 1226
Not Disclosed           1179
Enterprise               625
Educational/Research     500
Non-Profit               320
Route Server             258
                         122
Network Services          27
Route Collector           12
Government                10
Name: AStype, dtype: int64

In [399]:
net['port_capacity'].sort_values(ascending=False).head(100)

asn
32934     26860000.0
20940     23992000.0
16509     21640000.0
15169     19590000.0
8075      18730000.0
             ...    
16735       921000.0
9049        921000.0
13238       893000.0
2635        890000.0
263237      890000.0
Name: port_capacity, Length: 100, dtype: float64

In [400]:
nodes['port_capacity'].sort_values(ascending=False).head(500)

asn
-102      50251588.0
-24       45180600.0
-21       39243675.0
 32934    26860000.0
-15       24737596.0
             ...    
 11019      410000.0
 8302       410000.0
 3303       407700.0
-535        407000.0
-185        402000.0
Name: port_capacity, Length: 500, dtype: float64

In [401]:
net['policy_general'].value_counts()

Open           9217
Selective      1921
Restrictive     167
                 71
No               31
Name: policy_general, dtype: int64

- ### Ajout de la colonne `info_scope`

In [402]:
net['info_scope'].value_counts()

Regional         3614
Europe           1903
Not Disclosed    1527
Global           1264
Asia Pacific     1228
South America     690
North America     646
Africa            226
Australia         172
                   96
Middle East        41
Name: info_scope, dtype: int64

In [403]:
dataset_as = pd.DataFrame()
dataset_ixp = pd.DataFrame()

info_scope_scaled = net['info_scope'].map({"Regional":1/3, "Europe":2/3, "Not Disclosed":2/3, "Global":1, 
                                           "Asia Pacific":2/3, "South America":2/3, "North America":2/3, 
                                           "Africa":2/3, "Australia":2/3, "":2/3, "Middle East":2/3})

In [404]:
dataset_as['info_scope'] = info_scope_scaled
dataset_as['info_scope'].value_counts()

0.666667    6529
0.333333    3614
1.000000    1264
Name: info_scope, dtype: int64

In [405]:
not_disclosed_vs_port = pd.DataFrame(net['port_capacity'] [net['info_scope'] == 'Not Disclosed'])
not_disclosed_vs_port.describe()

Unnamed: 0,port_capacity
count,1527.0
mean,20195.25
std,102094.7
min,1.0
25%,1000.0
50%,8200.0
75%,11000.0
max,3200000.0


In [406]:
not_disclosed_vs_ixcount = pd.DataFrame(net['ix_count'] [net['info_scope'] == 'Not Disclosed'])
not_disclosed_vs_ixcount.describe()

Unnamed: 0,ix_count
count,1527.0
mean,1.78258
std,1.848774
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,29.0


In [407]:
info_scope_global = pd.DataFrame(net['ix_count'] [net['info_scope'] == 'Regional'])
info_scope_global.describe()

Unnamed: 0,ix_count
count,3614.0
mean,2.368567
std,2.531346
min,1.0
25%,1.0
50%,2.0
75%,3.0
max,75.0


- ### Ajout de la colonne `info_traffic`

In [408]:
net['info_traffic'].value_counts()

                3112
1-5Gbps         2093
5-10Gbps        1263
100-1000Mbps    1149
10-20Gbps       1055
20-50Gbps        900
50-100Gbps       555
100-200Gbps      313
20-100Mbps       231
1-5Tbps          178
0-20Mbps         149
500-1000Gbps     130
300-500Gbps      123
200-300Gbps      117
10-20Tbps         17
5-10Tbps          13
20-50Tbps          5
100+Tbps           3
50-100Tbps         1
Name: info_traffic, dtype: int64

In [409]:
nb_info_traffic = len(net['info_traffic'].value_counts())

info_traffic_scaled = net['info_traffic'].map({"0-20Mbps":1/nb_info_traffic, "20-100Mbps":2/nb_info_traffic,
                       "100-1000Mbps":3/nb_info_traffic, "1-5Gbps":4/nb_info_traffic, 
                       "5-10Gbps":5/nb_info_traffic, "10-20Gbps":6/nb_info_traffic, 
                       "20-50Gbps":7/nb_info_traffic, "50-100Gbps":8/nb_info_traffic, 
                       "100-200Gbps":9/nb_info_traffic, "":10/nb_info_traffic, 
                       "200-300Gbps":11/nb_info_traffic, "300-500Gbps":12/nb_info_traffic,
                       "500-1000Gbps":13/nb_info_traffic, "1-5Tbps":14/nb_info_traffic, 
                       "5-10Tbps":15/nb_info_traffic, "10-20Tbps":16/nb_info_traffic, 
                       "20-50Tbps":17/nb_info_traffic, "50-100Tbps":18/nb_info_traffic,
                       "100+Tbps":19/nb_info_traffic})

In [410]:
dataset_as['info_traffic'] = info_traffic_scaled
dataset_as['info_traffic'].value_counts()

0.526316    3112
0.210526    2093
0.263158    1263
0.157895    1149
0.315789    1055
0.368421     900
0.421053     555
0.473684     313
0.105263     231
0.736842     178
0.052632     149
0.684211     130
0.631579     123
0.578947     117
0.842105      17
0.789474      13
0.894737       5
1.000000       3
0.947368       1
Name: info_traffic, dtype: int64

- ### Ajout de la colonne `net_count` [IXP]

In [411]:
max_net_count = ix['net_count'].max()
net_count_scaled = [len(DiGraph[i])/max_net_count for i in DiGraph.nodes() if i < 0]

dataset_ixp['net_count'] = net_count_scaled

In [412]:
dataset_ixp.head()

Unnamed: 0,net_count
0,0.249809
1,0.18411
2,0.131398
3,0.077922
4,0.092437


- ### Ajout de la colonne `ix_count` [AS]

In [413]:
max_ix_count = net['ix_count'].max()
ix_count_scaled = [len(DiGraph[i])/max_ix_count for i in DiGraph.nodes() if i > 0]

dataset_as['ix_count'] = ix_count_scaled

In [414]:
dataset_as.head()

Unnamed: 0_level_0,info_scope,info_traffic,ix_count
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20940,1.0,1.0,0.515244
31800,1.0,0.157895,0.042683
22822,1.0,0.736842,0.253049
3303,0.666667,0.736842,0.152439
6079,0.666667,0.736842,0.018293


In [415]:
dataset_as['net_count'] = preprocessing.scale(net['ix_count'])

In [416]:
dataset_as.head()

Unnamed: 0_level_0,info_scope,info_traffic,ix_count,net_count
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20940,1.0,1.0,0.515244,22.742338
31800,1.0,0.157895,0.042683,1.17839
22822,1.0,0.736842,0.253049,11.02994
3303,0.666667,0.736842,0.152439,5.11901
6079,0.666667,0.736842,0.018293,0.302696


- ### Ajout de la colonne `port_capacity` <br>
   - 1) On divise par port_max de l'AS
   - 2) On divise par la port_max indépendament entre IXP et AS

In [417]:
max_portAS = net['port_capacity'].max()
dataset_as['port_capacity'] = net['port_capacity']/max_portAS

In [418]:
dataset_as.head()

Unnamed: 0_level_0,info_scope,info_traffic,ix_count,net_count,port_capacity
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20940,1.0,1.0,0.515244,22.742338,0.893224
31800,1.0,0.157895,0.042683,1.17839,0.000253
22822,1.0,0.736842,0.253049,11.02994,0.272934
3303,0.666667,0.736842,0.152439,5.11901,0.015179
6079,0.666667,0.736842,0.018293,0.302696,0.013775


In [463]:
max_portIXP = ix['port_capacity'].max()

list_port_scaled = ix['port_capacity']/max_portIXP
print(list_port_scaled)

dataset_ixp['port_capacity'] = ix['port_capacity']/max_portIXP

asn
-1      0.280998
-2      0.229079
-3      0.159643
-4      0.077176
-5      0.141351
          ...   
-886    0.002599
-888    0.015741
-890    0.000044
-892    0.001592
-893    0.000219
Name: port_capacity, Length: 780, dtype: float64


In [474]:
dataset_ixp.head()

Unnamed: 0,net_count
0,0.249809
1,0.18411
2,0.131398
3,0.077922
4,0.092437


- ### Ajout de la colonne `info_ratio` [AS]

In [423]:
net['info_ratio'].value_counts()

Balanced           3494
Mostly Inbound     3476
Not Disclosed      1965
Mostly Outbound    1078
Heavy Inbound       803
Heavy Outbound      397
                    194
Name: info_ratio, dtype: int64

In [424]:
nb_info_ratio = 5
info_ratio_scaled = net['info_ratio'].map({"Heavy Inbound":1/nb_info_ratio, "Mostly Inbound":2/nb_info_ratio,
                                           "Balanced":3/nb_info_ratio, "Not Disclosed":3/nb_info_ratio,
                                           "":3/nb_info_ratio, "Mostly Outbound":4/nb_info_ratio,
                                           "Heavy Outbound":1})

In [425]:
dataset_as['info_ratio'] = info_ratio_scaled
dataset_as['info_ratio'].value_counts()

0.6    5653
0.4    3476
0.8    1078
0.2     803
1.0     397
Name: info_ratio, dtype: int64

In [471]:
dataset_as.head()

Unnamed: 0_level_0,info_scope,info_traffic,ix_count,net_count,port_capacity,info_ratio,policy_general,info_type
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20940,1.0,1.0,0.515244,22.742338,0.893224,1.0,0.25,2
31800,1.0,0.157895,0.042683,1.17839,0.000253,0.2,0.25,5
22822,1.0,0.736842,0.253049,11.02994,0.272934,0.8,0.5,2
3303,0.666667,0.736842,0.152439,5.11901,0.015179,0.4,0.5,0
6079,0.666667,0.736842,0.018293,0.302696,0.013775,0.4,0.5,0


- ### Ajout de la colonne `policy_general` [AS]

In [427]:
net['policy_general'].value_counts()

Open           9217
Selective      1921
Restrictive     167
                 71
No               31
Name: policy_general, dtype: int64

In [428]:
nb_policy_general = 4
policy_general_scaled = net['policy_general'].map({"Open":1/nb_policy_general, "Selective":2/nb_policy_general,
                                                   "Restrictive":3/nb_policy_general, "No":1,
                                                   "":1/nb_policy_general})

In [429]:
dataset_as['policy_general'] = policy_general_scaled
dataset_as['policy_general'].value_counts()

0.25    9288
0.50    1921
0.75     167
1.00      31
Name: policy_general, dtype: int64

- ### Ajout de la colonne `info_type` [AS]

In [430]:
net['info_type'].value_counts()

Cable/DSL/ISP           4993
NSP                     2135
Content                 1226
Not Disclosed           1179
Enterprise               625
Educational/Research     500
Non-Profit               320
Route Server             258
                         122
Network Services          27
Route Collector           12
Government                10
Name: info_type, dtype: int64

In [431]:
info_type_classified = net['info_type'].map({"Cable/DSL/ISP":0, "NSP":1, "Content":2, "":0,
                                             "Not Disclosed":0, "Enterprise":3, "Educational/Research":4,
                                             "Non-Profit":5, "Route Server":6, "Network Services":7,
                                             "Route Collector":8, "Government":9})

In [432]:
dataset_as['info_type'] = info_type_classified
dataset_as['info_type'].value_counts()

0    6294
1    2135
2    1226
3     625
4     500
5     320
6     258
7      27
8      12
9      10
Name: info_type, dtype: int64

In [433]:
dataset_as.head()

Unnamed: 0_level_0,info_scope,info_traffic,ix_count,net_count,port_capacity,info_ratio,policy_general,info_type
asn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20940,1.0,1.0,0.515244,22.742338,0.893224,1.0,0.25,2
31800,1.0,0.157895,0.042683,1.17839,0.000253,0.2,0.25,5
22822,1.0,0.736842,0.253049,11.02994,0.272934,0.8,0.5,2
3303,0.666667,0.736842,0.152439,5.11901,0.015179,0.4,0.5,0
6079,0.666667,0.736842,0.018293,0.302696,0.013775,0.4,0.5,0


### Exportation des données :
Table `dataset_as` pour les **AS feature** <br>
Table `dataset_ixp` pour les **IXP feature**


In [475]:
# Exportation des données
suffix = '_pDB_'

dataset_as.to_csv('data_GCN/dataset_AS'+suffix+'.csv', index=False)
dataset_ixp.to_csv('data_GCN/dataset_IXP'+suffix+'.csv', index=False)

 Calcul des **page_rank**

In [None]:
pagerank_pondere = nx.pagerank(DiGraph)
pagerank = nx.pagerank(DiGraph, weight=None)
pagerank_pondere_inverse = nx.pagerank(DiGraph.reverse(copy=True))
pagerank_inverse = nx.pagerank(DiGraph.reverse(copy=True), weight=None)