# Air Traffic Input Data Epxloration and Construction for Usiing GraphStorm

This notebook will help readers to get better idea of the input data format that GraphStorm's graph construction commands requires. For more details of the input data format, readers can refer to GraphStorm's [Input Raw Data Specification](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/raw_data.html) documentations.

GraphStorm can handel experise-level graph data, measured in billions of nodes and tens of billions of edges. This synthetic air trasportation, however, is relatively small. Therefore we will use the `graphstorm.gconstruct.construct_graph` command that can run on a single machine. For large graph data that might consume large machine memory, readers can refer to GraphStorm's [Distributed Graph Construction
](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/distributed/index.html) documentations.

In [1]:
import os
import json
import pandas as pd

## Explore the Input Data

In [2]:
airport_path = './airport.parquet'
airport_node_df = pd.read_parquet(airport_path)

In [3]:
airport_node_df.sample(7)

Unnamed: 0,iata_code,latitude_deg,longitude_deg,inventory_amounts
25,MUC,48.353802,11.7861,"[38.0836, 149.0189, 309.1679, 382.152, 392.400..."
311,SDJ,38.139702,140.917007,"[31.3694, 146.7459, 293.8917, 368.27, 383.98, ..."
326,BEL,-1.379279,-48.476207,"[34.9218, 150.4731, 298.6603, 357.8567, 372.01..."
120,DEN,39.861698,-104.672997,"[55.7827, 207.3755, 396.9583, 506.4432, 537.53..."
264,VRA,23.034401,-81.435303,"[29.8999, 155.3827, 316.1508, 389.1257, 405.43..."
159,SAT,29.533701,-98.469803,"[64.7669, 213.8917, 410.0118, 519.9839, 547.92..."
404,HYD,17.231318,78.429855,"[37.6584, 141.7291, 281.9312, 342.3924, 358.37..."


In [8]:
demand_edge_path = './demand_edge.parquet'
demand_edge_df = pd.read_parquet(demand_edge_path)

In [9]:
demand_edge_df.sample(7)

Unnamed: 0,src_code,dst_code,demands
146026,BSB,ITM,"[0.0, 0.0, 0.37, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
43840,LAX,HRG,"[0.0, 0.03, 0.0, 0.26, 0.0, 0.0, 0.0, 0.0, 0.0..."
75580,ZIA,SAV,"[0.0, 0.0, 0.42, 0.0, 0.0, 0.0, 0.0, 0.0, 0.02..."
205054,MCO,TSN,"[0.0, 0.0, 0.16, 0.0, 0.12, 0.99, 0.0, 0.0, 0...."
216086,SJU,PVG,"[0.72, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
125803,MWX,BZE,"[0.61, 0.0, 0.0, 0.17, 0.0, 0.0, 0.0, 0.0, 0.0..."
13914,BEY,HAJ,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [10]:
traffic_edge_path = 'traffic_edge.parquet'
traffic_edge_df = pd.read_parquet(traffic_edge_path)

In [11]:
traffic_edge_df.sample(7)

Unnamed: 0,src_code,dst_code,capacity,traffics
6850,SEA,JFK,2.25,"[0.0485, 0.0485, 0.0485, 0.0485, 0.0485, 0.048..."
7363,SNA,SDF,2.0,"[0.0028, 0.0028, 0.0028, 0.0028, 0.0028, 0.002..."
8489,ZAG,PVG,0.0625,"[0.0407, 0.0407, 0.0407, 0.0407, 0.0407, 0.040..."
4008,LHW,TAO,3.0,"[1.08, 1.08, 1.08, 1.08, 1.08, 1.08, 1.08, 1.0..."
4059,LRM,KWL,0.0625,"[0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.062..."
3554,KWE,RUH,0.25,"[0.0275, 0.0275, 0.0275, 0.0275, 0.0275, 0.027..."
3246,JED,CLE,0.25,"[0.155, 0.155, 0.155, 0.155, 0.155, 0.155, 0.1..."


### Prepare the JSON file for `graphstorm.gconstruct.construct_graph` command

The `graphstorm.gconstruct.construct_graph` command replys on a JSON file to understand the given graph data. So here we provide the JSON file of the synthetic air transportation network. For more details of each field of the JSON file and format requirements, readers can refer to the [Configuration JSON Object Explanations](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/single-machine-gconstruct.html#configuration-json-object-explanations).

In [12]:
air_traffic_json = {"version": "gconstruct-v0.1"}

Node objects record the node types in a graph, where node data tables are stored and node features or labels if have. It worths noticed that we here normalized some features by using GraphStorm's built-in feature engineering functions. This operation could help GNN models to converge quickly.

In [13]:
nodes = []
airport = {
    "node_type": "airport",
    "format": {
        "name": "parquet"
    },
    "files": [
        airport_path
    ],
    "node_id_col": "iata_code",
    "features": [
        {
            "feature_col": "latitude_deg",
            "feature_name": "latitude",
            "transform": {"name": "max_min_norm",
                          "max_val": 90.,
                          "min_val": -90.}
        },
        {
            "feature_col": "longitude_deg",
            "feature_name": "longitude",
            "transform": {"name": "max_min_norm",
                          "max_val": 180.,
                          "min_val": -180.}
        },
        {
            "feature_col": "inventory_amounts",
            "feature_name": "inventory_amounts",
            "transform": {"name": "max_min_norm",
                          "max_val": 1000.,
                          "min_val": 0.}
        }
    ],
    "labels": [
        {
            "label_col": "inventory_amounts",
            "task_type": "regression",
            "split_pct": [
                0.8,
                0.1,
                0.1
            ]
        }
    ]
}

nodes.append(airport)

Edge objects

In [None]:
edges = []
ap_demand_ap = {
    "relation": [
        "airport",
        "demand",
        "airport"
    ],
    "format": {
        "name": "parquet"
    },
    "files": [
        demand_edge_path
    ],
    "source_id_col": "src_code",
    "dest_id_col": "dst_code",
    "features": [
        {
            "feature_col": "demands",
            "feature_name": "demands"
        }
    ]
}
ap_traffic_ap = {
    "relation": [
        "airport",
        "traffic",
        "airport"
    ],
    "format": {
        "name": "parquet"
    },
    "files": [
        traffic_edge_path
    ],
    "source_id_col": "src_code",
    "dest_id_col": "dst_code",
    "features": [
        {
            "feature_col": "capacity",
            "feature_name": "capacity"
        },
        {
            "feature_col": "traffics",
            "feature_name": "traffics"
        }
    ]
}
edges.append(ap_demand_ap)
edges.append(ap_traffic_ap)

In [15]:
air_traffic_json['nodes'] = nodes
air_traffic_json['edges'] = edges

In [16]:
with open(os.path.join("config.json"), "w") as f:
    json.dump(air_traffic_json, f, indent=4)

## Run GraphStorm `gconstruct` Command to Process Air Traffic Data for Using GraphStorm

In [1]:
!pip install graphstorm

# If using CPU instances
!pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install dgl==1.1.3 -f https://data.dgl.ai/wheels-internal/repo.html

Looking in indexes: https://download.pytorch.org/whl/cpu
Looking in links: https://data.dgl.ai/wheels-internal/repo.html


In [17]:
!python -m graphstorm.gconstruct.construct_graph \
           --conf-file config.json \
           --output-dir gs_1p/ \
           --num-parts 1 \
           --graph-name air_traffic

################################################################################
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
future torchdata release! Please see https://github.com/pytorch/data/issues/1196
to learn more and leave feedback.
################################################################################

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
INFO:root:The graph has 1 node types and 2 edge types.
INFO:root:Node type airport has 471 nodes
INFO:root:Edge type ('airport', 'demand', 'airport') has 221370 edges
INFO:root:Edge type ('airport', 'traffic', 'airport') has 8512 edges
INFO:root:Node type airport has features: ['latitude', 'longitude', 'inventory_amounts', 'train_mask', 'val_mask', 'test_mask'].
INFO:root:Train/val/test on airport with mask train_mask, val_mask, test_mask: 376, 47, 47
INFO:root:Note: Custom train, validate, test mask information for nodes are not collected.
INFO:root:Edge typ