# Air Traffic Input Data Epxloration and Graph Construction for Air Traffic Prediction

This notebook helps readers get better understanding of the input data format of the synthetic air transportation graph, and how it meets GraphStorm's graph construction command requirements. Readers should run the `Synthetic_Airport_Traffic_wAirlines.ipynb` first to generate the three parquet files that will be used in this notebook.

GraphStorm can handel experise-level graph data, measured in billions of nodes and tens of billions of edges. This synthetic air trasportation, however, is relatively small. Therefore we will use the `graphstorm.gconstruct.construct_graph` command that can run on a single machine. Details of the input data format can be found in GraphStorm's [Input Raw Data Specification](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/raw_data.html) documentations. For large graph data that might consume more memory in multiple machines, readers can refer to GraphStorm's [Distributed Graph Construction
](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/distributed/index.html) documentations.

In [19]:
import os
import json
import pandas as pd

## Explore the Input Data

In order to deal with time series data, we store them as lists in one column. In such way, we store the time series data as a $N \times T$ tensor, where $N$ is the number of samples, and $T$ is the number of days. Then we can use this tensor for modeling.

In [20]:
airport_path = './airport.parquet'
airport_node_df = pd.read_parquet(airport_path)

In [21]:
airport_node_df.sample(7)

Unnamed: 0,iata_code,latitude_deg,longitude_deg,inventory_amounts
67,LUN,-15.330833,28.452722,"[32.0998, 149.0973, 278.5409, 355.3423, 374.61..."
298,ANC,61.179004,-149.992561,"[59.6863, 225.2477, 405.2508, 519.401, 551.395..."
420,SUB,-7.37983,112.787003,"[32.94, 145.8966, 278.7163, 353.1881, 370.8502..."
250,MEX,19.435137,-99.071328,"[36.5487, 166.4547, 312.2333, 373.408, 389.343..."
312,HND,35.552299,139.779999,"[26.7707, 129.2653, 275.964, 353.6493, 368.263..."
154,RDU,35.877602,-78.787498,"[54.9289, 197.2499, 388.1767, 508.6314, 539.00..."
146,ORF,36.895341,-76.201,"[62.4095, 216.1833, 412.1431, 499.6082, 526.24..."


In [22]:
demand_edge_path = './demand_edge.parquet'
demand_edge_df = pd.read_parquet(demand_edge_path)

In [23]:
demand_edge_df.sample(7)

Unnamed: 0,src_code,dst_code,demands
158907,CPH,GYE,"[0.0, 0.75, 0.0, 0.82, 0.0, 0.0, 0.0, 0.0, 0.0..."
42749,UBN,ADD,"[0.77, 0.0, 0.38, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0..."
27113,AEP,GOT,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
44660,YYC,JUB,"[0.0, 0.0, 0.71, 0.56, 0.0, 0.0, 0.0, 0.0, 0.2..."
213372,KWE,KHN,"[0.49, 0.0, 0.81, 1.91, 0.0, 0.22, 0.84, 0.0, ..."
200582,COV,BNE,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
183528,AYT,GOI,"[0.0, 0.49, 0.0, 0.46, 0.0, 0.0, 0.0, 0.0, 0.0..."


In [24]:
traffic_edge_path = 'traffic_edge.parquet'
traffic_edge_df = pd.read_parquet(traffic_edge_path)

In [25]:
traffic_edge_df.sample(7)

Unnamed: 0,src_code,dst_code,capacity,traffics
1744,DFW,DAM,0.25,"[0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 0.012..."
1575,CVG,ORD,6.25,"[0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2..."
7188,SMF,FLL,3.25,"[0.0363, 0.0363, 0.0363, 0.0363, 0.0363, 0.036..."
143,AER,LED,3.1875,"[0.0367, 0.0367, 0.0367, 0.0367, 0.0367, 0.036..."
4980,ORD,AMS,0.25,"[0.0, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25..."
7957,TYN,HAK,5.25,"[0.0077, 0.0077, 0.0077, 0.0077, 0.0077, 0.007..."
2415,FUK,DNA,1.375,"[0.0017, 0.0017, 0.0017, 0.0017, 0.0017, 0.001..."


### Prepare the JSON file for `graphstorm.gconstruct.construct_graph` command

The `graphstorm.gconstruct.construct_graph` command replys on a JSON file to understand the given graph data. So here we provide the JSON file of the synthetic air transportation network. For more details of each field of the JSON file and format requirements, readers can refer to the [Configuration JSON Object Explanations](https://graphstorm.readthedocs.io/en/latest/cli/graph-construction/single-machine-gconstruct.html#configuration-json-object-explanations).

In [26]:
air_traffic_json = {"version": "gconstruct-v0.1"}

Node objects record the node types in a graph, in where node data table path, and node features or labels if have, are presented. It is worth noting that we normalize some features by using GraphStorm's built-in transform (feature engineering) functions. This operation could help GNN models to converge faster.

In [27]:
nodes = []
airport = {
    "node_type": "airport",
    "format": {
        "name": "parquet"
    },
    "files": [
        airport_path
    ],
    "node_id_col": "iata_code",
    "features": [
        {
            "feature_col": "latitude_deg",
            "feature_name": "latitude",
            "transform": {"name": "max_min_norm",
                          "max_val": 90.,
                          "min_val": -90.}
        },
        {
            "feature_col": "longitude_deg",
            "feature_name": "longitude",
            "transform": {"name": "max_min_norm",
                          "max_val": 180.,
                          "min_val": -180.}
        },
        {
            "feature_col": "inventory_amounts",
            "feature_name": "inventory_amounts",
            "transform": {"name": "max_min_norm",
                          "max_val": 1000.,
                          "min_val": 0.}
        }
    ],
    "labels": [
        {
            "label_col": "inventory_amounts",
            "task_type": "regression",
            "split_pct": [
                0.8,
                0.1,
                0.1
            ]
        }
    ]
}

nodes.append(airport)

Edge objects are similar as node objects except that they have a "relation" field to record source node type, edge type, and destination type in a list.

In [28]:
edges = []
ap_demand_ap = {
    "relation": [
        "airport",
        "demand",
        "airport"
    ],
    "format": {
        "name": "parquet"
    },
    "files": [
        demand_edge_path
    ],
    "source_id_col": "src_code",
    "dest_id_col": "dst_code",
    "features": [
        {
            "feature_col": "demands",
            "feature_name": "demands"
        }
    ]
}
ap_traffic_ap = {
    "relation": [
        "airport",
        "traffic",
        "airport"
    ],
    "format": {
        "name": "parquet"
    },
    "files": [
        traffic_edge_path
    ],
    "source_id_col": "src_code",
    "dest_id_col": "dst_code",
    "features": [
        {
            "feature_col": "capacity",
            "feature_name": "capacity"
        },
        {
            "feature_col": "traffics",
            "feature_name": "traffics"
        }
    ]
}
edges.append(ap_demand_ap)
edges.append(ap_traffic_ap)

In [29]:
air_traffic_json['nodes'] = nodes
air_traffic_json['edges'] = edges

In [30]:
# Save to a local file, named config.json
with open(os.path.join("config.json"), "w") as f:
    json.dump(air_traffic_json, f, indent=4)

## Run GraphStorm `gconstruct` Command to Process Air Traffic Data for Using GraphStorm

With the tree parquet files and the `config.json` JSON file, we then run graph construction.

First, let's install GraphStorm and its depenencies, assuming on a CPU machine.

In [31]:
!pip install graphstorm

# If using CPU instances
!pip install torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install dgl==1.1.3 -f https://data.dgl.ai/wheels-internal/repo.html

Now, run the `gconstruct.construct_graph` command with a few arguments. For example, as the air traffic graph data is relatively small, we only split the data into one partition, i.e., `--num-part 1`. And the output data will be saved to a folder named `gs_1p` to be used in the model training and inference at the `AirTrafficPrediction.ipynb` notebook.

In [32]:
!python -m graphstorm.gconstruct.construct_graph \
           --conf-file config.json \
           --output-dir gs_1p/ \
           --num-parts 1 \
           --graph-name air_traffic

INFO:root:The graph has 1 node types and 2 edge types.
INFO:root:Node type airport has 471 nodes
INFO:root:Edge type ('airport', 'demand', 'airport') has 221370 edges
INFO:root:Edge type ('airport', 'traffic', 'airport') has 8408 edges
INFO:root:Node type airport has features: ['latitude', 'longitude', 'inventory_amounts', 'train_mask', 'val_mask', 'test_mask'].
INFO:root:Train/val/test on airport with mask train_mask, val_mask, test_mask: 376, 47, 47
INFO:root:Note: Custom train, validate, test mask information for nodes are not collected.
INFO:root:Edge type ('airport', 'demand', 'airport') has features: ['demands'].
INFO:root:Edge type ('airport', 'traffic', 'airport') has features: ['capacity', 'traffics'].
The graph has 1 node types and balance among 4 types
Converting to homogeneous graph takes 0.003s, peak mem: 5.120 GB
Save partitions: 0.022 seconds, peak memory: 6.127 GB
There are 229778 edges in the graph and 0 edge cuts for 1 partitions.
INFO:root:Graph construction generate