# Notebook 0: Data Prepare

This notebook will create an example graph data to be used in the other notebooks to demonstrate how to program using GraphStorm APIs.
The example graph data comes from [DGL's ACM publication dataset](https://data.dgl.ai/dataset/ACM.mat), which is the same as the data explainedin the [Use Your Own Data tutorial](https://graphstorm.readthedocs.io/en/latest/tutorials/own-data.html).

-----

## Prerequisites
This notebook assumes the following:
- Python 3;
- Linux OS, Ubuntu or Amazon Linux;
- GraphStorm and its dependencies (following the [Setup GraphStorm with pip packages tutorial](https://graphstorm.readthedocs.io/en/latest/install/env-setup.html#setup-graphstorm-with-pip-packages))
- [Jupyter web interactive server](https://jupyter.org/).

Users can use the following command to check if the above prerequisites are met.

In [1]:
import graphstorm as gs

## Download Data Generation Script
GraphStorm provides a Python script that can download and convert the DGL ACM publication data for GraphStorm usage. Therefore, first let's download the script file from the [GraphStorm Github repository](https://github.com/awslabs/graphstorm).

In [2]:
!wget -O ./acm_data.py https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py

## Generate ACM Raw Table Data
Then we can use the command below to build the raw table data, which is the standard input data for GraphStorm's gconstruct module.

In [None]:
!python ./acm_data.py --output-path ./acm_raw --output-type raw_w_text

## Construct GraphStorm Input Graph Data
With the raw ACM tables we then can use GraphStorm's graph construction method to prepare the ACM graph for other notebooks. The graph construction module perform:
- read in the raw data, and convert it to DGL graph;
- split the DGL graph into multiple partitions as the distributed DGL graphs;
- produce node id mapping files and other supporting files.

For the GraphStorm Standalone mode, we only need one partition. Therefore, in the command below we set the `--num-parts` to be `1`. For other arguments, users can refer to [GraphStorm Graph Construction arguments](https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html).

In [4]:
!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs_1p \
          --num-parts 1 \
          --graph-name acm

#### 3-Partition Input Data
To better illustrate GraphStorm required input data structure, we can use the following command to create a 3-partition input data.

In [5]:
!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs_3p \
          --num-parts 3 \
          --graph-name acm

## Data Exploration and Explanation
The above commands created two sets of ACM data, i.e., the raw ACM data tables, and ACM GraphStorm input graphs. Below we explore these datasets, and explain their format so that users can prepare their own graph data easily.

### Raw ACM Table Data in the `./acm_raw` Folder
We can explore the `acm_raw` folder with the `ls -al` command. 

In [6]:
!ls -al ./acm_raw

total 24
drwxrwxr-x 4 ubuntu ubuntu 4096 Dec 19 21:27 .
drwxrwxr-x 6 ubuntu ubuntu 4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu 5306 Dec 19 21:27 config.json
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 19 21:27 edges
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 19 21:27 nodes


In [7]:
!ls -al ./acm_raw/nodes

total 38744
drwxrwxr-x 2 ubuntu ubuntu     4096 Dec 19 21:27 .
drwxrwxr-x 4 ubuntu ubuntu     4096 Dec 19 21:27 ..
-rw-rw-r-- 1 ubuntu ubuntu 18843566 Dec 19 21:27 author.parquet
-rw-rw-r-- 1 ubuntu ubuntu 20704514 Dec 19 21:27 paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu   113462 Dec 19 21:27 subject.parquet


In [8]:
!ls -al ./acm_raw/edges

total 1016
drwxrwxr-x 2 ubuntu ubuntu   4096 Dec 19 21:27 .
drwxrwxr-x 4 ubuntu ubuntu   4096 Dec 19 21:27 ..
-rw-rw-r-- 1 ubuntu ubuntu 263138 Dec 19 21:27 author_writing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 156358 Dec 19 21:27 paper_cited_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu 162714 Dec 19 21:27 paper_citing_paper.parquet
-rw-rw-r-- 1 ubuntu ubuntu  87792 Dec 19 21:27 paper_is-about_subject.parquet
-rw-rw-r-- 1 ubuntu ubuntu 265948 Dec 19 21:27 paper_written-by_author.parquet
-rw-rw-r-- 1 ubuntu ubuntu  84005 Dec 19 21:27 subject_has_paper.parquet


#### Graph Description JSON File `config.json`

The `acm_raw` folder includes one `config.json` file that describes the table-based raw graph data. Except for a **version** object, the JSON file contains a **nodes** object and an **edges** object.

The **nodes** object contains a list of *node* objects, each of which includes a set of properties to describe one node type in a graph data. For example, in the `config.json` file, there is a node type, called "papers". For each node type, GraphStorm defines a few other properties, such as **format**, **files**, and **features**.

Similarly, the **edges** object contains a list of *edge* objects. Most of *edge* properties are same as *node*'s except that *edge* object has the **relation** property that define an edge type in a canonical format, i.e., *source node type*, *relation type*, and *destination node type*.

For a full list of the JSON configuration properties, users can refer to the [GraphStorm Graph Construction JSON Explanations](https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations).

To use your own graph, users need to prepare their own JSON file.

In [6]:
!cat ./acm_raw/config.json

#### Raw ACM Tables in the `nodes/` and `edges/` folder.
As defined in the `./acm_raw/config.json` file, the node data files are stored at the `./acm_raw/nodes/` folder, and edge data files are stored at the `./acm_raw/edges/` folder. General description of these files can be found at the [Input raw node/edge data files](https://graphstorm.readthedocs.io/en/latest/tutorials/own-data.html#input-raw-node-edge-data-files). Here, we can read some node ("paper") and edge (\["paper", "citing", "paper"\]) tables to learn more about them.

In [10]:
import pandas as pd

paper_node_path = './acm_raw/nodes/paper.parquet'
paper_citing_paper_edge_path = './acm_raw/edges/paper_citing_paper.parquet'

**The "paper" node table**

The paper node table could be read in as a Pandas DataFrame. The table has a few columns, whose names are used in the `config.json`. For the "paper" nodes, there is a `node_id` column, including a unique identifier for each node, a `feat` column, including a 256D numerical tensor for each node, a `text` column, including free text feature for each node, and a `label` column, including an integer to indicate the class that each node is assigned.

The other two node types, "author" and "subject", have similar data tables. Users can explore them with the similar code below.

In [11]:
paper_node_df = pd.read_parquet(paper_node_path)

print(paper_node_df.shape)
paper_node_df.sample(4)

(12499, 4)


Unnamed: 0,node_id,label,feat,text
4011,p4011,4,"[0.012342263, -0.01471429, -0.012913096, 0.007...",'User behavior driven ranking without editoria...
11379,p11379,12,"[-0.012718345, 0.020719944, -0.010691697, 0.00...",'Reducing truth-telling online mechanisms to o...
9401,p9401,8,"[-0.013923097, 0.017362924, -0.009770028, -0.0...",'The lazy adversary conjecture fails We prove ...
4928,p4928,1,"[0.019353714, 0.0066366955, 0.0115322415, 0.01...",'Privacy preserving schema and data matching ...


**The (paper, citing, paper) edge table**

The "paper, citing, paper" edge table could also be read in as a Pandas DataFrame. It has three columns. The `source_id` and `dest_id` column contain the same identifiers listed in the "paper" node table. The `label` column is a placeholder to be used for spliting the "paper, citing, paper" edges for a link prediction task.

In [12]:
pcp_edge_df = pd.read_parquet(paper_citing_paper_edge_path)

print(pcp_edge_df.shape)
pcp_edge_df.sample(4)

(30789, 3)


Unnamed: 0,source_id,dest_id,label
28779,p11255,p12232,1.0
2791,p704,p6747,1.0
429,p119,p16,1.0
7301,p2354,p8747,1.0


### GraphStorm Input Graph Data in the `./acm_gs_*p/` Folder

In the above cells, we created a 1-partition graph in the `acm_gs_1p` folder and a 3-partition graph in the `acm_gs_3p` folder. The contents of the two folders are nearly the same, including 

1. a GraphStorm partitioned configuration JSON file;
2. original node id space to GraphStorm node id space mapping files, created during graph processing;
3. GraphStorm node id space to shuffle node id space mapping, created during graph patitioning;
4. label statitic files.

In [13]:
!ls -al ./acm_gs_1p

total 1884
drwxrwxr-x 3 ubuntu ubuntu    4096 Dec 19 21:28 .
drwxrwxr-x 6 ubuntu ubuntu    4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    1673 Dec 19 21:28 acm.json
-rw-rw-r-- 1 ubuntu ubuntu  213402 Dec 19 21:28 author_id_remap.parquet
-rw-rw-r-- 1 ubuntu ubuntu     191 Dec 19 21:28 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 Dec 19 21:28 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu     515 Dec 19 21:28 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu  241655 Dec 19 21:28 node_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu  150409 Dec 19 21:28 paper_id_remap.parquet
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:28 part0
-rw-rw-r-- 1 ubuntu ubuntu    2934 Dec 19 21:28 subject_id_remap.parquet


In [14]:
!ls -al ./acm_gs_3p

total 1892
drwxrwxr-x 5 ubuntu ubuntu    4096 Dec 19 21:29 .
drwxrwxr-x 6 ubuntu ubuntu    4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    3319 Dec 19 21:29 acm.json
-rw-rw-r-- 1 ubuntu ubuntu  213402 Dec 19 21:29 author_id_remap.parquet
-rw-rw-r-- 1 ubuntu ubuntu     191 Dec 19 21:29 edge_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu 1287802 Dec 19 21:29 edge_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu     515 Dec 19 21:29 node_label_stats.json
-rw-rw-r-- 1 ubuntu ubuntu  241655 Dec 19 21:29 node_mapping.pt
-rw-rw-r-- 1 ubuntu ubuntu  150409 Dec 19 21:29 paper_id_remap.parquet
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:29 part0
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:29 part1
drwxrwxr-x 2 ubuntu ubuntu    4096 Dec 19 21:29 part2
-rw-rw-r-- 1 ubuntu ubuntu    2934 Dec 19 21:29 subject_id_remap.parquet


Because the choice of the different number of partitions, the two folders have different partition data sub-folders, named after "part0" to "part***N***", where ***N*** is the number of partitions specified with the `--num-parts` argument of construct_graph command.

<div class="alert alert-block alert-info">
<b>Tip:</b> In the next sections, we use the 3-partition graph to explore these four sets of files and sub-folders one by one. But we will use the 1-partition graph in the other notebooks for GraphStorm standalone mode programming tutorials. </div>


#### The GraphStorm Partition Configuration File `acm.json`
The `acm.json` file describe the partitioned graph that GraphStorm uses for model training and inference. 

It includes basic information about the partitioned graph, such as node and edge types, the number of each node and edge type, and the number of partitions along with the other partition mapping information.

In [7]:
!cat ./acm_gs_3p/acm.json

#### Raw Node ID Mapping Files `****_id_remap.parquet`
Because the original node ids could be any types, e.g., strings, integers, or even floats, during graph processing GraphStorm conducts an ID mapping, which map the original node ID space given by users into the interger type node ID space, starting from 0. This mapping information is stored in those `****_id_remap.parquet` files.

In [16]:
author_nid_mapping_df = pd.read_parquet('./acm_gs_3p/author_id_remap.parquet')

print(author_nid_mapping_df.shape)
author_nid_mapping_df.sample(4)

(17431, 2)


Unnamed: 0,orig,new
765,a765,765
8438,a8438,8438
14914,a14914,14914
5227,a5227,5227


As shown above, the `author_id_remap.parquet` file has two columns. The `orig` column contains the original string type node IDs in the raw node table data, while the `new` column contains the new integer node IDs in the Graph Node ID space.

#### GraphStorm Partition Node/Edge ID Mapping Files `****_mapping.pt`
GraphStorm relies on the distributed DGL graph as its input graph data. The distributed DGL graph has its own node ID space, thus creating another node id mapping during graph partition.

These node id mappings, in the form of a python dictionary, are stored in those `****_mapping.pt` files, which can be loaded using Pytorch.

<div class="alert alert-block alert-info">
<b>Tip:</b>In general, uses do not need to do the id mapping back operations. If use GraphStorm's command line interface to train models and do inference, GraphStorm will automatically remapping the partitioned ID space to the original node ID space. </div>

In [17]:
import torch as th

node_mapping_dict = th.load('./acm_gs_3p/node_mapping.pt')
print('Node id mapping:')
print(f'Node mapping keys: {list(node_mapping_dict.keys())}')
ntype0 = list(node_mapping_dict.keys())[0]
print(f'Node type \'{ntype0}\' first 10 mapping ids: {node_mapping_dict[ntype0][:10]}\n')

edge_mapping_dict = th.load('./acm_gs_3p/edge_mapping.pt')
print('Edge id mapping:')
print(f'Edge mapping keys: {list(edge_mapping_dict.keys())}')
etype0 = list(edge_mapping_dict.keys())[0]
print(f'Edge type \'{etype0}\' first 10 mapping ids: {edge_mapping_dict[etype0][:10]}\n')

Node id mapping:
Node mapping keys: ['author', 'paper', 'subject']
Node type 'author' first 10 mapping ids: tensor([16442,  7664,  7665,  7667, 16448,  7669,  7670, 16443,  7674, 16453])

Edge id mapping:
Edge mapping keys: [('author', 'writing', 'paper'), ('paper', 'cited', 'paper'), ('paper', 'citing', 'paper'), ('paper', 'is-about', 'subject'), ('paper', 'written-by', 'author'), ('subject', 'has', 'paper')]
Edge type '('author', 'writing', 'paper')' first 10 mapping ids: tensor([ 8198, 15018,  3479,   253, 21728, 20622, 15980, 13148, 11788,  9858])



The ID mapping logic in those tensors is that GraphStorm graph ID is stored in these tensors, and their position indexes are the new partitioned node IDs. For example, for "author" nodes, the GraphStorm graph ID `16442` has a new partitioned node ID `0` because the number `16642` is in the first position (index=`0`) of the mapping tensor.

#### Label Statistic Files `****_label_stats.json`

If users specify the label statistc property in the `config.json` file, e.g., for the "paper" node's `label` object setting `"label_stats_type": "frequency_cnt"`, GraphStorm will collect labels' statistics and stored in the `****_label_stats.json` files.

In [8]:
!cat ./acm_gs_3p/node_label_stats.json

#### Partitioned Graph Data `partN/***.dgl`

The distributed DGL graph datasets are saved in these `partN` subfolders, each of which contains three DGL formated files:
1. `edge_feat.dgl`: edge features of one partition if have.
2. `graph.dgl`: graph structure of one partition.
3. `node_feat.dgl`: node features of one partition if have.

In [19]:
!ls -al ./acm_gs_3p/part0

total 13892
drwxrwxr-x 2 ubuntu ubuntu     4096 Dec 19 21:29 .
drwxrwxr-x 6 ubuntu ubuntu     4096 Dec 19 21:29 ..
-rw-rw-r-- 1 ubuntu ubuntu    31926 Dec 19 21:29 edge_feat.dgl
-rw-rw-r-- 1 ubuntu ubuntu  2081555 Dec 19 21:29 graph.dgl
-rw-rw-r-- 1 ubuntu ubuntu 12097671 Dec 19 21:29 node_feat.dgl
