# Notebook 0: Data Prepare

This notebook will create an example graph data to be used in the other notebooks to demonstrate how to program using GraphStorm APIs.
The example graph data comes from [DGL's ACM publication dataset](https://data.dgl.ai/dataset/ACM.mat), which is the same as the data explainedin the [Use Your Own Data tutorial](https://graphstorm.readthedocs.io/en/latest/tutorials/own-data.html).

## Prerequisites
This notebook assumes the following prerequisites.
- Python 3;
- Linux OS, Ubuntu or Amazon Linux;
- GraphStorm and its dependencies (following the [Setup GraphStorm with pip packages tutorial](https://graphstorm.readthedocs.io/en/latest/install/env-setup.html#setup-graphstorm-with-pip-packages))
- [Jupyter web interacitve server](https://jupyter.org/).

Users can use the following command to check if the above prerequisites are met.

In [1]:
import graphstorm as gs
print(gs.__version__)

0.2.1


### Download Data Generation Script
GraphStorm provides a Python script that can download and convert the DGL ACM publication data for GraphStorm usage. Therefore, first let's download the script file from [GraphStorm Github repository](https://github.com/awslabs/graphstorm).

In [2]:
!wget -O ./acm_data.py https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py

--2023-12-11 18:58:35--  https://github.com/awslabs/graphstorm/raw/main/examples/acm_data.py
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/awslabs/graphstorm/main/examples/acm_data.py [following]
--2023-12-11 18:58:35--  https://raw.githubusercontent.com/awslabs/graphstorm/main/examples/acm_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18296 (18K) [text/plain]
Saving to: ‘./acm_data.py’


2023-12-11 18:58:35 (29.6 MB/s) - ‘./acm_data.py’ saved [18296/18296]



### Generate ACM Raw Table Data
The Python script is self-contained. We can use it with the command below to build the raw table data, which is the stand input data for GraphStorm.

In [2]:
!python ./acm_data.py --output-path ./acm_raw --output-type raw_w_text

Namespace(download_path='/tmp/ACM.mat', dataset_name='acm', output_type='raw_w_text', output_path='./acm_raw')
Graph(num_nodes={'author': 17431, 'paper': 12499, 'subject': 73},
      num_edges={('author', 'writing', 'paper'): 37055, ('paper', 'cited', 'paper'): 30789, ('paper', 'citing', 'paper'): 30789, ('paper', 'is-about', 'subject'): 12499, ('paper', 'written-by', 'author'): 37055, ('subject', 'has', 'paper'): 12499},
      metagraph=[('author', 'paper', 'writing'), ('paper', 'paper', 'cited'), ('paper', 'paper', 'citing'), ('paper', 'subject', 'is-about'), ('paper', 'author', 'written-by'), ('subject', 'paper', 'has')])

 Number of classes: 14

 Paper node labels: torch.Size([12499])

 ('paper', 'citing', 'paper') edge labels:30789
Saving ACM data to /tmp/acm.dgl ......
/tmp/acm.dgl saved.
Saving ACM node text to /tmp/acm_text.pkl ......
/tmp/acm_text.pkl saved.
author nodes have: Index(['node_id', 'feat', 'text'], dtype='object') columns ......
paper nodes have: Index(['node_id',

### Construct GraphStorm Input Graph Data
With the raw ACM table we can use GraphStorm's graph construction method to prepare the ACM graph for other notebooks.

For the GraphStorm Standalone mode, we only need one partition. Therefore, in the command below we set the `--num-parts` to be `1`. For other arguments, users can refer to [GraphStorm Graph Construction arguments](https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html).

In [3]:
!python -m graphstorm.gconstruct.construct_graph \
          --conf-file ./acm_raw/config.json \
          --output-dir ./acm_gs \
          --num-parts 1 \
          --graph-name acm

INFO:root:The graph has 3 node types and 6 edge types.
INFO:root:Node type author has 17431 nodes
INFO:root:Node type paper has 12499 nodes
INFO:root:Node type subject has 73 nodes
INFO:root:Edge type ('author', 'writing', 'paper') has 37055 edges
INFO:root:Edge type ('paper', 'cited', 'paper') has 30789 edges
INFO:root:Edge type ('paper', 'citing', 'paper') has 30789 edges
INFO:root:Edge type ('paper', 'is-about', 'subject') has 12499 edges
INFO:root:Edge type ('paper', 'written-by', 'author') has 37055 edges
INFO:root:Edge type ('subject', 'has', 'paper') has 12499 edges
INFO:root:Node type author has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids'].
INFO:root:Node type paper has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids', 'train_mask', 'val_mask', 'test_mask', 'label'].
INFO:root:Train/val/test on paper: 9999, 1249, 1249
INFO:root:Node type subject has features: ['feat', 'input_ids', 'attention_mask', 'token_type_ids'].
INFO:root:Edge type 

## Data Exploration and Explanation
The above commands created two sets of ACM data. Below we explore these datasets, and explain their format so that users can prepare their own graph data for using GraphStorm.

### Raw ACM Table Data in the `./acm_raw` Folder
We can list the `acm_raw` folder with the `ls -al` command. 

In [4]:
!ls -al ./acm_raw

total 24
drwxrwxr-x 4 ubuntu ubuntu 4096 Dec 12 22:41 .
drwxrwxr-x 5 ubuntu ubuntu 4096 Dec 12 22:41 ..
-rw-rw-r-- 1 ubuntu ubuntu 5249 Dec 12 22:41 config.json
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 12 22:41 edges
drwxrwxr-x 2 ubuntu ubuntu 4096 Dec 12 22:41 nodes


### GraphStorm Graph Construction JSON

The `acm_raw` folder includes one `config.json` file that describes the table-based raw data format. Except for a **version** object, the JSON file contains a **nodes** object and an **edges** object.

The **nodes** object contains a list of *nodes*, each of which includes a set of properties to describe one node type in a graph data. For example, in the `config.json` file, there is a node type, called "papers". For each node type, GraphStorm defines a few other properties, such as **format**, **files**, and **features**.

Similarly, the **edges** object contains a list of *edges*. Most of *edge* properties are same as *node*'s except that *edge* object has the **relation** property that define an edge type in a canonical for, i.e., *source node type*, *relation type*, and *destination node type*.

For a full list of the JSON configuration properties, users can refer to the [GraphStorm Graph Construction JSON Explanations](https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations).

In [6]:
!cat ./acm_raw/config.json

{
    "version": "gconstruct-v0.1",
    "nodes": [
        {
            "node_type": "author",
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/nodes/author.parquet"
            ],
            "node_id_col": "node_id",
            "features": [
                {
                    "feature_col": "feat",
                    "feature_name": "feat"
                },
                {
                    "feature_col": "text",
                    "feature_name": "text",
                    "transform": {
                        "name": "tokenize_hf",
                        "bert_model": "bert-base-uncased",
                        "max_seq_length": 16
                    }
                }
            ]
        },
        {
            "node_type": "paper",
            "format": {
                "name": "parquet"
            },
            "files": [
                "./acm_raw/nodes/paper.parquet"
            ]

### Raw ACM Tables
As defined in the `./acm_raw/config.json` file, the node data files are stored at the `./acm_raw/nodes/` folder, and edge data files are stored at the `./acm_raw/edges/` folder. General description of these files can be found at the [Input raw node/edge data files](https://graphstorm.readthedocs.io/en/latest/tutorials/own-data.html#input-raw-node-edge-data-files). Here, we can read some nodes("paper")/edges(\["paper", "citing", "paper"\]) to know more about them.

In [7]:
import pandas as pd

paper_node_path = './acm_raw/nodes/paper.parquet'
paper_citing_paper_edge_path = './acm_raw/edges/paper_citing_paper.parquet'

#### Paper node table

In [8]:
paper_node_df = pd.read_parquet(paper_node_path)

print(paper_node_df.shape)
paper_node_df.sample(4)

(12499, 4)


Unnamed: 0,node_id,label,feat,text
1511,p1511,2,"[-0.0006541484, 0.009245367, 0.017781269, 0.01...",'Conversation specification: a new approach to...
8644,p8644,8,"[-0.017404346, -0.013577025, 0.0048244707, -0....",'Strip packing with precedence constraints and...
8602,p8602,9,"[0.0011218095, 0.016130054, -0.011704452, -0.0...",' Network arrivals are often modeled as Poiss...
1580,p1580,0,"[-0.019133765, 0.010865432, -0.021192657, 0.00...",'Dynamic hybrid clustering of bioinformatics b...


The paper type node table has four columns.

#### paper_citing_paper edge table

In [14]:
pcp_edge_df = pd.read_parquet(paper_citing_paper_edge_path)

print(pcp_edge_df.shape)
pcp_edge_df.sample(4)

(30789, 3)


Unnamed: 0,source_id,dest_id,label
24725,8712,4377,1.0
16460,5471,7375,1.0
6829,2262,97,1.0
28274,10890,5780,1.0
