### Google Colab Setup

Before getting started we need to run some standard code to set up our environment. You'll need to execute this code again each time you start the notebook.

First, run this cell to load the [autoreload](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html?highlight=autoreload) extension. This enables us to modify `.py` source files and reintegrate them into the notebook, ensuring a smooth editing and debugging experience.

In [1]:
%load_ext autoreload
%autoreload 2

Next we need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section.

Run the following cell to mount your Google Drive. Follow the link, sign in to your Google account (the same account you used to store this notebook!).

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os

# TODO: Fill in the Google Drive path where you uploaded assignment1
# Example: If you create a Fall2023 folder and put all the files under A1 folder, then 'Fall2023/A1'
GOOGLE_DRIVE_PATH_POST_MYDRIVE = 'Colab Notebooks/Group-Project'
GOOGLE_DRIVE_PATH = os.path.join('/content', 'drive', 'MyDrive', GOOGLE_DRIVE_PATH_POST_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))

['.git', 'dataset-screening', 'paper', 'data', 'model', 'notebooks', 'scripts', 'README.md', '.gitignore', 'requirements.txt', 'config', 'logs', 'plot']


### Local Setup or Google Colab
Run the cell below regardless of setup to set the path

In [4]:
# if running locally set GOOGLE PATH
import sys
if 'google.colab' in sys.modules:
  print(f'Running in google colab. Our path is `{GOOGLE_DRIVE_PATH}`')
else:
  GOOGLE_DRIVE_PATH = '.'
  print('Running locally.')

Running in google colab. Our path is `/content/drive/MyDrive/Colab Notebooks/Group-Project`


# Env setup & package installation

In [None]:
!bash "{GOOGLE_DRIVE_PATH}/config/setup_env.sh"

⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:06
🔁 Restarting kernel...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.11/dist-packages/condacolab.py", line 207, in install_miniforge
    install_from_url(
  File "/usr/local/lib/python3.11/dist-packages/condacolab.py", line 169, in install_from_url
    get_ipython().kernel.do_shutdown(True)
    ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'kernel'
Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - done


    current version: 24.11.2
    latest version: 25.5.1

Please update conda by running

    $ conda update -n base -c 

In [None]:
# clean up env if necessary
# !conda remove -n pinsage_env --all -y


Remove all packages in environment /usr/local/envs/pinsage_env:


## Package Plan ##

  environment location: /usr/local/envs/pinsage_env


The following packages will be REMOVED:

  _libgcc_mutex-0.1-conda_forge
  _openmp_mutex-4.5-2_gnu
  brotli-python-1.1.0-py310hf71b8c6_3
  bzip2-1.0.8-h4bc722e_7
  ca-certificates-2025.7.14-hbd8a1cb_0
  certifi-2025.7.14-pyhd8ed1ab_0
  cffi-1.17.1-py310h8deb56e_0
  charset-normalizer-3.4.2-pyhd8ed1ab_0
  colorama-0.4.6-pyhd8ed1ab_1
  dgl-2.0.0.cu118-py310_0
  h2-4.2.0-pyhd8ed1ab_0
  hpack-4.1.0-pyhd8ed1ab_0
  hyperframe-6.1.0-pyhd8ed1ab_0
  icu-75.1-he02047a_0
  idna-3.10-pyhd8ed1ab_1
  ld_impl_linux-64-2.44-h1423503_1
  libblas-3.9.0-32_h59b9bed_openblas
  libcblas-3.9.0-32_he106b2a_openblas
  libexpat-2.7.1-hecca717_0
  libffi-3.4.6-h2dba641_1
  libgcc-15.1.0-h767d61c_3
  libgcc-ng-15.1.0-h69a702a_3
  libgfortran-15.1.0-h69a702a_3
  libgfortran5-15.1.0-hcea5267_3
  libgomp-15.1.0-h767d61c_3
  liblapack-3.9.0-32_h7ac8fdf_openblas
  liblzma-5.8.1-

In [None]:
# test whether pinsage sampler package available to use
# change dir to script folder
import os
os.chdir(GOOGLE_DRIVE_PATH + "/scripts")
# run on env
!conda run -n pinsage_env python test_installation.py

🔍 Starting environment test for DGL PinSAGE example...

✅ Built-in modules loaded (os, re, pickle, argparse)
✅ Data modules loaded (numpy, pandas, scipy, dask)
✅ Torch loaded (version: 2.0.1+cu117, CUDA: True)
✅ DGL loaded (version: 2.0.0+cu118)
✅ TorchText loaded (tokenizer, vocab)
✅ tqdm loaded
✅ pandas.api.types helpers loaded
✅ Local module found: sampler.py
⚠️  Local module NOT found: evaluation.py (expected if not copied yet)
⚠️  Local module NOT found: layers.py (expected if not copied yet)
⚠️  Local module NOT found: builder.py (expected if not copied yet)
⚠️  Local module NOT found: data_utils.py (expected if not copied yet)

🎉 Environment test complete!
⚠️  Note: Some local scripts were not found. If you're setting up the full repo, be sure to include:
  - evaluation.py
  - layers.py
  - builder.py
  - data_utils.py



# PinSage Model Training

In [None]:
# import os
# os.chdir("/content/drive/MyDrive/Colab Notebooks/Group-Project")

# import subprocess
# result = subprocess.run(
#     ["conda", "run", "-n", "pinsage_env", "python", "scripts/builder.py"],
#     stdout=subprocess.PIPE,
#     stderr=subprocess.STDOUT,
#     universal_newlines=True
# )
# print(result.stdout)

🔍 Loading data and building inductive split...


=== Building Inductive Split ===
Loading features from: /content/drive/MyDrive/Colab Notebooks/Group-Project/data/features.npy
Feature shape: (402691, 1937)
Loading edges from: /content/drive/MyDrive/Colab Notebooks/Group-Project/data/edges.csv
Total edges read: 808185
Positive edges count: 808185
Graph loaded with 402691 nodes and 808185 edges
Splitting 402691 nodes into train/test with ratio 0.8/0.2
Train nodes: 322152, Test nodes: 80539
Train graph: 322152 nodes, 515791 edges
Extracting edges within node set of size: 322152
Extracted 515791 edges
Generating 5 negative edges from 322152 valid nodes
Generated 5 negative edges
Train feature pairs shape: torch.Size([515796, 3874])
Extracting edges within node set of size: 80539
Extracted 32866 edges
Generating 5 negative edges from 80539 valid nodes
Generated 5 negative edges
Validation feature pairs shape: torch.Size([32871, 3874])
=== Inductive Split Complete ===
✅ Train Graph:
  - Num 

In [None]:
# step 1: data preparation
%cd /content/drive/MyDrive/Colab\ Notebooks/Group-Project/scripts/
!conda run -n pinsage_env --no-capture-output python test_datautils.py

/content
🔍 Loading data and building inductive split...

Generated 515791 negative edges
Generated 32866 negative edges
✅ Train Graph:
  - Num nodes: 322152
  - Num edges: 515791

✅ Validation Graph:
  - Num nodes: 80539
  - Num edges: 32866

✅ Train Edges / Labels:
  - Positive + Negative edges: 1031582
  - Labels (sample): [1, 1, 1, 1, 1]

✅ Validation Edges / Labels:
  - Positive + Negative edges: 65732
  - Labels (sample): [1, 1, 1, 1, 1]

🎉 Data utility pipeline works! You can now feed train_feat_pairs and train_labels into a model.


In [None]:
# step2: build sampler
%cd /content/drive/MyDrive/Colab\ Notebooks/Group-Project/scripts/
!conda run -n pinsage_env --no-capture-output python test_sampler.py

/content
🔍 Loading data and building inductive split...

Generated 515791 negative edges
Generated 32866 negative edges

🚀 Testing one batch from train_loader...

✅ Train heads: [46996, 229709, 132423, 314186]
✅ Train tails: [290125, 208811, 131516, 287460]
✅ Labels: [1, 1, 0, 1]
✅ Train Blocks: [102, 38]

🧪 Testing one batch from val_loader...

✅ Val heads: [52043, 41178, 41702, 40321]
✅ Val tails: [56814, 50460, 8094, 53274]
✅ Labels: [1, 1, 1, 1]
✅ Val Blocks: [36, 25]


In [None]:
# step3: train
%cd /content/drive/MyDrive/Colab\ Notebooks/Group-Project/scripts/
!conda run -n pinsage_env --no-capture-output python train_pinsage_model.py

/content

🔍 Loading data and building inductive split...

Generated 515791 negative edges
Generated 32866 negative edges
✅ Graph: 322152 nodes, 515791 edges
✅ Graph: 80539 nodes, 32866 edges
✅ Train edges: 1031582, Validation edges: 65732

🎲 Trial 1/1
🔧 Config: {'batch_size': 512, 'num_epochs': 35, 'learning_rate': 0.0001, 'hidden_feats': 384, 'out_feats': 128, 'num_layers': 2, 'dropout': 0.3, 'train_ratio': 0.8}

📦 Preparing dataloaders...

✅ Epoch 1 | Loss: 0.5896 | AUC: 0.7972 | AP: 0.7892 | Val AUC: 0.9148 | Time: 65.92s | GPU: 6246.24 MB
✅ Epoch 2 | Loss: 0.5317 | AUC: 0.8724 | AP: 0.8705 | Val AUC: 0.9423 | Time: 64.58s | GPU: 6246.85 MB
✅ Epoch 3 | Loss: 0.5126 | AUC: 0.8919 | AP: 0.8929 | Val AUC: 0.9495 | Time: 64.62s | GPU: 6246.98 MB
✅ Epoch 4 | Loss: 0.4998 | AUC: 0.9044 | AP: 0.9071 | Val AUC: 0.9560 | Time: 64.48s | GPU: 6246.98 MB
✅ Epoch 5 | Loss: 0.4902 | AUC: 0.9132 | AP: 0.9168 | Val AUC: 0.9595 | Time: 64.57s | GPU: 6246.98 MB
✅ Epoch 6 | Loss: 0.4822 | AUC: 0.9203 

In [None]:
!nvidia-smi

Thu Jul 17 19:55:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   33C    P0             53W /  400W |       0MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                