# HW4: Graph Neural Networks

* There are two datasets in the `data` folder: `train.pt`, `test.pt`. You will train a GCNN on the train dataset, then make predictions on the test dataset.
* There are two parts in this notebook. `Part I` gives a custom `Dataset` object and loads the datasets. The `QM_Dataset` object inherites from torch geometric `Dataset` object. `Part II` is an example solution.
* This HW is implemented with [Pytorch Geometric (PyG)](https://pytorch-geometric.readthedocs.io/en/latest/index.html). Another popular library for implementing GNNs is [Deep Graph Library (DGL)](https://www.dgl.ai/)

In [1]:
!pip install torch_geometric

Collecting torch_geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m61.4/63.1 kB[0m [31m2.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m976.9 kB/s[0m eta [36m0:00:00[0m
Downloading torch_geometric-2.6.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch_geometric
Successfully installed torch_geometric-2.6.1


In [2]:
import pandas as pd
import torch
import torch.nn.functional as F
from torch.nn import Linear, ReLU, Sequential
from torch_geometric.data import Dataset
from torch_geometric.loader import DataLoader

In [None]:
# only on google colab

from google.colab import drive
drive.mount('/content/drive')

## Part I. The training and testing dataset are provided

- The train and test datasets were pre-processed graphs. The train dataset contains 20,000 graphs, while the test dataset contains 2,000 graphs.
- Each graph contains the following components:

    - `x`, the matrix containing node features, `[num_of_nodes, num_node_features=11]`
    - `edge_index`, the matrix containing connection information about different nodes, `[2, num_of_edges]`
    - `y`, the label for the graph, `scaler`. The value is set to `0` in the test dataset
    - `pos`, the matrix containing the node positions, `[num_of_nodes, 3]`
    - `edge_attr`, the matrix containing the edge information, `[num_edges, 4]`
    - `names`, index for the graph. For example, `gdb_59377`

- Depending on the graph convolutional layer that is used, different components are needed. For the most basic application, `x`, `edge_index` and `y` will be used.


In [None]:
class QM_Dataset(Dataset):
    def __init__(self, path):
        super().__init__(root=".")
        self.data = torch.load(path)

    def len(self):
        return len(self.data)

    def get(self, idx):
        return self.data[idx]

train_path = ".........../data/train.pt"
test_path = ".........../data/test.pt"

train_data_ = QM_Dataset(train_path)

# train dataset can be split for validation purposes
train_data, validate_data = torch.utils.data.random_split(train_data_, [19000, 1000])
test_data = QM_Dataset(test_path)

## Part II. Example solution

In [None]:
# define the network
# many convolutional layers are available in torch_geometric.nn
# here NNConv is just used as an example

from torch_geometric.nn import NNConv, Set2Set

class Net(torch.nn.Module):
    def __init__(self, num_features=11, dim=64):
        super().__init__()
        self.lin0 = torch.nn.Linear(num_features, dim)
        nn = Sequential(Linear(4, 128), ReLU(), Linear(128, dim * dim))
        self.conv = NNConv(dim, dim, nn, aggr='mean')      # replace with your own convolutional layers here
        self.set2set = Set2Set(dim, processing_steps=3)    # set2set is used to map from nodes to graphs
        self.lin1 = torch.nn.Linear(2 * dim, dim)
        self.lin2 = torch.nn.Linear(dim, 1)

    def forward(self, data):
        out = F.relu(self.lin0(data.x))                    #data.x size [batch_num_nodes, num_node_features]
        for _ in range(3):
            out = F.relu(self.conv(out, data.edge_index, data.edge_attr))
        out = self.set2set(out, data.batch)                #[batch_num_nodes, dim] ==> [batch_num_graphs, dim*2]
        out = F.relu(self.lin1(out))
        out = self.lin2(out)
        return out.view(-1)

In [None]:
# define training and evaluation functions
def train(loader):
    """Takes in training dataset loader,
    train the model one step,
    update the parameters,
    return the current loss"""

def eval(loader):
    """Takes the validation dataset loader,
    return the validation MAE"""

In [None]:
# load the datasets
train_loader = DataLoader(train_data, batch_size=128)
validate_loader = DataLoader(validate_data, batch_size=128)
test_loader = DataLoader(test_data, batch_size=8)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.00005)

In [None]:
# training
num_epochs=100
for epoch in range(1, num_epochs):
    """Calculate loss and
    validation MAE"""

In [None]:
# predict
model.eval()
y_pred = []
Idx = []
for data in test_loader:
    """Predict and save graph index and
    predicted y value"""

assert(len(Names) == len(Ys))
df = pd.DataFrame({"Idx": Idx, "labels": y_pred})

In [None]:
# upload solution
df.columns = ['Idx', 'labels']
df.to_csv(".........../data/submission1.csv", index=False)