# Graph Convolutional Neural Netowrk (GCNN) Homework

* There are two datasets in the `data` folder: `train.pt`, `test.pt`. You will train a GCNN on the training dataset, then predict on the test dataset.
* There are two parts in the documentation. `Part I` gives a custom `Dataset` object and loads the datasets. The `QM_Dataset` object inheriates from torch geometric `Dataset` object. `Part II` is an example solution.
* This HW is implemented with [Pytorch Geometric (PyG)](https://pytorch-geometric.readthedocs.io/en/latest/index.html). Another popular libraray for implementing graph neural networks is [Deep Graph Library (DGL)](https://www.dgl.ai/)

In [None]:
import pandas as pd
import torch
import torch.nn.functional as F
from torch.nn import Linear, ReLU, Sequential
from torch_geometric.data import Dataset
from torch_geometric.loader import DataLoader

## Part I. The training and testing dataset are provided

- The training and testing datasets were pre-processed graphs. The training dataset contains 20,000 graphs, while the test dataset contains 2,000 graphs.
- Each graph contais the following components 

    - `x`, the matrix containing node features, `[num_of_nodes, num_node_features=11]`
    - `edge_index`, the matrix containing connection information about different nodes, `[2, num_of_edges]`
    - `y`, the label for the graph, `scaler`. The value is set to `0` in the test dataset
    - `pos`, the matrix containing the node positions, `[num_of_nodes, 3]`
    - `edge_attr`, the matrix containing the edge information, `[num_edges, 4]`
    - `names`, index for the graph. For example, `gdb_59377`

- Depending on the graph convolutional layer that is used, different components are needed. For the most basic application, `x`, `edge_index` and `y` will be used.


In [None]:
class QM_Dataset(Dataset):
    def __init__(self, path):
        super().__init__(self)
        self.path = path
        self.data = torch.load(self.path)

    def len(self):
        return len(self.data)
    
    def get(self, idx):
        return self.data[idx]
        
train_path = "data2/train.pt"
test_path = "data2/test.pt"

train_data_ = QM_Dataset(train_path)
#The training dataset can be splitted into two parts for validation purpose
train_data, validate_data = torch.utils.data.random_split(train_data_, [19000, 1000])
test_data = QM_Dataset(test_path)

## Part II. Example solution

In [None]:
# Define the network
# Many convolutional layers are availabel in torch_geometric.nn
# Here NNConv is used as an example
from torch_geometric.nn import NNConv, Set2Set


class Net(torch.nn.Module):
    def __init__(self, num_features=11, dim=64):
        super().__init__()
        self.lin0 = torch.nn.Linear(num_features, dim)
        nn = Sequential(Linear(4, 128), ReLU(), Linear(128, dim * dim))
        self.conv = NNConv(dim, dim, nn, aggr='mean')  #You will need to replace with your own convolutiona layers here
        self.set2set = Set2Set(dim, processing_steps=3)  #The set2set is used to map from nodes to graphs
        self.lin1 = torch.nn.Linear(2 * dim, dim)
        self.lin2 = torch.nn.Linear(dim, 1)

    def forward(self, data):
        out = F.relu(self.lin0(data.x))  #data.x size [batch_num_nodes, num_node_features]
        for _ in range(3):
            out = F.relu(self.conv(out, data.edge_index, data.edge_attr))
        out = self.set2set(out, data.batch)  #[batch_num_nodes, dim] ==> [batch_num_graphs, dim*2]
        out = F.relu(self.lin1(out))
        out = self.lin2(out)
        return out.view(-1)

In [None]:
# Define a trianing and evaluating function
def train(loader):
    """Takes in training dataset loader,
    train the model one step,
    update the parameters,
    return the current loss"""
    model.train()
    loss_all = 0
    for data in loader:
        data = data.to(device)
        optimizer.zero_grad()
        loss = F.mse_loss(model(data), data.y)
        loss.backward()
        loss_all += loss.item() * data.num_graphs
        optimizer.step()
    return loss_all / len(loader.dataset)

def eval(loader):
    """Takes the validation dataset loader,
    return the validation MAE"""
    model.eval()
    Ys = []
    Names = []
    Y_true = []
    for data in loader:
        data = data.to(device)
        out = model(data)
        ys = out.to("cpu").tolist()
        Ys += ys
        Names += data['name']
        Y_true += data['y']
    assert(len(Y_true) == len(Ys))
    E = 0
    for i in range(len(Y_true)):
        E += abs(Y_true[i] - Ys[i])
    return E/len(Y_true)

In [None]:
# Load the datasets
train_loader = DataLoader(train_data, batch_size=128)
validate_loader = DataLoader(validate_data, batch_size=128)
test_loader = DataLoader(test_data, batch_size=8)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.00005)

In [None]:
#Training
mae = eval(validate_loader)
print(f'Epoch: 0, MAE: {mae:.7f}')
for epoch in range(1, 200):
    loss = train(train_loader)
    mae = eval(validate_loader)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.7f}, MAE: {mae:.7f}')

Epoch: 0, MAE: 2.8424060
Epoch: 001, Loss: 2.6428377, MAE: 0.9120902
Epoch: 002, Loss: 1.4618543, MAE: 0.8981929
Epoch: 003, Loss: 1.4353232, MAE: 0.8911480
Epoch: 004, Loss: 1.4149225, MAE: 0.8850090
Epoch: 005, Loss: 1.3891462, MAE: 0.8611026
Epoch: 006, Loss: 1.3565412, MAE: 0.8406896
Epoch: 007, Loss: 1.3250736, MAE: 0.8318036
Epoch: 008, Loss: 1.3061361, MAE: 0.8283791
Epoch: 009, Loss: 1.2951733, MAE: 0.8256869
Epoch: 010, Loss: 1.2844763, MAE: 0.8202440
Epoch: 011, Loss: 1.2746580, MAE: 0.8201486
Epoch: 012, Loss: 1.2636615, MAE: 0.8160493
Epoch: 013, Loss: 1.2573440, MAE: 0.8200057
Epoch: 014, Loss: 1.2504716, MAE: 0.8164753
Epoch: 015, Loss: 1.2417441, MAE: 0.8147994
Epoch: 016, Loss: 1.2384232, MAE: 0.8136805
Epoch: 017, Loss: 1.2336900, MAE: 0.8054752
Epoch: 018, Loss: 1.2316047, MAE: 0.8146860
Epoch: 019, Loss: 1.2264985, MAE: 0.8085163
Epoch: 020, Loss: 1.2215189, MAE: 0.8081459
Epoch: 021, Loss: 1.2175943, MAE: 0.8074743
Epoch: 022, Loss: 1.2198225, MAE: 0.8108947
Epoch: 

In [None]:
# Predict
model.eval()
Ys = []
Names = []
for data in test_loader:
    data = data.to(device)
    out = model(data)
    ys = out.to("cpu").tolist()
    Ys += ys
    Names += data['name']
 
assert(len(Names) == len(Ys))
df = pd.DataFrame({"Names": Names, "labels": Ys})
df.head()

Unnamed: 0,Names,labels
0,gdb_59377,1.450459
1,gdb_14632,3.597212
2,gdb_35326,1.688538
3,gdb_11448,2.629341
4,gdb_35889,3.674875


In [None]:
#Upload solution
df.columns = ['Idx', 'labels']
df.to_csv("/home/liucmu/pps3/PYG/GCNN/data2/Y_sample.csv", index=False)