In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import GCNConv

# Load the Cora dataset
dataset = Planetoid(root='data/Cora', name='Cora')

# Prepare data
data = dataset[0]

# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

def train():
    # Training loop
    for epoch in range(100):
        model.train()
        optimizer.zero_grad()
        
        # Forward pass
        out = model(data)
        loss = criterion(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        optimizer.step()

        if epoch % 10 == 0:
            print(f'Epoch {epoch}, Loss: {loss.item()}')

    print("Training complete!")

train()


Epoch 0, Loss: 1.9547138214111328
Epoch 10, Loss: 0.6605719327926636
Epoch 20, Loss: 0.1383318454027176
Epoch 30, Loss: 0.030570585280656815
Epoch 40, Loss: 0.01079611573368311
Epoch 50, Loss: 0.005712065380066633
Epoch 60, Loss: 0.003950811456888914
Epoch 70, Loss: 0.0031179855577647686
Epoch 80, Loss: 0.002624556655064225
Epoch 90, Loss: 0.002282576635479927
Training complete!


## Explanation:
GCN aggregates features from a node’s neighbors using graph convolutions. This allows the network to learn representations based on both node features and graph structure.
The Cora dataset is used to classify nodes into one of 7 research topics.

## Questions (1 point each):

1. What would happen if we added more GCN layers (e.g., 3 layers instead of 2)? How would this affect over-smoothing?
2. What would happen if we used a larger hidden dimension (e.g., 64 instead of 16)? How would this impact the model's capacity?
3. What would happen if we replaced ReLU activation with a sigmoid function? Would the performance change?

4. What would happen if we trained on only 10% of the nodes and tested on the remaining 90%? How would the performance be affected?
5. What would happen if we used a different optimizer (e.g., RMSprop) instead of Adam? Would it affect the convergence speed?

Extra credit: 
1. What would happen if we used edge weights (non-binary) in the adjacency matrix? How would it affect message passing?
2. What would happen if we removed the log-softmax function in the output layer? Would the loss function still work correctly?

## No points, just for you to think about:
1. What would happen if we applied dropout to the node features during training? How would it affect the model’s generalization?
2. What would happen if we used mean-pooling instead of summing the messages in the GCN layers?
3. What would happen if we pre-trained the node features using a different algorithm, like Node2Vec, before feeding them into the GCN?


### 1. What would happen if we added more GCN layers (e.g., 3 layers instead of 2)? How would this affect over-smoothing?

In [4]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, hidden_dim)
        self.conv3 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        x = torch.relu(x)
        x = self.conv3(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

train()

Epoch 0, Loss: 1.9440739154815674
Epoch 10, Loss: 0.7493696212768555
Epoch 20, Loss: 0.1318102329969406
Epoch 30, Loss: 0.016000602394342422
Epoch 40, Loss: 0.003237819531932473
Epoch 50, Loss: 0.0012195336166769266
Epoch 60, Loss: 0.0007049707346595824
Epoch 70, Loss: 0.0005119502893649042
Epoch 80, Loss: 0.00041591981425881386
Epoch 90, Loss: 0.0003577024326659739
Training complete!


Using more layers would most likely make the model more accurate using loss metrics, but it would also cause over smoothing as the nodes will be grouped together even more, causing a loss of information from an extra layer of grouping. We can see that the loss does decrease significantly from 0.002 to 0.0003

### 2. What would happen if we used a larger hidden dimension (e.g., 64 instead of 16)? How would this impact the model's capacity?

In [5]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=64, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

train()

Epoch 0, Loss: 1.9572679996490479
Epoch 10, Loss: 0.08742734789848328
Epoch 20, Loss: 0.003399561857804656
Epoch 30, Loss: 0.0006251619779504836
Epoch 40, Loss: 0.00026571392663754523
Epoch 50, Loss: 0.00018113326223101467
Epoch 60, Loss: 0.00015136782894842327
Epoch 70, Loss: 0.00013669671898242086
Epoch 80, Loss: 0.00012743067054543644
Epoch 90, Loss: 0.00012043907190673053
Training complete!


Using a larger hidden dimension would allow the model to learn more features without having to worry about over smoothening as the model will group cells together the same amount of times, but instead just have more dimensionality allowing it to learn more connections. This may lead to overfitting if the model is too complex. We can see an even larger decrease in loss from 0.002 to 0.0001

### 3. What would happen if we replaced ReLU activation with a sigmoid function? Would the performance change?

In [6]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.sigmoid(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

train()

Epoch 0, Loss: 2.1052041053771973
Epoch 10, Loss: 1.4802534580230713
Epoch 20, Loss: 1.0245342254638672
Epoch 30, Loss: 0.6691532135009766
Epoch 40, Loss: 0.42247724533081055
Epoch 50, Loss: 0.2708204686641693
Epoch 60, Loss: 0.18145246803760529
Epoch 70, Loss: 0.12838099896907806
Epoch 80, Loss: 0.09573208540678024
Epoch 90, Loss: 0.07465018332004547
Training complete!


Using sigmoid instead of relu would limit the weight outputs to between 0 and 1. While for some cases this would be beneficial, any sort of prediction that requires values beyond that range would be largely hindered. We can see that here our loss actually ends up higher than before which implies that our dataset of research papers is hindered by the limit of 0 to 1.

### 4. What would happen if we trained on only 10% of the nodes and tested on the remaining 90%? How would the performance be affected?

In [9]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

train_mask = torch.randperm(data.num_nodes) < int(0.1 * data.num_nodes)
test_mask = ~train_mask

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data)
    loss = criterion(out[train_mask], data.y[train_mask])
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

print("Training complete!")

model.eval()
with torch.no_grad():
    out = model(data)
    test_loss = criterion(out[test_mask], data.y[test_mask])

print(test_loss)

Epoch 0, Loss: 1.937464714050293
Epoch 10, Loss: 0.7758140563964844
Epoch 20, Loss: 0.2278340607881546
Epoch 30, Loss: 0.07061011344194412
Epoch 40, Loss: 0.028967803344130516
Epoch 50, Loss: 0.015462013892829418
Epoch 60, Loss: 0.010025356896221638
Epoch 70, Loss: 0.007363757584244013
Epoch 80, Loss: 0.005820433143526316
Epoch 90, Loss: 0.004811818245798349
Training complete!
tensor(0.7256)


Using only 90% of the values for testing we would end up with a model that is very poor at generalizing. We can see this through the loss on the evaluated data being very high compared to the loss on the training dataset. This is due to overfitting as the model hasn't been exposed to a wide enough variety of the input data.

### 5. What would happen if we used a different optimizer (e.g., RMSprop) instead of Adam? Would it affect the convergence speed?

In [10]:
# Define a 2-layer GCN
class GCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return torch.log_softmax(x, dim=1)

# Initialize model, optimizer, and loss function
model = GCN(input_dim=dataset.num_node_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.RMSprop(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

train()

Epoch 0, Loss: 1.9579170942306519
Epoch 10, Loss: 0.03823234885931015
Epoch 20, Loss: 0.013815916143357754
Epoch 30, Loss: 0.00785114150494337
Epoch 40, Loss: 0.005249897018074989
Epoch 50, Loss: 0.0038193848449736834
Epoch 60, Loss: 0.002943785861134529
Epoch 70, Loss: 0.002356960903853178
Epoch 80, Loss: 0.00193834921810776
Epoch 90, Loss: 0.0016265342710539699
Training complete!


RMS prop works well in scenarios where data has lots of variability and our research paper data has that which allows our data to converge much faster which we can see through our much lower learning and faster decrease.