<a href="https://colab.research.google.com/github/dsinsight/GNN/blob/main/Simple_GNN_using_CORA_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Simple GNN using CORA Dataset**

## **1. Importing Required Libraries**

- **torch:** PyTorch library, used for building and training deep learning models.
- **torch.nn.functional as F:** Provides useful functions like activation functions (relu, log_softmax), loss functions (nll_loss), and other operations that don't require creating layers explicitly.
- **GCNConv:** A Graph Convolution layer from torch_geometric, which is a key component for building Graph Neural Networks (GNNs).
- **Planetoid:** A dataset loader from torch_geometric for benchmark datasets like CORA, CiteSeer, and PubMed. These datasets are often used for tasks like node classification on citation networks.
- **DataLoader:** A PyTorch Geometric utility to load data in mini-batches, but it’s unused in this particular example.


In [28]:
!pip install torch_geometric
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid
from torch_geometric.data import DataLoader



## **2. Loading the Dataset**

- **dataset = Planetoid(...):** Loads the CORA dataset, which is a citation network where nodes represent papers, and edges represent citation relationships between papers. Each node has features (e.g., the text of the paper) and a label (e.g., the paper's category).
- **root='/tmp/Cora':** Specifies where to download/store the dataset locally.

In [29]:
# Load the Cora dataset (a citation network)
dataset = Planetoid(root='/tmp/Cora', name='Cora')

##**3. Defining the Graph Convolutional Network (GCN) Model**

- **GCN:** A custom class for the Graph Convolutional Network (GCN) model. It inherits from torch.nn.Module.
- **GCNConv(dataset.num_features, 16):** The first graph convolution layer, where dataset.num_features refers to the number of input features of each node (e.g., the word embeddings of the paper), and 16 is the number of output channels (hidden units).
- **GCNConv(16, dataset.num_classes):** The second graph convolution layer, where the input is the 16 hidden units from the first layer, and the output has the size of dataset.num_classes, which corresponds to the number of paper categories in CORA (i.e., the classification task).

##**4. Defining the Forward Pass**

- **data.x:** The feature matrix of the graph, where each row represents a node and contains its features.
- **data.edge_index:** The edge list that represents the connections between nodes (edges of the graph).
- **self.conv1(x, edge_index):** The first graph convolution layer. It aggregates features from neighboring nodes based on the graph structure (defined by edge_index).
- **F.relu(x):** Applies the ReLU activation function to introduce non-linearity.
- **F.dropout(x, training=self.training):** Applies dropout for regularization to prevent overfitting, only during training.
- **self.conv2(x, edge_index):** The second graph convolution layer, which produces the logits for classification.
- **F.log_softmax(x, dim=1):** The output of the model is passed through a log softmax function, which is typically used in multi-class classification tasks. It outputs the log-probabilities of each class for each node.



In [30]:
class GCN(torch.nn.Module):
    def __init__(self):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(dataset.num_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

##**5. Creating the Model, Optimizer, and Loss Function**

- **torch.device('cuda'...):** Checks if a GPU is available and sets the device to cuda (GPU) if available, otherwise uses cpu.
- **model = GCN().to(device):** Initializes the GCN model and transfers it to the specified device (GPU or CPU).
- **data = dataset[0].to(device):** Accesses the first graph object in the dataset (in this case, the CORA citation network) and transfers it to the same device.
- **optimizer = torch.optim.Adam(...):** Creates an Adam optimizer with a learning rate of 0.01 and a weight decay of 5e-4 for regularization (used to prevent overfitting).

In [32]:
# Create the model, define the optimizer and loss function
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

6. Training the GNN

- **model.train():** Sets the model in training mode (PyTorch has separate behaviors for training and evaluation modes).
- **Training Loop:** The model is trained for 200 epochs.
  - **optimizer.zero_grad():** Clears the gradients from the previous step.
  - **out = model(data):** Forward pass through the GCN model, generating predictions for the graph.
  - **F.nll_loss(out[data.train_mask], data.y[data.train_mask]):** Calculates the negative log likelihood loss using only the nodes in the training set (indicated by data.train_mask), comparing the predicted labels (out) and true labels (data.y).
  - **loss.backward():** Computes the gradients for the model parameters using backpropagation.
  - **optimizer.step():** Updates the model parameters based on the computed gradients.

In [33]:
# Training the GNN
model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

##**7. Testing the GNN**

- **model.eval():** Puts the model into evaluation mode (disables dropout and other layers that behave differently during training).
- **_, pred = model(data).max(dim=1):** Performs a forward pass to get predictions for all nodes in the graph. max(dim=1) extracts the predicted class labels by selecting the class with the highest probability.
- **pred[data.test_mask].eq(data.y[data.test_mask]):** Compares the predicted labels (pred) with the actual labels (data.y) for the test set (indicated by data.test_mask).
- **correct:** Counts the number of correct predictions.
- **accuracy:** Computes the accuracy by dividing the number of correct predictions by the total number of test examples.
- **print(f'Accuracy:** {accuracy:.4f}'): Prints the accuracy of the model on the test set.



In [34]:
# Testing the GNN
model.eval()
_, pred = model(data).max(dim=1)
correct = int(pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())
accuracy = correct / int(data.test_mask.sum())
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.8030


This code implements a **Graph Convolutional Network (GCN)** for node classification on the **CORA citation network**. The GCN leverages the graph structure by using graph convolutions, which allow the model to aggregate information from neighboring nodes to improve classification performance. The code includes the steps for loading the data, defining the GCN architecture, training the model, and evaluating its accuracy on the test set.