# Graph Machine Learning

This practical work is devoted to the discovery of graph machine learning using Graph Neural Network, on a simple and classic classification experiment, namely the [MUTAG dataset](). 

Before starting to work, we will check that our kernel configuration is ok. Execute the next cell, you should have no error. 

In [None]:
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

import networkx as nx
import numpy as np
import matplotlib.pyplot as plt

from tqdm import tqdm

## Dataset Loading


The MUTAG dataset is a classic classification dataset used in graph machine learning. It consists of a collection of mutagenic aromatic and heteroaromatic nitro compounds. The goal is to predict whether a compound is mutagenic or non-mutagenic based on its molecular structure.

The dataset contains a set of graphs, where each graph represents a compound. Each node in the graph represents an atom in the compound, and the edges represent the bonds between atoms. The nodes carry the atom types as attributes and the edges the kind of atomic bond.

The dataset is labeled, with each graph labeled as either mutagenic or non-mutagenic. This makes it a binary graph classification problem.

The following code imports the `TUDataset` class from `torch_geometric.datasets` module and creates an instance of the dataset by specifying the root directory where the data will be locally stored.

In [None]:
from torch_geometric.datasets import TUDataset


dataset_path = "data/TUDataset"
dataset = TUDataset(root=dataset_path, name='MUTAG')

dataset.download()

### Data exploration

By looking, at the size of the dataset, we can retrieve the number of compounds. Each item in the dataset corresponds to a graph, and each item of `dataset.y` corresponds to the class of the compound.

In [None]:
print(len(dataset))

In [None]:
graph = dataset[0]
print(graph)

To visualize a molecule, we can rely on `networkx` library. Note that nodes attributes encode the atom type as a one hot encoding, the same with edge attributes and kind of atomic bond.

What are the possible node's and edge's labels ? How many nodes for the first graph ? 

In [None]:
from torch_geometric.utils import to_networkx
g = to_networkx(graph, node_attrs='x')
nx.draw(g, with_labels=True)


## Pre processing the dataset

rappel de l'importance de séparer train et test, bonne pratique avec pytorch geometric
présentation des dataloader, notion de batch, shuffle oui/non


To make a good model, we want to evaluate its performance on a test set, not used during the learning phase. This separation can be done by keeping by splitting the dataset object using slices of `train_test_split` from scikit-learn library.

Once this separation is performed, we create two dataloaders. The DataLoader class is used to load data in batches during the training and testing phases of a machine learning model. It helps in efficiently processing large datasets by dividing them into smaller batches. 

By using these DataLoader objects, you can iterate over the data in batches during the training and testing phases of your model. This allows you to efficiently process large datasets and train your model more effectively.

In [None]:
from sklearn.model_selection import train_test_split

train_dataset, test_dataset = train_test_split(dataset, train_size=.8, random_state=42) 


In [None]:
from torch_geometric.loader import DataLoader

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

In [None]:
# check the batch sizes
for data in train_loader:
    print(len(data))

## A first GNN 

To create our first GNN prediction model, we will rely on a model implemented using the Pytorch Geometric library in the `graphadon.py`file. We will first use it as a black box, and then try to understand its components.

![./figures/graph_level.svg](./figures/graph_level.svg)

Create an instance of `FirstGNN` with the default parameters and print the instance.  

In [None]:
%load_ext autoreload
%autoreload 2

from graphadon import FirstGNN

num_node_features = dataset.num_node_features
model = FirstGNN()
print(model)


### Model Learning

Once our architecture has been created, we need to tune its parameters to fit a particular task, here our classification task. As a classic MLP or CNN, the learning loop follows the same backbone : 
 1. Forward pass
 1. Gradient computation
 1. Backward pass
 1. Reinitialisation of gradients.

Complete the following code to learn the parameters of our `FirstGNN`.
For each epoch, compute the loss and the accuracy performance on train set. 



In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()

NB_EPOCHS = ...

for epoch in tqdm(range(1, NB_EPOCHS+1)):
    model.train()
    for data in train_loader:  # Pour chaque mini batch
        # forward pass. What the GNN need to perform the forward pass ?
        out = model.forward(data.x, data.edge_index, data.batch) 
        loss = criterion(out, data.y)   # calcul de la loss
        loss.backward()  #calcul des gradients  
        
        
        optimizer.step()  # Rétro propagation 
        optimizer.zero_grad()  # on remet à 0 pour le prochain tour
    
    model.eval()
    # test de l'accuracy sur le train 
    for data in train_loader: 
        out = model.forward(data.x, data.edge_index, data.batch) 
        pred = out.argmax(dim=1) # décider de qui a gagné
    accuracy = ...
    


### Learning curves

To be sure that everything went well, plot the learning curves according to the loss and the accuracy on train set.

Are the curves following your expectations ? 


### Prediction

For now, we only check the performance on data we already know the properties, which is quite useless. Let's evaluate our model on test set. 

Compute the accuracy on test set and compare it to the accuracy on train set. 

What's your opinion on the computed value ? 


In [None]:
model.eval() # utile pour les dropout 

for data in test_loader:  
    ...
acc = ...
print('Accuracy: {:.4f}'.format(acc))

# Add a validation set

Let's now evaluate the performance on our test set (which hence becomes a validation set) for each epoch. 

1. Complete the function test to get the accuracy of a pair of model/dataloader
2. Modify your learning curve to compute accuracy on both learning/validation set
3. Draw the learning curves 
 

In [None]:
def test(model, loader):
    """Compute the accuracy on loader using model"""
    acc = ...
    return correct / len(loader.dataset)  

## GNN Implementation

Now we are able to train a predictive model as a black box, let's go into details and analyze how this GNN is built. 

### Layer Analysis

1. Open the file `graphadon.py` and analyse the contents of the `FirstGNN` class. What layers do you identify ?
> Note that you can heavily rely on the documentation of `pytorch geometric` : [https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html]()

2. Add a convolutional layer followed by a ReLU. 
> Congrats ! You implemented your first GNN, now you can test it. 




### So many convolutions 

Change the convolutional layers to  `GraphConv` and test your new model. 

You can also pick among the many layers available : https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#convolutional-layers



### Readout/Pooling

In the first GNN implemented, we inserted a `global_mean_pooling` layer. 

1. What is the rationale of this function ? What happen if we remove it ? 
2. Change this function to another readout strategy and test the new model.



### Hyperparameters

GNNs architectures open a lot of new possibilities but they come with a strong drawback : tuning hyperaparemeters.

1. Identify the hyperparameters for the different layers, the model itself and the optimization algorithm.

2. Propose a strategy to find the best combination 



## Competition 

Since you are now a new expert on GNNs, let's compete. We created on Kaggle a private leaderboard to test your skills on another dataset encoding molecular compounds : [Kaggle Competition](https://www.kaggle.com/competitions/graphadon-contest)

Here's your invitation link : https://www.kaggle.com/t/aa069d65592d4d15ba457898007e7540

It's still a binary classification task, but now you only have access to the graph of the test set, but not their properties to predict ! 

To compete, you will need to learn a model using the `train_dataset` provided in `train_dataset.pt` file. The submission file for Kaggle competition can be obtained using the `generate_pred_for_kaggle` and `generate_kaggle_file` functions provided in `graphadon.py` file. Check the example using FirstGNN as a source of inspiration.

The train dataset is composed of 2168 molecules, where each node is encoded by 14 binary values corresponding to a one hot encoding of the corresponding atom. Test set is composed of 2169 molecules and encoded in the same way as the train set, except the `y` value.

Do your best ! 

In [None]:
len(test_dataset)

In [None]:
# Load the data

train_dataset = torch.load("./train_dataset.pt")
test_dataset = torch.load("./test_dataset.pt")

In [None]:

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)


In [None]:
from graphadon import FirstGNN
from graphadon import learn_and_perf

model = FirstGNN()

acc_train, acc_test, losses  = learn_and_perf(model, train_loader, None, nb_epochs=100)

plt.plot(acc_train, label="train")
plt.legend()

In [None]:
from graphadon import generate_pred_for_kaggle, generate_kaggle_file
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
preds = generate_pred_for_kaggle(model, test_loader)
generate_kaggle_file(preds, "./kaggle.csv")
