# TP Machine Learning on Networks

----------

## Preparatory Instructions

#### Packages
Make sure you have installed the following Packages:
* pytorch 1.9.*
    - Please go to https://pytorch.org to generate the installation command for your setup
* pytorch-geometric
    - Please go to https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html to generate the installation command for your setup
* matplotlib

Additionally, make sure to have the python files **utility_functions.py** and **neural_network_models.py** provided by the teacher

-----

## TME Instructions

### Excercise 1: Node classification with Graph Neural Networks

#### 1.1: Importance of Graphs in Machine Learning

**1.1.1**. Load cora dataset
- Use num_train_per_class = 100
- Use num_val = 300
- Use num_test = 1000

**1.1.2**. Classification of cora dataset with classical neural network
- Create a MLP_1_hidden model (neural network black box). 
    - Use 50 hidden neurons.
    
- Test the performance of the model without training
- Train the model (do not use the validation set, play with the number of epochs)
- Test the performance of the model after training

**1.1.3**. Classification of cora dataset with graph neural network 
- Create a GCN_1_hidden model (neural network black box). 
    - Use 50 hidden neurons.
    
- Test the performance of the model without training
- Train the model (do not use the validation set, play with the number of epochs)
- Test the performance of the model after training

#### 1.2: Preventing Overfitting

**1.2.1**. Repeat **1.1.2** but this time use the validation set during training

**1.2.2**. Repeat **1.1.3** but this time use the validation set during training

#### 1.3: Impact of training samples

**1.3.1**. Load non-trained MLP_1_hidden and GCN_1_hidden models

**1.3.2**. Train both models for different amounts of training samples (both models should be trained on the same data)
- Use num_val = 300
- Use num_test = 1000

**1.3.3**. Compare the performance of the models

#### 1.4: Impact of network parameters

**1.4.1**. Load cora dataset
- Use num_train_per_class = 20
- Use num_val = 300
- Use num_test = 1000

**1.4.2**. Load a non-trained GCN_1_hidden model

**1.4.3**. Train and evaluate the GCN_1_hidden model for different number of hidden neurons

**1.4.4**. Load a non-trained GCN_2_hidden model

**1.4.5**. Train and evaluate the GCN_2_hidden model for different number of hidden neurons in both hidden layers

#### Questions:
* Describe differences between the multi-layer perceptron and the graph convolutional neural network.
* Describe the issue of overfitting and how the evaluation set can be used to amend it.
* Describe what are the effects of varying the number of training samples and the reasons behind them.
* Describe what is the effect of increasing/decreasing the number of hidden layers and neurons in them.

### Excercise 2: Graph classification with Graph Neural Networks

#### 2.1: The learning rate and batch sizes

**2.1.1**. Load proteins dataset
- Use num_train_per_class = 150
- Use num_val = 300
- Use num_test = 500
- Use batch_size = 32

**2.1.2**. Create a GCN_graph_classif_model model

**2.1.3**. Train and evaluate the model for different values of the learning rate

**2.1.4**. Train and evaluate the model for different values of the batch_size (batch_size cannot be larger than num_train_per_class)

**2.1.4**. Train and evaluate the model by decreasing num_train_per_class

#### Questions:
* Describe what is the role of the learning rate.
* Describe what is the role of the batch size.
* Does the best values of learning rate and batch size depend on the size of the training dataset? If they do, 

#### 2.2: Proposing your own model

**2.2.1**. Open the **neural_network_models.py** file and propose a modification of the **GCN_graph_classif_model** model (by adding or removing layers) that either improves perdiction performance or that makes training easier while keeping performance (w.r.t. to your results from excercise 2.1).



-----

## Appendix

The following methods are all contained in the file **utility_functions.py**

### Loading datasets

**load_cora_dataset(num_train_per_class, num_val, num_test, seed=0)**

Input:
- num_train_per_class: number of elements taken for each class for training
- num_val: number of elements assigned to validation set
- num_test: number of elements assigned to the test set

Output:
- Dataset container (it has 1 graph object)

**load_protein_dataset(num_train_per_class, num_val, num_test, batch_size, seed=0)**

Input:
- num_train_per_class: number of elements taken for each class for training
- num_val: number of elements assigned to validation set
- num_test: number of elements assigned to the test set

Output:
- Dataset container (it has 1113 graph objects)

**Dataset container**

This is a container that provides access to all the graph objects associated to this dataset. Each graph object has all the necessary edge-lists, features and splits required by the neural network models.  

The following Dataset methods will be useful to you:

* Dataset.num_features : number of features associated to each vertex
* Dataset.num_classes : number of classes to predict
* Dataset[ n ] : access to the n-th graph object

### Creating, training and testing a multi-layer perceptron with 1 hidden layer

**get_MLP_1_hidden(in_dimension, hid_dimension, out_dimension)**

Input:
- in_dimension: Number of neurons input layer
- hid_dimension: Number of neurons hidden layer
- out_dimension: Number of neurons output layer

Output:
- Untrained MLP_1_hidden neural network (black-box) model

**train_MLP_1_hidden(model, data, num_epochs, lr=0.01, use_val=False)**

Input:
- model: Untrained MLP_1_hidden model
- data: Graph object 
- num_epochs: Number of iterations during training phase
- lr (optional, default=0.01): Learning rate
- use_val (optional, default=False): Use the validation set during training 

Output:
- Trained MLP_1_hidden model

**test_MLP_1_hidden(model, data)**

Input:
- model: MLP_1_hidden model
- data: Graph object 

Output:
- Accuracy of the model on the test set

### Creating, training and testing a graph convolutional network for node classification with 1 hidden layer

**get_GCN_1_hidden(in_dimension, hid_dimension, out_dimension)**

Input:
- in_dimension: Number of neurons input layer
- hid_dimension: Number of neurons hidden layer
- out_dimension: Number of neurons output layer

Output:
- Untrained GCN_1_hidden neural network (black-box) model

**train_GCN_1_hidden(model, data, num_epochs, lr=0.01, use_val=False)**

Input:
- model: Untrained GCN_1_hidden model
- data: Graph object 
- num_epochs: Number of iterations during training phase
- lr (optional, default=0.01): Learning rate
- use_val (optional, default=False): Use the validation set during training 

Output:
- Trained GCN_1_hidden model

**test_GCN_1_hidden(model, data)**

Input:
- model: GCN_1_hidden model
- data: Graph object 

Output:
- Accuracy of the model on the test set

### Creating, training and testing a graph convolutional network for node classification with 2 hidden layers

**get_GCN_2_hidden(in_dimension, hid_1_dimension, hid_2_dimension, out_dimension)**

Input:
- in_dimension: Number of neurons input layer
- hid_1_dimension: Number of neurons in first hidden layer
- hid_2_dimension: Number of neurons in second hidden layer
- out_dimension: Number of neurons output layer

Output:
- Untrained GCN_2_hidden neural network (black-box) model

**train_GCN_2_hidden(model, data, num_epochs, lr=0.01, use_val=False)**

Input:
- model: Untrained GCN_2_hidden model
- data: Graph object 
- num_epochs: Number of iterations during training phase
- lr (optional, default=0.01): Learning rate
- use_val (optional, default=False): Use the validation set during training 

Output:
- Trained GCN_2_hidden model

**test_GCN_2_hidden(model, data)**

Input:
- model: GCN_2_hidden model
- data: Graph object 

Output:
- Accuracy of the model on the test set

### Creating, training and testing a graph convolutional network for graph classification

**get_GCN_graph_classif(in_dimension, hid_1_dimension, hid_2_dimension, hid_3_dimension, out_dimension)**

Input:
- in_dimension: Number of neurons input layer
- hid_1_dimension: Number of neurons in first hidden layer
- hid_2_dimension: Number of neurons in second hidden layer
- hid_3_dimension: Number of neurons in third hidden layer
- out_dimension: Number of neurons output layer

Output:
- Untrained GCN_graph_classif neural network (black-box) model

**train_GCN_graph_classif(model, dataset, num_epochs, lr=0.01, use_val=False)**

Input:
- model: Untrained GCN_2_hidden model
- dataset: Dataset container 
- num_epochs: Number of iterations during training phase
- lr (optional, default=0.01): Learning rate
- use_val (optional, default=False): Use the validation set during training 

Output:
- Trained GCN_graph_classif model

**test_GCN_graph_classif(model, dataset)**

Input:
- model: GCN_graph_classif model
- dataset: Dataset container 

Output:
- Accuracy of the model on the test set