# Config File Structure

In [1]:
import json
import keras as ks

The hyper-parameter dictionary used for model tuning, fitting and data structue has the following form:
```python3
hyper = {
    "info":{ 
        # General information for training run
        "kgcnn_version": "2.0.0", # Version 
        "postfix": "" # postfix for output folder
    },
    "model": { 
        # Model specific parameter, see kgcnn.literature
    },
    "data": { 
        # Dataset specific parameter
    },
    "training": {
        "fit": { 
            # keras fit arguments serialized
        },
        "compile": { 
            # Keras compile arguments serialized
        }
    }
}
```
The following sections explain each block.


In [2]:
hyper = {}

## Model

The model parameters can be reviewed from the default values in ``kgcnn.literature``. Mostly model input and output has to be matched depending on the data representation. That is type of input and its shape. An input-type checker can be used from `kgcnn.data.base.MemoryGraphDataset`, which has `assert_valid_model_input`. In ``inputs`` a list of kwargs must be given, which are each unpacked in the corresponding ``ks.layers.Input``. The order matters and is model dependent.

Moreover, naming of the model input is used to link the tensor properties of the dataset with the model input. 

In [3]:
hyper.update({
    "model":{
        "module_name": None, 
        "class_name": "make_model",
        "config":{
            "inputs": [
                {"shape": [None, 100], "name": "node_attributes", "dtype": "float32"},
                {"shape": [None, 2], "name": "edge_indices", "dtype": "int64"},
                {"shape": (), "name": "total_nodes", "dtype": "int64"},
                {"shape": (), "name": "total_edges", "dtype": "int64"}
            ],
            # More model specific kwargs, like:
            "depth": 5,
            # Output part defining model output
            "output_embedding": "graph",
            "output_mlp": {"use_bias": [True, True, False], "units": [140, 70, 70],
                           "activation": ["relu", "relu", "softmax"]}
        }
    }
})

Here, the training script should provide ``dataset.tensor({"name": "edge_indices"})`` of shape `(batch, None, 2)` and ``dataset.tensor({"name": "node_attributes"})`` of shape `(batch, None, 100)` from the dataset. Note that the shape must match the actual shape in dataset. Same for `total_nodes` and `total_edges`.

For output, idally all models simply have a MLP at the output and the activation as well as the final output dimension can be chosen by setting the kwargs ``output_mlp`` (unpacked in MLP) for last layer in ``units`` and ``activation``. The number in units must macht the labels or classes of the target. This is mostly ``dataset.tensor({"name": "graph_labels"})``, but depends on dataset and classification task, either graph or node classification.

## Data

The kwargs for the dataset are not fully identical and vary a little depending on the datset. However, the most common are listed below.



In [4]:
hyper.update({
    "data":{      
        # Other optinal entries (depends on the training script)
        "data_unit": "mol/L",
    },
    "dataset": {
        "class_name": "QM9Dataset", # Name of the dataset
        "module_name": "kgcnn.data.datasets.QM9Dataset",
        
        # Config like filepath etc., leave empty for pre-defined datasets
        "config": {}, 
        
        # Methods to run on dataset, i.e. the list of graphs
        "methods": [
            {"prepare_data": {}}, # Used for cache and pre-compute data, leave out for pre-defined datasets
            {"read_in_memory": {}}, # Used for reading into memory, leave out for pre-defined datasets
            
            # Example method to run over each graph in the list using `map_list` method.
            {"map_list": {"method": "set_range", "max_distance": 4, "max_neighbours": 30}},
            {"map_list": {"method": "count_nodes_and_edges", "total_edges": "total_edges",
                          "count_edges": "edge_indices", "count_nodes": "node_attributes", "total_nodes": "total_nodes"}},
        ]
    },
})

## Training

The kwargs for training simply sets arguments for ``model.compile(**kwargs)`` and ``model.fit(**kwargs)`` that matches keras arguments as well as for the k-fold split from scikit-learn. The kwargs are expected to be fully serialized, if the hyper parameters are supposed to be saved to json.

In [5]:
hyper.update({
    "training":{
        # Cross-validation of the data
        "cross_validation": {
            "class_name": "KFold",
            "config": {"n_splits": 5, "random_state": 42, "shuffle": True}
        },
        # Standard scaler for regression targets
        "scaler": {
            "class_name": "StandardScaler",
            "module_name": "kgcnn.data.transform.scaler.standard",
            "config": {"with_std": True, "with_mean": True, "copy": True}
        },
        # Keras model compile and fit
        "compile": {
            "loss": "categorical_crossentropy",
            "optimizer": ks.saving.serialize_keras_object(
                ks.optimizers.Adam(learning_rate=0.001))
        },
        "fit": {
            "batch_size": 32, "epochs": 800, "verbose": 2, 
            "callbacks": []
        }
    }
})

## Info

Some general information on the training, such as the used kgcnn version or a postfix for the output files.

In [6]:
hyper.update({
    "info":{ # Generla information
        "postfix": "_v1", # Appends _v1 to output folder
        "postfix_file": "_run2", # Appends _run2 to info files
        "kgcnn_version": "4.0.0"    
    }
})

# Summary

Final hyper dictionary which can be fed to training script:

In [7]:
hyper

{'model': {'module_name': None,
  'class_name': 'make_model',
  'config': {'inputs': [{'shape': [None, 100],
     'name': 'node_attributes',
     'dtype': 'float32'},
    {'shape': [None, 2], 'name': 'edge_indices', 'dtype': 'int64'},
    {'shape': (), 'name': 'total_nodes', 'dtype': 'int64'},
    {'shape': (), 'name': 'total_edges', 'dtype': 'int64'}],
   'depth': 5,
   'output_embedding': 'graph',
   'output_mlp': {'use_bias': [True, True, False],
    'units': [140, 70, 70],
    'activation': ['relu', 'relu', 'softmax']}}},
 'data': {'data_unit': 'mol/L'},
 'dataset': {'class_name': 'QM9Dataset',
  'module_name': 'kgcnn.data.datasets.QM9Dataset',
  'config': {},
  'methods': [{'prepare_data': {}},
   {'read_in_memory': {}},
   {'map_list': {'method': 'set_range',
     'max_distance': 4,
     'max_neighbours': 30}},
   {'map_list': {'method': 'count_nodes_and_edges',
     'total_edges': 'total_edges',
     'count_edges': 'edge_indices',
     'count_nodes': 'node_attributes',
     'tot