# Moleculenet

The `MoleculeNetDataset` class is intended for datasets that consist of a table of smiles and corresponding targets and convert them into a tensors representation for graph networks. The class provides properties and methods for making graph features from smiles.
The typical input is a `csv` or `excel` file with smiles and corresponding graph labels.

The graph structure matches the molecular graph, i.e. the chemical structure. And features for atoms and bonds are generated with `RDkit` chemical informatics software.

The atomic coordinates
are generated by a conformer guess. Since this require some computation time, it is only done once and the
molecular coordinate or mol-blocks stored in a single SDF file with the base-name of the csv file.

For demonstration, we make an artifical table of smiles and some values and store them to file.

In [1]:
import os
os.makedirs("ExampleMol", exist_ok=True)
csv_data = "".join([
    "smiles,Values1,Values2\n",  # Need header!
    "CCC, 1, 0.1\n",
    "CCCO, 2, 0.3\nCCCN, 3, 0.2\n",
    "CCCC=O, 4, 0.4\n"
    "NOCF, 4, 1.4\n"
])
with open("ExampleMol/data.csv", "w") as f:
    f.write(csv_data)

The file structure is:


```bash
├── ExampleMol
    ├── data.csv
    └── data.sdf  # After prepare_data
```

In [2]:
from kgcnn.data.moleculenet import MoleculeNetDataset, OneHotEncoder

## 1. Initialization

To load the dataset from memeory the ``MoleculeNetDataset`` class requires the information of the directory the data is in and the name of the csv-file. Also recommended is to provide a name of the dataset.

In [3]:
dts = MoleculeNetDataset(file_name="data.csv", 
                         data_directory="ExampleMol/", 
                         dataset_name="ExampleMol")

## 2. Data Preparation

Precompute the molecular structure and possibly also coordinates and cache the information to file as SDF mol table format in the same folder as provided in the class initialization. The structure generation can be run in parallel but the SDF file generated may be large and must still fit in memory.

In [4]:
dts.prepare_data(
    overwrite=True, 
    smiles_column_name="smiles", 
    add_hydrogen=True,
    make_conformers=True,
    optimize_conformer=True,
    num_workers=None  # Default is #cpus
)

INFO:kgcnn.data.ExampleMol:Generating molecules and store ExampleMol/data.sdf to disk...
INFO:kgcnn.data.ExampleMol: ... converted molecules 5 from 5


<kgcnn.data.moleculenet.MoleculeNetDataset at 0x266817903d0>

## 3. Read Data

After ``prepare_data()`` is called, the cached mol-file can be read directly from the data-directory.
The reading step can also define the labels or targets to assigning property `graph_labels` from the column of the csv table. By default a simple graph is generated without attributes.

In [5]:
dts.read_in_memory(
    label_column_name=["Values1", "Values2"], 
    add_hydrogen=False,  # We remove H's 
    has_conformers=True  # We keep strucutre
)
print("Number of graphs:", len(dts))

INFO:kgcnn.data.ExampleMol:Read molecules from mol-file.


TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

## 4. Setting Attributes

For molecular graphs nodes, edges or atoms and bonds should have attributes that are for `MoleculeNetDataset` generated from `RDkit` that includes chirality, stereo information (etc). Note that if a smile can not be processed by `RDkit` the graph will not have attributes.

This can be achieved by setting a list of identifiers of predefined attributes or supply custom functions.
Additionally an encoder can be provided to cast or transform the `RDkit` data formats into a list or value that eventually be cast into numpy `dtype="float"` array. Also one-hot-encoding or mapping on to distributions can be handled by encoders. Attributes 

In [None]:
# Class to make attributes used by MoleculeNetDataset
import rdkit.Chem as Chem
from kgcnn.mol.module_rdkit import MolecularGraphRDKit
mol = MolecularGraphRDKit()
# Identifiers:
print("Atoms:", list(mol.atom_fun_dict.keys()))
print("Bonds:", list(mol.bond_fun_dict.keys()))
print("Molecule:", list(mol.mol_fun_dict.keys()))

Custom functions must take a `RDkit` Mol, Atom or Bond instance as input for node edge or graph respectively.

In [None]:
# Or make custom function
def mol_feature(m):
    return m.GetNumAtoms()

Or using a callback directly for a new attribute, which takes the csv table and molecule list as argument.

In [None]:
def graph_size_callback(mg, ds):
    return mg.mol.GetNumAtoms()

Or using custom transform

In [None]:
def custum_trafo(mg):
    return mg.compute_charge()

In [None]:
dts.set_attributes(
    # Nodes
    nodes=["Symbol", "TotalNumHs", "GasteigerCharge"], 
    encoder_nodes={
        "Symbol": OneHotEncoder(["C", "N", "O"], dtype="str", add_unknown=False)
    },
    # Edges
    edges=["BondType", "Stereo"], 
    encoder_edges = {
        "BondType": int
    },
    # Graph-level
    graph=["ExactMolWt", mol_feature],
    additional_callbacks= {"size": graph_size_callback},
    custom_transform=custum_trafo
)

## 4. Checking graphs in dataset 

In [None]:
import networkx as nx

In [None]:
dts.obtain_property("node_number"), dts.obtain_property("node_symbol")

In [None]:
print(dts[3])

In [None]:
G = nx.Graph()
G.add_nodes_from([(i, {"atom": x}) for i, x in enumerate(dts.obtain_property("node_symbol")[3])])
G.add_edges_from(dts.obtain_property("edge_indices")[3])

In [None]:
labels = nx.get_node_attributes(G, 'atom') 
nx.draw(G,labels=labels)

In [None]:
Chem.MolFromSmiles("CCCC=O")

Checking the output of ``set_attributes`` method

In [None]:
dts.obtain_property("node_attributes")

In [None]:
dts.obtain_property("edge_attributes")

In [None]:
dts.obtain_property("graph_attributes")

In [None]:
dts.obtain_property("graph_labels")

In [None]:
dts.save()

In [None]:
dts.load()

In [None]:
dts[0]