<a href="https://colab.research.google.com/github/autowiki4/Deep-Chem-Tutorials/blob/main/examples/tutorials/Working_With_Splitters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Working With Splitters

When using machine learning, you typically divide your data into training, validation, and test sets.  The MoleculeNet loaders do this automatically.  But how should you divide up the data?  This question seems simple at first, but it turns out to be quite complicated.  There are many ways of splitting up data, and which one you choose can have a big impact on the reliability of your results.  This tutorial introduces some of the splitting methods provided by DeepChem.

## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Working_With_Splitters.ipynb)



In [1]:
!pip install --pre deepchem
import deepchem
deepchem.__version__



Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


'2.8.1.dev'

## Splitters

In DeepChem, a method of splitting samples into multiple datasets is defined by a `Splitter` object.  Choosing an appropriate method for your data is very important.  Otherwise, your trained model may seem to work much better than it really does.

Consider a typical drug development pipeline.  You might begin by screening many thousands of molecules to see if they bind to your target of interest.  Once you find one that seems to work, you try to optimize it by testing thousands of minor variations on it, looking for one that binds more strongly.  Then perhaps you test it in animals and find it has unacceptable toxicity, so you try more variations to fix the problems.

This has an important consequence for chemical datasets: they often include lots of molecules that are very similar to each other.  If you split the data into training and test sets in a naive way, the training set will include many molecules that are very similar to the ones in the test set, even if they are not exactly identical.  As a result, the model may do very well on the test set, but then fail badly when you try to use it on other data that is less similar to the training data.

Let's take a look at a few of the splitters found in DeepChem.

### • General Splitters
      ○ RandomSplitter
      ○ RandomGroupSplitter
      ○ RandomStratifiedSplitter
      ○ SingletaskStratifiedSplitter
      ○ IndexSplitter
      ○ SpecifiedSplitter
      ○ TaskSplitter

### • Molecular Splitters
      ○ ScaffoldSplitter
      ○ MolecularWeightSplitter
      ○ MaxMinSplitter
      ○ ButinaSplitter
      ○ FingerprintSplitter


 Let's take a look how different splitters work.




### RandomSplitter

This is one of the simplest splitters.  It just selects samples for the training, validation, and test sets in a completely random way.

Didn't we just say that's a bad idea?  Well, it depends on your data.  If every sample is truly independent of every other, then this is just as good a way as any to split the data.  There is no universally best choice of splitter.  It all depends on your particular dataset, and for some datasets this is a fine choice.

### RandomStratifiedSplitter

Some datasets are very unbalanced: only a tiny fraction of all samples are positive.  In that case, random splitting may sometimes lead to the validation or test set having few or even no positive samples for some tasks.  That makes it unable to evaluate performance.

`RandomStratifiedSplitter` addresses this by dividing up the positive and negative samples evenly.  If you ask for a 80/10/10 split, the validation and test sets will contain not just 10% of samples, but also 10% of the positive samples for each task.

### ScaffoldSplitter

This splitter tries to address the problem discussed above where many molecules are very similar to each other.  It identifies the scaffold that forms the core of each molecule, and ensures that all molecules with the same scaffold are put into the same dataset.  This is still not a perfect solution, since two molecules may have different scaffolds but be very similar in other ways, but it usually is a large improvement over random splitting.

### ButinaSplitter

This is another splitter that tries to address the problem of similar molecules.  It clusters them based on their molecular fingerprints, so that ones with similar fingerprints will tend to be in the same dataset.  The time required by this splitting algorithm scales as the square of the number of molecules, so it is mainly useful for small to medium sized datasets.

### SpecifiedSplitter

This splitter leaves everything up to the user.  You tell it exactly which samples to put in each dataset.  This is useful when you know in advance that a particular splitting is appropriate for your data.

An example is temporal splitting.  Consider a research project where you are continually generating and testing new molecules.  As you gain more data, you periodically retrain your model on the steadily growing dataset, then use it to predict results for other not yet tested molecules.  A good way of validating whether this works is to pick a particular cutoff date, train the model on all data you had at that time, and see how well it predicts other data that was generated later.

### TaskSplitter

Provides a simple interface for splitting datasets task-wise.

For some learning problems, the training and test datasets should have different tasks entirely. This is a different paradigm from the usual Splitter, which ensures that split datasets have different data points, not different tasks. This method improves multi-task learning and problem decomposition situations by enhancing their efficiency and performance.

### SingletaskStratifiedSplitter

Another way of splitting data, particularly for classification tasks with imbalanced class distributions is the single-task stratified splitter. The single-task stratified splitter maintains the class distribution in the original dataset across training, validation and test sets. This is crucial when working with imbalanced datasets where some classes may be under-represented.

### FingerprintSplitter

Class for doing data splits based on the Tanimoto similarity(Tanimoto similarity measures overlap between two sets succinctly) between ECFP4 fingerprints(ECFP4 fingerprints encode unique parts of molecules for efficient comparison).

This class tries to split the data such that the molecules in each dataset are as different as possible from the ones in the other datasets. This makes it a very stringent test of models. Predicting the test and validation sets may require extrapolating far outside the training data.It splits molecular datasets using Tanimoto similarity scores calculated from ECFP4 fingerprints. ECFP4, based on Morgan fingerprints, encodes molecular substructures.

### MolecularWeightSplitter

Another splitter performs data splits based on molecular weight


## Effect of Using Different Splitters

Let's look at an example.  We will load the Tox21 toxicity dataset using random, fingerprint, scaffold, and Butina splitting.  For each one we train a model and evaluate it on the training and test sets.

In [2]:
import deepchem as dc

splitters = ['random', 'scaffold', 'butina', 'fingerprint']
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
for splitter in splitters:
    tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP', splitter=splitter)
    train_dataset, valid_dataset, test_dataset = datasets
    model = dc.models.MultitaskClassifier(n_tasks=len(tasks), n_features=1024, layer_sizes=[1000])
    model.fit(train_dataset, nb_epoch=10)
    print('splitter:', splitter)
    print('training set score:', model.evaluate(train_dataset, [metric], transformers))
    print('test set score:', model.evaluate(test_dataset, [metric], transformers))
    print()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[16:36:02] Explicit valence for atom # 4 Al, 6, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True, bool includeAtomMaps=True, bool includeChiralPresence=False)
[16:36:03] Explicit valence for atom # 9 Al, 6, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True, bool includeAtomMaps=True, bool includeChiralPresence=False)
[16:36:03] Explicit valence for atom # 5 Al, 6, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIso

splitter: random
training set score: {'roc_auc_score': 0.9554874391199669}
test set score: {'roc_auc_score': 0.7838446499068819}



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[16:36:28] Explicit valence for atom # 4 Al, 6, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True, bool includeAtomMaps=True, bool includeChiralPresence=False)
[16:36:29] Explicit valence for atom # 9 Al, 6, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIsotopes=True, bool includeAtomMaps=True, bool includeChiralPresence=False)
[16:36:30] Explicit valence for atom # 5 Al, 6, is greater than permitted
    rdkit.Chem.rdmolfiles.CanonicalRankAtoms(NoneType)
did not match C++ signature:
    CanonicalRankAtoms(RDKit::ROMol mol, bool breakTies=True, bool includeChirality=True, bool includeIso

splitter: scaffold
training set score: {'roc_auc_score': 0.958455659270116}
test set score: {'roc_auc_score': 0.6845891062590703}



[1;30;43mStreaming output truncated to the last 5000 lines.[0m


splitter: butina
training set score: {'roc_auc_score': 0.9586332682931221}
test set score: {'roc_auc_score': 0.6008836631437293}



[1;30;43mStreaming output truncated to the last 5000 lines.[0m


splitter: fingerprint
training set score: {'roc_auc_score': 0.9547898839426442}
test set score: {'roc_auc_score': 0.627558291868893}



All of them produce very similar performance on the training set, but the random splitter has much higher performance on the test set.  Scaffold splitting has a lower test set score, and Butina splitting is even lower.  Does that mean random splitting is better?  No!  It means random splitting doesn't give you an accurate measure of how well your model works.  Because the test set contains lots of molecules that are very similar to ones in the training set, it isn't truly independent.  It makes the model appear to work better than it really does.  Scaffold splitting and Butina splitting give a better indication of what you can expect on independent data in the future.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Discord
The DeepChem [discord](https://discord.com/channels/1170082703853490257/1210627997921443870) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

## Citing This Tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX.

In [None]:
@manual{Intro8,
 title={Working With Splitters},
 organization={DeepChem},
 author={Eastman, Peter, Mohapatra, Bibhusundar and Ramsundar, Bharath},
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Working_With_Splitters.ipynb}},
 year={2021},
}