# Using Splitters 

In this tutorial we will have a look at the various splitters that are present in deepchem library and how each of them can be used.

In [5]:
import deepchem as dc
import pandas as pd
import os

We start with the IndexSplitter. This splitter returns a range object which contains the split according to the fractions provided by the user. The three range objects can then be used to iterate over the dataset as test,valid and Split.


#Each of the splitters that will be used has two functions inherited from the main class that are train_test_split which can be used to split the data into training and tesing data and the other fucnction is train_valid_test_split which is used to split the data to train, validation and test split.

#Note: All the splitters have a default percentage of 80,10,10 as train, valid and test respectively. But can be changed by specifying the frac_train,frac_test and frac_valid in the ratio we want to split the data.

In [7]:
input_data=os.path.join("/home/ab/deepchem/deepchem/models/tests/example.csv")

We then featurize the data using any one of the featurizers present.

In [15]:
tasks=['log-solubility']
featurizer=dc.feat.CircularFingerprint(size=1024)
loader = dc.data.CSVLoader(tasks=tasks, smiles_field="smiles",featurizer=featurizer)
dataset=loader.featurize(input_data)

Loading raw samples now.
shard_size: 8192
About to start loading CSV from /home/ab/deepchem/deepchem/models/tests/example.csv
Loading shard 1 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 0 took 0.035 s
TIMING: dataset construction took 0.045 s
Loading dataset from disk.


In [12]:
from deepchem.splits.splitters import IndexSplitter

In [16]:
splitter=IndexSplitter()
train_data,valid_data,test_data=splitter.split(dataset)

In [19]:
train_data,valid_data,test_data

(range(0, 8), range(8, 9), range(9, 10))

In [18]:
len(dataset)

10

As we can see that without providing the user specifications on how to split the data, the data was split into a default of 80,10,10.

But when we specify the parameters the dataset can be split according to our specificaitons.

In [20]:
train_data,valid_data,test_data=splitter.split(dataset,frac_train=0.7,frac_valid=0.2,frac_test=0.1)

In [21]:
train_data,valid_data,test_data

(range(0, 7), range(7, 9), range(9, 10))