#  Working With Datasets

Data is central to machine learning.  This tutorial introduces the `Dataset` class that DeepChem uses to store and manage data.  It provides simple but powerful tools for efficiently working with large amounts of data.  It also is designed to easily interact with other popular Python frameworks such as NumPy, Pandas, TensorFlow, and PyTorch.

## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Working_With_Datasets.ipynb)



In [None]:
!pip install --pre deepchem

We can now import the `deepchem` package to play with.

In [1]:
import deepchem as dc
dc.__version__

'2.4.0-rc1.dev'

# Anatomy of a Dataset

In the last tutorial we loaded the Delaney dataset of molecular solubilities.  Let's load it again.

In [2]:
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

We now have three Dataset objects: the training, validation, and test sets.  What information does each of them contain?  We can start to get an idea by printing out the string representation of one of them.

In [3]:
print(test_dataset)

<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['C1c2ccccc2c3ccc4ccccc4c13' 'COc1ccccc1Cl'
 'COP(=S)(OC)Oc1cc(Cl)c(Br)cc1Cl' ... 'CCSCCSP(=S)(OC)OC' 'CCC(C)C'
 'COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl'], task_names: ['measured log solubility in mols per litre']>


There's a lot of information there, so let's start at the beginning.  It begins with the label "DiskDataset".  Dataset is an abstract class.  It has a few subclasses that correspond to different ways of storing data.

- `DiskDataset` is a dataset that has been saved to disk.  The data is stored in a way that can be efficiently accessed, even if the total amount of data is far larger than your computer's memory.
- `NumpyDataset` is an in-memory dataset that holds all the data in NumPy arrays.  It is a useful tool when manipulating small to medium sized datasets that can fit entirely in memory.
- `ImageDataset` is a more specialized class that stores some or all of the data in image files on disk.  It is useful when working with models that have images as their inputs or outputs.

Now let's consider the contents of the Dataset.  Every Dataset stores a list of *samples*.  Very roughly speaking, a sample is a single data point.  In this case, each sample is a molecule.  In other datasets a sample might correspond to an experimental assay, a cell line, an image, or many other things.  For every sample the dataset stores the following information.

- The *features*, referred to as `X`.  This is the input that should be fed into a model to represent the sample.
- The *labels*, referred to as `y`.  This is the desired output from the model.  During training, it tries to make the model's output for each sample as close as possible to `y`.
- The *weights*, referred to as `w`.  This can be used to indicate that some data values are more important than others.  In later tutorials we will see examples of how this is useful.
- An *ID*, which is a unique identifier for the sample.  This can be anything as long as it is unique.  Sometimes it is just an integer index, but in this dataset the ID is a SMILES string describing the molecule.

Notice that `X`, `y`, and `w` all have 113 as the size of their first dimension.  That means this dataset contains 113 samples.

The final piece of information listed in the output is `task_names`.  Some datasets contain multiple pieces of information for each sample.  For example, if a sample represents a molecule, the dataset might record the results of several different experiments on that molecule.  This dataset has only a single task: "measured log solubility in mols per litre".  Also notice that `y` and `w` each have shape (113, 1).  The second dimension of these arrays usually matches the number of tasks.

# Accessing Data from a Dataset

There are many ways to access the data contained in a dataset.  The simplest is just to directly access the `X`, `y`, `w`, and `ids` properties.  Each of these returns the corresponding information as a NumPy array.

In [4]:
test_dataset.y

array([[-1.7065408738415053],
       [0.2911162036252904],
       [-1.4272475857596547],
       [-0.9254664241210759],
       [-1.9526976701170347],
       [1.3514839414275706],
       [-0.8591934405084332],
       [-0.6509069205829855],
       [-0.32900957160729316],
       [0.6082797680572224],
       [1.8295961803473488],
       [1.6213096604219008],
       [1.3751528641463715],
       [0.45632528420252055],
       [1.0532555151706793],
       [-1.1053502367839627],
       [-0.2011973889257683],
       [0.3479216181504126],
       [-0.9870056231899582],
       [-0.8161160011602158],
       [0.8402352107014712],
       [0.22815686919328],
       [0.06247441016167367],
       [1.040947675356903],
       [-0.5197810887208284],
       [0.8023649343513898],
       [-0.41895147793873655],
       [-2.5964923680684198],
       [1.7443880585596654],
       [0.45206487811313645],
       [0.233837410645792],
       [-1.7917489956291888],
       [0.7739622270888287],
       [1.0011838851893173]

This is a very easy way to access data, but you should be very careful about using it.  This requires the data for all samples to be loaded into memory at once.  That's fine for small datasets like this one, but for large datasets it could easily take more memory than you have.

A better approach is to iterate over the dataset.  That lets it load just a little data at a time, process it, then free the memory before loading the next bit.  You can use the `itersamples()` method to iterate over samples one at a time.

In [5]:
for X, y, w, id in test_dataset.itersamples():
    print(y, id)

[-1.70654087] C1c2ccccc2c3ccc4ccccc4c13
[0.2911162] COc1ccccc1Cl
[-1.42724759] COP(=S)(OC)Oc1cc(Cl)c(Br)cc1Cl
[-0.92546642] ClC(Cl)CC(=O)NC2=C(Cl)C(=O)c1ccccc1C2=O
[-1.95269767] ClC(Cl)C(c1ccc(Cl)cc1)c2ccc(Cl)cc2 
[1.35148394] COC(=O)C=C
[-0.85919344] CN(C)C(=O)Nc2ccc(Oc1ccc(Cl)cc1)cc2
[-0.65090692] N(=Nc1ccccc1)c2ccccc2
[-0.32900957] CC(C)c1ccc(C)cc1
[0.60827977] Oc1c(Cl)cccc1Cl
[1.82959618] OCC2OC(OC1(CO)OC(CO)C(O)C1O)C(O)C(O)C2O 
[1.62130966] OC1C(O)C(O)C(O)C(O)C1O
[1.37515286] Cn2c(=O)n(C)c1ncn(CC(O)CO)c1c2=O
[0.45632528] OCC(NC(=O)C(Cl)Cl)C(O)c1ccc(cc1)N(=O)=O
[1.05325552] CCC(O)(CC)CC
[-1.10535024] CC45CCC2C(CCC3CC1SC1CC23C)C4CCC5O
[-0.20119739] Brc1ccccc1Br
[0.34792162] Oc1c(Cl)cc(Cl)cc1Cl
[-0.98700562] CCCN(CCC)c1c(cc(cc1N(=O)=O)S(N)(=O)=O)N(=O)=O
[-0.816116] C2c1ccccc1N(CCF)C(=O)c3ccccc23 
[0.84023521] CC(C)C(=O)C(C)C
[0.22815687] O=C1NC(=O)NC(=O)C1(C(C)C)CC=C(C)C
[0.06247441] c1c(O)C2C(=O)C3cc(O)ccC3OC2cc1(OC)
[1.04094768] Cn1cnc2n(C)c(=O)n(C)c(=O)c12
[-0.51978109] CC(=O)SC4C

Most deep learning models can process a batch of multiple samples all at once.  You can use `iterbatches()` to iterate over batches of samples.

In [6]:
for X, y, w, ids in test_dataset.iterbatches(batch_size=50):
    print(y.shape)

(50, 1)
(50, 1)
(13, 1)


`iterbatches()` has other features that are useful when training models.  For example, `iterbatches(batch_size=100, epochs=10, deterministic=False)` will iterate over the complete dataset ten times, each time with the samples in a different random order.

Datasets can also expose data using the standard interfaces for TensorFlow and PyTorch.  To get a `tensorflow.data.Dataset`, call `make_tf_dataset()`.  To get a `torch.utils.data.IterableDataset`, call `make_pytorch_dataset()`.  See the API documentation for more details.

The final way of accessing data is `to_dataframe()`.  This copies the data into a Pandas `DataFrame`.  This requires storing all the data in memory at once, so you should only use it with small datasets.

In [7]:
test_dataset.to_dataframe()

Unnamed: 0,X,y,w,ids
0,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-1.706541,1.0,C1c2ccccc2c3ccc4ccccc4c13
1,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.291116,1.0,COc1ccccc1Cl
2,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-1.427248,1.0,COP(=S)(OC)Oc1cc(Cl)c(Br)cc1Cl
3,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.925466,1.0,ClC(Cl)CC(=O)NC2=C(Cl)C(=O)c1ccccc1C2=O
4,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-1.952698,1.0,ClC(Cl)C(c1ccc(Cl)cc1)c2ccc(Cl)cc2
...,...,...,...,...
108,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.646150,1.0,FC(F)(F)C(Cl)Br
109,<deepchem.feat.mol_graphs.ConvMol object at 0x...,1.505805,1.0,CNC(=O)ON=C(SC)C(=O)N(C)C
110,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.007586,1.0,CCSCCSP(=S)(OC)OC
111,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.049716,1.0,CCC(C)C


# Creating Datasets

Now let's talk about how you can create your own datasets.  Creating a `NumpyDataset` is very simple: just pass the arrays containing the data to the constructor.  Let's create some random arrays, then wrap them in a NumpyDataset.

In [8]:
import numpy as np

X = np.random.random((10, 5))
y = np.random.random((10, 2))
dataset = dc.data.NumpyDataset(X=X, y=y)
print(dataset)

<NumpyDataset X.shape: (10, 5), y.shape: (10, 2), w.shape: (10, 1), ids: [0 1 2 3 4 5 6 7 8 9], task_names: [0 1]>


Notice that we did not specify weights or IDs.  These are optional, as is `y` for that matter.  Only `X` is required.  Since we left them out, it automatically built `w` and `ids` arrays for us, setting all weights to 1 and setting the IDs to integer indices.

In [9]:
dataset.to_dataframe()

Unnamed: 0,X1,X2,X3,X4,X5,y1,y2,w,ids
0,0.54733,0.919941,0.289138,0.431806,0.776672,0.532579,0.443258,1.0,0
1,0.980867,0.642487,0.46064,0.500153,0.014848,0.678259,0.274029,1.0,1
2,0.953254,0.704446,0.857458,0.378372,0.705789,0.704786,0.90108,1.0,2
3,0.90497,0.72971,0.304247,0.861546,0.917029,0.121747,0.758845,1.0,3
4,0.464144,0.059168,0.600405,0.880529,0.688043,0.595495,0.719861,1.0,4
5,0.820482,0.139002,0.627421,0.129399,0.920024,0.63403,0.464525,1.0,5
6,0.113727,0.551801,0.536189,0.066091,0.31132,0.699331,0.171532,1.0,6
7,0.516131,0.918903,0.429036,0.844973,0.639367,0.464089,0.337989,1.0,7
8,0.809393,0.20145,0.82142,0.84139,0.100026,0.230462,0.376151,1.0,8
9,0.07675,0.389277,0.350371,0.291806,0.127522,0.544606,0.306578,1.0,9


What about creating a DiskDataset?  If you have the data in NumPy arrays, you can call `DiskDataset.from_numpy()` to save it to disk.  Since this is just a tutorial, we will save it to a temporary directory.

In [10]:
import tempfile

with tempfile.TemporaryDirectory() as data_dir:
    disk_dataset = dc.data.DiskDataset.from_numpy(X=X, y=y, data_dir=data_dir)
    print(disk_dataset)

<DiskDataset X.shape: (10, 5), y.shape: (10, 2), w.shape: (10, 1), ids: [0 1 2 3 4 5 6 7 8 9], task_names: [0 1]>


What about larger datasets that can't fit in memory?  What if you have some huge files on disk containing data on hundreds of millions of molecules?  The process for creating a DiskDataset from them is slightly more involved.  Fortunately, DeepChem's `DataLoader` framework can automate most of the work for you.  That is a larger subject, so we will return to it in a later tutorial.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

## Citing This Tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX. 

In [None]:
@manual{Intro2, 
 title={Working with Datasets}, 
 organization={DeepChem},
 author={Ramsundar, Bharath}, 
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/168bea9e0959b51e5c66bbcd569b572c656fd000/examples/tutorials/Working_With_Datasets.ipynb}}, 
 year={2021}, 
} 