Skip to content

Latest commit

 

History

History
169 lines (123 loc) · 5.22 KB

data.rst

File metadata and controls

169 lines (123 loc) · 5.22 KB

Data

DeepChem dc.data provides APIs for handling your data.

If your data is stored by the file like CSV and SDF, you can use the Data Loaders. The Data Loaders read your data, convert them to features (ex: SMILES to ECFP) and save the features to Dataset class. If your data is python objects like Numpy arrays or Pandas DataFrames, you can use the Datasets directly.

Contents

Datasets

DeepChem dc.data.Dataset objects are one of the core building blocks of DeepChem programs. Dataset objects hold representations of data for machine learning and are widely used throughout DeepChem.

The goal of the Dataset class is to be maximally interoperable with other common representations of machine learning datasets. For this reason we provide interconversion methods mapping from Dataset objects to pandas DataFrames, TensorFlow Datasets, and PyTorch datasets.

NumpyDataset

The dc.data.NumpyDataset class provides an in-memory implementation of the abstract Dataset which stores its data in numpy.ndarray objects.

deepchem.data.NumpyDataset

DiskDataset

The dc.data.DiskDataset class allows for the storage of larger datasets on disk. Each DiskDataset is associated with a directory in which it writes its contents to disk. Note that a DiskDataset can be very large, so some of the utility methods to access fields of a Dataset can be prohibitively expensive.

deepchem.data.DiskDataset

ImageDataset

The dc.data.ImageDataset class is optimized to allow for convenient processing of image based datasets.

deepchem.data.ImageDataset

Data Loaders

Processing large amounts of input data to construct a dc.data.Dataset object can require some amount of hacking. To simplify this process for you, you can use the dc.data.DataLoader classes. These classes provide utilities for you to load and process large amounts of data.

CSVLoader

deepchem.data.CSVLoader

UserCSVLoader

deepchem.data.UserCSVLoader

ImageLoader

deepchem.data.ImageLoader

JsonLoader

JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.

deepchem.data.JsonLoader

SDFLoader

deepchem.data.SDFLoader

FASTALoader

deepchem.data.FASTALoader

InMemoryLoader

The dc.data.InMemoryLoader is designed to facilitate the processing of large datasets where you already hold the raw data in-memory (say in a pandas dataframe).

deepchem.data.InMemoryLoader

Data Classes

DeepChem featurizers often transform members into "data classes". These are classes that hold all the information needed to train a model on that data point. Models then transform these into the tensors for training in their default_generator methods.

Graph Data

These classes document the data classes for graph convolutions. We plan to simplify these classes (ConvMol, MultiConvMol, WeaveMol) into a joint data representation (GraphData) for all graph convolutions in a future version of DeepChem, so these APIs may not remain stable.

The graph convolution models which inherit KerasModel depend on ConvMol, MultiConvMol, or WeaveMol. On the other hand, the graph convolution models which inherit TorchModel depend on GraphData.

deepchem.feat.mol_graphs.ConvMol

deepchem.feat.mol_graphs.MultiConvMol

deepchem.feat.mol_graphs.WeaveMol

deepchem.feat.graph_data.GraphData

Base Classes (for develop)

Dataset

The dc.data.Dataset class is the abstract parent class for all datasets. This class should never be directly initialized, but contains a number of useful method implementations.

deepchem.data.Dataset

DataLoader

The dc.data.DataLoader class is the abstract parent class for all dataloaders. This class should never be directly initialized, but contains a number of useful method implementations.

deepchem.data.DataLoader