DeepChem dc.data
provides APIs for handling your data.
If your data is stored by the file like CSV and SDF, you can use the Data Loaders. The Data Loaders read your data, convert them to features (ex: SMILES to ECFP) and save the features to Dataset class. If your data is python objects like Numpy arrays or Pandas DataFrames, you can use the Datasets directly.
Contents
DeepChem dc.data.Dataset
objects are one of the core building blocks of DeepChem programs. Dataset
objects hold representations of data for machine learning and are widely used throughout DeepChem.
The goal of the Dataset
class is to be maximally interoperable with other common representations of machine learning datasets. For this reason we provide interconversion methods mapping from Dataset
objects to pandas DataFrames, TensorFlow Datasets, and PyTorch datasets.
The dc.data.NumpyDataset
class provides an in-memory implementation of the abstract Dataset
which stores its data in numpy.ndarray
objects.
deepchem.data.NumpyDataset
The dc.data.DiskDataset
class allows for the storage of larger datasets on disk. Each DiskDataset
is associated with a directory in which it writes its contents to disk. Note that a DiskDataset
can be very large, so some of the utility methods to access fields of a Dataset
can be prohibitively expensive.
deepchem.data.DiskDataset
The dc.data.ImageDataset
class is optimized to allow for convenient processing of image based datasets.
deepchem.data.ImageDataset
Processing large amounts of input data to construct a dc.data.Dataset
object can require some amount of hacking. To simplify this process for you, you can use the dc.data.DataLoader
classes. These classes provide utilities for you to load and process large amounts of data.
deepchem.data.CSVLoader
deepchem.data.UserCSVLoader
deepchem.data.ImageLoader
JSON is a flexible file format that is human-readable, lightweight, and more compact than other open standard formats like XML. JSON files are similar to python dictionaries of key-value pairs. All keys must be strings, but values can be any of (string, number, object, array, boolean, or null), so the format is more flexible than CSV. JSON is used for describing structured data and to serialize objects. It is conveniently used to read/write Pandas dataframes with the pandas.read_json and pandas.write_json methods.
deepchem.data.JsonLoader
deepchem.data.SDFLoader
deepchem.data.FASTALoader
The dc.data.InMemoryLoader
is designed to facilitate the processing of large datasets where you already hold the raw data in-memory (say in a pandas dataframe).
deepchem.data.InMemoryLoader
DeepChem featurizers often transform members into "data classes". These are classes that hold all the information needed to train a model on that data point. Models then transform these into the tensors for training in their default_generator
methods.
These classes document the data classes for graph convolutions. We plan to simplify these classes (ConvMol
, MultiConvMol
, WeaveMol
) into a joint data representation (GraphData
) for all graph convolutions in a future version of DeepChem, so these APIs may not remain stable.
The graph convolution models which inherit KerasModel
depend on ConvMol
, MultiConvMol
, or WeaveMol
. On the other hand, the graph convolution models which inherit TorchModel
depend on GraphData
.
deepchem.feat.mol_graphs.ConvMol
deepchem.feat.mol_graphs.MultiConvMol
deepchem.feat.mol_graphs.WeaveMol
deepchem.feat.graph_data.GraphData
The dc.data.Dataset
class is the abstract parent class for all datasets. This class should never be directly initialized, but contains a number of useful method implementations.
deepchem.data.Dataset
The dc.data.DataLoader
class is the abstract parent class for all dataloaders. This class should never be directly initialized, but contains a number of useful method implementations.
deepchem.data.DataLoader