# A Guide To Dataloaders


### What Is A Dataloader?

As the name suggests, dataloaders are tools we use inorder to load data from files or from disks into primary memory. In the field of data science and machinne learning, handling of large amounts of data is an everyday occurrence, and as such, requires tools to load and manage data. With small datasets, and modern tools like pandas, this is a simple process. However, when using larger datasets that may not wholly fit into memory, the loading and processing of data becomes a more complex task.

In this tutorial, we will look into the dataloaders used in deepcheem, and how we use them for various tasks

### Required Prerequisites:
1.  deepchem installed on your machine, venv, or jupyter notebook environment
2.  A working dataset, for the saj=ke of demonstration. These can be found [here]()
3.  Pandas and numpy in your environment

In [None]:
import deepchem as dc
import Pandas
import numpy

### A General Overview And Some Terms To Keep in Mind

The purpose of a loader is to load and featurize data from disk into memory, or to be precise, into 'Dataset' objects. In general, loaders in deepchem use a featurizer to process the input data
before writing it into the disk.

Parts of the datasset can be loaded into memory for use, as pandas dataframes. Such parts are referred to as 'Shards'. 

Featurize, in this context, means to represent real world object, like molecules, as computer-understandable formats. The dataloader featurizes/processes the input data and stores it to disk, where parts of it can be easily loaded into memory for handling. There are many ways of featurizing molecule data, such as SMILES strings, or fingerprint methods. Such methods are beyond the scope of this tutorial, however, and will not be covered here.

### How We Use a Dataloader In Code

The dataloaders shown below are implementations of the Dataloader class in deepchem. This class defines 4 methods, namely create_dataset(), get_shards(), featurize_shard(), and featurize(). The featurize() method is depreceted, and create_dataset() can be used instead.

create_dataset() returns a dataset object after featurizing the input. In case the file is large, this cannot be done directly, as it may be too memory intensive to do in a single step. In such cases, the dataset may be accessed as shards inorder to work with it. It takes in parameters as:
1. inputs: List. List of inputs to process. Entries can be filenames or arbitrary objects.
2. data_dir: str, optional (default None). Directory to store featurized dataset.
3. shard_size: int, optional (default 8192). Number of examples stored in each shard.

The function returns a Diskdataset object containing featurized representation of data from 'inputs'

### The Loaders Of Deepchem

The loaders of deepchem are:
1. CSVLoader
2. UserCSVLoader
3. JsonLoader
4. SDFLoader
5. FASTALoader
6. FASTQLoader
7. ImageLoader
8. InMemoryLoader
9. DTFYamlLoader
10. SAMLoader
11. BAMLoader
12. CRAMLoader

While this may seem like a somewhat long and overly complicated list, using each of htese dataloaders is just a matter of the type your input is. The names are somewhat self explanatory, ie, use a JsonLoader fo JSON files and use CRAM loader for CRAM format inputs. As such, the usage of the above are very similar.