In PyTorch, Dataset and DataLoader are two important classes that help you efficiently load, preprocess, and iterate over your data during training or evaluation of a machine learning model. They are part of the torch.utils.data module.

1. Dataset
What it is: A Dataset is an abstract class that represents a collection of data samples (e.g., images, text, etc.). It provides a way to access individual data samples and their corresponding labels.

Purpose: It allows you to define how to load and preprocess your data.

Key Methods:

** _len_(): Returns the total number of samples in the dataset.

** _getitem_(index): Returns the data sample and its label at the specified index.

Example: Custom Dataset
Let's create a custom dataset for a simple example where we have a list of numbers and their squares.

In [2]:
from torch.utils.data import Dataset

class SquareDataset(Dataset):
    def __init__(self, data):
        self.data = data  # List of numbers

    def __len__(self):
        return len(self.data)  # Total number of samples

    def __getitem__(self, index):
        x = self.data[index]  # Input number
        y = x ** 2            # Label (square of the number)
        return x, y

# Create a dataset
data = [1, 2, 3, 4, 5]
dataset = SquareDataset(data)

# Access a sample
print(dataset[1])  # Output: (2, 4)

(2, 4)


2. DataLoader

What it is: A DataLoader is a utility that wraps around a Dataset and provides features like:

Batching: Splits the data into smaller batches.

Shuffling: Randomly shuffles the data to avoid overfitting.

Parallel Loading: Loads data in parallel using multiple workers (for faster data loading).

Purpose: It makes it easy to iterate over the dataset in batches during training or evaluation.

Example: Using DataLoader

Let's use the SquareDataset we created earlier with a DataLoader.

In [3]:
from torch.utils.data import DataLoader

# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterate over the DataLoader
for batch in dataloader:
    x, y = batch
    print("Batch of inputs:", x)
    print("Batch of labels:", y)

Batch of inputs: tensor([4, 1])
Batch of labels: tensor([16,  1])
Batch of inputs: tensor([3, 2])
Batch of labels: tensor([9, 4])
Batch of inputs: tensor([5])
Batch of labels: tensor([25])


In [1]:
import torch
from torch.utils.data import DataLoader , Dataset , TensorDataset

In [3]:

x = torch.arange(12,dtype=torch.float16)
x

tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.],
       dtype=torch.float16)

The code x = torch.arange(12, dtype=torch.float16) creates a 1-dimensional tensor (a tensor with one axis) containing a sequence of numbers from 0 to 11, with each number represented as a 16-bit floating-point number (float16).

Breakdown of the Code:

1. `torch.arange(12):`

This generates a sequence of numbers starting from `0 `up to but not including` 12`.

The sequence is: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]`.

2. `dtype=torch.float16:`

This specifies the data type of the tensor as `float16` (16-bit floating-point).

`float16` is a lower-precision floating-point format that uses less memory compared to `float32` or `float64`.

DataLoader returns in iterator , which we can use to iterate through the indivisual examples in the dataset.

In [4]:
data_loader = DataLoader(x)

In [7]:
for item in data_loader:
    print(item)

tensor([0.], dtype=torch.float16)
tensor([1.], dtype=torch.float16)
tensor([2.], dtype=torch.float16)
tensor([3.], dtype=torch.float16)
tensor([4.], dtype=torch.float16)
tensor([5.], dtype=torch.float16)
tensor([6.], dtype=torch.float16)
tensor([7.], dtype=torch.float16)
tensor([8.], dtype=torch.float16)
tensor([9.], dtype=torch.float16)
tensor([10.], dtype=torch.float16)
tensor([11.], dtype=torch.float16)


# Creating batches of data

In [8]:
data_loader = DataLoader(x,batch_size=4,shuffle=True)

What is a DataLoader?
A DataLoader is a PyTorch utility that helps you efficiently load and iterate over your data in batches. It is commonly used during training or evaluation of machine learning models.

Parameters Explained:

1. `x:`

* This is the dataset you want to load. It should be an instance of a `Dataset` class (e.g., a custom dataset or a built-in dataset like `TensorDataset`).

* Example: If `x` is a list of tensors or a custom dataset, the `DataLoader` will iterate over it.

2. `batch_size=4:`

* This specifies the number of samples to load in each batch.

* Example: If your dataset has 100 samples and batch_size=4, the DataLoader will create 25 batches, each containing 4 samples.

3. `shuffle=True:`

* This determines whether the data should be shuffled before creating batches.

* If `shuffle=True,` the data will be randomly shuffled at the beginning of each epoch (iteration over the entire dataset).

* Shuffling is typically used during training to ensure the model doesn't learn any unintended patterns from the order of the data.

# Example: Using DataLoader with a TensorDataset

Let's say you have a dataset of input features X and corresponding labels y. You can use TensorDataset to wrap them into a dataset and then create a DataLoader.

In [11]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# Example data
X = torch.tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]], dtype=torch.float32)
y = torch.tensor([0, 1, 0, 1, 0, 1], dtype=torch.float32)

# Create a TensorDataset
dataset = TensorDataset(X, y)

# Create a DataLoader
data_loader = DataLoader(dataset, batch_size=4, shuffle=True)

# Iterate over the DataLoader
for batch in data_loader:
    inputs, labels = batch
    print("Inputs:", inputs)
    print("Labels:", labels)

Inputs: tensor([[ 9., 10.],
        [11., 12.],
        [ 7.,  8.],
        [ 1.,  2.]])
Labels: tensor([0., 1., 1., 0.])
Inputs: tensor([[3., 4.],
        [5., 6.]])
Labels: tensor([1., 0.])


# Output:

Since shuffle=True, the order of the batches will be different each time you run the code. Here’s an example output:

# What Happens Under the Hood? 

* The DataLoader takes the dataset (x) and splits it into batches of size batch_size=4.

* If shuffle=True, it randomly shuffles the data before creating batches.

* During iteration, it returns one batch at a time, containing both the input features and labels.

# Key Points:

* Batching: The DataLoader splits the dataset into smaller batches for efficient processing.

* Shuffling: Shuffling ensures that the model doesn't learn any patterns based on the order of the data.

* Flexibility: You can use DataLoader with any dataset, whether it's a custom dataset or a built-in one.

# Common Use Cases:

* Training: Use shuffle=True to shuffle the data at the beginning of each epoch.

* Evaluation: Use shuffle=False to keep the data in its original order.

* Large Datasets: Use num_workers to load data in parallel for faster processing.

# Summary:

* The line data_loader = DataLoader(x, batch_size=4, shuffle=True) creates a DataLoader that:

* Iterates over the dataset x.

* Splits the data into batches of size 4.

* Shuffles the data before creating batches.

This is a standard way to prepare your data for training or evaluation in PyTorch!

# Output:

Since shuffle=True, the order of the batches will be different each time you run the code. Here’s an example output:

In [12]:
for i , batch in enumerate(data_loader):
    print(f'Batch : {i}', batch)

Batch : 0 [tensor([[ 7.,  8.],
        [ 3.,  4.],
        [ 1.,  2.],
        [11., 12.]]), tensor([1., 1., 0., 1.])]
Batch : 1 [tensor([[ 5.,  6.],
        [ 9., 10.]]), tensor([0., 0.])]


# What Does This Code Do?

1. `enumerate(data_loader):`

* The `enumerate` function adds an index (`i`) to each batch as you iterate over the `DataLoader`.

* It returns a tuple `(i, batch)`, where:

  * i is the batch index (starting from 0).

  * batch is the actual batch of data.

2. `print(f'Batch: {i}', batch):`

* This prints the batch index (`i`) and the corresponding batch of data.

Explanation of the Output:

1. Batch 0:

* Contains 4 samples (since `batch_size=4`).

* The input features (`X`) are:

```pyhton
tensor([[ 5.,  6.],
        [ 1.,  2.],
        [ 9., 10.],
        [ 3.,  4.]])
```
* The corresponding labels (y) are:
```python
tensor([0., 0., 0., 1.])
```

2. Batch 1:

* Contains the remaining 2 samples (since the dataset has 6 samples in total).

* The input features (`X`) are:

```pyhton
tensor([[11., 12.],
        [ 7.,  8.]])
```
* The corresponding labels (y) are:

```python

tensor([1., 1.])

```

In [13]:
X.shape , y.shape

(torch.Size([6, 2]), torch.Size([6]))

In [16]:
dataset = TensorDataset( torch.Tensor(X), torch.Tensor(y)) ## combine X and Y into a dataset
data_loader = DataLoader(dataset, batch_size= 5)
for i ,batch in enumerate(data_loader):
    print(f'Batch: {i} \n X: {batch[0]} , \n y: {batch[1]}')

Batch: 0 
 X: tensor([[ 1.,  2.],
        [ 3.,  4.],
        [ 5.,  6.],
        [ 7.,  8.],
        [ 9., 10.]]) , 
 y: tensor([0., 1., 0., 1., 0.])
Batch: 1 
 X: tensor([[11., 12.]]) , 
 y: tensor([1.])
