## **Why we need this**

The code which we have written previously have a **Big Flaw**
```py
for epoch in range(epochs):

  # Forward Pass
  y_pred = model.forward(x_train_tensor)

  # Loss Calculate
  loss = loss_function(y_pred.squeeze(), y_train_tensor)

  # Make gradeints zero(Doing it before backward pass bcoz it is suggested.)
  optimizer.zero_grad()

  # BackWard Loss(backPropogate)
  loss.backward()

  # Update weight & bias using optimizer
  optimizer.step()

  # Print the Loss
  print(f"Epoch:{epoch +1}, Loss:{loss}")
  print('='*60)

```

In the above code we are looping again & again over the entire dataset i.e(x_train_tensor).

- This is very inefficient way. Because we are using Batch Gradient Descent.
- Convergence is not that great. Because we are updating the parameter one time by looking at the overall data.

---
**Solution** : Divide data into Batches and perform on that. And iterate over the batch. This is called as **Mini Batch Gradient Descent.**

**Solution one by using nested loops to iterate over data**

This will work but there are some problems with this approach.

- Sometime data gathering is very difficult, because suppose there are dataset for images in multiple folder based on categories. So this Approach does not handle that.

- Another Problem is There is no transformation in this. Sometime for **RGB** images we required to transform some. Suppose convert colour images to **B/W**.

- No Shuffling and Sampling

- Batch management & Parallelization.

So this will work but it is not good way.

---

**To Solve this problems we have DataSet & DataLoader Classess**

## **How Dataset & DataLoader Works**

Dataset and DataLoader are core abstractions in PyTorch that decouple how you define your data from how you efficiently iterate over it in training loops.

**Dataset Class**

The Dataset class is essentially a blueprint. When you create a custom Dataset, you decide how data is loaded and returned.

It defines:

- `__init__()` which tells how data should be loaded.

- `__len__()` which returns the total number of samples.This will be used to calculate number of batches based on batch size and lenght of data.

- `__getitem__(index)` which returns the data (and label) at the
given index.

---

Dataset class does the loading(reading) of the data and it remembers where data is present in your memory.

**Dataset Class is an Abstract Class.**

Thus we need to create all the three methods in our custom datasetclass.

**DataLoader Class**

The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading for you.

DataLoader Control Flow:

- At the start of each epoch, the DataLoader (if shuffle=True)
shuffles indices(using a sampler).

- It divides the indices into chunks of batch_size.

- for each index in the chunk of batch_size, data samples are fetched from the Dataset object using get item because it gives item based on index.

- The samples are then collected and combined into a batch (using `collate_fn`).

- The batch is returned to the main training loop.

---

DataLoader Class works on creating and extarcting batches from that loaded data which helps to create mini batch.

In [1]:
from sklearn.datasets import make_classification
import torch

In [2]:
# Create a dummy classification dataset
x , y = make_classification(
    n_samples=100,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_classes=2,
    random_state=42
)

In [3]:
x

array([[ 0.55942643,  2.38869353],
       [ 1.31217492, -0.7173148 ],
       [-1.5598485 , -1.92487377],
       [-2.2813861 , -0.1368559 ],
       [ 1.56070438, -0.42795824],
       [-0.80804463,  1.19664076],
       [-0.27062383, -2.25553963],
       [ 0.480502  ,  0.54914434],
       [-1.20757158, -1.26898369],
       [ 0.25415746, -1.79532002],
       [ 2.59123946,  0.24472415],
       [ 0.07123641,  0.49429823],
       [-1.17762637, -1.20592943],
       [ 0.93343952,  0.68811892],
       [ 1.65214494, -0.35885569],
       [-1.40735658, -1.56826626],
       [ 1.02255619, -1.08324727],
       [-0.81680628, -0.6795874 ],
       [ 1.50575249, -0.38919817],
       [-2.17105282, -0.04862909],
       [ 0.71479373, -1.42922002],
       [-0.15013844, -0.11708689],
       [-1.4117586 , -1.5332749 ],
       [-2.58590856, -0.40925706],
       [ 0.82600732, -1.05383855],
       [-0.07133524,  0.08896214],
       [ 0.6273745 , -1.32933233],
       [ 1.65882246, -0.43131517],
       [ 1.2798899 ,

In [4]:
x.shape

(100, 2)

In [10]:
type(x)

numpy.ndarray

In [6]:
y

array([0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0])

In [7]:
y.shape

(100,)

In [9]:
type(y)

numpy.ndarray

In [12]:
# Convert them to tensors
x_tensor = torch.tensor(x, dtype= torch.float32)
y_tensor = torch.tensor(y, dtype= torch.long)

In [13]:
x_tensor

tensor([[ 0.5594,  2.3887],
        [ 1.3122, -0.7173],
        [-1.5598, -1.9249],
        [-2.2814, -0.1369],
        [ 1.5607, -0.4280],
        [-0.8080,  1.1966],
        [-0.2706, -2.2555],
        [ 0.4805,  0.5491],
        [-1.2076, -1.2690],
        [ 0.2542, -1.7953],
        [ 2.5912,  0.2447],
        [ 0.0712,  0.4943],
        [-1.1776, -1.2059],
        [ 0.9334,  0.6881],
        [ 1.6521, -0.3589],
        [-1.4074, -1.5683],
        [ 1.0226, -1.0832],
        [-0.8168, -0.6796],
        [ 1.5058, -0.3892],
        [-2.1711, -0.0486],
        [ 0.7148, -1.4292],
        [-0.1501, -0.1171],
        [-1.4118, -1.5333],
        [-2.5859, -0.4093],
        [ 0.8260, -1.0538],
        [-0.0713,  0.0890],
        [ 0.6274, -1.3293],
        [ 1.6588, -0.4313],
        [ 1.2799,  1.2590],
        [ 0.2506,  0.1398],
        [-0.0532,  1.8561],
        [-2.0583, -2.5234],
        [-2.0263,  0.0619],
        [-1.6583, -1.5713],
        [ 1.2801,  1.2894],
        [ 0.9642,  0

In [14]:
type(x_tensor)

torch.Tensor

In [15]:
y_tensor

tensor([0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
        1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
        1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
        0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
        1, 0, 0, 0])

In [16]:
type(y_tensor)

torch.Tensor

**Import the from utlility data class**

In [17]:
from torch.utils.data import Dataset, DataLoader

In [18]:
# Create a custom dataset class with 3 methods.

class CustomDataset(Dataset):

  # How data is loaded
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  # Lenght of data
  def __len__(self):
    # Number of rows
    return self.features.shape[0]

  # get item at index
  def __getitem__(self, index):
    # return item at that index
    return self.features[index], self.labels[index]

In [19]:
# Object of that Class

# Feature and Label
dataset = CustomDataset(x_tensor, y_tensor)

In [20]:
dataset

<__main__.CustomDataset at 0x7c344dbea960>

In [21]:
len(dataset)

100

In [22]:
dataset[3]

(tensor([-2.2814, -0.1369]), tensor(0))

**Data Loader Class**

In [23]:
"""
We dont need to create any data loader class.
Just create instance of it and pass dataset, batchsize and shuffle True or False.
"""

# dataloader = DataLoader(dataset_object, batch_size, Shuffling)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

In [24]:
dataloader

<torch.utils.data.dataloader.DataLoader at 0x7c344cfc0d40>

In [25]:
"""
Extract Data with data loader
"""

# Simply run a loop and extract info which is there in data loader, We have features and labels

for feature,label in dataloader:
  print(feature)
  print(label)
  print('-'*50)

tensor([[ 1.9852, -0.0219],
        [-2.3280, -0.1759],
        [-0.2706, -2.2555],
        [-0.9192,  1.0617]])
tensor([1, 0, 1, 0])
--------------------------------------------------
tensor([[ 0.8524, -1.1466],
        [ 0.4805,  0.5491],
        [-1.0063, -1.1405],
        [-2.0263,  0.0619]])
tensor([1, 1, 0, 0])
--------------------------------------------------
tensor([[-0.5083,  1.4805],
        [-1.5598, -1.9249],
        [-1.4118, -1.5333],
        [-1.5445, -1.5104]])
tensor([0, 0, 0, 0])
--------------------------------------------------
tensor([[2.5912, 0.2447],
        [0.8620, 1.3597],
        [1.5331, 1.7415],
        [0.2506, 0.1398]])
tensor([1, 1, 1, 1])
--------------------------------------------------
tensor([[-1.1581,  0.8656],
        [ 1.1733,  0.7364],
        [-0.5750, -0.3751],
        [ 1.6521, -0.3589]])
tensor([0, 1, 0, 1])
--------------------------------------------------
tensor([[ 0.9642,  0.5560],
        [ 1.8399,  2.3045],
        [-1.2172, -1.3672],

## **Where to Apply Transformation in this**

We do this in DataCLass `get` method.
Transformation before returning the row

What kind of transformation

Example

1. For Images:

- resize
- b/w to rgb and vice versa
- data augmentation
- etc etc...

2. For Text:

- lower case
- lemmatization
- stop word removal
- etc etc...

## **Parallelization**

`n_workers`

Imagine the entire data loading and training process for one epoch with num_workers=4:

**Assumptions:**
- Total samples: 10,000
- Batch size: 32
- Workers (num_workers): 4
- Approximately 312 full batches per epoch (10000 / 32 ≈ 312).

## Workflow

**1.Sampler and Batch Creation (Main Process):**

Before training starts for the epoch, the DataLoader’s sampler generates a shuffled list of all 10,000 indices. These
are then grouped into 312 batches of 32 indices each. All these batches are queued up, ready to be fetched by
workers.

**2.Parallel Data Loading (Workers):**

At the start of the training epoch, you run a training loop like:
```py
for batch_data, batch_labels in dataloader:
  # Training logic
```


Under the hood, as soon as you start iterating over dataloader, it dispatches the first four batches of indices
to the four workers:

- Worker #1 loads batch 1 (indices [batch_1_indices])
- Worker #2 loads batch 2 (indices [batch_2_indices])
- Worker #3 loads batch 3 (indices [batch_3_indices])
- Worker #4 loads batch 4 (indices [batch_4_indices])

*Each worker:*

- Fetches the corresponding samples by calling __getitem__ on the dataset for each index in that batch.

- Applies any defined transforms and passes the samples through collate_fn to form a single batch tensor.

**3.First Batch Returned to Main Process:**

Whichever worker finishes first sends its fully prepared batch (e.g., batch 1) back to the main process.

As soon as the main process gets this first prepared batch, it yields it to your training loop, so your code for
batch_data, batch_labels in dataloader:receives (batch_data, batch_labels) for the first batch.


**4.Model Training on the Main Process:**

While you are now performing the forward pass, computing loss, and doing backpropagation on the first
batch, the other three workers are still preparing their batches in parallel.

By the time you finish updating your model parameters for the first batch, the DataLoader likely has the
second, third, or even more batches ready to go (depending on processing speed and hardware).


**5.Continuous Processing:**

As soon as a worker finishes its batch, it grabs the next batch of indices from the queue.

For example: after Worker #1 finishes with batch 1, it immediately starts on batch 5. After Worker #2
finishes batch 2, it takes batch 6, and so forth.

This creates a pipeline effect: at any given moment, up to 4 batches are being prepared concurrently.

**6.Loop Progression:**

Your training loop simply sees:
```python
for batch_data, batch_labels in dataloader:
    # forward pass
    # loss computation
    # backward pass
    # optimizer step
```

Each iteration, it gets a new, ready-to-use batch without long I/O waits, because the workers have been pre-
loading and processing data in parallel.

**7. End of the Epoch:**

After ~312 iterations, all batches have been processed. All indices have been consumed, so the DataLoader
has no more batches to yield.

The epoch ends. If shuffle=True, on the next epoch, the sampler reshuffles indices, and the whole process
repeats with workers again loading data in parallel.

---


## **How does shuffling Happen in Data Loader**

This uses sampler

In PyTorch, the sampler in the DataLoader determines the strategy for selecting samples from the dataset during data loading. It controls how indices of the dataset are drawn for each batch.

**Types of Samplers**

PyTorch provides several predefined samplers, and you can create custom ones:
1. **SequentialSampler**:

- Samples elements sequentially, in the order they appear in the dataset.

- Default when shuffle=False.

- Should be used when working with timeseries data

2. **RandomSampler**:

- Samples elements randomly without replacement.

- Default when shuffle=True.

We can also create custom Sampling Strategies.

Suppose we need sampling but in that sample we also need it to follow a particular types of Distribution.

## **Collate Function**

The collate_fn in PyTorch's DataLoader is a function that specifies how to combine a list of samples from a dataset into a single batch.

By default, the DataLoader uses a simple batch collation mechanism, but collate_fn allows you to customize how the data should be processed and batched.

**Why do we need custom merging strategy for merging data to create batches**

- When there will be difference size of tensor in our rows and while merging they both cannot be merged because the shape is not same

## **Data Loader Important Parameters**

The DataLoader class in PyTorch comes with several parameters that allow you to customize how data is loaded, batched, and preprocessed. Some of the most commonly used and important parameters include:

1. dataset(mandatory) :
The Dataset from which the DataLoader will pull data.
Must be a subclass of torch.utils.data.Dataset that implements __getitem__ and
__len__.


2. batch_size: How many samples per batch to load. Default is 1. Larger batch sizes can speed up training on GPUs but require more memory.

3. shuffle: If True, the DataLoader will shuffle the dataset indices each epoch.
Helpful to avoid the model becoming too dependent on the order of samples.

4. num_workers:
The number of worker processes used to load data in parallel.
Setting num_workers > 0 can speed up data loading by leveraging multiple CPU
cores, especially if I/O or preprocessing is a bottleneck.

5. pin_memory:
If True, the DataLoader will copy tensors into pinned (page-locked) memory before
returning them.
This can improve GPU transfer speed and thus overall training throughput,
particularly on CUDA systems.


6. drop_last:
If True, the DataLoader will drop the last incomplete batch if the total number of samples is not divisible by the batch size.
Useful when exact batch sizes are required (for example, in some batch
normalization scenarios).

7. collate_fn:
A callable that processes a list of samples into a batch (the default simply stacks tensors).
Custom collate_fn can handle variable-length sequences, perform custom batching logic, or handle complex data structures.

8. sampler:
sampler defines the strategy for drawing samples (e.g., for handling imbalanced
classes, or custom sampling strategies).
batch_sampler works at the batch level, controlling how batches are formed.
Typically, you don’t need to specify these if you are using batch_size and shuffle.
However, they provide lower-level control if you have advanced requirements.
