##  **1. Custom Dataset Class**

```
torch/
    __init__.py
    utils/
        __init__.py
        data/
            __init__.py  # This is a crucial file that makes 'data' a Python module
            dataset.py   # This file defines the base Dataset class and random_split
            dataloader.py  # This file defines the DataLoader class
            sampler.py
```


So, you can use: 

```python
from torch.utils.data import DataLoader, random_split
```


or 

```python
import torch.utils.data.dataset as dataset
import torch.utils.data.dataloader as dataloader

dataset.random_split()
dataloader.DataLoader()

```



You define your dataset by subclassing `torch.utils.data.Dataset` and overriding `__len__()` and `__getitem__()`.

---

In [15]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, random_split, Dataset, Subset


class MyCustomDataSet(Dataset):
    def __init__(self, data, lables):
        self.data = data
        self.lables = lables

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.lables[idx]

##  **2. `random_split`** 

`torch.utils.data.random_split(dataset, lengths, generator=None)`: splits a dataset into non-overlapping new datasets of given lengths.

---


#### **2.1. Only Index-Based – No Data Copy**

* `random_split` **does not copy** the underlying data.
* It **wraps the original dataset** and uses internally shuffled indices to simulate subsets.
* Memory usage is minimal because it's just a view via `Subset`.

Example:


In [16]:
torch.manual_seed(42)
sample_size = 100
data = torch.randn(sample_size, 2)
lables = torch.randint(0, 2, (sample_size,))

dataset = MyCustomDataSet(data=data, lables=lables)

train_size = int(0.75*len(dataset))
val_size = int(0.15*len(dataset))
test_size = len(dataset)-train_size-val_size

---

####  **2.2. Reproducibility with Generator**

To ensure reproducibility (same split every run), pass a seeded `torch.Generator`:

In [17]:
generator = torch.Generator().manual_seed(42)

train_dataset, val_dataset, test_dataset = random_split(
    dataset, [train_size, val_size, test_size], generator=generator)


If you don't pass a generator, a random seed is used from the system, and results will vary across runs.

---

####  **2.3. How It Works Internally**

* Internally, it:

  * Shuffles indices using the generator (if given),
  * Splits them into the specified sizes,
  * Creates `Subset(dataset, indices)` for each split.

---

####  **2.4.Common Pitfalls**

* **Don't modify the original dataset in-place** after splitting. The splits reference it.
* Be careful with imbalanced class distributions — `random_split` does **not** preserve class ratios.

---


### Example

In [18]:
from torchvision import transforms
import torch
from torch.utils.data import DataLoader, Dataset, random_split
class MRIDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]


seed = 42
torch.manual_seed(seed)

data = torch.randint(low=0, high=20, size=(10,))
labels = torch.randint(low=0, high=4, size=(10,))

dataset = MRIDataset(data=data, labels=labels)
train_dataset, val_dataset, test_dataset = random_split(
    dataset, [0.7, 0.15, 0.15], generator=torch.Generator().manual_seed(seed))


for data, label in train_dataset:
    print(data.item(), label.item())

16 2
0 3
7 2
10 3
6 3
15 0
2 2
13 2


## **3.`Subset`**

`Subset` creates a **view** of a dataset using a list of indices. It’s a wrapper that lets you work with just a portion of a dataset **without copying** the data.

```python
torch.utils.data.Subset(dataset, indices)
```
---


In [19]:

indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
subset = Subset(dataset, indices)

print(dataset[0])

subset.dataset   # Original dataset
subset.indices   # List of indices used

(tensor(2), tensor(2))


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


#### 3.1. **No Data Copy**

* Like `random_split`, `Subset` does **not duplicate data** — it just stores references (indices).
* It’s memory-efficient and fast.

#### 3.2. **How It Works**

* Internally, `Subset` defines `__getitem__` like this:

  ```python
  def __getitem__(self, idx):
      return self.dataset[self.indices[idx]]
  ```
* So each item access fetches from the original dataset using the provided index mapping.

---

#### **3.3 Stratified Splits with scikit-learn**



You can use `StratifiedShuffleSplit` to split based on labels and then wrap them in `Subset`:

```python
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import Subset

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
targets = dataset.targets  # Or dataset.labels depending on the dataset

for train_idx, val_idx in sss.split(X=targets, y=targets):
    train_ds = Subset(dataset, train_idx)
    val_ds = Subset(dataset, val_idx)
```
---

##  **4. ImageFolder**


In PyTorch, `torchvision.datasets.ImageFolder` is a utility class for loading image datasets arranged in a specific directory structure. It automatically assigns labels based on subdirectory names, making it ideal for classification tasks.

---

**Directory Structure**

`ImageFolder` expects the dataset directory to be structured like this:

```
root/
    class1/
        img1.png
        img2.png
        ...
    class2/
        img3.png
        img4.png
        ...
```

* Each **subfolder** under `root` is treated as a class.
* All images inside a class folder are treated as samples of that class.

---

**How It Works**

```python
from torchvision import datasets, transforms

# Define optional transforms (resizing, normalization, etc.)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

# Load dataset
dataset = datasets.ImageFolder(root='path/to/root', transform=transform)
```

---

**Labels and Classes**

* `dataset.classes`: list of class names (e.g., `['cat', 'dog']`)
* `dataset.class_to_idx`: dict mapping class names to label indices (e.g., `{'cat': 0, 'dog': 1}`)
* Each sample is a tuple: `(image_tensor, label)`

You can access an image and its label like this:

```python
img, label = dataset[0]
```
---

##  **5. DataLoader**


```
torch/
    __init__.py
    utils/
        __init__.py
        data/
            __init__.py  # This is a crucial file that makes 'data' a Python module
            dataset.py   # This file defines the base Dataset class and random_split
            dataloader.py  # This file defines the DataLoader class
            sampler.py
```

So, you can use: 

```python
from torch.utils.data import DataLoader, random_split
```


or 

```python
import torch.utils.data.dataset as dataset
import torch.utils.data.dataloader as dataloader

dataset.random_split()
dataloader.DataLoader()

```



The `DataLoader` handles batching, shuffling, and loading the data in parallel with multiple workers.

```python
from torch.utils.data import DataLoader

dataset = MyDataset(data=torch.randn(100, 3, 32, 32), labels=torch.randint(0, 10, (100,)))
loader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=2)
```

---

###  **Using the DataLoader in Training**

```python
for batch in loader:
    inputs, targets = batch
    # your training loop here
```

### Example

In [20]:

batch_size = 4
loader = DataLoader(dataset, batch_size=batch_size,
                    shuffle=True, num_workers=2, pin_memory=True)

for batch in loader:
    targets, input = batch
    print(targets)
    print(input)
    print('='*50)

tensor([10, 16,  4, 15])
tensor([3, 2, 3, 0])
tensor([13,  6, 14,  0])
tensor([2, 3, 2, 3])
tensor([2, 7])
tensor([2, 2])


---

###  Alternative: Use Built-in Datasets

PyTorch provides several datasets in `torchvision.datasets` (for images) and `torchtext`, `torchaudio`, etc.

Example with CIFAR-10:
```python
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor()
])

train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
```

---

## Standard Practice For Loading Dataset

The **common and correct** approach in PyTorch is:


#### **1. Keep ALL dataset tensors on CPU**

You do **not** move the raw dataset to GPU.

Example:

```python
x = torch.arange(0, 10, 0.5)   # CPU
labels = func(x)               # CPU
```

#### **2. Use DataLoader with:**

* `pin_memory=True`
* `num_workers > 0` (for speed)

```python
loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True
)
```

#### **3. Move batches to GPU inside the training loop**

This is where `.to(device)` is used:

```python
for xb, yb in loader:
    xb = xb.to(device, non_blocking=True)
    yb = yb.to(device, non_blocking=True)
```

---

## Why this is the correct approach?

#### **Reason 1 — GPU memory is limited**

Datasets can be gigabytes.
GPU memory is 4–24 GB.

#### **Reason 2 — pin_memory=True speeds up CPU → GPU transfer**

Pinned memory works only for CPU tensors.
It allows **DMA** fast transfer to GPU.

#### **Reason 3 — DataLoader performs prefetching & asynchronous transfer**

When you use:

```python
pin_memory=True
non_blocking=True
```

PyTorch automatically overlaps:

* next batch loading
* with GPU training of current batch

Which speeds training significantly.

#### **Reason 4 — You avoid the error**

GPU tensors cannot be pinned.

---

## ❌ WRONG approach (your original)

```python
x = x.to(device)
labels = labels.to(device)
```

Dataset stays on GPU, so:

* DataLoader cannot use pinned memory
* GPU memory gets filled
* CPU → GPU pipeline is broken

---



