# **Transformers**
- Data does not always come in its final processed form that is required for training machine learning algorithms. We use transforms to perform some manipulation of the data and make it suitable for training.
- All TorchVision datasets have two parameters :
    - `transform` to modify the **features** and `target_transform` to modify the **labels** - that accept callables containing the transformation logic. 
    - The ***torchvision.transforms*** module offers several commonly-used transforms out of the box.
    - The _FashionMNIST_ features are in PIL Image format, and the labels are integers. For training, we need the features as normalized tensors, and the labels as one-hot encoded tensors. To make these transformations, we use `ToTensor` and `Lambda`.

In [1]:
!pip install torch
!pip install torchvision




In [2]:
import torch
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda

ds = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),  # normalized tensor
    target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float).scatter_(0, torch.tensor(y), value=1)) # one hot encoding
)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:03<00:00, 6837305.34it/s] 


Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 140054.67it/s]


Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:01<00:00, 2568286.35it/s]


Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 6497826.36it/s]

Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw






## **ToTensor()**
- `ToTensor()` converts a PIL image or NumPy `ndarray` into a `FloatTensor`, and scales the image's pixel intensity values in the range [0., 1.].
## **Lambda Transforms**
- Lambda transforms apply any user-defined lambda function. Here, we define a function to turn the integer into a one-hot encoded tensor. It first creates a zero tensor of size 10 (the number of labels in our dataset) and calls **scatter_** which assigns a `value=1` on the index as given by the label `y`.

In [3]:
target_transform = Lambda(lambda y: torch.zeros(
    10, dtype=torch.float).scatter_(dim=0, index=torch.tensor(y), value=1))

# **Transforming and Augmenting Images**
- Torchvision supports common computer vision transformations in the `torchvision.transforms` and `torchvision.transforms.v2` modules. Transforms can be used to transform or augment data for training or inference of different tasks (image classification, detection, segmentation, video classification).


In [4]:
# Image Classification
import torch
from torchvision.transforms import v2

H, W = 32, 32
img = torch.randint(0, 256, size=(3, H, W), dtype=torch.uint8)

transforms = v2.Compose([
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = transforms(img)
img

tensor([[[-0.0116, -0.0116, -0.0116,  ...,  2.1804,  2.1804,  2.1804],
         [-0.0116, -0.0116, -0.0116,  ...,  2.1804,  2.1804,  2.1804],
         [-0.0116, -0.0116, -0.0116,  ...,  2.1804,  2.1804,  2.1804],
         ...,
         [ 0.8618,  0.8618,  0.8618,  ..., -0.6965, -0.6965, -0.6965],
         [ 0.8618,  0.8618,  0.8618,  ..., -0.6965, -0.6965, -0.6965],
         [ 0.8618,  0.8618,  0.8618,  ..., -0.6965, -0.6965, -0.6965]],

        [[ 1.7108,  1.7108,  1.7108,  ...,  2.2360,  2.2360,  2.2360],
         [ 1.7108,  1.7108,  1.7108,  ...,  2.2360,  2.2360,  2.2360],
         [ 1.7108,  1.7108,  1.7108,  ...,  2.2360,  2.2360,  2.2360],
         ...,
         [ 2.2885,  2.2885,  2.2885,  ..., -0.1800, -0.1800, -0.1800],
         [ 2.2885,  2.2885,  2.2885,  ..., -0.1800, -0.1800, -0.1800],
         [ 2.2885,  2.2885,  2.2885,  ..., -0.1800, -0.1800, -0.1800]],

        [[ 0.3219,  0.3219,  0.3219,  ..., -1.2467, -1.2467, -1.2467],
         [ 0.3219,  0.3219,  0.3219,  ..., -1

In [5]:
# Detection (re-using imports and transforms from above)
from torchvision import tv_tensors

img = torch.randint(0, 256, size=(3, H, W), dtype=torch.uint8)
boxes = torch.randint(0, H // 2, size=(3, 4))
boxes[:, 2:] += boxes[:, :2]
boxes = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=(H, W))

# The same transforms can be used!
img, boxes = transforms(img, boxes)
# And you can pass arbitrary input structures
output_dict = transforms({"image": img, "boxes": boxes})
output_dict

{'image': tensor([[[ 2.3932,  2.3932,  2.4937,  ...,  1.5820,  1.4959,  1.4959],
          [ 2.3932,  2.3932,  2.4937,  ...,  1.5820,  1.4959,  1.4959],
          [ 2.4118,  2.4118,  2.5087,  ...,  1.7523,  1.6626,  1.6626],
          ...,
          [-0.2803, -0.2803, -0.2302,  ..., -2.4676, -2.4496, -2.4496],
          [-0.2989, -0.2989, -0.2558,  ..., -2.5566, -2.5423, -2.5423],
          [-0.2989, -0.2989, -0.2558,  ..., -2.5566, -2.5423, -2.5423]],
 
         [[-1.8231, -1.8231, -1.8531,  ...,  1.6670,  1.7721,  1.7721],
          [-1.8231, -1.8231, -1.8531,  ...,  1.6670,  1.7721,  1.7721],
          [-1.7650, -1.7650, -1.7950,  ...,  1.8376,  1.9463,  1.9463],
          ...,
          [ 5.2504,  5.2504,  5.2429,  ...,  0.6986,  0.6986,  0.6986],
          [ 5.2891,  5.2891,  5.2891,  ...,  0.8342,  0.8342,  0.8342],
          [ 5.2891,  5.2891,  5.2891,  ...,  0.8342,  0.8342,  0.8342]],
 
         [[-1.1483, -1.1483, -1.1186,  ..., -4.3008, -4.4792, -4.4792],
          [-1.1483,

### Supported input types and conventions
- Tensor image are expected to be of shape (`C`, `H`, `W`), where `C` is the number of channels, and `H` and `W` refer to height and width. Most transforms support batched tensor input. A batch of Tensor images is a tensor of shape (`N`, `C`, `H`, `W`), where `N` is a number of images in the batch. The ***v2*** transforms generally accept an arbitrary number of leading dimensions (`...`, `C`, `H`, `W`) and can handle batched images or batched videos.

### Dtype and expected value range 
- The expected range of the values of a tensor image is implicitly defined by the tensor dtype. Tensor images with a float dtype are expected to have values in `[0, 1]`. Tensor images with an integer dtype are expected to have values in `[0, MAX_DTYPE]` where `MAX_DTYPE` is the largest value that can be represented in that dtype. Typically, images of dtype `torch.uint8` are expected to have values in `[0, 255]`.

### V1 or V2? Which one should I use?
- We recommending using the `torchvision.transforms.v2` transforms instead of those in `torchvision.transforms`. They’re faster and they can do more things. Just change the import and you should be good to go. Moving forward, new features and improvements will only be considered for the v2 transforms.
- In Torchvision 0.15 (March 2023), we released a new set of transforms available in the `torchvision.transforms.v2` namespace. These transforms have a lot of advantages compared to the v1 ones (in `torchvision.transforms`):
    -  They can transform ***images*** but also ***bounding boxes***, ***masks***, or ***videos***. This provides support for tasks beyond _image classification_: _detection_, _segmentation_, _video classification_, etc.
    - They support more transforms like `CutMix` and `MixUp`.
    - They support arbitary input structures(dicts, lists, tuples, etc)
    - Future improvements and features will be added to the v2 transforms only.
- These transforms are fully backward compatible with the v1 ones, so if you’re already using tranforms from `torchvision.transforms`, all you need to do to is to update the import to `torchvision.transforms.v2`. In terms of output, there might be negligible differences due to implementation differences.
### Performance Considerations
- We recommend the following guidelines to get the best performance out of the transforms:
    - Rely on the v2 transforms from `torchvision.transforms.v2`
    - Use tensors instead of PIL images.
    - Use `torch.uint8` dtype, especially for resizing.
    - Resize with bilinear or bicubic mode.
- ***Note :*** Note that resize transforms like `Resize` and `RandomResizedCrop` typically prefer __channels-last__ input and tend not to benefit from `torch.compile()` at this time.

In [6]:
from torchvision.transforms import v2
transforms = v2.Compose([
    v2.ToImage(),  # Convert to tensor, only needed if you had a PIL image
    v2.ToDtype(torch.uint8, scale=True),  # optional, most input are already uint8 at this point
    # ...
    v2.RandomResizedCrop(size=(224, 224), antialias=True),  # Or Resize(antialias=True)
    # ...
    v2.ToDtype(torch.float32, scale=True),  # Normalize expects float input
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
transforms

Compose(
      ToImage()
      ToDtype(scale=True)
      RandomResizedCrop(size=(224, 224), scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=InterpolationMode.BILINEAR, antialias=True)
      ToDtype(scale=True)
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], inplace=False)
)

## **Transform classes, functionals, and kernels**
- Transforms are available as classes like `Resize`, but also as functionals like `resize()` in the `torchvision.transforms.v2.functional` namespace. This is very much like the `torch.nn` package which defines both classes and functional equivalents in `torch.nn.functional`.
- The functionals support PIL images, pure tensors, or **TVTensors**, e.g. both `resize(image_tensor)` and `resize(boxes)` are valid.
