# Memory Format

## Channels Last Memory Format in PyTorch

Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).

For example, classic (contiguous) storage of NCHW tensor (in our case it is two 4x4 images with 3 color channels) look like this:

![classic_memory_format](./figs/classic_memory_format.png)

Channels last memory format orders data differently:

![channels_last_memory_format](./figs/channels_last_memory_format.png)


Pytorch supports memory formats (and provides back compatibility with existing models including eager, JIT, and TorchScript) by utilizing  existing strides structure.
For example, 10x3x16x16 batch in Channels last format will have strides equal to (768, 1, 48, 3).

<div class="alert alert-info"><h4>Note</h4><p>NCHW stands for: batch N, channels C, depth D, height H, width W. It is a way to store multidimensional arrays / data frames / matrix into memory, which can be considered as a 1-D array.</p></div>





### Classic PyTorch contiguous tensor

In [12]:
import torch

N, C, H, W = 10, 3, 32, 32
x = torch.empty(N, C, H, W)
print(x.shape)  # Outputs: (10, 3, 32, 32) as dimensions order preserved
x.stride()  # Ouputs: (3072, 1024, 32, 1)

torch.Size([10, 3, 32, 32])


(3072, 1024, 32, 1)

**Conversion operator**

In [14]:
x = x.to(memory_format=torch.channels_last)
print(x.shape)  # Outputs: (10, 3, 32, 32) as dimensions order preserved
x.stride()  # Outputs: (3072, 1, 96, 3)

torch.Size([10, 3, 32, 32])


(3072, 1, 96, 3)

**Back to contiguous**

In [15]:
x = x.to(memory_format=torch.contiguous_format)
x.stride()  # Outputs: (3072, 1024, 32, 1)

(3072, 1024, 32, 1)

## Performance Gain 

The most significant performance gains are observed on NVIDIA’s hardware with Tensor Cores support running on reduced precision (`torch.float16`). We are able to archive over 22% perf gains with channels last comparing to contiguous format, both while utilizing AMP (Automated Mixed Precision) training scripts.

### Launch command

**You need ImageNet to execute the code.**

In [1]:
!python code/main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 --channels-last true ./data

opt_level = O2
keep_batchnorm_fp32 = None <class 'NoneType'>
loss_scale = None <class 'NoneType'>

CUDNN VERSION: 8204

=> creating model 'resnet50'
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Traceback (most recent call last):
  File "code/main_amp.py", line 543, in <module>
    main()
  File "code/main_amp.py", line 207, in main
    train_dat

## Credits/links
- Vitaly Fedyunin <https://github.com/VitalyFedyunin>
- https://github.com/apache/incubator-mxnet/issues/5778
- https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html