<a href="https://colab.research.google.com/github/ashutosh15072000/Deep-Learning-Basics/blob/main/PyTorch_Basics/Pytorch_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What is Tensor in Pytorch?
- Pytorch tensors are pointer into allocated memory.

In [1]:
import torch

In [3]:
x=torch.tensor([
    [0.,1,2,3],
    [4,5,6,7],
    [8,9,10,11],
    [12,13,14,15],])
x

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.]])

In [7]:
## TO go to the next row(dim 0) , skip 4 element in storage
assert x.stride(0)==4

In [8]:
## TO go to the next col(dim 1) , skip 1 element in storage
assert x.stride(1)==1

In [9]:
## To find the elments
r,c=1,2
index=r*x.stride(0)+c*x.stride(1)
index

6

In [10]:
x=torch.tensor([[1,2,3],[4,5,6]])


In [12]:
def same_storage(x:torch.Tensor,y: torch.Tensor):
  return x.untyped_storage().data_ptr()==y.untyped_storage().data_ptr()


In [13]:
x.untyped_storage().data_ptr()

338481472

In [21]:
## GET ROW 0
y=x[0]
print(f"X tensor memory allocation : {x.untyped_storage().data_ptr()}")
print(f"Y tensor memory allocation : {y.untyped_storage().data_ptr()}")
print(f"Both X and Y Tensor have same value : {torch.equal(y,x[0])}")
print(f"Both X and Y Tensor have same memory allocation : {same_storage(x,y)}")

X tensor memory allocation : 338481472
Y tensor memory allocation : 338481472
Both X and Y Tensor have same value : True
Both X and Y Tensor have same memory allocation : True


In [25]:
## Get Column 1
y=x[:,1]
print(torch.equal(y,torch.tensor([2,5])))
print(same_storage(x,y))

True
True


In [29]:
y=x.view(3,2)
print(y)
torch.equal(y,torch.tensor([[1,2],[3,4],[5,6]]))
print(same_storage(x,y))


tensor([[1, 2],
        [3, 4],
        [5, 6]])
True


In [32]:
x

tensor([[1, 2, 3],
        [4, 5, 6]])

In [36]:
## Transpose the matrix
y=x.transpose(1,0)
print(y)
print(torch.equal(y,torch.tensor([[1,4],[2,5],[3,6]])))
print(same_storage(x,y))

tensor([[1, 4],
        [2, 5],
        [3, 6]])
True
True


In [38]:
## Check that Mutating x also mutates y
x[0][0]=100
print(x)
print(y)

tensor([[100,   2,   3],
        [  4,   5,   6]])
tensor([[100,   4],
        [  2,   5],
        [  3,   6]])


In [39]:
x=torch.tensor([[1,2,3],[4,5,6]])
y=x.t()
print(x)
print(y)

tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[1, 4],
        [2, 5],
        [3, 6]])


In [45]:
 y.view(2,3)

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

In [43]:
same_storage(x,y)

True

    One can enforce a tensor to be contiguous first:

In [47]:
y=x.transpose(1,0).contiguous().view(2,3)
y

tensor([[1, 4, 2],
        [5, 3, 6]])

In [48]:
same_storage(x,y)

False

In [49]:
print(x.untyped_storage().data_ptr())
print(y.untyped_storage().data_ptr())
##     Views are free, copying take both (additional) memory and compute.


364240896
364281472


##  Matrix Multiplication

In [50]:
x=torch.ones(16,32)
w=torch.ones(32,2)
print(x,w)
y=x@w
print(y)


tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 

In general , we perform operations for every example in a batch and token in a sequence.

In [51]:
x=torch.ones([4,8,16,32])
w=torch.ones(32,2)
y=x@w
y

tensor([[[[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
          [32., 32.],
          [32., 32.],
          [32., 32.]],

         [[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
          [32., 32.],
          [32., 32.],
          [32., 32.]],

         [[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
          [32., 32.],
          [32., 32.],
          [32., 32.]],

         ...,

         [[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
          [32., 32.],
          [32., 32.],
          [32., 32.]],

         [[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
          [32., 32.],
          [32., 32.],
          [32., 32.]],

         [[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
          [32., 32.],
          [32., 32.],
          [32., 32.]]],


        [[[32., 32.],
          [32., 32.],
          [32., 32.],
          ...,
        

In [66]:
import timeit

In [67]:
def time_matmul(a: torch.Tensor, b: torch.Tensor) -> float:
    """Return the number of seconds required to perform `a @ b`."""
    # Wait until previous CUDA threads are done
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    def run():
        # Perform the operation
        a @ b
        # Wait until CUDA threads are done
        if torch.cuda.is_available():
            torch.cuda.synchronize()
    # Time the operation `num_trials` times
    num_trials = 5
    total_time = timeit.timeit(run, number=num_trials)
    return total_time / num_trials

## Linear Model

As motivation, suppose you have a linear model.
    
We have n points
    
Each point is d-dimsional
    
The linear model maps each d-dimensional vector to a k outputs

In [68]:
if torch.cuda.is_available():
        B = 16384  # Number of points
        D = 32768  # Dimension
        K = 8192   # Number of outputs
else:
        B = 1024
        D = 256
        K = 64

In [69]:
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [70]:
x=torch.ones(B,D,device=device)
w=torch.ones(D,K,device=device)
print(x.shape,w.shape)

torch.Size([16384, 32768]) torch.Size([32768, 8192])


In [71]:
## Matrix Muplication
y=x@w
print(y.shape)

torch.Size([16384, 8192])


We have one multiplication (x[i][j] * w[j][k]) and one addition per (i, j, k) triple.

In [80]:
actual_num_flops =2*B*D*K
print(f"Total number of Flops for particular linear model : {actual_num_flops}")

Total number of Flops for particular linear model : 8796093022208


    `FLOPs of other operations`
    
Elementwise operation on a m x n matrix requires O(m n) FLOPs.
    
Addition of two m x n matrices requires m n FLOPs.
    In general, no other operation that you'd encounter in deep learning is as expensive as matrix multiplication for large enough matrices.
    Interpretation:
    
B is the number of data points
    
(D K) is the number of parameters
    
FLOPs for forward pass is 2 (# tokens) (# parameters)
    It turns out this generalizes to Transformers (to a first-order approximation).

In [74]:
actual_time=time_matmul(x,w)
print(f"Actual Time to Compute matrix Multipication on Linear Model on GPU(T4) {actual_time}")

Actual Time to Compute matrix Multipication on Linear Model on GPU(T4) 1.7326699799999914


In [84]:
actual_flops_per_sec = actual_num_flops / actual_time
actual_flops_per_sec

5076611890169.67

In [82]:
promised_flop_per_sec=8e12

 ## Model FLOPs utilization (MFU)

 Definition: (actual FLOP/s) / (promised FLOP/s) [ignore communication/overhead]

In [86]:
mfu=actual_flops_per_sec/promised_flop_per_sec
print(f"Model FLOPs utilization (MFU) : {mfu}")

Model FLOPs utilization (MFU) : 0.6345764862712088


Usually, MFU of >= 0.5 is quite good (and will be higher if matmuls dominate)

### Let's do it with bfloat16:

In [89]:
x=x.to(torch.bfloat16)
w=w.to(torch.bfloat16)
print(x.dtype,w.dtype)
bfloat16_actual_time=time_matmul(x,w)
print(f"Actual Time to Compute matrix Multipication on Linear Model on GPU(T4) {bfloat16_actual_time}")


bf16_actual_flop_per_sec = actual_num_flops / bfloat16_actual_time
print(f"BF16 actual flop per sec {bf16_actual_flop_per_sec}")


promised_flop_per_sec=16e12
bf16_mfu=bf16_actual_flop_per_sec/promised_flop_per_sec

print(f"Model FLOPs utilization (MFU) : {bf16_mfu}")

torch.bfloat16 torch.bfloat16
Actual Time to Compute matrix Multipication on Linear Model on GPU(T4) 3.8737608425999497
BF16 actual flop per sec 2270685615249.374
Model FLOPs utilization (MFU) : 0.14191785095308587


  Note: comparing bfloat16 to float32, the actual FLOP/s is higher.
    The MFU here is rather low, probably because the promised FLOPs is a bit optimistic.