In [None]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [None]:
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118


In [None]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CUDA not available")


2.3.0+cu121
True
Tesla T4


##Using PyTorch 2.0: Side-by-Side Comparison with Previous Versions

Basic Tensor Operations
PyTorch 1.x:

In [None]:
import torch

# Create tensors
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])

# Basic operations
c = a + b
d = a * b

print(f"a: {a}")
print(f"b: {b}")
print(f"a + b: {c}")
print(f"a * b: {d}")


a: tensor([1, 2, 3])
b: tensor([4, 5, 6])
a + b: tensor([5, 7, 9])
a * b: tensor([ 4, 10, 18])


PyTorch 2.0: In PyTorch 2.0, operations are optimized for GPU (CUDA) execution, enhancing performance.

In [None]:
import torch

# Create tensors
a = torch.tensor([1, 2, 3], device='cuda')
b = torch.tensor([4, 5, 6], device='cuda')

# Basic operations
c = a + b
d = a * b

print(f"a: {a}")
print(f"b: {b}")
print(f"a + b: {c}")
print(f"a * b: {d}")

Dynamic Shapes and Control Flow
PyTorch 1.x:

In [None]:
x = torch.randn(3, 4)
y = torch.randn(4, 5)

result = torch.matmul(x, y)
print(f"Result: {result}")

PyTorch 2.0:Dynamic shapes and control flow are better supported and more efficiently executed on GPUs in PyTorch 2.0.

In [None]:
x = torch.randn(3, 4, device='cuda')
y = torch.randn(4, 5, device='cuda')

result = torch.matmul(x, y)
print(f"Result: {result}")


Using PyTorch 2.0: Side-by-Side Comparison with Previous Versions
To illustrate the differences and improvements in PyTorch 2.0 compared to previous versions, we'll highlight how common operations and tasks have been enhanced. This comparison will help you understand the new features and optimizations in PyTorch 2.0, making it easier to use the latest version effectively.

Dynamic Shapes and Control Flow
PyTorch 1.x:

python
Copy code
import torch

# Create tensors on CPU
x = torch.randn(3, 4)
y = torch.randn(4, 5)

# Perform matrix multiplication
result = torch.matmul(x, y)
print(f"Result: {result}")
PyTorch 2.0:

python
Copy code
import torch

# Create tensors on GPU
x = torch.randn(3, 4, device='cuda')
y = torch.randn(4, 5, device='cuda')

# Perform matrix multiplication
result = torch.matmul(x, y)
print(f"Result: {result}")
Explanation of Differences
Dynamic Shapes and Control Flow
In PyTorch 1.x, tensor operations such as matrix multiplication (torch.matmul) can be performed on both CPU and GPU. However, handling variable-length inputs and dynamic shapes can be less efficient and flexible.

PyTorch 2.0 Enhancements:

Improved GPU Utilization: By creating tensors directly on the GPU (using device='cuda'), PyTorch 2.0 ensures better performance for operations like matrix multiplication. This is particularly beneficial for dynamic shapes and control flow operations that can take advantage of GPU acceleration.

Efficient Execution: PyTorch 2.0 introduces enhancements that make the execution of operations involving dynamic shapes and control flow more efficient. This allows models to handle variable-length inputs more effectively, reducing overhead and improving overall performance.

Automatic Optimizations: The new TorchInductor backend in PyTorch 2.0 provides automatic optimizations for both CPU and GPU code, which includes handling dynamic shapes more efficiently. This means less manual optimization is required from developers.

Mixed Precision Training
PyTorch 1.x (with apex):

python
Copy code
from apex import amp
import torch

model = torch.nn.Linear(4, 4).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# Dummy data
data = torch.randn(16, 4, device='cuda')
target = torch.randn(16, 4, device='cuda')

output = model(data)
loss = torch.nn.functional.mse_loss(output, target)

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
optimizer.step()

print(f"Output: {output}")
PyTorch 2.0:

python
Copy code
from torch.cuda.amp import autocast, GradScaler
import torch

model = torch.nn.Linear(4, 4).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scaler = GradScaler()

# Dummy data
data = torch.randn(16, 4, device='cuda')
target = torch.randn(16, 4, device='cuda')

# Training step with mixed precision
with autocast():
    output = model(data)
    loss = torch.nn.functional.mse_loss(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

print(f"Output: {output}")
Explanation of Differences
Mixed Precision Training
In PyTorch 1.x, mixed precision training was possible using the NVIDIA apex library, which required additional setup and manual management of mixed precision operations.

PyTorch 2.0 Enhancements:

Integrated Mixed Precision: PyTorch 2.0 natively supports mixed precision training using the torch.cuda.amp module, eliminating the need for the external apex library. This integration simplifies the process and reduces the potential for errors.

Autocast and GradScaler: The autocast context manager and GradScaler in PyTorch 2.0 make it easier to implement mixed precision training. autocast automatically casts operations to the appropriate precision, while GradScaler manages the scaling of gradients to prevent underflow, providing a more user-friendly and efficient approach.

Asynchronous CUDA Execution
PyTorch 1.x:

python
Copy code
# Asynchronous execution required more manual setup.
import torch

stream = torch.cuda.Stream()
torch.cuda.synchronize()

with torch.cuda.stream(stream):
    a = torch.rand(10000, 10000, device='cuda')
    b = torch.rand(10000, 10000, device='cuda')
    c = torch.matmul(a, b)

torch.cuda.synchronize()
print("Matrix multiplication completed asynchronously")
PyTorch 2.0:

python
Copy code
import torch

stream = torch.cuda.Stream()

with torch.cuda.stream(stream):
    a = torch.rand(10000, 10000, device='cuda')
    b = torch.rand(10000, 10000, device='cuda')
    c = torch.matmul(a, b)

print("Matrix multiplication completed asynchronously")
Explanation of Differences
Asynchronous CUDA Execution
In PyTorch 1.x, setting up asynchronous execution with CUDA streams required explicit synchronization and careful management to ensure correct execution.

PyTorch 2.0 Enhancements:

Simplified Asynchronous Execution: PyTorch 2.0 streamlines the setup for asynchronous execution. By improving the integration with CUDA streams, operations can be more easily overlapped without the need for manual synchronization calls (torch.cuda.synchronize), simplifying the code and improving performance.

Better Performance: These enhancements reduce idle times and improve the overall performance of GPU computations by allowing better utilization of hardware resources, especially for large-scale tensor operations.

In [None]:
from apex import amp
import torch

model = torch.nn.Linear(4, 4).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# Dummy data
data = torch.randn(16, 4, device='cuda')
target = torch.randn(16, 4, device='cuda')

output = model(data)
loss = torch.nn.functional.mse_loss(output, target)

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
optimizer.step()

print(f"Output: {output}")


pytorch 2.0 mixed precision training

In [None]:
from torch.cuda.amp import autocast, GradScaler
import torch
model = torch.nn.Linear(4, 4).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scaler = GradScaler()
# Dummy data
data = torch.randn(16, 4, device='cuda')
target = torch.randn(16, 4, device='cuda')
# Training step with mixed precision
with autocast():
    output = model(data)
    loss = torch.nn.functional.mse_loss(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
print(f"Output: {output}")

##Asynchronous CUDA Execution

In [None]:
# Asynchronous execution required more manual setup.#pytorch 1.x
import torch

stream = torch.cuda.Stream()
torch.cuda.synchronize()

with torch.cuda.stream(stream):
    a = torch.rand(10000, 10000, device='cuda')
    b = torch.rand(10000, 10000, device='cuda')
    c = torch.matmul(a, b)

torch.cuda.synchronize()
print("Matrix multiplication completed asynchronously")


In [None]:
#pytorch 2.0 Asynchronous execution
import torch

stream = torch.cuda.Stream()

with torch.cuda.stream(stream):
    a = torch.rand(10000, 10000, device='cuda')
    b = torch.rand(10000, 10000, device='cuda')
    c = torch.matmul(a, b)

print("Matrix multiplication completed asynchronously")


## Mixed Precision training Cifar example

In [None]:
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Data transformation
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Loading CIFAR-10 data
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:05<00:00, 30844359.64it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data


In [None]:
from torch import nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(nn.ReLU()(self.conv1(x)))
        x = self.pool(nn.ReLU()(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = nn.ReLU()(self.fc1(x))
        x = nn.ReLU()(self.fc2(x))
        x = self.fc3(x)
        return x


In [None]:
import torch
from torch.cuda.amp import GradScaler, autocast

model = SimpleCNN().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
scaler = GradScaler()

for epoch in range(2):  # Example with 2 epochs
    for i, (inputs, labels) in enumerate(trainloader):
        inputs, labels = inputs.cuda(), labels.cuda()

        optimizer.zero_grad()
        with autocast():
            outputs = model(inputs)
            loss = nn.CrossEntropyLoss()(outputs, labels)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        if i % 200 == 199:
            print(f'Epoch {epoch + 1}, Batch {i + 1}, Loss: {loss.item():.3f}')


Epoch 1, Batch 200, Loss: 2.315
Epoch 1, Batch 400, Loss: 2.301
Epoch 1, Batch 600, Loss: 2.281
Epoch 1, Batch 800, Loss: 2.266
Epoch 1, Batch 1000, Loss: 2.223
Epoch 1, Batch 1200, Loss: 2.112
Epoch 1, Batch 1400, Loss: 2.007
Epoch 2, Batch 200, Loss: 1.941
Epoch 2, Batch 400, Loss: 1.929
Epoch 2, Batch 600, Loss: 1.926
Epoch 2, Batch 800, Loss: 1.884
Epoch 2, Batch 1000, Loss: 2.011
Epoch 2, Batch 1200, Loss: 1.702
Epoch 2, Batch 1400, Loss: 1.648


##Using Torchscript

In [None]:
import torch
from torch import nn

class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.layer1 = nn.Linear(10, 50)
        self.layer2 = nn.Linear(50, 10)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = self.layer2(x)
        return x


In [None]:
#Convert Model to TorchScript
# Initialize the model
model = NeuralNet()

# Example input tensor
example_input = torch.rand(1, 10)

# Trace the model
traced_model = torch.jit.trace(model, example_input)


In [None]:
#save and Load TorchScript

# Save the model
traced_model.save("model.pt")

# Load the model
loaded_model = torch.jit.load("model.pt")


In [None]:
# Run the model with some input
output = loaded_model(torch.rand(1, 10))
print(output)