## Example of optimization done by PyTorch compared to the NumPy library (JIT compilation)

#### JIT Compilation and Caching ( using only CPU)
One significant optimization that PyTorch implements is Just-In-Time (JIT) compilation and caching, which can lead to performance improvements over NumPy for certain operations.

In this example, we define a simple computation and run it using NumPy, PyTorch, and PyTorch with JIT compilation. The JIT-compiled version often shows better performance, especially for repeated computations, due to the following optimizations:

    - Code optimization: The JIT compiler analyzes the computation graph and applies various optimizations, such as operator fusion and dead code elimination.
    - Specialized code generation: It generates specialized machine code for the specific operations and data types used in your function.
    - Caching: Once compiled, the optimized version is cached, reducing overhead for subsequent calls.
    - Reduced Python overhead: JIT compilation can reduce the Python interpreter overhead by executing more operations in compiled code.

These optimizations can lead to significant performance improvements, especially for complex operations or when the same computation is repeated many times. The exact performance gain can vary depending on the specific operation, input sizes, and hardware. 

It's worth noting that for very simple operations or small data sizes, the overhead of JIT compilation might outweigh its benefits. However, for larger, more complex computations typical in deep learning scenarios, PyTorch's JIT optimization can provide substantial speedups compared to NumPy.

In [None]:
import torch
import numpy as np
import time

# Define a simple function
def compute(x, y):
    return x * y + x

# Create input data
x = np.random.rand(10000, 10000)
y = np.random.rand(10000, 10000)

# NumPy version
def numpy_compute():
    print('Inside numpy_compute()')
    return compute(x, y)

# PyTorch version
def torch_compute():
    print('Inside torch_compute()')
    x_torch = torch.from_numpy(x)
    y_torch = torch.from_numpy(y)
    return compute(x_torch, y_torch)

# JIT-compiled PyTorch version
@torch.jit.script
def torch_jit_compute(x, y):
    print('Inside torch_jit_compute()')
    return compute(x, y)

def torch_jit_run():
    x_torch = torch.from_numpy(x)
    y_torch = torch.from_numpy(y)
    print('Inside torch_jit_run()')
    return torch_jit_compute(x_torch, y_torch)

# Benchmark
def benchmark(func, name):
    start = time.time()
    for _ in range(1000):
        func()
    end = time.time()
    print(f"{name} took {end - start:.4f} seconds")

benchmark(numpy_compute, "NumPy")
benchmark(torch_compute, "PyTorch")
benchmark(torch_jit_run, "PyTorch JIT")

NumPy took 339.1130 seconds

PyTorch took 137.8524 seconds

PyTorch JIT took 69.9366 seconds

### JIT Compilation and Caching ( using GPU if available)
#### (But with the  overhead of transferring data between CPU and GPU memory)

**Key changes from the above version:

    - We use torch.device() to automatically select GPU if available, otherwise fallback to CPU.
    - The input data is cast to float32 to ensure compatibility with GPU operations.
    - In the PyTorch functions, we move the tensors to the selected device using .to(device).
    - We add a GPU warm-up step to ensure the GPU is initialized before benchmarking.
    - We use torch.cuda.synchronize() before and after the warm-up to ensure all CUDA operations are completed.
    - In the PyTorch functions, we move the result back to CPU and convert to numpy array if GPU was used. This ensures a fair comparison with the NumPy version, which always returns a numpy array.
    - The JIT-compiled function is now device-agnostic. It will run on whatever device the input tensors are on.

This version will use the GPU if it's available, potentially showing even more significant performance improvements for PyTorch over NumPy, especially for larger computations. However, for smaller computations, the overhead of moving data to and from the GPU might offset some of the performance gains. 

Remember that the relative performance can still vary based on the specific hardware, CUDA version, and the nature of the computation being performed. It's always a good idea to benchmark with your specific use case and data sizes.

In [5]:
import torch
import numpy as np
import time

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define a simple function
def compute(x, y):
    return x * y + x

# Create input data
x = np.random.rand(10000, 10000).astype(np.float32)
y = np.random.rand(10000, 10000).astype(np.float32)

x_torch = torch.from_numpy(x).to(device)
y_torch = torch.from_numpy(y).to(device)

# NumPy version
def numpy_compute():
    return compute(x, y)

# PyTorch version
def torch_compute():
    result = compute(x_torch, y_torch)
    return result.cpu().numpy() if device.type == "cuda" else result.numpy()

# JIT-compiled PyTorch version
@torch.jit.script
def torch_jit_compute(x, y):
    return compute(x, y)

def torch_jit_run():
    result = torch_jit_compute(x_torch, y_torch)
    return result.cpu().numpy() if device.type == "cuda" else result.numpy()

# Benchmark
def benchmark(func, name):
    start = time.time()
    for _ in range(1000):
        func()
    end = time.time()
    print(f"{name} took {end - start:.4f} seconds")

# Warm-up GPU
if device.type == "cuda":
    torch.cuda.synchronize()
    warm_up = torch.rand(10000, 10000, device=device)
    warm_up = warm_up * warm_up + warm_up
    torch.cuda.synchronize()

print("Starting benchmarks...")
benchmark(numpy_compute, "NumPy")
benchmark(torch_compute, "PyTorch")
benchmark(torch_jit_run, "PyTorch JIT")

Using device: cuda
Starting benchmarks...
NumPy took 176.4616 seconds
PyTorch took 183.7095 seconds
PyTorch JIT took 181.1943 seconds


## Modified version of the code that keeps the results on the GPU for PyTorch operations:

Key changes and explanations:

    - We create the PyTorch tensors directly on the GPU using device=device in the tensor creation.
    - The mul function no longer explicitly calls .cuda() since the tensors are already on the GPU.
    - We don't move the result back to CPU in the PyTorch function.
    - In the benchmark function, we use torch.cuda.synchronize() for PyTorch operations to ensure all GPU computations are completed before measuring time.
    - We run the benchmark for 10,000 iterations to get a more accurate measurement.
    - For the NumPy benchmark, we move the PyTorch tensors to CPU and convert them to NumPy arrays only once, outside the timing loop.

This setup should give you a more accurate comparison between GPU-accelerated PyTorch operations and CPU-based NumPy operations. The PyTorch operations now stay entirely on the GPU, **avoiding the overhead of transferring data between CPU and GPU memory** for each operation. 

Remember that for small tensor sizes like 200x100, the GPU might not show significant speedup due to the overhead of GPU kernel launches. The benefits of GPU acceleration are typically more pronounced for larger tensor sizes and more complex operations.

In [4]:
import torch
import numpy as np
import time

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define a simple function
def compute(x, y):
    return x * y + x

# Create input data
x = torch.rand(10000, 10000, device=device)
y = torch.rand(10000, 10000, device=device)

x_np = x.cpu().numpy()
y_np = y.cpu().numpy()

# NumPy version (always on CPU)
def numpy_compute():
    return compute(x_np, y_np)

# PyTorch version
def torch_compute():
    return compute(x, y)

# JIT-compiled PyTorch version
@torch.jit.script
def torch_jit_compute(x, y):
    return compute(x, y)

# Benchmark function
def benchmark(func, name, *args):
    start = time.time()
    for _ in range(1000):
        result = func(*args)
        if torch.is_tensor(result):
            torch.cuda.synchronize()  # Ensure GPU operations are completed
    end = time.time()
    print(f"{name} took {end - start:.6f} seconds")

# Warm-up GPU
if device.type == "cuda":
    warm_up = torch.rand(10000, 10000, device=device)
    print("Warming up GPU...")
    warm_up = warm_up * warm_up + warm_up
    torch.cuda.synchronize()

print("Starting benchmarks...")

# NumPy benchmark
benchmark(numpy_compute, "NumPy (CPU)")

# PyTorch benchmark
benchmark(torch_compute, "PyTorch")

# PyTorch JIT benchmark
benchmark(torch_jit_compute, "PyTorch JIT", x, y)

Using device: cuda
Warming up GPU...
Starting benchmarks...
NumPy (CPU) took 184.981980 seconds
PyTorch took 4.242224 seconds
PyTorch JIT took 2.127695 seconds


# Key Takeaways

We have seen :
 


Here are some key strategies to optimize overhead due to data transfer between GPU and CPU in deep learning projects:

    - Use Pinned Memory:
    When transferring data from CPU to GPU, use pinned (page-locked) memory. This can be done by setting pin_memory=True in DataLoader or using torch.cuda.FloatTensor(x).pin_memory() for tensors
    - Asynchronous Data Transfer:
    Utilize asynchronous data transfers to overlap computation with data transfer. This can be achieved using CUDA streams
    - Minimize Transfer Frequency:
    Try to keep data on the GPU as much as possible. Only transfer data to CPU when absolutely necessary, such as for logging or saving checkpoints
    - Batch Processing:
    Process data in batches to reduce the number of transfers and take advantage of GPU parallelism
    - Use CUDA Graphs:
    For static computational graphs, use CUDA graphs to reduce CPU overhead in launching GPU operations
    - Avoid Unnecessary Synchronization:
    Minimize calls to torch.cuda.synchronize() and other operations that force synchronization between CPU and GPU
    - Direct Tensor Creation on GPU:
    Create tensors directly on the GPU when possible, rather than creating them on CPU and then transferring
    - Optimize Data Loading:
    Use efficient data loading techniques like NVIDIA DALI or PyTorch's DataLoader with multiple workers and pinned memory
    - Consider Using SpeedTorch:
    For specific use cases, libraries like SpeedTorch can provide faster CPU-GPU data transfer than standard PyTorch methods1
    - Profile and Benchmark:
    Use tools like PyTorch Profiler or NVIDIA Nsight Systems to identify bottlenecks in data transfer and optimize accordingly
    - Use Appropriate Data Types:
    Ensure you're using the most appropriate data types for your tensors to minimize transfer sizes
    - Pipeline Parallelism:
    For large models, consider using pipeline parallelism to distribute computation across multiple GPUs and reduce data transfer overhead4
    - Avoid Unnecessary Transfers:
    Be cautious about operations that implicitly move data between devices, such as using .item() or .cpu() in critical paths

Remember that the effectiveness of these strategies can vary depending on your specific use case and hardware setup. It's important to profile your application and focus on optimizing the most significant bottlenecks.

## Profiling

To profile a deep learning-based application for breast cancer diagnosis, you can use several techniques and tools. Here's a comprehensive approach:

1. Use PyTorch Profiler:
PyTorch provides a built-in profiler that can help identify performance bottlenecks:

```python
from torch.profiler import profile, record_function, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 
             record_shapes=True, profile_memory=True) as prof:
    with record_function("model_inference"):
        model(input)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
```

This will give you detailed information about CPU and GPU usage, memory consumption, and execution time for different operations[4].

2. NVIDIA Nsight Systems:
For GPU-specific profiling, NVIDIA Nsight Systems can provide detailed insights into GPU utilization, memory transfers, and kernel execution[4].

3. Memory Profiling:
Use `torch.cuda.memory_summary()` to get a detailed breakdown of GPU memory usage[4].

4. Data Loading Optimization:
Profile your data loading pipeline:
- Use `num_workers` in DataLoader to parallelize data loading
- Set `pin_memory=True` for faster CPU to GPU transfers[4]

5. Model Architecture Analysis:
Use `torchinfo` or `summary` to get a detailed view of your model architecture, including the number of parameters and FLOPs[4].

6. Batch Size and Learning Rate Profiling:
Experiment with different batch sizes and learning rates to find the optimal balance between speed and accuracy[4].

7. Mixed Precision Training:
Profile the performance impact of using mixed precision training with `torch.cuda.amp`[4].

8. Distributed Training Profiling:
If using distributed training, profile the communication overhead and load balancing between GPUs[4].

9. Inference Optimization:
For deployment, profile the inference speed and consider optimizations like model quantization or TorchScript[4].

10. Custom Profiling:
Implement custom timing functions for specific parts of your pipeline:

```python
import time

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.2f} seconds")
        return result
    return wrapper

@timeit
def preprocess_data(data):
    # Your preprocessing code here
    pass
```

11. System-level Profiling:
Use tools like `nvidia-smi` for GPU monitoring and `htop` for CPU and memory usage[4].

12. End-to-end Profiling:
Profile the entire pipeline from data loading to model inference to results analysis to identify overall bottlenecks[4].

Remember to profile on a representative dataset and under realistic conditions. Also, consider the trade-offs between speed and accuracy, especially in a medical diagnosis context where accuracy is crucial. Always validate any optimizations to ensure they don't negatively impact the model's diagnostic performance.

Citations:
[1] https://jeas.springeropen.com/articles/10.1186/s44147-024-00411-z
[2] https://discuss.pytorch.org/t/gpu-cpu-memory-transfer-time-changes/176384
[3] https://discuss.pytorch.org/t/torch-is-slow-compared-to-numpy/117502
[4] https://towardsdatascience.com/optimize-pytorch-performance-for-speed-and-memory-efficiency-2022-84f453916ea6?gi=be1659ebf739
[5] https://discuss.pytorch.org/t/does-jit-makes-model-faster/44532
[6] https://discuss.pytorch.org/t/cpu-x10-faster-than-gpu-recommendations-for-gpu-implementation-speed-up/54980
[7] https://www.mdpi.com/2076-3417/13/21/11654
[8] https://discuss.pytorch.org/t/pytorch-tensor-performance-vs-numpy-array/9168