# GPU Acceleration

Let's check out some different PyTorch arithmetic on the CPU vs. GPU!

First, import PyTorch and check the version.
We will also make sure we can access the CUDA GPU.
This should always be your first step!

We'll also set a manual_seed for the random operations. While this isn't strictly neccesary for this experiment, it's good practice as it aids with reproducability.

In [None]:
import torch

print("PyTorch Version:", torch.__version__)

# Help with reproducability of test
torch.manual_seed(2016)

if not torch.cuda.is_available():
    raise OSError("ERROR: No GPU found.")

## Dot Product

Dot products are **extremely** common tensor operations. They are used deep neural networks and linear algebra applications.

A dot product is essentially just a bunch of multiplications and additions.

PyTorch provides the [`torch.tensordot()`](https://pytorch.org/docs/stable/generated/torch.tensordot.html) method.

First, let's define two methods to compute the dot product. One will take place on the CPU and the other on the GPU.

### CPU Timing

The CPU method is trivial!

### GPU Timing

The GPU method has a bit more two it. We must:

1. Send the tensors to the GPU for computation. We call [`torch.to()`](https://pytorch.org/docs/stable/generated/torch.Tensor.to.html) on the tensor to send it to a particular [device](https://pytorch.org/docs/stable/tensor_attributes.html#torch.device)
2. Wait for the GPU to synchronize. According to [the docs](https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution), GPU ops take place asynchronously so you need to use synchronize for precise timing.

In [None]:
import pandas as pd
import timeit


# Compute the tensor dot product on CPU
def cpu_dot_product(a, b):
    return torch.tensordot(a, b)


# Send the tensor to GPU then compute dot product
# synchronize() required for timing accuracy, see:
# https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution
def gpu_dot_product(a, b):
    a_gpu = a.to("cuda")
    b_gpu = b.to("cuda")
    product = torch.tensordot(a_gpu, b_gpu)
    torch.cuda.synchronize()
    return product

### Running the benchmark

This section declares the start and stop tensor sizes for our test.
You can change `SIZE_LIMIT` and then run again; just know that at some point you will run out of memory!

Next, it does tests at several sizes within this range, doubling each time.

We use [`timeit.timeit()`](https://docs.python.org/3/library/timeit.html#timeit.timeit) for the tests. It will call the function multiple times and then average those times. Timeit is also more accurate than manually calling Python's time function and doing subtraction.

Finally, results are saved into a list that's then exported to a pandas DataFrame for easy viewing.

In [None]:
SIZE_LIMIT = 10000  # where to stop at
tensor_size = 10  # start at size 10
results = []

print("Running with 2D tensors from", tensor_size, "to", SIZE_LIMIT, "square")

# Run the test
while tensor_size < SIZE_LIMIT:
    # Random array
    a = torch.rand(tensor_size, tensor_size)
    b = torch.rand(tensor_size, tensor_size)

    # Time the CPU operation
    cpu_time = timeit.timeit("cpu_dot_product(a, b)", globals=globals(), number=50)

    # Time the GPU operation
    # First, we send the data to the GPU, called the warm up
    # It really depends on the application of this time is important or negligible
    # We are doing it here becasue timeit() averages the results of multiple runs
    gpu_dot_product(a, b)
    # Now we time the actual operation
    gpu_time = timeit.timeit("gpu_dot_product(a, b)", globals=globals(), number=50)

    # Record the results
    results.append(
        {
            "tensor_size": tensor_size,
            "cpu_time": cpu_time,
            "gpu_time": gpu_time,
            "gpu_speedup": cpu_time / gpu_time,  # Greater than 1 means faster on GPU
        }
    )

    # Double tensor_size
    tensor_size = tensor_size * 2

# Done! Cast the results to a DataFrame and print
results_df = pd.DataFrame(results)
print(results_df)

### Dot Product Results

If you left the default sizes, you should see 10 rows of results.
You'll notice that with small tensors the CPU is *faster* than the GPU!
This is also indidcated by the **gpu_speedup** being less than 1.

But as the tensor sizes grow, the GPU overtakes the CPU for speed! 🏎️

## Next: Summing a tensor

Your task is to repeat this benchmark below, but computing the sum of a single **1D tensor**.

Use the [`torch.sum()`](https://pytorch.org/docs/stable/generated/torch.sum.html) method.

In [None]:
# Define your methods here

In [None]:
# Conduct your benchmark here

### Tensor sum results

Jot down some thoughts to yourself here about what you saw 📈