# Tital V100 Model
We'll define some system parameters and then define the model.

This model, for simplicitys sake, could be broken down into three main inputs

1. GPU parameters
2. Convolutional Parameters
3. "CUDA" parameters (i.e. how the Conv calculations are divided amongst cuda cores)

Given the first 2 inputs, we should be able to define a roofline model. Given the 3rd input, we can then find the point at which our "cuda" implementation lies on said model.

Here, we'll define the parameters for our model.

In [52]:
# convolution model parameters

# first layer is input, RGB image so that gives us three layers
input_layers = {
    'dimx': 226,
    'dimy': 226,
    'depth': 64,
}

kernel_parameters = {
    "x": 5,
    "y": 5,
    "depth": input_layers["depth"],
    "padding": 3,
    "stridex": 1,
    "stridey": 1,
}
output_layers = {
    'dimx': 226,
    'dimy': 226,
    'depth': 64,
}

# GPU parameters
system_parameters = {
    "core_frequency": 1.5e9,  # GHz
    "multiply": 1,  # Ops per cycle
    "scratchpad_mem_access": 1,  # Ops per cycle
}

# CUDA parameters
cuda_parameters = {
    "tile_size": 32,
    "block_size": 32,
    "warp_size": 32,
    "num_threads": 1024,
    "num_blocks": 1024,
}


Next, we'll make some definitivitve calculations about our model. Total number of bytes and total operations. These should be constant regardless of how we allocate our problem space in CUDA.

In [53]:
# the definitative total number of operations for this convolution.
total_bytes = output_layers['dimx'] * output_layers['dimy'] * output_layers['depth'] * 4 # 4 bytes per float

# every 
total_bytes += kernel_parameters['x'] * kernel_parameters['y'] * kernel_parameters['depth'] * output_layers['depth'] * 4 # 4 bytes per float
total_bytes += input_layers['dimx'] * input_layers['dimy'] * input_layers['depth'] * 4 # 4 bytes per float

# total_ops = input_layers['dimx'] * input_layers['dimy'] * input_layers['depth'] * kernel_parameters['x'] * kernel_parameters['y'] * kernel_parameters['depth']
total_ops = output_layers['dimx'] * output_layers['dimy'] * output_layers['depth'] * kernel_parameters['x'] * kernel_parameters['y'] * kernel_parameters['depth']

# print the total bytes in MB
print("Total bytes: {} MB".format(total_bytes / 1024 / 1024))

# print the total ops in GFLOPS
print("Total ops: {} GFLOP".format(total_ops / 1000 / 1000 / 1000))

# calculate ops per second

Total bytes: 25.330078125 MB
Total ops: 5.2301824 GFLOP


Our total number of operations and cores utilzed will give us roughly the total execution time.

In [54]:
streaming_multiprocessors = 84
blocks_per_multiprocessor = 4
fp32_cores_per_block = 16
total_fp32_cores = streaming_multiprocessors * blocks_per_multiprocessor * fp32_cores_per_block


flops = kernel_parameters['x'] * kernel_parameters['y'] * kernel_parameters['depth'] * output_layers['dimx'] * output_layers['dimy']
bytes = kernel_parameters['x'] * kernel_parameters['y'] * kernel_parameters['depth'] * 4
print("FLOPS: {}".format(flops))
print("Bytes: {}".format(bytes))
print("FLOPS/Byte: {}".format(flops / bytes))

print("Total FP32 cores: {}".format(total_fp32_cores))

max_theoretical_throughput = total_fp32_cores * system_parameters["core_frequency"]
print("Max theoretical throughput: {} GFLOPS".format(max_theoretical_throughput / 1000 / 1000 / 1000))

execution_time = total_ops / system_parameters["core_frequency"] / cuda_parameters["tile_size"] / cuda_parameters["block_size"] 
print("Execution time: {} seconds".format(execution_time))

FLOPS: 81721600
Bytes: 6400
FLOPS/Byte: 12769.0
Total FP32 cores: 5376
Max theoretical throughput: 8064.0 GFLOPS
Execution time: 0.0034050666666666667 seconds
