# Protocol: Multiple VTs Sharing a Kernel

Date: 16.09.2021

## Question

Does it make a difference whether VTs are executed sequentially with only one per kernel, or parallel as blocks of a kernel?

## Hypothesis

Both variants yield the same results. The parallel execution should have better performance, as only one kernel setup is needed.

## Setup

- GPU: NVIDIA GeForce RTX 2080 Ti
- Program: `main` branch, commit e160572
- Model: Waypoints model
- CUDA_FLAGS: `-DGRAPPLE_MODEL=WaypointsState`

## Implementation

First, we compile with the additional CUDA_FLAG `-DGRAPPLE_VTS=1` and execute with `-n 250`. Result:

```
$ time ./build/grapple -s 1736331306 -n 250
run,block,thread,state,uniques,visited,visited_percent,vts,total_visited
...
249,,,,,1.95865e+07,0.456033,62500,254321

real    0m26.691s
user    0m26.424s
sys     0m0.204s
```

Cumulated Total Visited States: 63568343

Full output data is available at [EXP-03-shared-kernel-1.csv](./data/EXP-03-shared-kernel-1.csv).

---

Then, we compile with the additional CUDA_FLAG `-DGRAPPLE_VTS=250` and execute with `-n 1`. Result:

```
$ time ./build/grapple -s 1736331306 -n 1
run,block,thread,state,uniques,visited,visited_percent,vts,total_visited
0,,,,,1.93387e+07,0.450264,250,63568498

real    0m0.754s
user    0m0.506s
sys     0m0.184s
```

Full output data is available at [EXP-03-shared-kernel-2.csv](./data/EXP-03-shared-kernel-2.csv).

## Evaluation

The amount of Unique Visited States and thus the state space coverage of both experiments is highly similar: Both experiments have a state space coverage of about 0.45%.

The amount of Total Visited States is also highly similar: Both experiments visit about 63568000 states.

The execution time of the second experiment is more than 35x as fast.

In the paper, supposedly only one VT is executed per *grid*, resulting in one CUDA kernel per VT.

## Conclusion, Discussion

The experiment has shown that the execution of multiple VTs inside a single CUDA kernel highly increases the execution time at nearly equal results. Resulting from this, VTs should be executed in batches whenever possible.

Thus, our hypothesis is confirmed.