Skip to content

[issue-858] GPU implementation of 911 vertices#880

Merged
NicolasJPosey merged 71 commits intoPoseyDevelopmentfrom
issue-858-911vertices-gpu-implementation
Mar 12, 2026
Merged

[issue-858] GPU implementation of 911 vertices#880
NicolasJPosey merged 71 commits intoPoseyDevelopmentfrom
issue-858-911vertices-gpu-implementation

Conversation

@NicolasJPosey
Copy link
Contributor

@NicolasJPosey NicolasJPosey commented Aug 27, 2025

Closes #858

Description

This PR introduces a GPU implementation of the NG911 model in connect with Nicolas Posey's master's capstone.

A small 911 test file was added to the GPU list in RunTest.sh file to allow for automated regression testing of this new implementation. Documentation has also been added to help with the process of implementing new mirrored GPU implementations from existing, domain-specific CPU implementations (see CpuGpuArchitecture.md). Lastly, documentation of the existing small 911 test file and a new medium 911 test file has been added to help start the process of documenting important configuration information of existing regression test files (see RegressionTestDocumentation.md).

Checklist (Mandatory for new features)

  • Added Documentation
  • Added Unit Tests

Testing (Mandatory for all changes)

  • GPU Test: test-medium-connected.xml Passed
  • GPU Test: test-large-long.xml Passed

Adds support for two uint64_t argument functions so that loadEpochInputs can be registered and called from the OperationManager class.
Added a data member for storing the total number of events that are read into the InputManager. This allows us to define vector capacities based on the number of events being simulated.
AllVertices now has a non-virtual loadEpochInputs method. This calls two virtual methods, one for loading the epoch inputs and other for copy the inputs to the GPU. The default behavior for both is to do nothing.
…od instead of a connections method

This method makes more sense to be a behavior of vertices as the behavior is also needed to be run on the GPU.
We need a dynamically sized array use we use a vector instead of an array. But we want the implement to be easily mirrored on the GPU so we interact with the vector like we would with an array.
…ent call

The push back call was not easy to mirror on the GPU. We already have examples of using the EventBuffer and insertEvent call on the GPU so we change to use this implementation. This also allows us to make it clear what size the buffer should be, again helping the mirrored GPU implementation.
This allows us to remove the resize calls which we don't want to do on the GPU. Also added a DoubleEventBuffer to use in place of RecordableVector<double>.
The correct pattern is to first copy the device pointers to the CPU and then the values to the CPU data members. It happens to be the same that the number of bytes for a uint64_t and a uint64_t pointer are the same. However, if this pattern is repeated for a type like float, an illegal memory error is thrown.
The GPU noise array only works for numVertices >= 100 and that are a multiple of 100. Otherwise, an invalid kernel configuration error is thrown which masks other possible errors.
Having asserts in kernels can cause them to fail silence. Using print statements and returning is a better way to fail inside kernels.
Clean up of commented out code, unnecessary extra variables, and unused methods.
The update in vertex creation to make each vertex have the same sized data member for the GPU made it so that we would never get a dropped call due to large queue sizes. The logic was changed were we interact with vertex queues in PSAPs and RESPs to act like the size is equal to the number of trunks which was the original implementation size.
RecordableVectors are cleared after each epoch if they are the dynamic type. The size is not reset. We need subscript operator access for droppedCalls so the type much be constant which does not clear the vector after each epoch.
The current implementation for generating noise on a device has some assumptions that break for the 911 GPU model. To get noise support for 911, we implemented a way to have vertices specific how many noise elements they will need. A method was then implemented in GPUModel that rounds the input up to the nearest multiple of 100.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change due to allowing for noise when using less than 100 vertices in GPUModel.cpp

@NicolasJPosey NicolasJPosey added documentation Improvements or additions to documentation enhancement New feature or request GPU NG911 labels Feb 16, 2026
@NicolasJPosey NicolasJPosey requested a review from stiber February 16, 2026 18:47
@stiber stiber linked an issue Feb 17, 2026 that may be closed by this pull request
Copy link
Contributor

@stiber stiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two docs items.

@NicolasJPosey NicolasJPosey marked this pull request as ready for review March 12, 2026 04:26
@NicolasJPosey NicolasJPosey requested a review from stiber March 12, 2026 04:27
@NicolasJPosey NicolasJPosey merged commit f7e7fe1 into PoseyDevelopment Mar 12, 2026
2 checks passed
NicolasJPosey added a commit that referenced this pull request Mar 13, 2026
* [issue-723] 911 edges GPU implementation (#867)

* Use vector of char for consistency with neuro

* Move setEdgeClassID method down into AllNeuroEdges since it's not used in 911 edges

* Initial implementation of 911 edges for GPU

* Make public explicitly

* Fix clang format issue

* Another try at a clang fix

* Manually release array memory from stack on device copy

This is to prevent a segmentation fault due to a stack overflow from the array declarations when running large graphs

* Expand vertex type map reported in debug log

* Resolve post merge changes

* Update GPU results with updates from PR 877

* [issue-858] GPU implementation of 911 vertices (#880)

* Add loadEpochInputs to OperationManager

Adds support for two uint64_t argument functions so that loadEpochInputs can be registered and called from the OperationManager class.

* Add vertices device struct to 911 class

* Add total number of events data member to InputManager

Added a data member for storing the total number of events that are read into the InputManager. This allows us to define vector capacities based on the number of events being simulated.

* Initial CPU GPU architecture documentation

* Refactor of loadEpochInputs to support loading inputs to GPU

AllVertices now has a non-virtual loadEpochInputs method. This calls two virtual methods, one for loading the epoch inputs and other for copy the inputs to the GPU. The default behavior for both is to do nothing.

* Refactor getEdgeToClosestResponder method to be a All911Vertices method instead of a connections method

This method makes more sense to be a behavior of vertices as the behavior is also needed to be run on the GPU.

* Forgot CPU code

* Some GPU implementations but is incomplete

* Remove reserve call since RecordableVector doesn't implement it

* Refactor internal vector use in PSAP and RESP advance logic

We need a dynamically sized array use we use a vector instead of an array. But we want the implement to be easily mirrored on the GPU so we interact with the vector like we would with an array.

* Convert call metrics to EventBuffers and swap push back for insert event call

The push back call was not easy to mirror on the GPU. We already have examples of using the EventBuffer and insertEvent call on the GPU so we change to use this implementation. This also allows us to make it clear what size the buffer should be, again helping the mirrored GPU implementation.

* Replace numeric bool with actual bool for readability

* Change vector type from RecordableVector to EventBuffer

This allows us to remove the resize calls which we don't want to do on the GPU. Also added a DoubleEventBuffer to use in place of RecordableVector<double>.

* Bug fix for copying spike histories from device

The correct pattern is to first copy the device pointers to the CPU and then the values to the CPU data members. It happens to be the same that the number of bytes for a uint64_t and a uint64_t pointer are the same. However, if this pattern is repeated for a type like float, an illegal memory error is thrown.

* Add a guard and debugging message for GPU random noise

The GPU noise array only works for numVertices >= 100 and that are a multiple of 100. Otherwise, an invalid kernel configuration error is thrown which masks other possible errors.

* Updates to support copying to and from GPU

* Support for copying to and from GPU and make type float for now

* Add GPU 911 vertices to make list

* Implementation runs but results aren't quite right

* Fix case sensative copying of call responder types

* Fix bug using wrong size for queue length and utilization histories

* Remove debugging printfs and replace asserts with printfs for errors

Having asserts in kernels can cause them to fail silence. Using print statements and returning is a better way to fail inside kernels.

* General cleanup

Clean up of commented out code, unnecessary extra variables, and unused methods.

* Free the array used to determine available servers and units in kernels.

* Readd support for getting dropped calls

The update in vertex creation to make each vertex have the same sized data member for the GPU made it so that we would never get a dropped call due to large queue sizes. The logic was changed were we interact with vertex queues in PSAPs and RESPs to act like the size is equal to the number of trunks which was the original implementation size.

* Fix error if a dropped call is found after the first epoch

RecordableVectors are cleared after each epoch if they are the dynamic type. The size is not reset. We need subscript operator access for droppedCalls so the type much be constant which does not clear the vector after each epoch.

* Support for noise in 911 models

The current implementation for generating noise on a device has some assumptions that break for the 911 GPU model. To get noise support for 911, we implemented a way to have vertices specific how many noise elements they will need. A method was then implemented in GPUModel that rounds the input up to the nearest multiple of 100.

* Add assert for random number thread count

* Add support for using noise to simulate attempted redials

Because only caller regions simulate attempted redials, we add a vector to map the caller region vertex IDs to the noise array on the device. This allows us to use the existing noise algorithm with larger graphs since we can only generate noise for up to 10000 vertices.

* Fix isFull error message to show right buffer size

* Fix bug with waiting queue check

If the number of trunks and servers is equal and the queue is full, capacity minus busy servers is negative. Since dstQueueSize is of type uint64_t, it can't be negative. The comparison then gives a false positive that the queue is not full. Fix is to cast the size to an int so that the right comparison is done.

* Debugging statements for memory analysis

* Add some larger 911 graphs

* Updates to history to support less memory usage on GPU

The call metrics account for the vast majority of the physical memory used by the GPU. By resizing each to a smaller value, we can fit larger graphs on the GPU by using more epochs with smaller steps per epoch.

* Fix firing rate value

Firing rate should actually be equal to 1 since we can have at most 1 call per second.

* Fix issue using wrong buffer size

The buffer size used for a CircularBuffer is 1 more than the capacity passed into the constructor. When we construct the buffer, we pass in the number of trunks but were effectively using 1 less during the simulation.

* Fix getting front index when we want end index for queue length calculation

* Add back in random redial attempt

* More updates to reduce memory usage

Metrics that used totalNumberOfEvents and totalTimeSteps were using more memory than needed. These were changed to maxEventsPerEpoch and stepsPerEpoch respectively. Also changed copyTo and copyFrom in All911Edges to use heap memory to prevent stack overflows with large graphs.

* Fix bug with vertex queue size

The buffer inside the CircularBuffer implementation is 1 larger than the capacity set at construction. VertexQueues are CircularBuffers so we add 1 where we use the buffer size.

* Another CircularBuffer size bug fix

Fixed allocation, copyTo, and copyFrom for VertexQueues. They are CircularBuffers which internally have a buffer that is 1 more than the capacity. The sizes used were updated to be 1 more than the stepsPerEpoch to match the construction capacity.

* Fix firing rate and change epoch parameters to reduce memory

Memory is mostly dependent on epoch duration so we decrease that parameter and increase the number of epochs parameter by the same factor. This keeps the total time steps constant but reduces memory usage. We can only have 1 call per step so the max firing rate should be 1.

* Add an approximate state wide, month long configuration

* General cleanup and adding of comments

* GPU Optimizations

Remove some branching and make changes to reduce amount of register usage.

* Dataset updates

* Timing adds, documentation, and updates

* Add regression testing documentation markdown

* Update after changing Abandoned and QueueLength history types

* Add small 911 test to regression script

* Add larger 911 test

This corresponds to Dataset A in Posey capstone report.

* Remove testing datasets

* Remove temp timing changes

* Correct how 2D arrays are copied from device to host

* Add noise state logging for debugging

* Noise is now generated and used for graphs with less than 100 vertices

* Fix formatting

* Try another clang fix

* Try to fix clang in function node file

* clang fix attempt

* more clang

* clang

* Clean up and port some optimizations to CPU

* clang formatting

* Rename GPU documentation file

* Remove trivial example and rewrite to clarify design of CPU implementation relative to GPU
@NicolasJPosey NicolasJPosey deleted the issue-858-911vertices-gpu-implementation branch March 13, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request GPU NG911

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create GPU Implementation of NG911 Vertices

2 participants