# CUDA Neural Network Demo - Example from current production code LHCb HLT1 trigger

The repository demonstrates a GPU-accelerated single-layer fully connected neural network implementation specifically designed for high-performance inference on NVIDIA GPUs.

## Key Features

- **GPU-accelerated inference**: Neural network evaluation runs entirely on GPU using CUDA
- **Template-based design**: Compile-time configuration of input size and hidden nodes
- **Optimized CUDA kernels**: Uses loop unrolling and fast math operations for performance
- **Random model generation**: Utility to generate random weights and biases for testing
- **Comprehensive testing**: Includes validation and statistics

## Neural Network Architecture

The implementation features a single-layer fully connected neural network with:
- **Input Layer**: Configurable size (template parameter)
- **Hidden Layer**: Single fully connected layer with ReLU activation
- **Output Layer**: Single neuron with sigmoid activation (0-1 output range)
- **Data preprocessing**: Input normalization using mean and standard deviation


## Table of Contents

1. [Introduction and Architecture](#1-introduction-and-architecture)
2. [Project Structure](#2-project-structure)
3. [Core Implementation - demo.cu](#3-core-implementation---democu)
4. [Test Program - main.cu](#4-test-program---maincu)
5. [Random Model Generator](#5-random-model-generator)
6. [Build System Configuration](#6-build-system-configuration)
7. [Mock Dependencies](#7-mock-dependencies)
8. [JSON Model Format](#8-json-model-format)
9. [Compilation and Usage](#9-compilation-and-usage)
10. [Performance Analysis](#10-performance-analysis)
11. [Allen Framework Background](#11-allen-framework-background)
12. [Complete Example Workflow](#12-complete-example-workflow)


## 1. Introduction and Architecture

### What is a Neural Network?

A neural network is a computational model inspired by biological neural networks. It consists of interconnected nodes (neurons) organized in layers that transform input data to produce desired outputs through learned parameters (weights and biases).

### GPU Acceleration for Neural Networks

### Architecture Overview

```
Input Layer (4 neurons)
    ↓ (normalization: (x - mean) / std)
    ↓
Hidden Layer (8 neurons, ReLU activation)
    ↓ (fully connected weights1[8][4] + bias1[8])
    ↓
Output Layer (1 neuron, Sigmoid activation)
    ↓ (weights2[8] + bias2)
    ↓
Final Output (0.0 to 1.0)
```

### Mathematical Formulation

The forward pass computes:

1. **Input Normalization**: `x_norm[i] = (x[i] - mean[i]) / std[i]`
2. **Hidden Layer**: `h[j] = ReLU(Σ(x_norm[i] * weights1[j][i]) + bias1[j])`
3. **Output Layer**: `output = Sigmoid(Σ(h[j] * weights2[j]) + bias2)`

Where:
- `ReLU(x) = max(0, x)`
- `Sigmoid(x) = 1 / (1 + exp(-x))`


## 2. Project Structure

The repository contains the following files:

```
gpu-demo/
├── demo.cu                 # Main neural network implementation
├── main.cu                 # Test program with CUDA kernel
├── nn_gen/
│   └── json_generator.cpp  # Utility to generate random model parameters
├── mock_dependencies.h     # Mock implementations for Allen framework
├── MVAModelsManager.h      # Header with framework includes
├── CMakeLists.txt         # Build system configuration
├── .gitignore            # Git ignore patterns
└── README.md             # Project documentation
```

### File Descriptions

| File | Purpose | Language |
|------|---------|----------|
| `demo.cu` | Core neural network implementation with CUDA kernels | CUDA C++ |
| `main.cu` | Test harness and example usage | CUDA C++ |
| `json_generator.cpp` | Random model parameter generation | C++ |
| `mock_dependencies.h` | Allen framework compatibility layer | C++ |
| `MVAModelsManager.h` | Framework integration headers | C++ |
| `CMakeLists.txt` | CMake build configuration | CMake |


## 3. Core Implementation - demo.cu

The `demo.cu` file contains the heart of the neural network implementation. Let's examine each component:

### 3.1 Headers and Namespace


In [None]:
// Main neural network implementation from demo.cu
#pragma once
#include "MVAModelsManager.h"
#include <nlohmann/json.hpp>

namespace Allen::MVAModels {

// Structure to hold single layer neural network data
struct SingleLayerData {
    std::vector<float> mean;
    std::vector<float> std;
    std::vector<std::vector<float>> weights1;
    std::vector<float> fweights1;  // Flattened weights1
    std::vector<float> bias1;
    std::vector<float> weights2;
    float bias2;
};

// Function to read JSON model parameters
inline SingleLayerData readSingleLayerJSON(std::string full_path) {
    SingleLayerData to_copy;

    nlohmann::json j;
    {
        std::ifstream i(full_path);
        j = nlohmann::json::parse(i);
    }

    // Parse JSON fields
    using array1d_t = std::vector<float>;
    using array2d_t = std::vector<std::vector<float>>;

    to_copy.mean = j.at("mean").get<array1d_t>();
    to_copy.std = j.at("std").get<array1d_t>();
    to_copy.weights1 = j.at("weights1").get<array2d_t>();
    to_copy.bias1 = j.at("bias1").get<array1d_t>();
    to_copy.weights2 = j.at("weights2").get<array1d_t>();
    to_copy.bias2 = j.at("bias2").get<float>();

    // Sanity checks
    assert(to_copy.mean.size() == j.at("num_input").get<int>());
    assert(to_copy.std.size() == j.at("num_input").get<int>());
    assert(to_copy.weights1.size() == j.at("num_node").get<int>() && 
           to_copy.weights1.front().size() == j.at("num_input").get<int>());
    assert(to_copy.bias1.size() == j.at("num_node").get<int>());
    assert(to_copy.weights2.size() == j.at("num_node").get<int>());

    // Flatten 2D weights array for GPU transfer
    for (const auto& innerVec : to_copy.weights1) {
        to_copy.fweights1.insert(to_copy.fweights1.end(), 
                                innerVec.begin(), innerVec.end());
    }

    return to_copy;
}

### 3.2 Device Neural Network Structure

The template-based device structure allows compile-time optimization:


In [None]:
// Template-based device neural network structure
template <unsigned num_input, unsigned num_node>
struct DeviceSingleLayerFCNN {
    constexpr static unsigned nInput = num_input;
    constexpr static unsigned nNode = num_node;

    // Data preprocessing parameters
    float mean[nInput];
    float std[nInput];

    // Model parameters
    float weights1[nNode][nInput];
    float bias1[nNode];
    float weights2[nNode];
    float bias2;

    // Main evaluation function (defined later)
    __device__ inline float evaluate(float* input) const;
};

### 3.3 Host Neural Network Class

The host class manages GPU memory and model loading:


In [None]:
// Host neural network class template
template <unsigned num_input, unsigned num_node>
struct SingleLayerFCNN : public MVAModelBase {
    using DeviceType = DeviceSingleLayerFCNN<num_input, num_node>;

    SingleLayerFCNN(std::string name, std::string path) 
        : MVAModelBase(name, path) {
        m_device_pointer = nullptr;
    }

    const DeviceType* getDevicePointer() const {
        return m_device_pointer;
    }

    void readData(std::string parameters_path) override {
        auto data_to_copy = readSingleLayerJSON(parameters_path + m_path);

        // Allocate GPU memory
        Allen::malloc((void**)&m_device_pointer, sizeof(DeviceType));

        // Calculate memory sizes
        constexpr auto size_mean = DeviceType::nInput * sizeof(float);
        constexpr auto size_std = DeviceType::nInput * sizeof(float);
        constexpr auto size_weights1 = (DeviceType::nNode * DeviceType::nInput) * sizeof(float);
        constexpr auto size_bias1 = DeviceType::nNode * sizeof(float);
        constexpr auto size_weights2 = DeviceType::nNode * sizeof(float);
        constexpr auto size_bias2 = sizeof(float);

        // Copy data to GPU
        Allen::memcpy(m_device_pointer->mean, data_to_copy.mean.data(), 
                     size_mean, Allen::memcpyHostToDevice);
        Allen::memcpy(m_device_pointer->std, data_to_copy.std.data(), 
                     size_std, Allen::memcpyHostToDevice);
        Allen::memcpy(m_device_pointer->weights1, data_to_copy.fweights1.data(), 
                     size_weights1, Allen::memcpyHostToDevice);
        Allen::memcpy(m_device_pointer->bias1, data_to_copy.bias1.data(), 
                     size_bias1, Allen::memcpyHostToDevice);
        Allen::memcpy(m_device_pointer->weights2, data_to_copy.weights2.data(), 
                     size_weights2, Allen::memcpyHostToDevice);
        Allen::memcpy(&m_device_pointer->bias2, &data_to_copy.bias2, 
                     size_bias2, Allen::memcpyHostToDevice);
    }

private:
    DeviceType* m_device_pointer;
};

### 3.4 Activation Functions

The implementation includes optimized CUDA device functions:


In [None]:
// Activation functions namespace
namespace ActivateFunction {
    // Rectified Linear Unit
    __device__ inline float relu(const float x) {
        return x > 0 ? x : 0;
    }

    // Sigmoid activation function
    __device__ inline float sigmoid(const float x) {
        return 1.0f / (1.0f + __expf(-x));
    }
}

### 3.5 Main Evaluation Function

The heart of the neural network - the forward pass implementation:


In [None]:
// Template specialization for the evaluate function
template <unsigned num_input, unsigned num_node>
__device__ inline float Allen::MVAModels::DeviceSingleLayerFCNN<num_input, num_node>::evaluate(float* input) const {
    using ModelType = Allen::MVAModels::DeviceSingleLayerFCNN<num_input, num_node>;

    // Data preprocessing - normalize inputs
#if (defined(TARGET_DEVICE_CUDA) && defined(__CUDACC__))
#pragma unroll
#endif
    for (unsigned i = 0; i < ModelType::nInput; i++) {
        input[i] = (input[i] - mean[i]) / std[i];
    }

    // Initialize hidden layer activations
    float h1[ModelType::nNode] = {0.f};

    // First layer computation with ReLU activation
#if (defined(TARGET_DEVICE_CUDA) && defined(__CUDACC__))
#pragma unroll
#endif
    for (unsigned i = 0; i < ModelType::nNode; i++) {
#if (defined(TARGET_DEVICE_CUDA) && defined(__CUDACC__))
#pragma unroll
#endif
        for (unsigned j = 0; j < ModelType::nInput; j++) {
            h1[i] += input[j] * weights1[i][j];
        }
        h1[i] = ActivateFunction::relu(h1[i] + bias1[i]);
    }

    // Output layer computation
    float output = 0.f;
#if (defined(TARGET_DEVICE_CUDA) && defined(__CUDACC__))
#pragma unroll
#endif
    for (unsigned i = 0; i < ModelType::nNode; i++) {
        output += h1[i] * weights2[i];
    }

    // Apply sigmoid activation to final output
    output = ActivateFunction::sigmoid(output + bias2);

    return output;
}

## 4. Test Program - main.cu

The `main.cu` file demonstrates how to use the neural network implementation with a complete CUDA kernel test harness.

### 4.1 CUDA Kernel for Neural Network Testing


In [None]:
#include "mock_dependencies.h"
#include "demo.cu"
#include <iostream>
#include <vector>
#include <random>
#include <iomanip>
#include <algorithm>
#include <numeric>

// CUDA kernel to test neural network evaluation
template <unsigned num_input, unsigned num_node>
__global__ void test_neural_network_kernel(
    const Allen::MVAModels::DeviceSingleLayerFCNN<num_input, num_node>* model,
    float* input_data,
    float* output_data,
    int num_tests)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < num_tests) {
        // Each thread processes one test case
        float local_input[num_input];
        for (int i = 0; i < num_input; i++) {
            local_input[i] = input_data[idx * num_input + i];
        }

        // Evaluate the neural network
        output_data[idx] = model->evaluate(local_input);
    }
}

### 4.2 Host Test Function

The comprehensive test function that orchestrates the entire evaluation process:


In [None]:
// Host function to test neural network evaluation
void test_neural_network_evaluation(const std::string& json_filepath) {
    constexpr unsigned num_input = 4;
    constexpr unsigned num_node = 8;
    constexpr int num_tests = 10;

    std::cout << "\n=== Neural Network Evaluation Test ===" << std::endl;
    std::cout << "Network configuration:" << std::endl;
    std::cout << "- Input size: " << num_input << std::endl;
    std::cout << "- Hidden nodes: " << num_node << std::endl;
    std::cout << "- Number of test cases: " << num_tests << std::endl;

    // Parse file path
    std::string path, filename;
    size_t last_slash_idx = json_filepath.find_last_of("/");
    if (std::string::npos != last_slash_idx) {
        path = json_filepath.substr(0, last_slash_idx + 1);
        filename = json_filepath.substr(last_slash_idx + 1);
    } else {
        path = "./";
        filename = json_filepath;
    }

    // Create and initialize the model
    Allen::MVAModels::SingleLayerFCNN<num_input, num_node> model("test_model", filename);

    try {
        // Load model data
        model.readData(path);
        std::cout << "✓ Model loaded successfully" << std::endl;
    } catch (const std::exception& e) {
        std::cout << "✗ Error loading model: " << e.what() << std::endl;
        return;
    }

    // Generate random test input data
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<float> dis(-2.0f, 2.0f);

    std::vector<float> host_input(num_tests * num_input);
    std::vector<float> host_output(num_tests);

    std::cout << "\nGenerating random test inputs..." << std::endl;
    for (int i = 0; i < num_tests * num_input; i++) {
        host_input[i] = dis(gen);
    }

    // Allocate device memory
    float* device_input;
    float* device_output;

    cudaMalloc(&device_input, num_tests * num_input * sizeof(float));
    cudaMalloc(&device_output, num_tests * sizeof(float));

    // Copy input data to device
    cudaMemcpy(device_input, host_input.data(), 
               num_tests * num_input * sizeof(float), 
               cudaMemcpyHostToDevice);

### 4.3 Kernel Launch and Results Processing


In [None]:
    // Launch kernel
    int block_size = 256;
    int grid_size = (num_tests + block_size - 1) / block_size;

    std::cout << "Launching CUDA kernel..." << std::endl;
    std::cout << "Grid size: " << grid_size << ", Block size: " << block_size << std::endl;

    test_neural_network_kernel<<<grid_size, block_size>>>(
        model.getDevicePointer(), device_input, device_output, num_tests);

    // Check for kernel launch errors
    cudaError_t kernel_error = cudaGetLastError();
    if (kernel_error != cudaSuccess) {
        std::cout << "✗ CUDA kernel launch error: " << cudaGetErrorString(kernel_error) << std::endl;
        return;
    }

    // Wait for kernel to complete
    cudaDeviceSynchronize();

    // Copy results back to host
    cudaMemcpy(host_output.data(), device_output, 
               num_tests * sizeof(float), 
               cudaMemcpyDeviceToHost);

    // Display results
    std::cout << "\n=== Test Results ===" << std::endl;
    std::cout << std::fixed << std::setprecision(6);

    for (int i = 0; i < num_tests; i++) {
        std::cout << "Test " << std::setw(2) << (i + 1) << ": ";
        std::cout << "Input [";
        for (int j = 0; j < num_input; j++) {
            std::cout << std::setw(8) << host_input[i * num_input + j];
            if (j < num_input - 1) std::cout << ", ";
        }
        std::cout << "] -> Output: " << std::setw(8) << host_output[i] << std::endl;
    }

    // Validate outputs (sigmoid should produce values between 0 and 1)
    bool all_valid = true;
    for (int i = 0; i < num_tests; i++) {
        if (host_output[i] < 0.0f || host_output[i] > 1.0f) {
            all_valid = false;
            break;
        }
    }

    std::cout << "\n=== Validation ===" << std::endl;
    if (all_valid) {
        std::cout << "✓ All outputs are in valid range [0, 1] (sigmoid activation)" << std::endl;
    } else {
        std::cout << "✗ Some outputs are outside valid range [0, 1]" << std::endl;
    }

    // Calculate statistics
    float min_output = *std::min_element(host_output.begin(), host_output.end());
    float max_output = *std::max_element(host_output.begin(), host_output.end());
    float avg_output = std::accumulate(host_output.begin(), host_output.end(), 0.0f) / num_tests;

    std::cout << "Output statistics:" << std::endl;
    std::cout << "- Min: " << min_output << std::endl;
    std::cout << "- Max: " << max_output << std::endl;
    std::cout << "- Average: " << avg_output << std::endl;

    // Clean up
    cudaFree(device_input);
    cudaFree(device_output);

    std::cout << "\n✓ Test completed successfully!" << std::endl;
}

### 4.4 Main Function

The entry point that handles command-line arguments and GPU detection:


In [None]:
int main(int argc, char* argv[]) {
    std::cout << "CUDA Neural Network Demo" << std::endl;
    std::cout << "========================" << std::endl;

    if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " <path_to_model.json>" << std::endl;
        return 1;
    }

    // Check CUDA device
    int device_count;
    cudaGetDeviceCount(&device_count);

    if (device_count == 0) {
        std::cout << "✗ No CUDA devices found!" << std::endl;
        return 1;
    }

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    std::cout << "Using CUDA device: " << prop.name << std::endl;
    std::cout << "Compute capability: " << prop.major << "." << prop.minor << std::endl;

    // Run the neural network test
    test_neural_network_evaluation(argv[1]);

    return 0;
}

## 5. Random Model Generator

The `nn_gen/json_generator.cpp` file creates random neural network parameters for testing purposes.

### 5.1 Generator Implementation


In [None]:
#include <iostream>
#include <fstream>
#include <vector>
#include <random>
#include <iomanip>
#include <nlohmann/json.hpp>

void generate_random_model(const std::string& filename, 
                          unsigned num_input, 
                          unsigned num_node,
                          float weight_range = 1.0f, 
                          float bias_range = 0.5f) {

    nlohmann::json j;

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<float> weight_dist(-weight_range, weight_range);
    std::uniform_real_distribution<float> bias_dist(-bias_range, bias_range);
    std::uniform_real_distribution<float> norm_dist(0.1f, 2.0f); // For mean/std normalization

    j["num_input"] = num_input;
    j["num_node"] = num_node;

    // Generate mean values for input normalization
    std::vector<float> mean(num_input);
    for (unsigned i = 0; i < num_input; i++) {
        mean[i] = norm_dist(gen);
    }
    j["mean"] = mean;

    // Generate std values for input normalization  
    std::vector<float> std_dev(num_input);
    for (unsigned i = 0; i < num_input; i++) {
        std_dev[i] = norm_dist(gen);
    }
    j["std"] = std_dev;

    // Generate weights1 (2D array: num_node x num_input)
    std::vector<std::vector<float>> weights1(num_node, std::vector<float>(num_input));
    for (unsigned i = 0; i < num_node; i++) {
        for (unsigned j = 0; j < num_input; j++) {
            weights1[i][j] = weight_dist(gen);
        }
    }
    j["weights1"] = weights1;

    // Generate bias1 (hidden layer biases)
    std::vector<float> bias1(num_node);
    for (unsigned i = 0; i < num_node; i++) {
        bias1[i] = bias_dist(gen);
    }
    j["bias1"] = bias1;

    // Generate weights2 (output layer weights)
    std::vector<float> weights2(num_node);
    for (unsigned i = 0; i < num_node; i++) {
        weights2[i] = weight_dist(gen);
    }
    j["weights2"] = weights2;

    // Generate bias2 (output layer bias)
    j["bias2"] = bias_dist(gen);

    // Write to file
    std::ofstream file(filename);
    file << std::setw(4) << j << std::endl;
    file.close();

    std::cout << "Generated random model: " << filename << std::endl;
    std::cout << "- Input size: " << num_input << std::endl;
    std::cout << "- Hidden nodes: " << num_node << std::endl;
    std::cout << "- Weight range: [-" << weight_range << ", " << weight_range << "]" << std::endl;
    std::cout << "- Bias range: [-" << bias_range << ", " << bias_range << "]" << std::endl;
}

### 5.2 Generator Main Function


In [None]:
int main(int argc, char* argv[]) {
    unsigned num_input = 4;
    unsigned num_node = 8;
    std::string filename = "random_model.json";

    if (argc > 1) num_input = std::stoi(argv[1]);
    if (argc > 2) num_node = std::stoi(argv[2]);
    if (argc > 3) filename = argv[3];

    std::cout << "Random Neural Network Model Generator" << std::endl;
    std::cout << "====================================" << std::endl;

    generate_random_model(filename, num_input, num_node);

    return 0;
}

## 6. Build System Configuration

The `CMakeLists.txt` file defines the build process using CMake.

### 6.1 CMake Configuration


In [None]:
cmake_minimum_required(VERSION 3.18)

project(GpuDemoNN LANGUAGES CXX CUDA)

# Find required packages
find_package(nlohmann_json 3.2.0 REQUIRED)

# Neural network test executable
add_executable(neural_network_test main.cu demo.cu)
target_link_libraries(neural_network_test PRIVATE nlohmann_json::nlohmann_json)
set_target_properties(neural_network_test PROPERTIES
    CUDA_STANDARD 17
    CXX_STANDARD 17
)

# JSON generator executable  
add_executable(json_generator nn_gen/json_generator.cpp)
target_link_libraries(json_generator PRIVATE nlohmann_json::nlohmann_json)
set_target_properties(json_generator PROPERTIES
    CXX_STANDARD 17
)

## 7. Mock Dependencies

The `mock_dependencies.h` file provides compatibility with the Allen framework.

### 7.1 Mock Allen Framework Functions


In [None]:
#pragma once
#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <memory>
#include <nlohmann/json.hpp>
#include <cassert>

// Mock Allen namespace functions
namespace Allen {
    enum MemcpyKind {
        memcpyHostToDevice = cudaMemcpyHostToDevice,
        memcpyDeviceToHost = cudaMemcpyDeviceToHost,
        memcpyDeviceToDevice = cudaMemcpyDeviceToDevice
    };

    inline void malloc(void** ptr, size_t size) {
        cudaMalloc(ptr, size);
    }

    inline void memcpy(void* dst, const void* src, size_t size, MemcpyKind kind) {
        cudaMemcpy(dst, src, size, (cudaMemcpyKind)kind);
    }
}

// Mock base class for MVA models
struct MVAModelBase {
    std::string m_name, m_path;
    MVAModelBase(std::string name, std::string path) : m_name(name), m_path(path) {}
    virtual void readData(std::string parameters_path) = 0;
    virtual ~MVAModelBase() = default;
};

## 8. JSON Model Format

The neural network parameters are stored in JSON format for easy configuration and testing.

### 8.1 JSON Schema

The model file contains the following fields:

| Field | Type | Description |
|-------|------|-------------|
| `num_input` | integer | Number of input features |
| `num_node` | integer | Number of hidden layer neurons |
| `mean` | array[float] | Input normalization means |
| `std` | array[float] | Input normalization standard deviations |
| `weights1` | array[array[float]] | Hidden layer weights (num_node × num_input) |
| `bias1` | array[float] | Hidden layer biases |
| `weights2` | array[float] | Output layer weights |
| `bias2` | float | Output layer bias |

### 8.2 Example JSON Model


In [None]:
{
    "num_input": 4,
    "num_node": 8,
    "mean": [0.5, 1.2, -0.3, 2.1],
    "std": [1.0, 0.8, 1.5, 0.9],
    "weights1": [
        [0.234, -0.567, 0.891, 0.123],
        [-0.456, 0.789, 0.345, -0.678],
        [0.123, -0.234, 0.567, -0.890],
        [0.678, 0.345, -0.123, 0.456],
        [-0.789, 0.234, 0.567, -0.345],
        [0.890, -0.123, 0.456, 0.678],
        [-0.345, 0.678, -0.789, 0.234],
        [0.456, -0.890, 0.123, -0.567]
    ],
    "bias1": [0.1, -0.2, 0.3, -0.4, 0.5, -0.6, 0.7, -0.8],
    "weights2": [0.8, -0.6, 0.4, -0.2, 0.9, -0.7, 0.5, -0.3],
    "bias2": 0.15
}

## 9. Compilation and Usage

### 9.1 Requirements

- **NVIDIA GPU** with CUDA support
- **CUDA Toolkit** (tested with CUDA 12.x)  
- **C++17** compatible compiler
- **CMake** (3.18+)
- **nlohmann-json** library (system-wide installation)

### 9.2 GPU Architecture Support

The CMakeLists.txt can be configured for different GPU architectures:

| Architecture | GPU Series | Compute Capability |
|--------------|-------------|-------------------|
| `sm_60` | GTX 10 series | 6.0 |
| `sm_70` | Titan V, GTX 16 series | 7.0 |
| `sm_75` | RTX 20 series | 7.5 (default) |
| `sm_86` | RTX 30 series | 8.6 |
| `sm_89` | RTX 40 series | 8.9 |

### 9.3 Build Commands


In [None]:
# Create build directory
!mkdir build
!cd build

# Configure project
!cmake ..

# Compile executables
!make

# This creates two executables:
# - json_generator: Creates random model parameters
# - neural_network_test: Runs neural network evaluation tests

### 9.4 Usage Examples

#### Generate a Random Model


In [None]:
# Generate model with 4 inputs, 8 hidden nodes
./json_generator 4 8 ../test_model.json

# Generate model with custom parameters
./json_generator 6 16 ../my_custom_model.json

#### Run Neural Network Test


In [None]:
# Test with generated model
./neural_network_test ../test_model.json

# Expected output:
# CUDA Neural Network Demo
# ========================
# Using CUDA device: GeForce RTX 3080
# Compute capability: 8.6
# 
# === Neural Network Evaluation Test ===
# Network configuration:
# - Input size: 4
# - Hidden nodes: 8
# - Number of test cases: 10
# ✓ Model loaded successfully
# 
# Generating random test inputs...
# Launching CUDA kernel...
# Grid size: 1, Block size: 256
# 
# === Test Results ===
# Test  1: Input [-0.527618, -0.339161,  1.488258, -0.412800] -> Output: 0.947760
# Test  2: Input [ 0.694293,  1.286170,  1.175678, -0.213452] -> Output: 0.694606
# ...
# 
# === Validation ===
# ✓ All outputs are in valid range [0, 1] (sigmoid activation)
# Output statistics:
# - Min: 0.382523
# - Max: 0.952799  
# - Average: 0.793204
# 
# ✓ Test completed successfully!

## 10. Performance Analysis

### 10.1 CUDA Optimization Techniques

The implementation uses several optimization strategies:

#### Loop Unrolling
```cuda
#if (defined(TARGET_DEVICE_CUDA) && defined(__CUDACC__))
#pragma unroll
#endif
```
- Reduces loop overhead
- Enables better instruction-level parallelism
- Compile-time optimization for known loop bounds

#### Template-based Design
- Compile-time configuration of network dimensions
- Eliminates runtime branching
- Enables aggressive compiler optimizations

#### Fast Math Operations
- Uses `__expf()` for single-precision exponential
- Optimized for GPU execution
- Leverages CUDA intrinsic functions

#### Memory Access Patterns
- Coalesced memory access for input data
- Local arrays for temporary computations
- Efficient GPU memory utilization

### 10.2 Performance Characteristics

| Aspect | Benefit |
|--------|---------|
| **Parallel Evaluation** | Process multiple inputs simultaneously |
| **Template Specialization** | Zero runtime overhead for network configuration |
| **CUDA Intrinsics** | Hardware-accelerated mathematical operations |
| **Memory Coalescing** | Optimal memory bandwidth utilization |

### 10.3 Scalability

The implementation scales well with:
- Number of concurrent test cases
- GPU compute capability
- Memory bandwidth
- Number of CUDA cores


## 11. Allen Framework Background

### 11.1 About Allen

Allen is a revolutionary GPU-based trigger system developed for the LHCb experiment at CERN. This neural network demo is adapted from Allen's production codebase.

#### Key Facts about Allen:
- **First complete high-throughput GPU trigger** for a High Energy Physics experiment
- **Processes 40 Tbit/s** data rate from the upgraded LHCb detector  
- **Reduces data rate** by factor of 30-60 in real-time
- **Implemented in ~500 GPU cards** for production operation
- **Operates at full LHC collision rate** of 30-40 MHz

### 11.2 Technical Achievements

#### Performance Metrics
- **Throughput**: Processes millions of collision events per second
- **Latency**: Real-time decision making within microseconds
- **Efficiency**: Linear scaling with GPU computational power
- **Reliability**: Production-ready for continuous operation

#### Algorithm Portfolio
Allen implements numerous pattern recognition algorithms:
- **Charged particle tracking** through silicon detectors
- **Primary vertex reconstruction** from collision points
- **Particle identification** (hadrons vs muons)
- **Displaced vertex finding** for long-lived particle decays

### 11.3 Innovation Impact

Allen represents a paradigm shift in high-energy physics computing:
- **GPU-first architecture** for trigger systems
- **Template-based programming** for compile-time optimization  
- **Heterogeneous computing** leveraging CPU and GPU strengths
- **Open-source framework** enabling broad community adoption

### 11.4 Publications and Recognition

- Published in **Journal of Instrumentation** (2020)
- **CERN openlab** collaboration project
- Featured in **NVIDIA** GPU computing showcases
- **Apache 2.0 license** for community access

This neural network demo preserves Allen's design principles while providing a simplified, educational implementation suitable for learning CUDA neural network programming.


## 12. Complete Example Workflow

### 12.1 Step-by-Step Tutorial

Let's walk through a complete example from model generation to evaluation:

#### Step 1: Environment Setup


# Install dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install build-essential cmake nvidia-cuda-toolkit
sudo apt install nlohmann-json3-dev

# Fedora/Red Hat instructions
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake cuda-toolkit
sudo dnf install nlohmann-json-devel

# Verify CUDA installation
nvcc --version
nvidia-smi

In [None]:
# for this tutorial all the dependencies are included in the binder environment except following
# run this from command line to install nlohmann_json, not from the cell 
!conda install -c conda-forge nlohmann_json

#### Step 2: Build the Project


In [None]:
# Create build directory
mkdir build && cd build

# Configure with CMake
cmake .. -DCMAKE_BUILD_TYPE=Release

# Compile (adjust -j flag based on available CPU cores)
make -j4

# Verify executables were created
ls -la json_generator neural_network_test

#### Step 3: Generate Model Parameters


In [None]:
# Generate a model with 4 inputs and 8 hidden neurons
./json_generator 4 8 ../demo_model.json

# Check the generated file
cat ../demo_model.json | head -20

# Generate models with different architectures
./json_generator 6 12 ../larger_model.json
./json_generator 2 4 ../smaller_model.json

#### Step 4: Run Neural Network Evaluation


In [None]:
# Test the demo model
./neural_network_test ../demo_model.json

# Test different models
./neural_network_test ../larger_model.json
./neural_network_test ../smaller_model.json

# Save output to file for analysis
./neural_network_test ../demo_model.json > results.txt

### 12.2 Python Integration Example

You can integrate this CUDA implementation with Python for data analysis:


In [None]:
import subprocess
import json
import numpy as np
import matplotlib.pyplot as plt

def generate_and_test_model(num_input, num_hidden, model_name):
    """Generate a model and run tests, return results"""

    # Generate model
    subprocess.run([
        './json_generator', 
        str(num_input), 
        str(num_hidden), 
        f'../models/{model_name}.json'
    ])

    # Run test and capture output  
    result = subprocess.run([
        './neural_network_test', 
        f'../models/{model_name}.json'
    ], capture_output=True, text=True)

    return result.stdout

def analyze_model_performance():
    """Analyze performance across different model sizes"""

    configurations = [
        (2, 4, "small"),
        (4, 8, "medium"), 
        (8, 16, "large")
    ]

    results = {}
    for num_input, num_hidden, name in configurations:
        output = generate_and_test_model(num_input, num_hidden, name)

        # Parse output statistics (simplified parsing)
        # In practice, you'd use more robust parsing
        lines = output.split('\n')
        for line in lines:
            if 'Average:' in line:
                avg_output = float(line.split(':')[-1].strip())
                results[name] = {
                    'config': (num_input, num_hidden),
                    'avg_output': avg_output
                }

    return results

# Example usage
results = analyze_model_performance()
print("Model Performance Analysis:")
for name, data in results.items():
    config = data['config'] 
    avg = data['avg_output']
    print(f"{name}: {config[0]}x{config[1]} -> avg output: {avg:.3f}")

### 12.3 Performance Benchmarking

For production deployments, consider benchmarking different configurations:


In [None]:
# Benchmark script example
for input_size in 2 4 6 8; do
    for hidden_size in 4 8 16 32; do
        echo "Testing ${input_size}x${hidden_size} configuration..."
        ./json_generator $input_size $hidden_size test_${input_size}_${hidden_size}.json
        time ./neural_network_test test_${input_size}_${hidden_size}.json
        echo "---"
    done
done

## Conclusion

This comprehensive tutorial has demonstrated a complete CUDA neural network implementation adapted from CERN's Allen framework. The key takeaways include:

### Technical Achievements
- **Template-based design** enabling compile-time optimizations
- **GPU memory management** with efficient CUDA memory operations  
- **Optimized kernels** using loop unrolling and fast math
- **JSON-based configuration** for flexible model parameters
- **Production-ready code** adapted from real LHC trigger systems

### Educational Value
- Complete CUDA neural network from scratch
- Integration of C++, CUDA, and CMake build systems
- Real-world application from high-energy physics
- Performance optimization techniques
- Memory management best practices

### Extensions and Future Work

This implementation can be extended in several ways:
1. **Multi-layer networks** with arbitrary depth
2. **Batch processing** for improved throughput
3. **Training capabilities** with backpropagation
4. **Different activation functions** (tanh, leaky ReLU, etc.)
5. **Mixed-precision arithmetic** using Tensor Cores
6. **Multi-GPU scaling** for large-scale deployments

### Resources for Further Learning

- **Allen Documentation**: https://allen-doc.docs.cern.ch
- **NVIDIA CUDA Toolkit**: https://developer.nvidia.com/cuda-toolkit
- **cuDNN Library**: https://developer.nvidia.com/cudnn
- **CUDA Programming Guide**: https://docs.nvidia.com/cuda/cuda-c-programming-guide/


