Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Claude.ai - OPUS to recreate the gp_driver.cpp #5

Open
johndpope opened this issue Apr 15, 2024 · 0 comments
Open

Using Claude.ai - OPUS to recreate the gp_driver.cpp #5

johndpope opened this issue Apr 15, 2024 · 0 comments

Comments

@johndpope
Copy link

johndpope commented Apr 15, 2024

PREFACE
admitably - it's garbage - garbage out.

Im not entirely clear on what you're attempting to do with amd driver internals

  • (but I enjoyed watching many hours of you pulling apart amd gpu)

it seems like you want to recreate this driver code for amd gpu / 7900xt?
https://github.com/geohot/cuda_ioctl_sniffer/blob/master/gpu_driver.cc

if not - please articulate. below is 2 attempts to create the driver - first with the referenced pm4 header (and an earlier one without it)

understand Claude.ai - has read every repo - every line of code - it knows all the code / headers from the AMD ROCT-Thunk-Interface repo

in addition - I do some cat *.py of a few a files (PM4Packet / queue /cpp / hpp) - so context looks like this
https://gist.github.com/johndpope/68b233e6d85f863fbb775aabca5ed65f

we pass this in - and I attach your code - gpu_driver.cc
recreate the nvidia driver for amd gpu 7900xt - gpu leveraging the ROCT-Thunk-Interface _setup /gpu_memcpy /gpu_compute etc - leverage the attached classes - use a pm4 queue

Screenshot 2024-04-15 at 11 34 07 am

Here is an attempt at recreating the NVIDIA driver code to work with an AMD GPU (7900XT), leveraging the ROCT-Thunk-Interface and the provided PM4 queue classes:

#include "helpers.h"
#include "nouveau.h"
#include "ROCT-Thunk-Interface.h"
#include "PM4Queue.hpp"
#include "PM4Packet.hpp"

#define ROCHSA_PM4_QUEUE_SIZE (64*1024) // 64 KB

uint64_t trivial[] = {
// Trivial compute shader, same as original
0x00005a00ff057624, 0x000fe200078e00ff,

0x0000580000027a02, 0x000fe20000000f00,
0x0000590000037a02, 0x000fca0000000f00,
0x0000000502007986, 0x000fe2000c101904,
0x000000000000794d, 0x000fea0003800000,
};

void gpu_setup(PM4Queue* pQueue) {
// Initialize the PM4 queue
pQueue->Init();
}

void gpu_memcpy(PM4Queue* pQueue, uint64_t dst, const uint32_t *src, int len) {
assert(len % 4 == 0);

// Use PM4 DMA packet to do the memcpy

pQueue->PlaceAndSubmitPacket(PM4DmaDataPacket(dst, src, len));
}

void gpu_compute(PM4Queue* pQueue, uint64_t shader_addr, uint64_t cb_addr, int cb_len) {

// Set up registers
const unsigned int COMPUTE_PGM_VALUES[] = {
static_cast<uint32_t>(shader_addr),       // PGM_LO
static_cast<uint32_t>(shader_addr >> 32) // PGM_HI
};

const unsigned int COMPUTE_PGM_RSRC1[] = { 0x000c0084 }; // Same as original

const unsigned int COMPUTE_DISPATCH_DIMENSIONS[] = {
1, 1, 1, // THREADS_X/Y/Z
1, 1, 1, // GROUPS_X/Y/Z

0, 0     // PIPELINESTAT/PERFCOUNT
};

const unsigned int COMPUTE_USER_DATA[] = {
static_cast<uint32_t>(cb_addr),       // CB1_BASE_LO
static_cast<uint32_t>(cb_addr >> 32), // CB1_BASE_HI

cb_len,                               // CB1_SIZE
1                                     // CB1_VALID
};

// Configure shader registers
pQueue->PlaceAndSubmitPacket(
PM4SetShaderRegPacket(mmCOMPUTE_PGM_LO, COMPUTE_PGM_VALUES,
sizeof(COMPUTE_PGM_VALUES)/sizeof(COMPUTE_PGM_VALUES[0])));

pQueue->PlaceAndSubmitPacket(
PM4SetShaderRegPacket(mmCOMPUTE_PGM_RSRC1, COMPUTE_PGM_RSRC1,
sizeof(COMPUTE_PGM_RSRC1)/sizeof(COMPUTE_PGM_RSRC1[0])));

pQueue->PlaceAndSubmitPacket(
PM4SetShaderRegPacket(mmCOMPUTE_NUM_THREAD_X, COMPUTE_DISPATCH_DIMENSIONS,
sizeof(COMPUTE_DISPATCH_DIMENSIONS)/sizeof(COMPUTE_DISPATCH_DIMENSIONS[0])));

pQueue->PlaceAndSubmitPacket(
PM4SetShaderRegPacket(mmCOMPUTE_USER_DATA_0, COMPUTE_USER_DATA,
sizeof(COMPUTE_USER_DATA)/sizeof(COMPUTE_USER_DATA[0])));

// Dispatch the compute shader

pQueue->PlaceAndSubmitPacket(PM4DispatchDirectPacket(1, 1, 1));

// Wait for shader completion
pQueue->PlaceAndSubmitPacket(PM4ReleaseMemoryPacket(true, cb_addr, 0xC0FFEE));
pQueue->Wait4PacketConsumption();
}

int main() {

PM4Queue queue;
HsaMemoryBuffer isaBuf(trivial, sizeof(trivial), PAGE_SIZE, false);

// Map and initialize GPU resources
void* gpu_mmio_ptr = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED, open("/dev/mem", O_RDWR), 0);

uint64_t gpu_local_mem = 0; // Allocate with hsaKmtAllocMemory()
uint64_t cb_gpu_addr = gpu_local_mem;

// Set up the queue
gpu_setup(&queue);

// Copy shader code to GPU memory

gpu_memcpy(&queue, gpu_local_mem, trivial, sizeof(trivial));

// Run the shader
gpu_compute(&queue, gpu_local_mem, cb_gpu_addr, 16);

// Clean up
munmap(gpu_mmio_ptr, PAGE_SIZE);
hsaKmtFreeMemory(gpu_local_mem, sizeGpuMem);

return 0;
}

Key changes:

Use ROCT-Thunk-Interface for low-level AMD GPU access instead of NVIDIA-specific interfaces.
Allocate GPU memory using hsaKmtAllocMemory() instead of NVIDIA allocators.
Use the provided PM4Queue class to manage the AMD command queue.
Build the shader dispatch sequence using PM4 packets like PM4SetShaderRegPacket, PM4DispatchDirectPacket, etc.
Wait for shader completion using PM4ReleaseMemoryPacket.
This provides the general structure, but more details would need to be filled in, like proper memory allocation, any kernel mode driver interactions, shader compilation, etc. The low-level register setup would also need to be adapted for the specific AMD GPU. But hopefully this illustrates the high-level approach of porting the NVIDIA code to AMD leveraging the ROCT-Thunk-Interface and PM4 queue.

I had it asked before passing the pm4packets - and it gave me this variation - where it didn't have the pm4packet header.

In this updated version, we use PM4 packets to perform memory copies and kernel dispatches. The main changes are in the gpu_memcpy and gpu_compute functions.

For gpu_memcpy, we create a PM4 packet with the PM4_HDR_IT_OPCODE_COPY opcode to copy data from the source to the destination. We fill in the necessary fields of the packet, such as the source and destination addresses and the size of the data. Then, we enqueue the packet into the HSA queue and wait for it to complete using a signal.

Similarly, for gpu_compute, we create a PM4 packet with the PM4_HDR_IT_OPCODE_DISPATCH_DIRECT opcode to dispatch the kernel. We provide the kernel object, argument buffer, and grid dimensions in the packet. We enqueue the packet into the HSA queue and wait for it to complete using a signal.

The rest of the code remains largely the same, with the addition of the PM4_HEADER_BUILD macro to construct the PM4 packet headers.

Please note that this code assumes you have the necessary headers and libraries for the AMD GPU PM4 packets, such as amd_gpu_pm4.h. Make sure to link against the appropriate libraries when compiling the code.

Again, this is a simplified example, and you may need to adapt it based on your specific requirements and the kernel code object you are using.

#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>
#include <hsa/amd_gpu_pm4.h>
#include <iostream>
#include <fstream>
#include <vector>
#include <cstring>
#include <cassert>

void check(hsa_status_t status, const char* msg) {
    if (status != HSA_STATUS_SUCCESS) {
        std::cerr << msg << " failed with status " << status << std::endl;
        exit(1);
    }
}

void gpu_setup(hsa_queue_t* queue) {
    // Placeholder for any necessary setup steps
}

void gpu_memcpy(hsa_queue_t* queue, void* dst, const void* src, size_t size) {
    hsa_status_t status;
    hsa_signal_t signal;
    status = hsa_signal_create(1, 0, NULL, &signal);
    check(status, "hsa_signal_create");

    hsa_region_t region;
    status = hsa_agent_iterate_regions(queue->agent, [](hsa_region_t region, void* data) {
        if (region.handle != 0) {
            hsa_region_global_flag_t flags;
            hsa_region_get_info(region, HSA_REGION_INFO_GLOBAL_FLAGS, &flags);
            if (flags & HSA_REGION_GLOBAL_FLAG_FINE_GRAINED) {
                *static_cast<hsa_region_t*>(data) = region;
                return HSA_STATUS_INFO_BREAK;
            }
        }
        return HSA_STATUS_SUCCESS;
    }, &region);
    check(status, "Finding a memory region");

    uint32_t* pm4_packet = static_cast<uint32_t*>(malloc(4 * sizeof(uint32_t)));
    pm4_packet[0] = PM4_HEADER_BUILD(PM4_HDR_IT_OPCODE_COPY, size / 4, 0, 0, 0);
    pm4_packet[1] = reinterpret_cast<uintptr_t>(src) >> 8;
    pm4_packet[2] = reinterpret_cast<uintptr_t>(dst) >> 8;
    pm4_packet[3] = size;

    uint64_t index = hsa_queue_load_write_index_relaxed(queue);
    hsa_kernel_dispatch_packet_t* dispatch_packet = reinterpret_cast<hsa_kernel_dispatch_packet_t*>(queue->base_address) + (index % queue->size);
    memcpy(dispatch_packet, pm4_packet, 4 * sizeof(uint32_t));
    hsa_queue_store_write_index_relaxed(queue, index + 1);

    hsa_signal_store_relaxed(queue->doorbell_signal, index);

    hsa_signal_wait_acquire(signal, HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX, HSA_WAIT_STATE_BLOCKED);

    status = hsa_signal_destroy(signal);
    check(status, "hsa_signal_destroy");

    free(pm4_packet);
}

void gpu_compute(hsa_queue_t* queue, const void* code_object, size_t code_object_size,
                 const void* args, size_t args_size) {
    hsa_status_t status;
    hsa_executable_t executable;
    status = hsa_executable_create(HSA_PROFILE_FULL, HSA_EXECUTABLE_STATE_UNFROZEN, NULL, &executable);
    check(status, "hsa_executable_create");

    status = hsa_executable_load_code_object(executable, queue->agent, code_object, code_object_size, NULL, NULL);
    check(status, "hsa_executable_load_code_object");

    status = hsa_executable_freeze(executable, NULL);
    check(status, "hsa_executable_freeze");

    hsa_executable_symbol_t kernel_symbol;
    status = hsa_executable_get_symbol(executable, NULL, "<kernel_name>", queue->agent, 0, &kernel_symbol);
    check(status, "hsa_executable_get_symbol");

    uint64_t kernel_object;
    status = hsa_executable_symbol_get_info(kernel_symbol, HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel_object);
    check(status, "hsa_executable_symbol_get_info");

    hsa_signal_t signal;
    status = hsa_signal_create(1, 0, NULL, &signal);
    check(status, "hsa_signal_create");

    uint32_t* pm4_packet = static_cast<uint32_t*>(malloc(12 * sizeof(uint32_t)));
    pm4_packet[0] = PM4_HEADER_BUILD(PM4_HDR_IT_OPCODE_ACQUIRE_MEM, 0, 0, 0, 0);
    pm4_packet[1] = 0;
    pm4_packet[2] = PM4_HEADER_BUILD(PM4_HDR_IT_OPCODE_DISPATCH_DIRECT, 0, 0, 0, 0);
    pm4_packet[3] = kernel_object & 0xFFFFFFFF;
    pm4_packet[4] = kernel_object >> 32;
    pm4_packet[5] = args_size;
    pm4_packet[6] = 0;
    pm4_packet[7] = reinterpret_cast<uintptr_t>(args) & 0xFFFFFFFF;
    pm4_packet[8] = reinterpret_cast<uintptr_t>(args) >> 32;
    pm4_packet[9] = 1;
    pm4_packet[10] = 1;
    pm4_packet[11] = 1;

    uint64_t index = hsa_queue_load_write_index_relaxed(queue);
    hsa_kernel_dispatch_packet_t* dispatch_packet = reinterpret_cast<hsa_kernel_dispatch_packet_t*>(queue->base_address) + (index % queue->size);
    memcpy(dispatch_packet, pm4_packet, 12 * sizeof(uint32_t));
    hsa_queue_store_write_index_relaxed(queue, index + 1);

    hsa_signal_store_relaxed(queue->doorbell_signal, index);

    hsa_signal_wait_acquire(signal, HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX, HSA_WAIT_STATE_BLOCKED);

    status = hsa_signal_destroy(signal);
    check(status, "hsa_signal_destroy");

    status = hsa_executable_destroy(executable);
    check(status, "hsa_executable_destroy");

    free(pm4_packet);
}

void gpu_dma_copy(hsa_queue_t* queue, void* dst, const void* src, size_t size) {
    gpu_memcpy(queue, dst, src, size);
}

int main() {
    hsa_status_t status;

    status = hsa_init();
    check(status, "hsa_init");

    hsa_agent_t agent;
    status = hsa_iterate_agents([](hsa_agent_t agent, void* data) {
        hsa_device_type_t device_type;
        hsa_agent_get_info(agent, HSA_AGENT_INFO_DEVICE, &device_type);
        if (device_type == HSA_DEVICE_TYPE_GPU) {
            *static_cast<hsa_agent_t*>(data) = agent;
            return HSA_STATUS_INFO_BREAK;
        }
        return HSA_STATUS_SUCCESS;
    }, &agent);
    check(status, "Finding a GPU agent");

    hsa_queue_t* queue;
    status = hsa_queue_create(agent, 4096, HSA_QUEUE_TYPE_SINGLE, NULL, NULL, UINT32_MAX, UINT32_MAX, &queue);
    check(status, "hsa_queue_create");

    uintptr_t gpu_base = 0x200500000;

    gpu_setup(queue);
    uint8_t data[] = {0xaa, 0xbb, 0xcc, 0xdd};
    gpu_memcpy(queue, reinterpret_cast<void*>(gpu_base + 4), data, sizeof(data));

    uint64_t args[] = {gpu_base, 0x1337};

    std::ifstream code_object_file("kernel.co", std::ios::binary);
    std::vector<char> code_object(std::istreambuf_iterator<char>(code_object_file), {});
    gpu_memcpy(queue, reinterpret_cast<void*>(gpu_base + 0x1000), code_object.data(), code_object.size());
    gpu_memcpy(queue, reinterpret_cast<void*>(gpu_base + 0x2160), args, sizeof(args));

    gpu_compute(queue, code_object.data(), code_object.size(), args, sizeof(args));

    gpu_dma_copy(queue, reinterpret_cast<void*>(gpu_base + 0x14), reinterpret_cast<void*>(gpu_base), 8);

    uint32_t result;
    gpu_memcpy(queue, &result, reinterpret_cast<void*>(gpu_base), sizeof(result));
    assert(result == 0x1337);

    gpu_memcpy(queue, &result, reinterpret_cast<void*>(gpu_base + 4), sizeof(result));
    assert(result == 0xDDCCBBAA);

    uint32_t dma_result1, dma_result2;
    gpu_memcpy(queue, &dma_result1, reinterpret_cast<void*>(gpu_base), sizeof(dma_result1));
    gpu_memcpy(queue, &dma_result2, reinterpret_cast<void*>(gpu_base + 0x14), sizeof(dma_result2));
    assert(dma_result1 == dma_result2);

    gpu_memcpy(queue, &dma_result1, reinterpret_cast<void*>(gpu_base + 4), sizeof(dma_result1));
    gpu_memcpy(queue, &dma_result2, reinterpret_cast<void*>(gpu_base + 0x18), sizeof(dma_result2));
    assert(dma_result1 == dma_result2);

    status = hsa_queue_destroy(queue);
    check(status, "hsa_queue_destroy");

    status = hsa_shut_down();
    check(status, "hsa_shut_down");

    return 0;
}

if you could articulate concisely and clearly - what the end game is - I am happy to get OPUS to have another crack.
If I need to include other code / context - happy to drop that in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant