Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EGL extensions not found in FreeBSD 12 (nvidia-driver 390.87_1) #7

Open
amshafer opened this issue Jan 14, 2019 · 30 comments
Open

EGL extensions not found in FreeBSD 12 (nvidia-driver 390.87_1) #7

amshafer opened this issue Jan 14, 2019 · 30 comments

Comments

@amshafer
Copy link

I have a quick question on some EGL extensions for the FreeBSD driver that can’t seem to be found. I’m posting here because the Nvidia forums for FreeBSD don’t seem to get any answers from developers and I found this issue trying to run this repository. If there is a better place for me to ask this question please let me know.

I am trying to write a small wayland compositor on FreeBSD, and while testing the eglstream example the FreeBSD driver reports that It can’t find certain extensions that patchnotes say are included.

Replication on FreeBSD 12.0-RELEASE amd64 on a GTX 1070 with nvidia-driver 390.87_1

(pkg install gmake …)
git clone https://github.com/aritger/eglstreams-kms-example
cd eglstreams-kms-example
gmake <——— Compilation with gmake works fine

running ./eglstreams-kms-example results in:
ERROR: eglGetProcAddress(eglGetOutputLayersEXT) failed


Some reading found that the extensions are searched for in the “utils.c” file. The 346 driver notes say:
Added support for the following EGL extensions:
EGL_EXT_device_base
EGL_EXT_platform_device
EGL_EXT_output_base <— It looks like this is the one missing

I don’t see EGL_EXT_output_base in the nvidia-settings app either. Is this repository a good way to get an OpenGL context without X11 on FreeBSD? And are these extensions still supported?

Please let me know if there is any more information I can provide, I am more than happy to help in any way to get this working, as it is a feature I am extremely interested in.

Thank you so much for your time!
Austin Shafer

@mvicomoya
Copy link
Contributor

Hi @ashaferian,

These mechanisms were only enabled for FreeBSD very recently and drivers out of 390.xx won't include them.

Is there any chance you can try with 415.25?

@amshafer
Copy link
Author

Hi @mvicomoya

Thanks for your reply! I have installed the 415.25 driver (didn't try it originally because it has yet to make it in the FreeBSD ports tree, I'll try to bump it to 415.25) but I still run into the following problem from line 154 of egl.c:

151     drmDeviceFile = pEglQueryDeviceStringEXT(device, EGL_DRM_DEVICE_FILE_EXT);
152 
153     if (drmDeviceFile == NULL) {
154         Fatal("No DRM device file found for EGL device.\n");
155     }

devfs does show the files /dev/nvidia-modeset /dev/nvidia0 /dev/nvidiactl. I am no expert on EGL, and certainly not on these extensions, but it seems like /dev/nvidia0 is what the drmDeviceFile is looking for?

On the plus side, it does seem to find the extensions that were originally missing. That's a start at least. Also I should mention that the 415.25 driver does work well on X11. Please let me know if there is anything I can do to help.
Thank you so much!

@aritger
Copy link
Owner

aritger commented Jan 15, 2019

Sorry, we don't yet have DRM KMS support in the NVIDIA FreeBSD driver. So, the NVIDIA EGL implementation on FreeBSD won't be able to find a DRM device file.

@amshafer
Copy link
Author

That is unfortunate to hear, though it certainly explains the problems I've had. Although I thought that's what the nvidia-modeset.ko module was supposed to provide? Or does it only support KMS without the full DRM feature set that EGL needs?

Is there any hope that this will be added in the future? I know most of the development time is spent on the Windows or Linux drivers but I'd really love to see this supported on FreeBSD.

Thanks again!

@aritger
Copy link
Owner

aritger commented Jan 15, 2019

The core modesetting functionality is provided by nvidia-modeset.ko, but gluing that to the DRM APIs is provided on Linux by nvidia-drm.ko. We designed this with the intent of making nvidia-drm.ko available on FreeBSD, eventually, but that hasn't been done, yet. At the time we started on nvidia-drm.ko, the DRM infrastructure on FreeBSD was lagging behind Linux. I haven't kept up with recent FreeBSD DRM developments.

If you're interested and motivated, it might be possible for you to take the nvidia-drm.ko source from the Linux .run driver package, and port that to FreeBSD. Otherwise, it is on our todo list within NVIDIA, though I admit it may be difficult to prioritize very highly.

Sorry this isn't already in place.

@mvicomoya
Copy link
Contributor

@ashaferian Yeah, what @aritger said. Sorry I misled you saying "these mechanisms were just enabled very recently". In fact, only some of these mechanisms were enabled. You should be able to use the Wayland EGL stack, but without a nvidia DRM KMS driver, EGL wouldn't be able to use EGLOutput to scanout.

Looking at https://github.com/FreeBSDDesktop/kms-drm, it might actually be relatively easy to grab nvidia-drm.ko source code from the Linux .run package (as @aritger also suggested) and use it as-is (or with very few modifications) using the linux-kpi compatibility interface.

@amshafer
Copy link
Author

Thank you guys for pointing me in the right direction! I'd certainly be interested in giving it a try. I'll have to take a look at the linux nvidia-drm.ko and see if it can be plugged into the linuxkpi. Is all of nvidia-drm open source? If so I'd rather just port it to FreeBSD than rely on the linuxkpi. I'd need to read up on the FreeBSD DRM interface before either of these though.

Another quick question, is Vulkan supported? In the 390.xx drivers libEGL had the vkGetProcAddress symbol for the vulkan loader, but it seems to be removed in 415.25.

Thanks!

@amshafer
Copy link
Author

Is all of nvidia-drm open source? If so I'd rather just port it to FreeBSD than rely on the linuxkpi. I'd need to read up on the FreeBSD DRM interface before either of these though.

Back at my desktop so I can answer part of this for myself, I do see all the sources in the nvidia-drm folder. Obviously the nvidia-drm-linux.c file is very linux specific, but how about the rest? some of the others like nvidia-dma-fence-helper.c seem to be linux specific as well but most don't appear to be. How closely tied are the remaining files? (i.e. is it a feasible port or would it involve a total rewrite?)

Thanks again for letting me pester you with questions!

@mvicomoya
Copy link
Contributor

@ashaferian nvidia-drm sources are linux-centric overall. I don't think we have been very good at keeping linux-specific bits contained in nvidia-drm-linux.c, sorry.

I'm not a FreeBSD expert myself, but it seems the DRM APIs in FreeBSD and Linux differ considerably, so a port might entail more work than the linuxkpi approach.

As for your Vulkan question, we don't support Vulkan on FreeBSD. There have been changes related to GLVND, and that's probably the cause for vkGetProcAddress not being exported by libEGL, since now it's GLVND's.

@amshafer
Copy link
Author

@mvicomoya Now that I think about it I recall seeing a mailing list post about how the linuxkpi based drm was what they were going to use going forward. I think the old drm you mentioned is getting removed next major release. I'll have to double check that but Linuxkpi seems the way to go.

So there wouldn't be any problems having nvidia-drm using the linuxkpi while nvidia.ko runs natively on FreeBSD? It sounds fine in theory but I figured I'd double check before I start investigating.

Ah that would make sense. Explains the removal of the symbol.

Thanks!

@amshafer
Copy link
Author

Porting nvidia-drm to FreeBSD

Apologies for the long message

@mvicomoya @aritger Thanks for your advice from this thread, I have thrown some time into porting the nvidia-drm module using FreeBSD's linuxkpi and have some questions about a roadblock I have experienced. I'm sure I have made mistakes so please let me know if there is anything I can correct.

I placed what I have so far in a repo so that I can get feedback (and hopefully others can test). From reading Nvidia's license it seems like this was okay to do based off the linux/BSD exception. If this is not allowed, or is a problem in any way please let me know so that I can remove it.

Current Status

nvidia-drm.ko builds and can be loaded on FreeBSD using the linuxkpi. /dev/dri/card0 is created and suspend/resume is not broken. This repo (eglstreams-kms-example) does not work however, as it 1) appears to need to link with libglvnd and 2) cannot perform drmModeGetResources. All my work was tested on a GTX 1070. A patch for pci id detection had to be added to the linuxkpi, so this port has to be run on FreeBSD 13.0-CURRENT revision 343451 or later.

Temporary hacks which need to be resolved at some point

  • FreeBSD's drm port does not install the drm headers. They had to be copied into src/common/linuxkpi to get things to compile. Definitely not ideal. The FreeBSD drm devs said the nvidia-drm module should be added to their repo, but I doubt Nvidia would allow that. I'll have to sort out with them about getting those headers installed.
  • src/nvidia-drm/nvidia-drm-conftest.h has the macro options enabled manually, as FreeBSD doesn't have conftest.sh in the nvidia driver.
  • I don't know how to use libglvnd yet, and it appears this is the new way to get access to EGL? I will probably have to write a minimal program to link with libglvnd and then perform what this repo does.
  • currently the linuxkpi seems to have a few memory leaks which are printed during the unloading of the nvidia-drm module. I don' t think they are nvidia-drm's fault but I plan on fixing them anyway.

The current problem (why I am writing to you)

One of the conftest options (NV_DRM_ATOMIC_MODESET_AVAILABLE) seems to be what actually enables most of the drm functionality. If this option is not enabled, nvidia-drm can be loaded successfully. It seems to work until the drmModeGetResources call, where it fails because this modeset option is not enabled.

I also had to implement some of the missing nvkms API functions in nvidia-modeset.ko. They appear to have been left out as nvidia-modeset originally didn't need them due to a lack of drm support.

The actual point of failure is line 303 of nvidia-drm-drv.c returning NULL
pDevice = nvKms->allocateDevice(&allocateDeviceParams);
The allocateDevice function pointer appears to point to _nv000007kms in the closed source portion of the driver, hence why I am a bit stuck.

  • Is NV_DRM_ATOMIC_MODESET_AVAILABLE needed to realistically use drm?
  • what is the best way to test if the drm portion works?
  • What could be going wrong with the nvKms->allocateDevice call? It appears to fail after the nvkms_sema_alloc call.
  • Should nvkms_open_from_kapi call nvidia_open_dev_kernel? That seems to be the FreeBSD driver's equivalent of the nvidia_open_common from the linux driver.

All in all there this port has some problems, and is far from perfect. It wasn't bad getting the open source portions ported, but now that the closed source allocation call is failing I could use a nudge in the right direction. I am still interested in getting this working so please tell me if there is anything I can do to resolve these issues and get this port functional. I can also provide a more in depth write up of what I actually had to change to get to this point, if it would be useful for future work on the FreeBSD driver.

Thank you so much for your time, I really appreciate your help.

Austin Shafer

@aritger
Copy link
Owner

aritger commented Jan 29, 2019

Nice work, Austin :)

It is definitely fine to host your work in progess on github.

For the foreseeable future, I think we want to distribute the official nvidia-drm source as part of the NVIDIA driver package (we don't make guarantees about source or binary compatibility between NVIDIA driver components from different releases). So, yes, we'll need to find some compromise with the FreeBSD drm developers for access to drm header files.

libglvnd is a vendor-neutral dispatch layer for OpenGL, GLX, and EGL: https://github.com/NVIDIA/libglvnd AFAIK, it is enabled on FreeBSD. What specifically is the problem you're seeing with trying to use libglvnd?

Yes, NV_DRM_ATOMIC_MODESET_AVAILABLE is needed for nvidia-drm to support DRM Kernel Modesetting (KMS).

I'm sorry the code for nvKms->allocateDevice() is not currently publicly available. Reading through it, it will call back to nvidia-modeset-freebsd.c for:

nvkms_sema_alloc()
nvkms_open_gpu()
nvkms_call_rm()
nvkms_open_from_kapi()
nvkms_ioctl_from_kapi()

(and possibly others I missed)

It looks like you already have tracing code in nvidia-modeset-freebsd.c. Are any of those calls failing?

For your nvkms_open_from_kapi() question, I think the important part is that it calls nvKmsOpen(), which it looks like your nvidia-modeset-freebsd.c does.

For background:

  • The nv-modeset-kernel.o binary is the core of the "NVKMS" ("NVIDIA modeset") driver, it is called mostly through its exported symbols nvKmsOpen(), nvKmsClose(), and nvKmsIoctl().
  • Most clients of NVKMS are user-space driver components (The NVIDIA X driver, the NVIDIA OpenGL driver, etc), but nvidia-drm is a kernel-mode NVKMS client.
  • nvidia-drm calls through the NVKMS "kapi" layer, which is part of the nv-modeset-kernel.o binary, which will then call through the nvidia-modeset-{linux,freebsd}.c "kernel interface layers", and then call into NVKMS through the same nvKmsOpen(), nvKmsClose(), and nvKmsIoctl() that user-space NVKMS clients use.

My guess at this point is that nvidia-modeset-freebsd.c is missing something that the NVKMS "kapi" layer is expecting. Try to trace all the nvidia-modeset-freebsd.c functions and see if any of those are failing.

If that still doesn't help shed light on things, then a next step could be to run nvidia-drm.ko on Linux, and trace the functions in nvidia-modeset-linux.c. I would expect the flow on nvidia-modeset-linux.c and nvidia-modeset-freebsd.c to match. Maybe that will help make it clearer what is different.

Sorry, that debugging may be a big tedious. Good luck!

@amshafer
Copy link
Author

Thanks for your reply :)

So, yes, we'll need to find some compromise with the FreeBSD drm developers for access to drm header files.

I assume they don't install the headers because no one else uses them, so I think If I get this ported they should be open to having their port install them.

libglvnd AFAIK, it is enabled on FreeBSD. What specifically is the problem you're seeing with trying to use libglvnd?

No problems, I just wanted to make sure that was the proper way to get to EGL now. This repo wants to link with libEGL directly and fails, I wanted to make sure glvnd was the way to go before I spent time making a new minimal test app.

nvkms_call_rm()

Hmm I don't remember seeing or implementing this one. I'll have to add it. I don't think I get far enough for this to be the root cause, but its impossible to tell until I implement it. Do I need to do anything special to register these functions I'm adding to the kapi? I wonder if I missed a step and thats why my open function isn't being called.

It looks like you already have tracing code in nvidia-modeset-freebsd.c. Are any of those calls failing?

The only call that I see in /var/log/messages is to nvkms_sema_alloc, which returns a pointer to the lock. When poking around in the kernel debugger afterwards I can't read that address, so I think it may get freed during nvKms->allocateDevice failing? I assume that method has a 'failed' goto tag at the end like some of the other methods do.

For reference here is the debugging output from /var/log/messages:

Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] nvKms:--------------
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] nvKms->enumerateGpus = ffffffff835545f0
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] nv_drm_driver.driver_features = 1f000
Jan 29 22:52:05 wolfgang kernel: drmn0: Comparing:--------------
Jan 29 22:52:05 wolfgang kernel: drmn0: vendor:10de == id->vendor:10de && device:1b81 == id->device:ffffffff
Jan 29 22:52:05 wolfgang kernel: drmn0: <drmn> on vgapci0
Jan 29 22:52:05 wolfgang kernel: drmn0: Comparing:--------------
Jan 29 22:52:05 wolfgang kernel: drmn0: vendor:10de == id->vendor:10de && device:1b81 == id->device:ffffffff
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] pci_dev fffff8003a76cc00 ------------------------
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] ->vendor = 4318
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] ->device = 65535
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] ->driver = ffffffff844d6170
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] [GPU ID 0x00000000] Loading driver
Jan 29 22:52:05 wolfgang kernel: nvkms_sema_alloc: creating mutex
Jan 29 22:52:05 wolfgang kernel: nvkms_sema_alloc: return 0x713f5860
Jan 29 22:52:05 wolfgang kernel: [drm:nv_drm_load] [nvidia-drm] [GPU ID 0x00000000] Failed to allocate NvKmsKapiDevice
Jan 29 22:52:05 wolfgang kernel: [drm:nv_drm_register_drm_device] [nvidia-drm] [GPU ID 0x00000000] Failed to register device
Jan 29 22:52:05 wolfgang kernel: device_attach: drmn0 attach returned -1
Jan 29 22:52:05 wolfgang kernel: [drm] [nvidia-drm] Registered pci driver with ret: 0

Its far from the greatest debugging output, but I can see the return from nvkms_sema_alloc (which should mean success). From reading the disassembled nvKms->allocateDevice It appears the next call is the nvkms_open_gpu, but I don't see any of my tracing print statements being called? That's about where I get lost.

Dump of assembler code for function _nv000007kms:
   0xffffffff83555be0 <+0>:	sub    $0x118,%rsp
   0xffffffff83555be7 <+7>:	mov    $0x1,%esi
   0xffffffff83555bec <+12>:	mov    %rbp,0xf0(%rsp)
   0xffffffff83555bf4 <+20>:	mov    %rdi,%rbp
   0xffffffff83555bf7 <+23>:	mov    $0xb8,%edi
   0xffffffff83555bfc <+28>:	mov    %rbx,0xe8(%rsp)
   0xffffffff83555c04 <+36>:	mov    %r12,0xf8(%rsp)
   0xffffffff83555c0c <+44>:	mov    %r13,0x100(%rsp)
   0xffffffff83555c14 <+52>:	mov    %r14,0x108(%rsp)
   0xffffffff83555c1c <+60>:	mov    %r15,0x110(%rsp)
   0xffffffff83555c24 <+68>:	callq  0xffffffff83515e20 <_nv002285kms>
   0xffffffff83555c29 <+73>:	test   %rax,%rax
   0xffffffff83555c2c <+76>:	mov    %rax,%rbx
   0xffffffff83555c2f <+79>:	je     0xffffffff83555c4b <_nv000007kms+107>
   0xffffffff83555c31 <+81>:	callq  0xffffffff83574ee0 <nvkms_sema_alloc>
   0xffffffff83555c36 <+86>:	test   %rax,%rax
   0xffffffff83555c39 <+89>:	mov    %rax,0x8(%rbx)
   0xffffffff83555c3d <+93>:	je     0xffffffff83555c4b <_nv000007kms+107>
   0xffffffff83555c3f <+95>:	mov    0x0(%rbp),%edi
   0xffffffff83555c42 <+98>:	callq  0xffffffff83574ae0 <nvkms_open_gpu> <---- It doesn't seem to get to here
...

For your nvkms_open_from_kapi() question, I think the important part is that it calls nvKmsOpen(), which it looks like your nvidia-modeset-freebsd.c does.

So should I use the nvkms_open_common that I wrote or the nvidia_open_dev_kernel? Or are you saying it shouldn't matter because they both call nvKmsOpen?

My guess at this point is that nvidia-modeset-freebsd.c is missing something that the NVKMS "kapi" layer is expecting. Try to trace all the nvidia-modeset-freebsd.c functions and see if any of those are failing.

I'll try adding print statements to every kapi call and see if I'm missing one. I only added statements to ones I added so I think your hunch is right and something is failing silently.

Sorry, that debugging may be a big tedious. Good luck!

Thanks! I suspect I'll have to pass my gpu through to a VM or something so I can step through it live. With my limited experience with this code its kind of hard to just look at the methods and tell whats going wrong.

Thanks again for all the help!

@aritger
Copy link
Owner

aritger commented Jan 30, 2019

If nvkms_sema_alloc() returns non-NULL, then the next thing nvKms->allocateDevice() will do is call nvkms_open_gpu(). Can you add some tracing to it?

nvkms_call_rm() is already implemented in nvidia-modeset-freebsd.c. It is just that you may want to add tracing to it. But, I suppose its data is opaque, so there isn't much to be learned by tracing it.

For nvkms_open_from_kapi(), I think it will be best to pattern after the nvidia-modeset-linux.c version, which just calls nvkms_open_common()

@amshafer
Copy link
Author

Yup and nvkms_call_rm is called super frequently, so it just seemed like noise when I traced it.

Made some progress, turns out I was only tracing the nvkms_open_from_kapi, however nvkms_open_gpu calls nvidia_open_dev_kernel instead. I have been using arbitrary GPU id's (as I couldn't find a way to get the correct id in the nvidia-drm code). This makes nvidia_open_dev_kernel fail while trying to get nvidia_find_state. I didn't realize until now that the main nvidia module used ids as well. I'll see if I can get the correct GPU id's later after class and hopefully that will solve this issue.

Cool I'll stick with the current linux modeled implementation of nvkms_open_common unless I run into problems with it.

Thanks!

@shkhln
Copy link

shkhln commented Feb 3, 2019

we don't support Vulkan on FreeBSD

If you don't mind random strangers interjecting, is there currently any plan to support Vulkan on FreeBSD in the future?

@aritger
Copy link
Owner

aritger commented Feb 6, 2019

We don't have current plans for Vulkan support on FreeBSD. But, knowing it is important to users helps us prioritize future work.

@shkhln
Copy link

shkhln commented Feb 6, 2019

Admittedly, I've only seen about a dozen or so complaints (like this one), which is not much considering it's been 3 years. What really annoys me is zero communication regarding supported features in Linux/Solaris/FreeBSD driver — even a simple comparison table posted somewhere on devtalk.nvidia.com would be nice.

@amshafer
Copy link
Author

amshafer commented Feb 7, 2019

Hi @aritger,

Back with some more questions, One is a bug I think I've stumbled upon and the other is about gem page faulting.

Also would like to second adding vulkan support. I'd much rather be interacting with Vulkan than EGL, as Vulkan definitely seems to be the future. Out of curiosity is Vulkan dependent on nvidia-drm? (in the linux driver at least)

Nvidia-modeset Acquiring Duplicate Locks

Kernel Panic: nvidia-modeset tries to acquire os.lock_sx @ nvidia_os.c:651 despite currently holding the lock

Steps to replicate - Just run kldload nvidia-modeset on FreeBSD-CURRENT or any version with witness debugging enabled.

Can be encountered on a clean version of 415.25. It's pretty trivial to replicate, although the result can be different depending on how you trigger it. Behavior can range from warnings/kernel panic during module loading to crashing the system later while trying to start x11. This is in the closed source portion of the driver so there's nothing I can do. It'd be great if you have a FreeBSD CURRENT test system that you can install 415.25 on see.

While debugging I enabled most of the kernel debugging options available for FreeBSD. The strict lock checking from the witness debugger notices locks being acquired while they are already held. I am on 12-RELEASE, but FreeBSD 13-CURRENT should have the same results, with more strict checking.

Normally on a release branch this isn't a problem at all, it just gets ignored. However the CURRENT branch of FreeBSD (the development branch) always has the strict lock checking enabled. It's my guess that the port maintainer for the nvidia driver encountered this on their development system, assumed it was also broken on 12.0, and won't pull 415+ until it works on CURRENT.

Feb  7 12:26:23 wolfgang kernel: acquiring duplicate lock of same type: "os.lock_sx"
Feb  7 12:26:23 wolfgang kernel:  1st os.lock_sx @ nvidia_os.c:651
Feb  7 12:26:23 wolfgang kernel:  2nd os.lock_sx @ nvidia_os.c:651
Feb  7 12:26:23 wolfgang kernel: stack backtrace:
Feb  7 12:26:23 wolfgang kernel: #0 0xffffffff80be1b93 at witnes
Feb  7 12:26:23 wolfgang kernel: s_debugger+0x73
Feb  7 12:26:23 wolfgang kernel: #1 0xffffffff80be18e3 at witness_checkorder+0xab3
Feb  7 12:26:23 wolfgang kernel: #2 0xffffffff80b82a98 at _sx_xlock+0x68
Feb  7 12:26:23 wolfgang kernel: #3 0xffffffff83be8522 at os_acquire_mutex+0x32
Feb  7 12:26:23 wolfgang kernel: #4 0xffffffff83acbac6 at _nv034611rm+0x16
Feb  7 12:27:19 wolfgang kernel: nvidia_open_dev:
Feb  7 12:27:20 wolfgang kernel: acquiring duplicate lock of same type: "os.lock_mtx"
Feb  7 12:27:20 wolfgang kernel:  1st os.lock_mtx @ nvidia_os.c:886
Feb  7 12:27:20 wolfgang kernel:  2nd os.lock_mtx @ nvidia_os.c:886
Feb  7 12:27:20 wolfgang kernel: stack backtrace:
Feb  7 12:27:20 wolfgang kernel: #0 0xffffffff80be1b93 at witness_debugger+0x73
Feb  7 12:27:20 wolfgang kernel: #1 0xffffffff80be18e3 at witness_checkorder+0xab3
Feb  7 12:27:20 wolfgang kernel: #2 0xffffffff80b58c33 at __mtx_lock_flags+0x93
Feb  7 12:27:20 wolfgang kernel: #3 0xffffffff83be899b at os_acquire_spinlock+0x1b
Feb  7 12:27:20 wolfgang kernel: #4 0xffffffff83ac70ac at _nv033903rm+0xc

Just for reference, here are the exact debug options I added to my kernel config (/usr/src/sys/amd64/conf/GENERIC):

Index: sys/amd64/conf/GENERIC
===================================================================
--- sys/amd64/conf/GENERIC      (revision 343025)
+++ sys/amd64/conf/GENERIC      (working copy)
@@ -88,6 +88,18 @@
 # Debugging support.  Always need this:
 options        KDB                     # Enable kernel debugger support.
 options        KDB_TRACE               # Print a stack trace for a panic.
+# For full debugger support use (turn off in stable branch):
+options        BUF_TRACKING            # Track buffer history
+options        DDB                     # Support DDB.
+options        FULL_BUF_TRACKING       # Track more buffer history
+options        GDB                     # Support remote GDB.
+options        DEADLKRES               # Enable the deadlock resolver
+options        INVARIANTS              # Enable calls of extra sanity checking
+options        INVARIANT_SUPPORT       # Extra sanity checks of internal structures, required by INVARIANTS
+options        WITNESS                 # Enable checks to detect deadlocks and cycles
+options        WITNESS_SKIPSPIN        # Don't run witness on spinlocks for speed
+options        MALLOC_DEBUG_MAXZONES=8 # Separate malloc(9) zones
+options VERBOSE_SYSINIT=0 # Support debug.verbose_sysinit, off by default
 
 # Kernel dump features.
 options        EKCD                    # Support for encrypted kernel dumps

Please let me know if there is anything you need me to test or do for this.

How does __nv_drm_vma_fault work? (This is my real problem)

Update - Porting nvidia-drm is going relatively well, I can load the module, open /dev/dri/card0, and generally get pretty far before things start going wrong. Thanks again for all your advice along the way!

Problem - I can't seem to process the GEM nvkms page faults correctly. I don't find the correct vm_page_t to add in the FreeBSD portion, and best case I cause the userland EGL application to segfault. Most of my problems result from me not having done this before. I've written virtual memory/page fault handling for XINU, but have had to learn the Linux and FreeBSD virtual memory systems on the fly for this.

I have read up on lwn.net about GEM and fault handling in Linux, and I found this which appears to be the Intel equivalent from the FreeBSD linuxkpi-based drm tree.

The relevant code is here

static int __nv_drm_vma_fault(struct vm_area_struct *vma,
                              struct vm_fault *vmf)
{
    unsigned long address = nv_page_fault_va(vmf);
    struct drm_gem_object *gem = vma->vm_private_data;
    struct nv_drm_gem_nvkms_memory *nv_nvkms_memory = to_nv_nvkms_memory(
        to_nv_gem_object(gem));
    unsigned long page_offset, pfn;
    int ret = -EINVAL;

    pfn = (unsigned long)(uintptr_t)nv_nvkms_memory->pPhysicalAddress;
    pfn >>= PAGE_SHIFT;

    page_offset = vmf->pgoff - drm_vma_node_start(&gem->vma_node);
#ifdef __linux__
#if defined(NV_VMF_INSERT_PFN_PRESENT)
    (void)ret;
    return vmf_insert_pfn(vma, address, pfn + page_offset);
#else
... (leaving out the related vm_insert_pfn section for clarity)
  • Request* - I'd love if you could give me some explanation as to what the __nv_drm_vma_fault handler does, or point me to some resources I can use to learn. I can follow it roughly, but I am confused on things like:
  • Is this the page fault handler for video memory?
    • Is the GPU video memory handle mapped at nv_nvkms_memory->pPhysicalAddress? If not, what is this address?
    • If not video memory, does this handler just back a gem buffer with normal anonymous memory?
  • What is the significance of drm_vma_node_start(&gem->vma_node)? It's unclear to me why this offset is used. I assume its just the offset into the virtual memory area where the gem memory should be stored?
  • I think most of my confusion can be summed up by "How do you find the location from which to back the faulted page?"

Just for reference, and to others reading this, here is my (broken) FreeBSD fault handling equivalent:

    page_offset = address + drm_vma_node_start(&gem->vma_node);
    vm_page_t page = PHYS_TO_VM_PAGE(IDX_TO_OFF(pfn + page_offset)); <--- can't find the right page
    vm_object_t obj = vma->vm_obj;
    vm_pindex_t pidx = OFF_TO_IDX(address);

    if (!page || !obj) {
	    NV_DRM_LOG_INFO("__nv_drm_vma_fault: page was busy, probably got the wrong one");
	    return VM_FAULT_OOM;
    }
    
    if (vm_page_busied(page)) {
	    NV_DRM_LOG_INFO("__nv_drm_vma_fault: page was busy, probably got the wrong one");
	    return VM_FAULT_OOM;
    }
    if (vm_page_insert(page, obj, pidx)) {
	    NV_DRM_LOG_INFO("__nv_drm_vma_fault: Could not insert the page");
	    return VM_FAULT_OOM;
    }

    page->valid = VM_PAGE_BITS_ALL;
    vm_page_xbusy(page);

    ret = VM_FAULT_NOPAGE;
    vma->vm_pfn_count++;
    return ret;

If any FreeBSD people see this and know what I've done wrong please feel free to comment. My work is stored in the git repo linked earlier.

As always please let me know if there is anything I can do differently to get this to work.
Thanks again for all of your time and your help!

@shkhln
Copy link

shkhln commented Feb 7, 2019

Out of curiosity is Vulkan dependent on nvidia-drm? (in the linux driver at least)

No, Vulkan doesn't depend on nvidia-drm. At least not the basic functionality, since it actually works under Linuxulator.

@amshafer
Copy link
Author

amshafer commented Feb 8, 2019

Oh cool, I assume thats what your repository is for right? When you say it works is that with or without x11 running?

@shkhln
Copy link

shkhln commented Feb 8, 2019

Oh cool, I assume that's what your repository is for right?

Not quite, my repo contains a crude glibc shim allowing loading Linux libGL.so (& friends) into a regular FreeBSD process without Linuxulator.

When you say it works is that with or without x11 running?

With X11.

@aritger
Copy link
Owner

aritger commented Feb 9, 2019

Hi Austin. Yes, I've taken note of the FreeBSD Vulkan requests.

nvidia-drm is not required for Vulkan, either within X11 or for VK_KHR_display.

I've filed NVIDIA internal bug 2507077 for the duplicate lock issue (I don't believe you can access the bug report, but you can at least use that number as a handle if asking about it in the future).

Virtual memory management is not my forte, but here is my understanding (looking at the Linux version of nvidia-drm-gen-nvkms-memory.c):

  • nv_drm_dumb_create() calls nvKms->allocateMemory() to allocate video memory.

  • nv_drm_dumb_create() calls nvKms->mapMemory() to map that video memory to the CPU.

  • nv_nvkms_memory->pPhysicalAddress is the physical address where the video memory is mapped. The GPU has several BARs (Base Address Registers) which are hardware-specific regions of address space. The system firmware and/or OS assign a physical address to each BAR. The BARs are smaller than all of video memory. So, mapping video memory to the CPU involves programming a portion of the GPU's BAR to point to a particular region of video memory. The physical address of the video memory is then the physical address of the BAR, plus the offset into the BAR where that video memory is mapped.

  • The physical mapping is linear: i.e., it occupies the physical addresses from
    nv_nvkms_memory->pPhysicalAddress
    to
    nv_nvkms_memory->pPhysicalAddress + args->size

Hopefully that helps with some context.

What sorts of values do you get for page_offset if you compute it like this?

page_offset = vmf->pgoff - drm_vma_node_start(&gem->vma_node);

are those small values, less than PAGE_SIZE?

Looking at the FreeBSD linuxkpi-based drm tree you pointed to, does just this give you something sane?

vm_page_t page = PHYS_TO_VM_PAGE(nv_nvkms_memory->pPhysicalAddress);

It might also help to sprinkle some printouts in the Linux version of __nv_drm_vma_fault() to get a better understanding of the values that it computes.

I hope that helps.

@amshafer
Copy link
Author

amshafer commented Feb 9, 2019

Hi @aritger

Thanks! I'm also happy to test anything regarding that bug if needed.

Also thanks for the explanation, its very helpful. I'll take a look at finding the correct physical location this weekend, but I'll follow up with a few things first.

The physical address of the video memory is then the physical address of the BAR, plus the offset into the BAR where that video memory is mapped.

so drm_vma_node_start(&gem->vma_node) is the offset into the BAR where the gem buffer is located? (and the location I should use to handle the fault?) Just wanting to make sure I'm not misunderstanding what you're saying.

What sorts of values do you get for page_offset if you compute it like this?

I get something like 0xfffffffff0000 or other such bad values. I do get reasonable values for the drm_vma_node_start(&gem->vma_node) bit. In theory this line should work, however it appears FreeBSD doesn't have an equivalent field to vm->pgoff in its vm system, so vm->pgoff will always be zero (causing int wraparound to the value listed earlier). The virtual_address field is filled with the offset calculated from the index into the vm_object corresponding to the page fault, so thats what it looks like my code should pass to vm_page_insert. The index value for FreeBSD serves as both the physical offset and the index into the internal radix tree, which is pretty cool. The index values I get don't seem to be the problem, selecting the right page to back the request with is where I seem to go wrong.

Looking at the FreeBSD linuxkpi-based drm tree you pointed to, does just this give you something sane?

This does, although the vm_page_t is busy, I assume this is because it was already mapped/in use from nvidia.ko. I'm still learning the ins and outs of FreeBSD vm so I'm exploring around to see what is and isn't normal. I can also get valid vm_page_t structs (that may or may not be busy) with the page_offset stuff that is currently in my code.

Thanks!

@aritger
Copy link
Owner

aritger commented Feb 12, 2019

@ashaferian, if you haven't seen it, this looks like a helpful resource for what you're investigating:
https://www.kernel.org/doc/html/v4.15/gpu/drm-mm.html

I'm not too familiar with drm_vma_node_start(), but hopefully the above document helps.

@amshafer
Copy link
Author

Thanks! I had actually read the first part about the concepts and then got sidetracked by trying to fix things. Looking at it again made me notice the function descriptions, including drm_vma_node_start. my guess is the note about shifting by PAGE_SHIFT bits to get an offset is the mistake I've been making.

@amshafer
Copy link
Author

amshafer commented Mar 4, 2019

Hi @aritger

I've made some progress, and have run into an issue with EGL (that I assume is coming from a drm initialization problem). I don't have source for libEGL_nvidia so I was hoping you could point me in the right direction.

Where I am at

I solved the page faulting problem from my previous messages. (For anyone reading this in the future, it came from me not properly setting the vma area's pfn_first and pfn_count fields.) The repo I linked earlier has the changes pushed, the commit messages have more details.

Running this eglstream example repo I can make it all the way through drm setup and mode setting! (kms.c) The screen goes black as the vt sets itself to graphical mode, but EGL is not initialized (egl.c) so there is no data to display

Current Problem

Here's the bit in egl.c where it breaks:

ret = pEglGetOutputLayersEXT(eglDpy, layerAttribs, &eglLayer, 1, &n);

    if (!ret || !n) {
        Fatal("Unable to get EGLOutputLayer for plane 0x%08x\n", planeID);
    }

pEglGetOutputLayersEXT is a valid pointer, but n returns as 0. ret is 1 (EGL_TRUE It appears) which means that the EGL call itself succeeded, and there just was no output layer for the plane. The plane id is 0x1e, if its relevant.

I've tried dtracing around as its been really helpful with the kernel part of this port, but I can't seem to get anywhere without the source. There are no calls to the nvidia-drm module during this EGL line. Is there some way that I need to register an output layer for a drm plane that I am not doing currently? Any help would be appreciated.

EGL linking question

Whats the proper way (simplest case) to link with EGL using the 415.25 driver? I see that there is a libEGL.so in the glvnd folder, and a libgEGL_nvidia.so seems to be installed. The 410* driver series is currently being brought into the FreeBSD port tree (see bugzilla ), so glvnd isn't yet available through a package manager. I assume Nvidia has some docs or tutorials on getting started with glvnd, but I can't seem to find anything about a simple EGL example. (Also I assume glvnd is used and not linking with libEGL directly)

Thank you so much!
Austin

@aritger
Copy link
Owner

aritger commented Mar 5, 2019

@ashaferian, nice debugging.

I stepped through things on the EGL side on Linux, and it looks like there is a good chunk of the code that interfaces EGL and DRM that is only built into the Linux EGL binary, but not the FreeBSD EGL binary. Sigh. Sorry :(

I think that explains why your output layer query is returning n==0.

I'll see what I can do to correct the FreeBSD build of EGL for a future NVIDIA driver release. In the meantime: is running the Linux EGL driver through the linuxulator viable to help test your nvidia-drm implementation? Other DRM-KMS tests might include the xf86-video-modesetting X11 driver.

For EGL linking, there is some stale, Linux-centric documentation about packaging glvnd here:

https://devtalk.nvidia.com/default/topic/915640/unix-graphics-announcements-and-news/multiple-glx-client-libraries-in-the-nvidia-linux-driver-installer-package/

The intent is that libEGL.so should be provided by glvnd, and that NVIDIA just provides a libEGL_nvidia.so that gets loaded by glvnd. On Linux, we bundle a glvnd libEGL with the driver .run file, and at install time detect if the distro has a distro-provided glvnd libEGL; if not, we install the one in the driver package. I forget what we do on FreeBSD, but I suspect we unconditionally install the driver-packaged glvnd libEGL.so. In any case, applications should just be able to link against libEGL.so, independent of any particular driver implementation.

@amshafer
Copy link
Author

amshafer commented Mar 6, 2019

@aritger I'm sorry to hear that :( If you manage to get it included at some point please let me know! Would introducing the DRM changes to FreeBSD's EGL mess with EGL users that aren't using DRM?

I can't seem to get the linuxlator working with EGL, although I think it's within the linuxlators ability. I had some trouble linking with the libraries from the Nvidia driver because they are 32-bit, and I need to go track down a libdrm package for things to work. I can try this at some point in the future when I have a little more free time.

Also thanks for the link, Its definitely helpful even if it's a little out dated. Maybe I'll help the committers finally get glvnd included in ports. And from a quick look I think you are right, and Nvidia just automatically makes a link for libEGL.so to the libEGL_nvidia. I vaguely remember hearing this causes problems if you install mesa-libs too, but I could be wrong.

Somewhat unrelated, but even though the EGL problems spell doom for this port, I did have a lot of fun learning how to do it. I'm looking for an internship before I go to grad school this fall, and if you all are still hiring interns for this summer I'd love to keep working on stuff like this.

Thank you so much for all of your help!
Austin

@shkhln
Copy link

shkhln commented Mar 6, 2019

I did have a lot of fun learning how to do it

I've got a suspicion you might be interested in looking at CUDA stuff. Specifically a few stubs in nvidia.ko: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224358#c15.

I can't seem to get the linuxulator working with EGL, although I think it's within the linuxulators ability. I had some trouble linking with the libraries from the Nvidia driver because they are 32-bit

If you have installed nvidia-driver port, you should also have 64-bit Linux libraries.

Nvidia just automatically makes a link for libEGL.so to the libEGL_nvidia.

Not really, libEGL.so is one of the glvnd wrapper libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants