-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New features (SDL) #12
Comments
SDL2: F2 quit, Left / Right Alt Open Solid Apple |
Ok keyboard works. |
This all looks very promising and it's great to see the SDL build included with Qt build. A quick informal test (Mario Bros) and I definitely see some input delays for Qt. SDL appears more responsive, but although CPU is only about 30% on Pi4, emulation appears to be running slower than normal (enhanced speed is unchecked). The --qt-ini flag makes changing configurations very easy. Exciting to see the rapid progress! |
Added audio. Enhanced speed only affects the emulator when the disk is spinning. I need to find some tradeoff quality / speed. |
Would it be possible to allocate audio processing to another core? I can imagine synchronization might be an issue... what about dedicating a core to the 6502 CPU responsibilities and another core to everything else (or developing the model of add-on cards using separate threads / cores)? Have you reached out the main AppleWin group to inquire about your code repository being wrapped in as a first class citizen? |
The problem with what you suggest is that it requires a departure from AppleWin, which would make merging 10x complicated. What might be more effective is a decrease in video quality where most of the time is actually spent. As merging the code with AW, you could try to mention it there and see what they answer: AppleWin#538 I periodically create PRs to fix compiler issues in the files that are shared, but have never tried to unify the code completely. AW would require some work to become more modular, which I don't think they are interested in. |
I stumbled across the following value in the applelin.conf file from https://github.com/linappleii/linapple: The comment surround this config value reads "By default the emulator's draw code, a large share of the processing, is performed in a separate thread, probably on a different core." I can confirm that on Pi two cores are being consumed for smooth sound & video. I understand not wanting to deviate from the AppleWin core, but since the video component deviates anyway, offloading video to a separate core may be reasonably straightforward and reconcile issues on Pi4. |
Could you please do me a favour and check this https://github.com/audetto/AppleWin/blob/master/source/frontends/sa2/emulator.cpp#L197 Using The emulator will not run and just render a black screen. No idea how to improve it though, short of reducing FPS. |
With top, CPU utilization is at 97-98%. Looks like pretty close to a full core being utilized based on overall CPU utilization of around 33%. Don't see a latency / performance problem with the normal screen. Switching to fullscreen, I now see lag (guessing top would show 100% utilization). Commenting out lines dropped overall utilization by 5-8%, but obviously isn't displaying anything. The above referenced linapple repository successfully runs SDL on a separate core, but it's SDL 1.2. I'm attaching A2 emulator code I've been working on from James Hammons that pulls some of the SDL2 initialization routines from GSPlus as a possible point of reference. The code may or may not be useful... the emulator itself only uses ~2/3 of a CPU core, so I have no idea whether it would span an additional core if needed. An additional note... SDL seems to be very sensitive to how it's initialized. It's still early to say definitively, but I'm seeing that initializing with the wrong audio values can sometimes sap performance. |
I am still puzzled by the results. What definitely takes a long time is the video update, which you can switch off to check. If you disable this https://github.com/audetto/AppleWin/blob/master/source/frontends/sa2/emulator.cpp#L199 if will still run the CPU, draw the screen (black) but will not update the Apple bitmap. I suspect that other emulators have a less precise video generation and so run quicker. I originally wrote my own video update (non precise) but dropped as it was a lot of extra work. You are definitely write about SDL and I still have to find a quick way to just paint a black screen at 60FPS on a Pi using SDL2. |
try this https://github.com/audetto/AppleWin/tree/pi On a Pi3 with FakeKMS ~77% in 1x size, ~99% is 2x size. |
I have tried and it does run very much at the same speed as my SDL port. My version runs at 78% or 99% as mentioned above and in 1x size audio is ok. I think they all suffer the same problem: SDL screen drawing. If you find any other emulator that runs quick on a Pi at 60FPS, I am happy to copy their video rendering. |
I've really been surprised by the performance differential between SheepShaver (PPC Mac emulator) versus several different Apple II emulators. I'm not sure whether the build I'm running is using SDL1.2 or SDL2, but resource utilization is only about 10% of CPU. My guess is the A2 emulators dedicate more cycles to ensure proper 6502 timing (whereas PPC Mac emulators mainly don't seem to care about CPU timing). It looks like SheepShaver can be compiled against either SDL1.2 or SDL2, which may be a solid test to identify whether SDL2 is the performance culprit. For GSPlus (and several other A2 emulators), increasing audio sample buffer initialization up to 4096, (e.g. wanted.samples = 4096;) addressed both sound problems AND reduced CPU load for SDL1.2 AND SDL2. A key advantage of SDL2 is that you get "free" scaling and texture overlays. It's allowed me to make the screen resizeable (up to 1080p) without significantly affecting performance as well as simulating scanlines. HOWEVER, in order to do this, I had to disable where the emulator was relying on code routines to double the image size (i.e. switched emulator FROM running at 560x384 with software scaling BACK to 280x192 relying on SDL2 hardware scaling for arbitrary resolution). This emulator is running at 60FPS along with some video improvements at about 66% core utilization. It seems that increasing the video buffer from 280x192 to 560x384 is pushing too much data to the SDL video buffer. I haven't taken a close look at how you're rendering video yet (i.e. a whole frame at a time or only refreshed regions). I don't know of any easy way to calculate a "buffer checksum," but for some emulators I think CPU utilization would go WAY down if it were possible to quickly calculate the video buffer checksum (something like an array_sum function) and only push video updates to SDL1.2/2 video buffer if the checksum changes. I'm not really surprised that DirectX on Windows is rendering faster than SDL2 on Linux. Unless you're using OpenGL within SDL, it's probably not taking full advantage of the GPU. |
GSPlus idles at about 10% CPU utilization, which is much more inline w/ SheepShaver (PPC Mac emulator). It uses SDL2 and is probably the best reference (check the Issues section for tips on proper compilation). Linapple happily runs over 100% (top), meaning it appears to be truly multi-core, but it's SDL1.2. The emulator I've optimized by James Hammons runs at 66% of a CPU core. Happy to drop ARM compiled binaries if you want to check for yourself. Alternatively, happy to supply whatever source you'd like (if you can't find it on GitHub). |
One thing at a time.
|
I sat down and took a look at your rendering code. I think you're spending a lot of time on the memcpy operation inside emulator.cpp (~40% of CPU time on a Pi4) to basically copy your video buffer for SDL (I believe instead of using SDL's own version). My own tests with memcpy in the past weren't very good for a block of data as large as what you're generating (I think you mentioned 560 x 384 in your SDL forum post. Here's relevant (working) code that renders without a memcpy operation: Here's a really simple example that renders a black screen: You've got a similar set of operations within your refreshTexture method, but are returning a rectangle and then pursuing subsequent rendering operations. I tried a relatively straightforward code swap, but am getting a black screen. Top, however, shows %CPU right at 60%, so if you can get rid of memcpy I believe it'll run with breathing room on a Pi 4. If you do manage to engage the Broadcom GPU blob mentioned within the SDL forums (perhaps using OpenGL ES), combined with eliminating memcpy, I think you'll be able to get this running with desired performance characteristics on a Pi 3. |
Yes, this is something I changed here https://github.com/audetto/AppleWin/tree/pi It seems that SDL_UpdateTexture is faster even if the doc says no. So it requires no memcpy. The problem with your code is that AW manages its own memory buffer for the video, so it would require deeper changes. |
I think your best bet is to really investigate how GSPlus is doing things: The only change I've made locally is documented here (to get sound working): digarok/gsplus#106 When I start this up, the emulator indicates that OpenGL is being used. Running at 2.8Mhz top indicates about 10% CPU utilization. At 8Mhz it's up to 25%. At "unlimited" I'm hitting 100% CPU, but that's no surprise. Looks like the key is to get OpenGL involved. Based on what I'm seeing with GSPlus, that should fix the problems on Pi3 too. |
Try this https://github.com/audetto/AppleWin/tree/threads I've moved AW CPU to a separate thread. |
Tested threads branch. I do see that CPU is exceeding 100% (multi-threading) and emulator is playable at full-screen with a minor increase in load. Sound, for me, is still significantly delayed (by at least a few seconds). |
As another point of reference, GSPort (https://github.com/david-schmidt/gsport/) is running at under 5% utilization ( Could this be useful: digarok/gsplus#58? |
https://github.com/digarok/gsplus : this one is fast because it does not redraw at a constant speed. If the screen changes, CPU goes up. It is a good idea, but needs cooperation from AW. The code was more complicated that I was able to understand in a quick glance. I put some counters around texture update and render copy. https://github.com/david-schmidt/gsport/ : this one uses xlib and maybe the same trick as gsplus. I dont want to try xlib. If you can make up a simple loop refreshing at 60Hz, and it is fast, then we can move from SDL to xlib (a very sad decision). digarok/gsplus#58 : it does not say much. |
Have you run a profiler on it? Seems like gprof and Gperftools work on Pi. I actually rewrote the core Windows BitBlt/StretchBlt in GSport back in 2015 to add full-screen integer scaling. I'm pretty sure some partial redraws were done, but I completely forget the redraws, sorry. I did test other GSport features I was developing on a Pi 2 and performance seemed fine. |
Short of profiling SDL2 and the Pi kernel, I don't know what else to do. In order to increase SNR, I've created a small projects that does exactly the same as this SDL2 port of AW so we can experiment and find the best configuration. https://github.com/audetto/SDL_Demo compiled in Release, on a Pi3 I get 66% CPU just to redraw the screen. If anybody can do better than this, I'd be happy to know. Doing a smart update of the screen requires invasive changes to AW video update routines, which will not happen anytime soon. |
This is a good approach. I took a few stabs at it and don't see a way to significantly reduce the CPU load. It DOES look like the problem relates to SDL and there are special ways to build SDL for Pi that interface directly with the Broadcom GPU (which then has to be statically linked). A discussion here that looks particularly relevant: grimfang4/sdl-gpu#87 ...and this: https://sourceforge.net/projects/raspberry-pi-cross-compilers/ |
In that discussion they were suggesting opengles and if one uses the opengles2 driver in SDL, CPU usage drops to 42% in the demo. Good. I've added a few options to
At the end of the run, it will print stats about timings:
The meaning of
They do not include time spent in locking. The clock shows expected vs actual speed (crucial for correct audio play). |
FYI, the changes drop CPU from ~50% (top) on Pi4 to 27%. Just about cuts resource usage in half. |
Hi guys. If the lack of hardware acceleration is the problem, the only
recourse will be to try updating frames (to SDL) less often. A couple of
approaches come to mind.
1. I think GSport does this: track performance, and if it's less than
desired then increase a variable which controls frame skipping. So first
you would skip every other frame (30 fps) etc. Of course, AppleWin will
still be updating the bitmap, but you just don't push it to SDL.
2. If it can be done efficiently, try to detect whether a new frame to be
pushed is the same as the last frame pushed. If it is, don't push to SDL.
Ideally this would be done inside AppleWin, but support for this kind of
feature was removed as Tom described. The cell-based method AppleWin used
was a common way to do this. Also line by line methods are used. Another
simple way is to keep the video data for the last frame and compare byte by
byte as the new frame is generated. This could be done externally to
AppleWin. A quicker method (rather than comparing) would be to generate a
checksum of the frame and just keep that. Compare it to the checksum for
the new frame and skip it if it is the same. Obviously this risks skipping
frames that are different but have the same checksum. The safest way is to
keep the entire last frame and compare every byte to the new frame. Of
course, if they differ early you can immediately start the SDL push. In
most cases they won't differ, which means you check the entire buffer, but
the upside is you avoid the SDL push, so it might be faster.
Cheers,
Nick.
|
This is what implicitly happens with the 2 threads version (I think). What I was toying with is exactly what you said, trying to be smart about detecting duplicate frames.
With the latest findings about opengles2, none of them are urgent, but I find the problem challenging and interesting, I will try to see what can be done. |
Here are a few compiler options compatible w/ cmake: the 'crypto' option for fpu might help with those hashes. It's designed for coin mining, so you may need to choose your hash algorithms carefully for it to kick in. |
That is very complicated considering we are trying to fix a performance problem. ; - ) Using an 8-bit (size_t) hash is also a bad idea because you only have 256 values, so a high chance of collision. I would just XOR or ADD the data to get the simplest checksum possible. Because we want the fastest result, and also the smallest chance of hash (checksum) collision, we should choose the longest data unit available. If you can use 64-bit then do that, otherwise just a running 32-bit XOR or ADD should be good enough. It's probably worth looking at the generated assembly language and using that to try to optimize your C++ loop. For instance, it might be like 6502 assembly in that counting backwards to 0 is more efficient that counting from 0 up to a constant. Cheers, |
All you say is true except: size_t is 32 / 64 bits depending on architecture. You are right as well in the data size, the loop should be over 32 / 64 bits at least, now it does byte by byte. |
Of course, you're right. I've been working in C#, JavaScript, and PowerShell for weeks, so my C++ personality is paged out. ; - ) I thought I would have a quick look in VS 2017, and got a surprise. Counting down, I was pleased that it unrolled the loop, and this version took 20 microseconds:
If your compiler doesn't do this you could manually unroll the loop - which didn't change the time here of course:
But then I tried counting up, and VS unleased the SIMD magic. This code ran in 11 microseconds:
I would have to look those instructions up(!) but I know ARM has SIMD these days too. Cheers, |
Audio: made the emulator speed stick to wall clock.
Channels 1 is Speaker, 2 is Mboard. |
Awesome to see you making headway! I've been investigating ways to integrate a proper UI (Gtk3 or Qt5) with SDL2 in order to be able to provide the responsiveness along with a full-fledged interface. SDL2 provides the There's not a whole lot of documentation available for mixing SDL with UI libraries, but it looks like Bsnes (https://github.com/bsnes-emu/bsnes) is using SDL2 with Gtk2, Gtk3, Qt4, and Qt5 (apparently selectable at compile time). It might be possible to come full circle back the the Qt interface you built along with the optimized SDL2 rendering code. |
Can you post an example of how you display a gtk dialog in sdl. |
Unfortunately I don't have Gtk code that goes further than attaching SDL2 to a Gtk3 window. For my own needs, I think ImGui (https://github.com/ocornut/imgui) is the more straightforward approach. It provides GUI elements with native SDL2 support. It's pretty lightweight w/ extensive demo code. Since it uses the native rendering engine (e.g. SDL2) it doesn't require the "window hack" (for Gtk/Qt) that probably isn't portable to Windows and appears reasonably cross-platform. Since you're rewriting the display mechanism, I assume you have access to it, but the caveat with ImGui is that it ties into the main rendering loop. If you're interested in taking the ImGui route, I'll try to throw together sample dialog code. |
My first reaction was: not another GUI toolkit! GTK or QT are stable, supported, and available everywhere. But, but, but .... their front page looks really impressive. Are packages available in the main distros: Fedora, Ubuntu, Raspbian? That would definitely help. |
I've learned what It uses OpenGL2, but they say one should jump to OpenGL3, and I need to see how they both work on a Pi. https://github.com/audetto/AppleWin/tree/imgui one needs to pass Most of the SDL code can be reused. |
I think this is a positive development! Keeping the GUI elements managed by SDL reduces the dependencies and, I suspect, will provide greater longevity. XGS (https://github.com/jmthompson/xgs) also uses ImGui + OpenGL3 and this has been tested on a Pi. In private exchanges with the author, he had this to say about rendering: "The VideoCore stuff isn't supported on the Pi 4 or 400; there is now a fully open source OpenGL driver for the Pi 4/400 as well as the Pi 3. But, I had to custom compile SDL to enable it (the KMSDRM driver). If you don't do this, then on the Pi 4/400 Mesa (and by extension, SDL) will fall back to the llvmpipe software rendering pipeline..and that would certainly spike your CPU because it really will be copying textures around manually. Why Raspberry Pi OS still ships without KMSDRM enabled in SDL baffles me." A new commit should be coming out soon that includes ImGui 1.80 and some further speed optimizations. Looking forward to trying out the ImGui version of AppleWin and will post back if I encounter any problems. |
Good to hear. |
Apologies, haven't had a chance to compile the latest code on my build machine over the last few weeks. Attempting the following: Returns CMake Warning: Manually-specified variables were not used by the project: IMGUI_PATH I git cloned the imgui library into the imgui path within the AppleWin project directory. I assume if there were a path issue or something that cmake would complain a little more loudly. The only other error message cmake displays is "Bad LIBRETRO_COMMON_PATH=NONE, skipping libretro code." The inclusion of imgui sounds like a promising development, but I'm not sure how to test it. |
have you used the imgui branch? if the path is not found, you should get a warning |
have you used the imgui branch? The imgui version runs very well on the Pi4 (very responsive). Not sure why, but the Qt build seems to run slow, even though the CPU core isn't hitting 100%. Scaling on imgui didn't seem to noticeably affect performance. |
Try this https://github.com/audetto/AppleWin/tree/imgui3 It uses SDL2+OpenGLES2 which should be a better option. You need |
I tried the imgui3 branch, but don't see a performance difference (on Pi4). I believe that that the Imgui note on "OpenGL2 being a non-ideal choice" is related to a couple factors: I don't think there's a specific reason to opt for OpenGLES 2 over OpenGL 2 in ImGui if you use the newer style syntax to initialize your OpenGL2 rendering engine. By comparison, OpenGL ES support was more recently added, so the examples provide "modern" syntax. Hope this provides some clarity. It's how I understand the situation with ImGui. |
Let's use a separate Issue for ImGui related opinions: #22 |
Ability to use OpenGL2 or OpenGLES has been added
|
Qt app: quit from menu. Ctrl-Q
SDL2: F6 full screen and some command line options.
The text was updated successfully, but these errors were encountered: