-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kernel] Implement compressed executables in ELKS #912
Conversation
Last commit fixes medium model binaries. |
The last commit adds the ability to compress many files at once: This PR is almost ready to merge. A seperate PR will be published with all After committing these PRs, creating a floppy with all files compressed is easily done for testing:
For those not wanting to build anything, attached is a bootable floppy with all executables compressed. The kernel boots and runs all executables with no noticeable delay (on QEMU). The uncompressed 1.44M floppy had 24k free, the compressed version has 335k available! |
@ghaerr, More testing and comparison in progress - using your 'floppy distro'. Thank you very much!! --Mellvik |
Hello @Mellvik,
Glad to hear the compressed binary distribution seems to be working flawlessly on real hardware! I am particularly interested in whether there is a noticeable difference from reading (or not reading, actually) binaries from the filesystem.
Actually it was your comment about #902 (comment) that informed me of the existence of Exomizer, which then pointed me to @tkchia's prior research into it, that prompted thoughts that such a scheme could now be implemented relatively quickly. And here we are! Thank you! |
Hello @Mellvik, I've attached a zip file containing the final nano and nano.z (compressed version), which should both run. These large medium-model executables (142k and 92k) should allow you to more easily see whether compressed executables help with floppy load times. There's also an optional nano.rc that shows some of the features (except color) that can be used, it needs to be renamed to ".nanorc" and placed in the /root (home) directory. I plan on uploading all of nano source soon. Thank you! |
Thank you @ghaerr, Seems to me to be a perfect setup for testing just that. BTW - I got your .nanorc already installed in /root on the image, so that confused me a little to begin with. Possibly good because when I then ran
… which I guess should not happen. diff(1) has been running fine for quite some time as I recall. And BTW I'm not running low on memory:
Back to nano:
The output from
As to startup speed, it is my estimate that while the uncompressed version takes about 10 seconds to load (this is the Compaq 386/20), the compressed version takes 20. So there seems to be a significant speed disadvantage for the compressed version, even when run from floppies. This is a disappointment, and confirms the feeling I've been getting after testing the system for a while - when running other commands. Still, I haven't booted the 'old' floppy image to get the 'feel' back in sync yet. Will do that. --Mellvik |
Hello @Mellvik, Nice testing, thank you for your full report! I think you are pretty much correct on all accounts. First, the sad reality of The .nanorc file in the /root directory was a mistake, I've removed it so that users will manually add it for the time being. I'm still trying to find the settings that will work without running out of memory, and that will be used to create a default nano.rc. Given these issues, and the work put into the complete nano port PR #915, I think I will still commit it, but may end up having to later revert to nano v1.0.6, which is a few years earlier (1998 instead of 2000). That earlier version doesn't have any of the memory problems. It is still important to test the build capabilities over various machines building v2.0.6.
Dang, this is a big disappointment!! I am pretty sure this means that a large (~112k) decompression takes 10 seconds on an old system. The good news is that most executables are vastly smaller, so the decompression time is much smaller.
I see. The other factor is that ELKS now has the full-track disk caching so in many cases the floppy read time is very small. So one doesn't really notice the disk read time at all, since its already been read in during the track read. At this point, we probably will need more full-system testing of all the binaries to see where we might set the tradeoff for the gain on the target disk by allowing more executables (35% gain) versus the added wait time for decompressing larger executables. Unfortunately, these two benefits seem to be in direct opposition to each other. By default, all executables (except nano, in the new PR) remain uncompressed. Thank you! |
Thank you, @ghaerr.
This is a disappointment, and confirms the feeling I've been getting after testing the system for a while - when running other commands.
I see. The other factor is that ELKS now has the full-track disk caching so in many cases the floppy read time is very small. So one doesn't really notice the disk read time at all, since its already been read in during the track read.
That makes sense.
At this point, we probably will need more full-system testing of all the binaries to see where we might set the tradeoff for the gain on the target disk by allowing more executables (35% gain) versus the added wait time for decompressing larger executables. Unfortunately, these two benefits seem to be in direct opposition to each other.
It may be worthwhile looking into doing the decompression in assembler. If I remember correctly there is a lot of looping and bit stuffing that lends itself well to 'optimization by hand'. Admittedly a project for the list rather than for the short term.
Anyway - having compressed binaries available adds value & flexibility to ELKS.
—Mellvik
|
More benchmarking: it is really hard to get exact numbers (it may actually be an idea to implement the FWIW: Using the
BTW - I'm using -M |
Hello @Mellvik, Thanks for the continued benchmarking. So it's clear that there is significant interaction between the disk track caching (which doesn't help FAT filesystem reads much), and executable compression, which runs faster because of the savings gained by the ratio of compression space saved and non-cached disk reading. There will also be a significant savings for ROM filesystems. I am still a bit surprised by the "nano" timings on MINIX filesystems being so much slower with compression and track caching... did you ever try running the same tests using vi instead? I've tested nano v2.0.6 a bit more: sadly, it seems that version can't even edit its own .rc file. Its use of the data segment is just too big, it seems. I'm going to add nano v1.0.6 to the outstanding PR, and possibly remove 2.0.6, since it adds tons of files that aren't really useful.
Yes, any version of nano will likely be too big for much real use. I still would like to proceed with adding at least v1.0.6 so we have a medium model binary to continue testing tools with. Thank you! |
@ghaerr - This time off of a minix floppy image only, and still using Like before, the test command was My setup does not have a minix fs on the hard drive at the moment, I may take another run using the same test later, not so much to test compression as to compare fs-performance. -M |
Hello @Mellvik, Thank you for the testing. There are a number of factors that will affect any test results. The following configuration options may greatly change the timings:
Now for the details:
Having said all this, it seems that vi.z may be taking ~3+ seconds to decompress, while vi.plain is taking ~5 seconds to load (likely due to not enough system buffers). I am a bit surprised it is still taking 5 seconds to load vi.plain, and wonder whether the combined operation of track caching and system buffering is still not optimized, as I would think we could that down to < 1 second total, for fully buffered operations. Figuring these reasons out will help determine where the kernel is spending most of its time. It is possible the even copying memory is quite slow, for instance. To see a bit more clearly (with stopwatch lol) the difference between the disk loading times vs the decompression times, perhaps recompile the kernel with following lines inserted in elks/fs/exodecr.c:
I am going to guess that the decompression time for vi.z is ~3 seconds. It would be nice to know. It's a bit tricky since the compressed sizes also decrease both track caching and system buffering requirements. So this whole effort does depend on a proper system tuning, which we may need a bit more work! Thank you! |
Thank you @ghaerr -
expanding scope to cover more than just the viability of compressed binaries makes further benchmarking all the more meaningful.
I'll continue this later this week.
A sidenote: The numbers reported earlier (coming from time()) do not convey the actual duration of the wait (which is significantly longer).
BTW - having verified that decompression of vi takes (say) 3 secs, we have a metric if we dive into rewriting key parts of the decompression algorithm in assembler.
I'll create a minix partition on the hard drive of this machine in order to expand the base for the testing. (I need to take a look inside the machine too, I've forgotten whether the HD is physical or CF lol).
—Mellvik
… 12. apr. 2021 kl. 18:37 skrev Gregory Haerr ***@***.***>:
Hello @Mellvik <https://github.com/Mellvik>,
Thank you for the testing.
There are a number of factors that will affect any test results. The following configuration options may greatly change the timings:
CONFIG_FS_NR_EXT_BUFFERS=64 (The number of 1K RAM buffers used. This is AFTER reading from disk/track cache).
CONFIG_TRACK_CACHE=y (Turns on disk drive track caching. Reads remaining track of requested disk sector).
CONFIG_EXEC_COMPRESS=y (Turns on executable decompression).
Now for the details:
The vi binary is 69008 bytes. This means that with CONFIG_FS_NR_EXT_BUFFERS set less than 68 (=64 * 1024 = 69632), the system will never fully buffer the vi read from kernel memory. Since there is additional directory and FAT/inode reading required, setting this value to 96 should remove this factor from the equation.
With CONFIG_TRACK_CACHE on, the system will read full-tracks, starting from the requested sector on every track, unless the requested sector is already in the track cache. This means that, depending on exactly how the vi executable is written on the media, there will likely be extra sectors read, skewing the timing. On brand-new FAT or MINIX volumes, this is minimized, as all files are written in consecutive sector order (almost). But on existing HD images that are greatly fragmented, this option could in fact slow down system operation, but certainly skew timing. Thus there should be seperate testing with this option OFF.
The CONFIG_TRACK_CACHE option also will operate quite differently between FAT and MINIX filesystems, due to the disk layout. I think we determined at the time of implementation that it didn't help FAT nearly as much as MINIX filesystem read times.
Having said all this, it seems that vi.z may be taking ~3+ seconds to decompress, while vi.plain is taking ~5 seconds to load (likely due to not enough system buffers). I am a bit surprised it is still taking 5 seconds to load vi.plain, and wonder whether the combined operation of track caching and system buffering is still not optimized, as I would think we could that down to < 1 second total, for fully buffered operations. Figuring these reasons out will help determine where the kernel is spending most of its time. It is possible the even copying memory is quite slow, for instance.
To see a bit more clearly (with stopwatch lol) the difference between the disk loading times vs the decompression times, perhaps recompile the kernel with following lines inserted in elks/fs/exodecr.c:
size_t decompress(char *buf, seg_t seg, size_t orig_size, size_t compr_size, int safety)
{
char *in = buf + compr_size;
char *out = buf + orig_size + safety;
debug("decompress: seg %x orig %u compr %u safety %d\n",
seg, orig_size, compr_size, safety);
segp = seg;
+printk("Start decompression\n");
out = exo_decrunch(in, out);
if (out - buf != safety)
{
debug("EXEC: decompress error\n");
return 0;
}
fmemcpyb(buf, seg, out, seg, orig_size);
+printk("Stop decompression");
return orig_size;
}
I am going to guess that the decompression time for vi.z is ~3 seconds. It would be nice to know. It's a bit tricky since the compressed sizes also decrease both track caching and system buffering requirements. So this whole effort does depend on a proper system tuning, which we may need a bit more work!
Thank you!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#912 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOHNW6CKHPGD3ZSWS73TIMOTNANCNFSM42MEXQFQ>.
|
Hi @ghaerr,
Some early numbers per your email last night. (all numbers from minix floppy image, like before)
Just like you guessed, there is a significant random factor in the measurements, depending on the placement of the actual files on the floppy/track. The new numbers show - in one case - as litte as 3%, in a different case as much as 65%.
Decompress time is just above the 3 secs as expected, stopwatch timing :-). And as you suggested, since the compressed image fits entirely in the buffers, the time measured by time() is almost entirely decompression, so actually 3.5secs is a reasonable metric for decompression of vi.
After increasing external buffers to 96, the load time for plain vi, now entirely in the buffer, is 0.35secs.
Possibly interesting - the decompressor is called twice: The first call doing the real work, the last being very short. (example below).
CONFIG_TRACK_CACHE has been on all the time.
Example:
# cd /
# /mnt/elks/zbin/ls
Start decomp
Stop decomp
Start decomp
Stop decomp
bin bootopts dev etc home lib linux
mnt root tmp var
#
For now, we got a reasonable idea about the actual decompression time.
And still no real clue as to why loading a 69k file from (1.44M) floppy takes 5 seconds :-)
Maybe we need to do some call tracing in the fs code?
—Mellvik
… 12. apr. 2021 kl. 18:37 skrev Gregory Haerr ***@***.***>:
Hello @Mellvik <https://github.com/Mellvik>,
Thank you for the testing.
There are a number of factors that will affect any test results. The following configuration options may greatly change the timings:
CONFIG_FS_NR_EXT_BUFFERS=64 (The number of 1K RAM buffers used. This is AFTER reading from disk/track cache).
CONFIG_TRACK_CACHE=y (Turns on disk drive track caching. Reads remaining track of requested disk sector).
CONFIG_EXEC_COMPRESS=y (Turns on executable decompression).
Now for the details:
The vi binary is 69008 bytes. This means that with CONFIG_FS_NR_EXT_BUFFERS set less than 68 (=64 * 1024 = 69632), the system will never fully buffer the vi read from kernel memory. Since there is additional directory and FAT/inode reading required, setting this value to 96 should remove this factor from the equation.
With CONFIG_TRACK_CACHE on, the system will read full-tracks, starting from the requested sector on every track, unless the requested sector is already in the track cache. This means that, depending on exactly how the vi executable is written on the media, there will likely be extra sectors read, skewing the timing. On brand-new FAT or MINIX volumes, this is minimized, as all files are written in consecutive sector order (almost). But on existing HD images that are greatly fragmented, this option could in fact slow down system operation, but certainly skew timing. Thus there should be seperate testing with this option OFF.
The CONFIG_TRACK_CACHE option also will operate quite differently between FAT and MINIX filesystems, due to the disk layout. I think we determined at the time of implementation that it didn't help FAT nearly as much as MINIX filesystem read times.
Having said all this, it seems that vi.z may be taking ~3+ seconds to decompress, while vi.plain is taking ~5 seconds to load (likely due to not enough system buffers). I am a bit surprised it is still taking 5 seconds to load vi.plain, and wonder whether the combined operation of track caching and system buffering is still not optimized, as I would think we could that down to < 1 second total, for fully buffered operations. Figuring these reasons out will help determine where the kernel is spending most of its time. It is possible the even copying memory is quite slow, for instance.
To see a bit more clearly (with stopwatch lol) the difference between the disk loading times vs the decompression times, perhaps recompile the kernel with following lines inserted in elks/fs/exodecr.c:
size_t decompress(char *buf, seg_t seg, size_t orig_size, size_t compr_size, int safety)
{
char *in = buf + compr_size;
char *out = buf + orig_size + safety;
debug("decompress: seg %x orig %u compr %u safety %d\n",
seg, orig_size, compr_size, safety);
segp = seg;
+printk("Start decompression\n");
out = exo_decrunch(in, out);
if (out - buf != safety)
{
debug("EXEC: decompress error\n");
return 0;
}
fmemcpyb(buf, seg, out, seg, orig_size);
+printk("Stop decompression");
return orig_size;
}
I am going to guess that the decompression time for vi.z is ~3 seconds. It would be nice to know. It's a bit tricky since the compressed sizes also decrease both track caching and system buffering requirements. So this whole effort does depend on a proper system tuning, which we may need a bit more work!
Thank you!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#912 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOHNW6CKHPGD3ZSWS73TIMOTNANCNFSM42MEXQFQ>.
|
Thanks @Mellvik, I agree with all your conclusions. The decompressor is called, with different memory addresses, separately for each section (.text and .data) of an executable. So that's correct.
My guess is that you're looking at the overhead of a whole track read for a 69k file, since CONFIG_TRACK_CACHE is on. |
Just for the heck of it I disabled TRACK_CACHE and ran the tests again. The results are as expected and as reported by the So, without the cache, --Mellvik |
This PR implements the ability to execute significantly compressed binaries directly from disk or ROM.
Implements request for compression of executables discussed in #226.
Uses third design discussed in #908.
The compression mechanism allows for the kernel to quickly decompress arbitrary text, fartext and data sections in executables while retaining backwards compatibility of the a.out header format, so that
chmem
can still be used to set heap and stack sizes after linking.The kernel "cost" of decompressing a binary is only 16 bytes extra per application, as the decompression is done "backwards in-place", which allows for the decompression to be performed in the decompressed memory size, plus 16 bytes. The added kernel code for decompressing binaries is less than 2K bytes.
It seems the average per-binary savings is on the order of 31-36%. If this were applied to all binaries on a ELKS 1.44M distribution floppy disk, disk access times for program startup should decrease by 30%, while also allowing another 500+ K bytes free space to be made available per floppy.
As an example, /bin/sh has been reduced from 50,960 bytes to 33,239, a 35% reduction in size.
This PR will have several commits added over the next few days, to bring in the changes incrementally.
This first commit adds changes required for the kernel. The next commit will add the
elks-compress
utility which compresses any ELKS binary.Almost everything seems to be working well, except that, for a I reason I can't yet identify,
nano
(the large fartext editor application) won't yet run. Stay tuned!