Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kernel] Implement compressed executables in ELKS #912

Merged
merged 10 commits into from
Apr 6, 2021
Merged

[kernel] Implement compressed executables in ELKS #912

merged 10 commits into from
Apr 6, 2021

Conversation

ghaerr
Copy link
Owner

@ghaerr ghaerr commented Apr 5, 2021

This PR implements the ability to execute significantly compressed binaries directly from disk or ROM.
Implements request for compression of executables discussed in #226.
Uses third design discussed in #908.

The compression mechanism allows for the kernel to quickly decompress arbitrary text, fartext and data sections in executables while retaining backwards compatibility of the a.out header format, so that chmem can still be used to set heap and stack sizes after linking.

The kernel "cost" of decompressing a binary is only 16 bytes extra per application, as the decompression is done "backwards in-place", which allows for the decompression to be performed in the decompressed memory size, plus 16 bytes. The added kernel code for decompressing binaries is less than 2K bytes.

It seems the average per-binary savings is on the order of 31-36%. If this were applied to all binaries on a ELKS 1.44M distribution floppy disk, disk access times for program startup should decrease by 30%, while also allowing another 500+ K bytes free space to be made available per floppy.

As an example, /bin/sh has been reduced from 50,960 bytes to 33,239, a 35% reduction in size.

This PR will have several commits added over the next few days, to bring in the changes incrementally.
This first commit adds changes required for the kernel. The next commit will add the elks-compress utility which compresses any ELKS binary.

Almost everything seems to be working well, except that, for a I reason I can't yet identify, nano (the large fartext editor application) won't yet run. Stay tuned!

@ghaerr ghaerr requested a review from tkchia April 5, 2021 04:30
@ghaerr
Copy link
Owner Author

ghaerr commented Apr 5, 2021

Last commit fixes medium model binaries. nano now runs compressed. Savings on the nano binary went from 139,480 to 90,487 bytes, a 36% reduction. All three (text, fartext and data) sections are compressed.

@ghaerr
Copy link
Owner Author

ghaerr commented Apr 6, 2021

The last commit adds the ability to compress many files at once: cd target; elks-compress bin/*.
By default, elks-compress compresses the input file. Using elks-compress -z infile will create infile.z for testing.

This PR is almost ready to merge. A seperate PR will be published with all exomizer src in elks/tools/elks-compress/exomizer'. This is being done to keep those many files out of this PR, for ease of review.

After committing these PRs, creating a floppy with all files compressed is easily done for testing:

$ make
$ cd image
$ make compress    <- runs elks-compress on elks/target/bin filesystem
$ make copyminix (or copyfat)   <- make minix or fat distribution floppy fd1440.bin of compress files)

For those not wanting to build anything, attached is a bootable floppy with all executables compressed. The kernel boots and runs all executables with no noticeable delay (on QEMU). The uncompressed 1.44M floppy had 24k free, the compressed version has 335k available!

fd1440.bin.zip

@ghaerr ghaerr merged commit cc9f2ab into ghaerr:master Apr 6, 2021
@ghaerr ghaerr mentioned this pull request Apr 6, 2021
@Mellvik
Copy link
Contributor

Mellvik commented Apr 7, 2021

@ghaerr,
while the rest of us (or at least some, in this part of the world) took a break for easter, you not only specified but implemented and tested compressed binaries.
Great work – being tested as we speak on real hardware, without a hitch this far and minimal delay as far as I can tell.

More testing and comparison in progress - using your 'floppy distro'.

Thank you very much!!

--Mellvik

@ghaerr
Copy link
Owner Author

ghaerr commented Apr 7, 2021

Hello @Mellvik,

More testing and comparison in progress - using your 'floppy distro'.

Glad to hear the compressed binary distribution seems to be working flawlessly on real hardware! I am particularly interested in whether there is a noticeable difference from reading (or not reading, actually) binaries from the filesystem. vi would be a good example to time, but remember that the both the floppy disk cache as well as the filesystem buffer cache will allow a true test to be performed only once after boot, as afterwards additional buffering will be present.

while the rest of us (or at least some, in this part of the world) took a break for easter, you not only specified but implemented and tested compressed binaries.

Actually it was your comment about #902 (comment) that informed me of the existence of Exomizer, which then pointed me to @tkchia's prior research into it, that prompted thoughts that such a scheme could now be implemented relatively quickly. And here we are!

Thank you!

@ghaerr
Copy link
Owner Author

ghaerr commented Apr 7, 2021

Hello @Mellvik,

I've attached a zip file containing the final nano and nano.z (compressed version), which should both run. These large medium-model executables (142k and 92k) should allow you to more easily see whether compressed executables help with floppy load times. There's also an optional nano.rc that shows some of the features (except color) that can be used, it needs to be renamed to ".nanorc" and placed in the /root (home) directory.

nano.zip

I plan on uploading all of nano source soon.

Thank you!

@Mellvik
Copy link
Contributor

Mellvik commented Apr 8, 2021

Thank you @ghaerr,

Seems to me to be a perfect setup for testing just that.

BTW - I got your .nanorc already installed in /root on the image, so that confused me a little to begin with. Possibly good because when I then ran diff .nanorc and nano.rc - the latter being the one from your nano.zip file, I got this:

# diff .nanorc nano.rc
6c6
< set regexp 
---
> # set regexp 
sys_brk(22) fail: brk 2a70 over by 1020 bytes
sys_brk(22) fail: brk 2872 over by 510 bytes
diff: out of memory
# 

… which I guess should not happen. diff(1) has been running fine for quite some time as I recall.

And BTW I'm not running low on memory:

# meminfo
  HEAP   TYPE  SIZE    SEG   TYPE    SIZE  CNT
  d7ee   SEG     16   223c   free      64    0
  d80a   SEG     16   3240   CSEG    6848    1
  d826   SEG     16   2240   BUF    65536    1
  d842   TTY     80
  d89e   SEG     16   33ec   DSEG    8816    1
  d8ba   SEG     16   3613   DSEG   12112    1
  d8d6   free    16
  d8f2   SEG     16   3a61   CSEG    6592    1
  d90e   SEG     16   3bfd   DSEG    6480    1
  d92a   SEG     16   3d92   CSEG    4576    1
  d946   SEG     16   94f5   free   45232    0
  d962   SEG     16   507f   DSEG   20592    1
  d97e   TTY     80
  d9da   TTY    104
  da4e   TTY   1024
  de5a   SEG     16   40a1   CSEG   46016    1
  de76   SEG     16   4bdd   free   18976    0
  de92   SEG     16   5586   free  259824    0
  deae   SEG     16   3eb0   free    7952    0
  deca   SEG     16   3908   free    5520    0
  dee6   free  8474
  Heap/free   10270/ 8490 Total mem  515136
  Memory usage  503KB total,  173KB used,  330KB free
# ps
  PID   GRP  TTY USER STAT CSEG DSEG  HEAP   FREE   SIZE COMMAND
    1     0      root    S 3240 33ec  3072   2010  15664 /bin/init 3 
    8     8    1 root    S 3a61 3bfd     0   1980  13072 /bin/getty /dev/tty1 
    9     9   S0 root    S 40a1 507f  1166   8866  66608 -/bin/sh 
   14     9   S0 root    R 3d92 3613  1024   6306  18608 ps 
# 

Back to nano:

nano starts fine if run w/o arguments. I can write a few lines of text and we're ok.
If run with an existing file (such as ./nano nano.rc), it runs out of memory during startup.
If run with a new file, it starts, but runs out of memory after just a few words of writing:

This is the tsys_brk(23) fail: brk f146 over by 444 bytes
sys_brk(23) fail: brk f14a over by 448 bytes
sys_brk(23) fail: brk f14c over by 450 bytes
ime  ll good men 
to   

The output from ./nano nano.rc goes like this:


Error in /root/.nanorc on line 86: Command "syntax" not understood

Error in /root/.nanorc on line 89: Command "color" not understood

Error in /root/.nanorc on line 91: Command "color" not understood

Error in /root/.nanorc on line 110: Command "color" not understood

Press Enter to continue starting nano.

sys_brk(24) fail: brk f2fe over by 882 bytes
sys_brk(24) fail: brk f302 over by 886 bytes
sys_brk(24) fail: brk f102 over by 374 bytes
nano is out of memory!# 

As to startup speed, it is my estimate that while the uncompressed version takes about 10 seconds to load (this is the Compaq 386/20), the compressed version takes 20. So there seems to be a significant speed disadvantage for the compressed version, even when run from floppies. This is a disappointment, and confirms the feeling I've been getting after testing the system for a while - when running other commands. Still, I haven't booted the 'old' floppy image to get the 'feel' back in sync yet. Will do that.

--Mellvik

@ghaerr
Copy link
Owner Author

ghaerr commented Apr 8, 2021

Hello @Mellvik,

Nice testing, thank you for your full report!

I think you are pretty much correct on all accounts. First, the sad reality of nano seems to be that, with the 2.0.6 version, it runs out of memory all the time. This isn't to do with how much system memory you have left, but rather that ELKS medium model programs are still limited to 64K data, and it seems to be using almost all of it. I saw this happen immediately after I added the color support (which is now turned off), which also requires regex support, which apparently uses tons of data memory.

The .nanorc file in the /root directory was a mistake, I've removed it so that users will manually add it for the time being. I'm still trying to find the settings that will work without running out of memory, and that will be used to create a default nano.rc.

Given these issues, and the work put into the complete nano port PR #915, I think I will still commit it, but may end up having to later revert to nano v1.0.6, which is a few years earlier (1998 instead of 2000). That earlier version doesn't have any of the memory problems. It is still important to test the build capabilities over various machines building v2.0.6.

As to startup speed, it is my estimate that while the uncompressed version takes about 10 seconds to load (this is the Compaq 386/20), the compressed version takes 20. So there seems to be a significant speed disadvantage for the compressed version, even when run from floppies.

Dang, this is a big disappointment!! I am pretty sure this means that a large (~112k) decompression takes 10 seconds on an old system. The good news is that most executables are vastly smaller, so the decompression time is much smaller.

This is a disappointment, and confirms the feeling I've been getting after testing the system for a while - when running other commands.

I see. The other factor is that ELKS now has the full-track disk caching so in many cases the floppy read time is very small. So one doesn't really notice the disk read time at all, since its already been read in during the track read.

At this point, we probably will need more full-system testing of all the binaries to see where we might set the tradeoff for the gain on the target disk by allowing more executables (35% gain) versus the added wait time for decompressing larger executables. Unfortunately, these two benefits seem to be in direct opposition to each other.

By default, all executables (except nano, in the new PR) remain uncompressed. elks-compress can be used to compress just a few binaries should we want to get into finer-grained testing for determining what a future distribution should look like.

Thank you!

@Mellvik
Copy link
Contributor

Mellvik commented Apr 8, 2021 via email

@Mellvik
Copy link
Contributor

Mellvik commented Apr 9, 2021

More benchmarking:

it is really hard to get exact numbers (it may actually be an idea to implement the time command in sash), but here's a surprise:
Startuptime for vi.z is significantly shorter than for vi.plain (compressed vs uncompressed) when run off of a FAT file system (HD, real hardware). I guess that says quite a bit about the performance of the FAT filesystem - but still, here's a setting where the binary compression is actually beneficial speed wise. It may be worthwhile to get some real numbers to tag onto this experience.

FWIW: Using the time command and ensuring it's buffered before the run, gives an average (5 runs) startup time

  • vi.z - 2.95s
  • vi.plain - 4.85s

BTW - I'm using vias the test object because nano is too big and I need networking active for my setup to work in a convenient way.

-M

@ghaerr ghaerr deleted the compress branch April 9, 2021 16:34
@ghaerr
Copy link
Owner Author

ghaerr commented Apr 9, 2021

Hello @Mellvik,

Thanks for the continued benchmarking. So it's clear that there is significant interaction between the disk track caching (which doesn't help FAT filesystem reads much), and executable compression, which runs faster because of the savings gained by the ratio of compression space saved and non-cached disk reading. There will also be a significant savings for ROM filesystems.

I am still a bit surprised by the "nano" timings on MINIX filesystems being so much slower with compression and track caching... did you ever try running the same tests using vi instead?

I've tested nano v2.0.6 a bit more: sadly, it seems that version can't even edit its own .rc file. Its use of the data segment is just too big, it seems. I'm going to add nano v1.0.6 to the outstanding PR, and possibly remove 2.0.6, since it adds tons of files that aren't really useful.

I'm using vias the test object because nano is too big and I need networking active for my setup to work in a convenient way.

Yes, any version of nano will likely be too big for much real use. I still would like to proceed with adding at least v1.0.6 so we have a medium model binary to continue testing tools with.

Thank you!

@Mellvik
Copy link
Contributor

Mellvik commented Apr 12, 2021

@ghaerr -
a final run of benchmarking compressed vs uncompressed ELKS binaries, per your suggestion above.

This time off of a minix floppy image only, and still using vi as the test object. The results are somewhat better than the previous (FAT fs based) test, but confirm the general impression: Compression increases load time by about 40%. OTOH - there is the (almost obvious) observation (last two lines, 2nd number) that the smaller compressed image has a significant advantage when the binary is (mostly) buffered.

Like before, the test command was # time /bin/vi.{z,plain} < /dev/null . Alternating compressed and uncompressed runs ensured no buffer-effects, except for the 'buffered' numbers, which come from running the same command twice.

Skjermbilde 2021-04-12 kl  16 50 24

My setup does not have a minix fs on the hard drive at the moment, I may take another run using the same test later, not so much to test compression as to compare fs-performance.

-M

@ghaerr
Copy link
Owner Author

ghaerr commented Apr 12, 2021

Hello @Mellvik,

Thank you for the testing.

There are a number of factors that will affect any test results. The following configuration options may greatly change the timings:

CONFIG_FS_NR_EXT_BUFFERS=64  (The number of 1K RAM buffers used. This is AFTER reading from disk/track cache).
CONFIG_TRACK_CACHE=y       (Turns on disk drive track caching. Reads remaining track of requested disk sector).
CONFIG_EXEC_COMPRESS=y     (Turns on executable decompression).

Now for the details:

  • The vi binary is 69008 bytes. This means that with CONFIG_FS_NR_EXT_BUFFERS set less than 68 (=64 * 1024 = 69632), the system will never fully buffer the vi read from kernel memory. Since there is additional directory and FAT/inode reading required, setting this value to 96 should remove this factor from the equation.
  • With CONFIG_TRACK_CACHE on, the system will read full-tracks, starting from the requested sector on every track, unless the requested sector is already in the track cache. This means that, depending on exactly how the vi executable is written on the media, there will likely be extra sectors read, skewing the timing. On brand-new FAT or MINIX volumes, this is minimized, as all files are written in consecutive sector order (almost). But on existing HD images that are greatly fragmented, this option could in fact slow down system operation, but certainly skew timing. Thus there should be seperate testing with this option OFF.
  • The CONFIG_TRACK_CACHE option also will operate quite differently between FAT and MINIX filesystems, due to the disk layout. I think we determined at the time of implementation that it didn't help FAT nearly as much as MINIX filesystem read times.

Having said all this, it seems that vi.z may be taking ~3+ seconds to decompress, while vi.plain is taking ~5 seconds to load (likely due to not enough system buffers). I am a bit surprised it is still taking 5 seconds to load vi.plain, and wonder whether the combined operation of track caching and system buffering is still not optimized, as I would think we could that down to < 1 second total, for fully buffered operations. Figuring these reasons out will help determine where the kernel is spending most of its time. It is possible the even copying memory is quite slow, for instance.

To see a bit more clearly (with stopwatch lol) the difference between the disk loading times vs the decompression times, perhaps recompile the kernel with following lines inserted in elks/fs/exodecr.c:

size_t decompress(char *buf, seg_t seg, size_t orig_size, size_t compr_size, int safety)
{
    char *in = buf + compr_size;
    char *out = buf + orig_size + safety;

    debug("decompress: seg %x orig %u compr %u safety %d\n",
        seg, orig_size, compr_size, safety);
    segp = seg;
+printk("Start decompression\n");
    out = exo_decrunch(in, out);
    if (out - buf != safety)
    {
        debug("EXEC: decompress error\n");
        return 0;
    }
    fmemcpyb(buf, seg, out, seg, orig_size);
+printk("Stop decompression");
    return orig_size;
}

I am going to guess that the decompression time for vi.z is ~3 seconds. It would be nice to know. It's a bit tricky since the compressed sizes also decrease both track caching and system buffering requirements. So this whole effort does depend on a proper system tuning, which we may need a bit more work!

Thank you!

@Mellvik
Copy link
Contributor

Mellvik commented Apr 13, 2021 via email

@Mellvik
Copy link
Contributor

Mellvik commented Apr 13, 2021 via email

@ghaerr
Copy link
Owner Author

ghaerr commented Apr 13, 2021

Thanks @Mellvik,

I agree with all your conclusions.

The decompressor is called, with different memory addresses, separately for each section (.text and .data) of an executable. So that's correct.

And still no real clue as to why loading a 69k file from (1.44M) floppy takes 5 seconds :-)

My guess is that you're looking at the overhead of a whole track read for a 69k file, since CONFIG_TRACK_CACHE is on.

@Mellvik
Copy link
Contributor

Mellvik commented Apr 15, 2021

Just for the heck of it I disabled TRACK_CACHE and ran the tests again.

The results are as expected and as reported by the fdtest utility we created while working on floppy speedup last year: track caching improves file load times by 400%.

So, without the cache, vi loads in approx 16,5 seconds regardless of compression (on average, compressed binaries load slightly faster).

--Mellvik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants