Skip to content

[kernel] Use optional ASM inb/out for fast I/O in ATA CF driver#2370

Merged
ghaerr merged 2 commits intomasterfrom
insw
Jul 25, 2025
Merged

[kernel] Use optional ASM inb/out for fast I/O in ATA CF driver#2370
ghaerr merged 2 commits intomasterfrom
insw

Conversation

@ghaerr
Copy link
Copy Markdown
Owner

@ghaerr ghaerr commented Jul 25, 2025

Uses GCC ASM extension for fast I/O in ATA CF driver.

Currently turned on using FASTIO=1 in ata.c. Tested on QEMU but not real hardware. ATA CF I/O should be as fast as possible now. IDE query code also updated to use insw() macro.

Getting the GCC ASM constraints correct took a little doing on this one, given ia16-elf-gcc doesn't seem to exactly match GCC documentation.

@ghaerr ghaerr merged commit 4648b8c into master Jul 25, 2025
1 check passed
@ghaerr ghaerr deleted the insw branch July 25, 2025 03:43
@toncho11
Copy link
Copy Markdown
Contributor

Nice. Please switch it to ON by default. I will try to test it.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 25, 2025 via email

@toncho11
Copy link
Copy Markdown
Contributor

Ah, yes. Sorry.

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 26, 2025

Everything seems to work with the last build except for missing cfa* in /dev. Once I rebooted with root=cfa1 and disabled=hda I was not able to mount any /dev/cf* because there are none. It is strange because at boot /dev/cf* seems to be present. The system boots normally and prints /dev/cfa1 for root, but when you look in dev it is not there. The /hda* are there and I have 43 entries in /dev. Maybe I reached the max for /dev? Maybe the /dev/hda* are not renamed to /dev/cfa* when disabled=hda ?

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 26, 2025

I use a minix image this time and xtide=3.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 26, 2025

Let me check why there is not a /dev/cfa* on MINIX filesystems. Strange, because that's what I test with.

It could be that sys doesn't create it... did you use sys to populate the CF image? This could be the problem since sys would work with FAT images, since /dev is faked, but not with MINIX. I'll check but I bet that's it.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 26, 2025

It could be that sys doesn't create it

Yes - that's the problem, the new /dev/cf* entries had not been added to sys. I'll fix that with a PR I'm preparing now.

@toncho11
Copy link
Copy Markdown
Contributor

So I made tests: copying the bin folder 2 times on a minix filesystem with time

cba1: 4m.16s and 4m.17s
hda1: 1m50s and 1m.56s

I restarted to avoid any caching.
The difference is big!
Is it possible that the XUB does 16 bit transfers, while we do 8 bit in XTCF mode?

@toncho11
Copy link
Copy Markdown
Contributor

Chatgpt suggested these optimizations:

+----------------------------------------------------+-------------------------------------------------------------+
| Suggestion                                         | Description & 8086 ASM Example                              |
+----------------------------------------------------+-------------------------------------------------------------+
| Use `REP INSB` for reads                           | Efficient 512-byte read from data port:                     |
|                                                    |   MOV DX, BASE           ; BASE + 0 (data port)             |
|                                                    |   MOV CX, 512                                             |
|                                                    |   REP INSB                ; DS:SI → memory                  |
+----------------------------------------------------+-------------------------------------------------------------+
| Use `REP OUTSB` for writes                         | Efficient 512-byte write to data port:                      |
|                                                    |   MOV DX, BASE           ; BASE + 0 (data port)             |
|                                                    |   MOV CX, 512                                             |
|                                                    |   REP OUTSB               ; DS:SI → port                    |
+----------------------------------------------------+-------------------------------------------------------------+
| Efficient DRQ polling                              | Wait for DRQ (bit 3) to be set before transfer:             |
|                                                    |   MOV DX, BASE + 7       ; Status port                     |
| WAIT_DRQ:                                           |   IN AL, DX                                                |
|                                                    |   TEST AL, 08h           ; Check DRQ                       |
|                                                    |   JZ WAIT_DRQ                                              |
+----------------------------------------------------+-------------------------------------------------------------+
| Inline register writes                             | Set sector count and LBA quickly before issuing command:    |
|                                                    |   MOV DX, BASE + 4                                         |
|                                                    |   OUT DX, AL             ; Sector count                    |
|                                                    |   MOV DX, BASE + 6                                         |
|                                                    |   OUT DX, AL             ; Sector number                   |
|                                                    |   MOV DX, BASE + 8                                         |
|                                                    |   OUT DX, AL             ; Cylinder low                   |
|                                                    |   MOV DX, BASE + 10                                        |
|                                                    |   OUT DX, AL             ; Cylinder high                  |
|                                                    |   MOV DX, BASE + 12                                        |
|                                                    |   OUT DX, AL             ; Drive/head                     |
|                                                    |   MOV DX, BASE + 14                                        |
|                                                    |   OUT DX, AL             ; Command (e.g., 20h)            |
+--------------------

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 27, 2025

cba1: 4m.16s and 4m.17s
hda1: 1m50s and 1m.56s

The difference is big!
Is it possible that the XUB does 16 bit transfers, while we do 8 bit in XTCF mode?

Wow, the speed difference sure is big. Although XUB is extremely tightly-coded ASM language, we should be able to do better than twice as slow though. It will be hard to replicate on my side since I can't actually emulate XUB, but I'll try to think up some ideas as to why.

Is it possible that the XUB does 16 bit transfers, while we do 8 bit in XTCF mode?

Well - no. XUB definitely does 16-bit transfers for XTIDE cards, but XTCF is purposely built for 8-bit I/O, so I don't think that can be the reason. (It might be worth testing with your XTIDE v2 card and setting xtide=2 to both test that portion of the ATA CF driver as well as speed sometime though, since that I/O mechanism is 16-bit transfers, which was the purpose of the "high-speed" XTIDE v2 mod).

Use REP INSB for reads
Use REP OUTSB for writes

Both of these would be great - except that the INSB/OUTSB (and INSW/OUTSW) instructions are not present on the 8088. There were all added in the 80186.

We use a loop like the following for the recently-introduced fast I/O:

    mov #512,%cx
again:
    in (%dx),%al
    stosb
    loop again

That is, three instructions per 8-bit read. That could be optimized faster by doing two I/O instructions per loop using something like:

    mov #256,%cx
again:
    in (%dx),%al
    stosb
    in (%dx),%al
    stosb
    loop again

I'll try that. QEMU is very fast on my box that it's very hard to tell the difference in speeds, but should show differences when lots of data is being copied.

Efficient DRQ polling

Inline register writes

While not using ASM for the polling, I don't think these are the issue since they only occur once per sector read, whereas the I/O itself occurs 512x per sector. It seems to me only the latter has the capability to decrease the total throughput by half. I'll look into both though.

Thanks for your testing and suggestions.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 27, 2025

So I made tests: copying the bin folder 2 times on a minix filesystem with time

Was this test made reading from, or writing to, the CF card? This could make a big difference.

The current driver design waits up to 50ms after each CF read for the CF to become "not busy". This was part of the original design and could be slowing things down a lot. On writing, the CF driver waits for the card to signal not busy as well, since according to spec it could potentially take seconds for some sectors to write to flash. So this wait is not as easily removed.

It sounds like a potential redesign of the driver wait mechanism might make sense, at least on the read side, if your testing was CF reads only.

@toncho11
Copy link
Copy Markdown
Contributor

It was: time cp /bin/* /root/bincopy. So both.

@toncho11
Copy link
Copy Markdown
Contributor

📚 Intel Manual Confirmation (8086 Programmer's Reference Manual)
INSB — Input String Byte
Description: Input byte from port to string at ES:DI

Opcode: 6C

Available on: 8086, 8088, 80186, 80188

✅ Conclusion
✅ INSB and OUTSB are 100% valid and supported on the 8088, 8086, and 80188 — despite assembler syntax differences.

They are safe to use for your XT-IDE driver on real 8086 hardware.

@toncho11
Copy link
Copy Markdown
Contributor

So it should be:

cld                  ; Clear direction flag (important for REP)
mov %dx, BASE        ; BASE + 0 = data port
mov %cx, #512
rep insb             ; Reads 512 bytes from DX into ES:DI

@toncho11
Copy link
Copy Markdown
Contributor

Chatgpt rewrite:

#define insb(port, seg, offset, cnt)                  \
    __extension__ ({                                  \
        asm volatile (                                \
            "push %%es\n\t"                           \
            "mov %1, %%es\n\t"                        \
            "cld\n\t"                                 \
            "rep insb\n\t"                            \
            "pop %%es"                                \
            :                                         \
            : "d" (port), "r" (seg), "D" (offset), "c" (cnt) \
            : "memory"                                \
        );                                            \
    })

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 27, 2025

Are you using chatGPT for "Intel Manual Confirmation (8086 Programmer's Reference Manual)"?? Did you actually read the manual, or are you still believing chatGPT?

✅ Conclusion
✅ INSB and OUTSB are 100% valid and supported on the 8088, 8086, and 80188 — despite assembler syntax differences.
They are safe to use for your XT-IDE driver on real 8086 hardware.

Sorry, but not true. Here's the original Intel application note on the 80186 showing the new instructions:
Screenshot 2025-07-27 at 9 29 31 AM

None of the emulators nor disassemblers I've written for 8086 include opcode 6C, it's part of a well defined "undefined" section of opcodes in the 8086.

The speed issue is somewhere else, I wish it were as easy as just adding INSB/OUTSB.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 27, 2025

It was: time cp /bin/* /root/bincopy. So both.

In that case, we've likely got the worst of three worlds: possibly slow I/O using single IN or OUT byte instruction per loop; wait after read, and wait after write.

Can you perform the test(s) FD/HD -> CF or CF -> FD/HD separately? That might help figure where the issue(s) are. I suspect they are in all three areas. I can change the driver for speed, at the potential risk of the third issue above: not knowing when a write fails to the CF card. What do you think?

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 27, 2025

Sorry. I won't be able to test for while.
Chatgpt managed to first persuade me that it was only early assemblers that did support it. So it told that this was the source of confusion. It was kind of persuading. A deeper check showed what you said. Yes, I use it to save time. It actually works good for many things.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 27, 2025

Yes, I use it to save time. It actually works good for many things.

No problem, I almost believed it for a moment, but had to set the record straight.

I won't be able to test for while.

I'm looking at the XUB source and are seeing ways this can be sped up, which I'll add. Thanks for your testing, I'll devise some ways to test over here and hopefully improve the speed.

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 27, 2025

I think we can start with adding a FASTCF (for fast CF cards) flag or ATACF (only for CF) that reduces the wait write confirm times to a few ms in ata_wait for CF cards. Chatgpt says that a write should typically take between 2 and 7 ms with the confirmation. So we can check every 3 ms.

Why not set the wait time dynamically? Maybe check the response time at boot and adjust the wait. Or make statistics at the first write and then adjust for all future writes.

@toncho11
Copy link
Copy Markdown
Contributor

Also ELKS reported boot time is 2 times slower with cfa compared to hda.

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 28, 2025

I was thinking of this character that is turning on the right upper corner of the screen on all reads/writes. I think the tight loop in ata_wait() and the console might interfere. Also it seems over-polling can degrade performance on some CF cards. We can:

  • disable this console output (the turning /|\ character) while reading/writing and see if this helps
  • introduce a small delay in ata_wait
  • maybe even disable the interrupts temporally while IO is performed to check for interference (with the console for example)

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 28, 2025

Also ELKS reported boot time is 2 times slower with cfa compared to hda.

Wow, that's a big difference. I'm looking into this, I don't think things like the spinning cursor have much to do with a 2x difference in speed between our ATA CF driver and XUB through our BIOS driver (the cursor spins during both anyways). However, our BIOS driver uses a cylinder track cache which could be producing a big difference - it certainly does on floppies. Both that mechanism and XUB use multi-sector reads which might also change the timing. For now, its quite a lot of work to pull out the BIOS track caching, and the ELKS block drivers will use 2-sector read/writes on MINIX, and larger full-cylinder multi-sector reads with floppy track caching. We need to prove that track caching is the bottleneck before implementing it, but the BIOS driver doesn't use track caching for hard drives, just 2-sector read/writes.

So there's lots going on under the hood that could contribute to this. Also, setting the number of EXT buffers larger (or smaller) than 64k could also make a difference. Lots of tuning that'd need to be done on real hardware if it matters.

I've finally hacked up a version of PCem that will emulate both XTIDE v1 as well as (for the moment) XT CF, so that I can actually test w/o real hardware. BTW, the ATA CF XTIDE works, so no big rush to test your real hardware. This setup will allow me to test enhancements to the insb and outsb macros for XT CF to try to speed it up. I'm not really sure if that's a bottleneck or not.

I just tested using the FASTIO vs not for a 500 sector dd and there was no difference at all.

introduce a small delay in ata_wait

Why not set the wait time dynamically?

We don't want more delays in waiting. Remember, the waiting is only performed while it has to: when the device is signaling that it is not ready. We can't decrease that wait time, and the max wait times are set by the ATA spec. If your device is faster, then the waits will be smaller. There is an unneeded wait after reading a sector that will be removed, and I found that we're not actually checking for a write error in the wait after the write. Both will be fixed.

Chatgpt says that a write should typically take between 2 and 7 ms with the confirmation. So we can check every 3 ms.

We check as fast as possible; waiting longer to check doesn't help any because there's nothing else for the system to do when locked in a read/write loop using dd, for example. The I/O is sector-by-sector, that's all the ATA interface will do, unless multi-sector I/O is implemented. We should probably add 2-sector I/O, so that MINIX blocks are processed quicker. I'm looking into that.

@toncho11
Copy link
Copy Markdown
Contributor

I see that there are a thousand aspects. So a write, read or wait can not be interrupted by the kernel for let's say console output? Still I think the console output might be slowing the whole IO. The CPU must execute it after all.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 28, 2025

So a write, read or wait can not be interrupted by the kernel for let's say console output?

Yes, the I/O wait is interrupted by the kernel by the hardware interrupt, which occurs every 10ms. Every 80ms, the spinner is updated (that's just less than 1/10 second). Very little time, really. Remember, our BIOS driver going through XUB does the same thing (updated the spinner) so the spinner isn't different between the two, but there's apparently a 2x speed difference in booting anyways.

Still I think the console output might be slowing the whole IO.

Yes, it might slow it down by 1/10 second overall. No way does it slow down from 9 seconds to 18. Can you remind me what the boot timing numbers were for ATA CF vs XUB? I can compare them to the PCem emulator which is currently booting from ATA CF in 6.64 secs.

@toncho11
Copy link
Copy Markdown
Contributor

On my Amstrad 1640 I got 3.5 seconds and 7 seconds for cfa and hda.

I was suggesting to avoid over-polling to prevent CF card bus saturation by adding a small delay (small nop loop) between the bsy register checks. I think it is worth checking.

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 28, 2025

Do not blame me too much. I am too tired this evening :). Chatgpt is suggesting this:

static int ATPROC ata_wait_drq(void)
{
    unsigned long timeout = jiffies + 10;  // ~100ms timeout
    unsigned char status;

    while (!time_after(jiffies, timeout)) {
        status = inb(ATA_REG_STATUS);

        if (status & ATA_STATUS_ERR)
            return -EIO;

        if (!(status & ATA_STATUS_BSY) && (status & ATA_STATUS_DRQ))
            return 0;

        // Minimal delay to reduce bus saturation
        inb(0x80);  // I/O delay; harmless dummy read
    }

    return -ETIMEDOUT;
}

Benefits:

  • Less aggressive polling avoids CF card slowdown.
  • Exits early on errors (e.g., no DRQ set).
  • Compatible with multi-process context (avoids blocking as long).
  • Follows the same polling rhythm that XUB uses (per-sector).
// Send READ or WRITE command first...
outb(ATA_CMD_READ, ATA_REG_CMD);

// Wait for DRQ, with timeout and early error check
if (ata_wait_drq() < 0)
    return -EIO;

// Now perform the entire 512-byte transfer using looped INs or OUTs

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 28, 2025

The BIOS might be able to better handle both console and int 13h. It is aware of both.

@toncho11
Copy link
Copy Markdown
Contributor

Based on some indirect evidence I think the read is more problematic. When I use the dd command towards the floppy I noticed that it was slower than on DOS. If the floppy speed is the same, then it must be the cfa read,

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 28, 2025

I was suggesting to avoid over-polling to prevent CF card bus saturation by adding a small delay (small nop loop) between the bsy register checks. I think it is worth checking.

None of my emulators will emulate bus saturation. If you really think this would make a difference, go ahead and add it and test on your hardware. I haven't read anything about a delay required when polling the status register, but who knows?

I can also add it if you want to test again without compiling.

Do not blame me too much. I am too tired this evening :). Chatgpt is suggesting this:

The ATA CF driver is essentially the same as what ChatGPT is suggesting (with all the other listed benefits), with the exception of the bus delay. If you can have ChatGPT dig up a link as to how excessive polling slows down CF cards, I'd definitely like to read about that.

On my Amstrad 1640 I got 3.5 seconds and 7 seconds for cfa and hda.

You mean that hda is 3.5 and cfa is 2x as slow - 7?

@toncho11
Copy link
Copy Markdown
Contributor

You mean that hda is 3.5 and cfa is 2x as slow - 7?

Yes.

@toncho11
Copy link
Copy Markdown
Contributor

toncho11 commented Jul 28, 2025

The ATA CF driver is essentially the same as what ChatGPT is suggesting (with all the other listed benefits), with the exception of the bus delay. If you can have ChatGPT dig up a link as to how excessive polling slows down CF cards, I'd definitely like to read about that.

I see that an error is reported only after bsy is cleared, so the above code is not even valid. I thought that if we could detect an error early on then we can save some time until bsy is cleared. But no. Anyway this is only in the case there were many errors.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 28, 2025

You mean that hda is 3.5 and cfa is 2x as slow - 7?

A (somewhat crude) idea would be to video the boot screen from HDA, then from CFA. Since one is 3.5 seconds slower than the other, we might be able to tell what is taking all the extra time literally from watching the video. That's quite a bit of time difference and one would think we should be able to notice something...

I'll keep playing with PCem to see if I can get anywhere near 2x changes, but so far nothing.

@ghaerr
Copy link
Copy Markdown
Owner Author

ghaerr commented Jul 28, 2025

but so far nothing.

Holy heck! PCem is duplicating the speed issue, 2.25 seconds with root=hda1, 6.64 seconds with root=cfa1. XUB is being emulated as well through BIOS hda1. It seems the boot is approximately the same until the filesystem is mounted via VFS. Then, our cfa1 driver is taking much longer than hda1 via XUB. I should be able to figure out somehow where the problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants