Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emulator Crashes #116

Closed
Vutshi opened this issue May 18, 2024 · 22 comments
Closed

Emulator Crashes #116

Vutshi opened this issue May 18, 2024 · 22 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers question Further information is requested

Comments

@Vutshi
Copy link

Vutshi commented May 18, 2024

Describe the bug
I run a program in ELKS and martypc crashes with the following output:

Loaded keyboard mapping file: ./configs/keyboard_layouts/keyboard_US.toml
thread 'main' panicked at core/src/cpu_808x/cycle.rs:632:26:
index out of bounds: the len is 2 but the index is 2
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at /Users/x2241064/.cargo/registry/src/index.crates.io-6f17d22bba15001f/winit-0.29.15/src/platform_impl/macos/app_state.rs:387:33:
called `Result::unwrap()` on an `Err` value: PoisonError { .. }

To Reproduce
Steps to reproduce the behavior:

  1. Boot the attached ELKS image
  2. login: root
  3. Execute mand_el4
  4. Following the rendering of the Mandelbrot set, martypc will crash

Expected behavior
Do not crash.

Environment (please complete the following information):

  • macOS 13.6.6 (22G630)
  • 2,3 GHz Quad-Core Intel Core i7
  • Intel Iris Plus Graphics
  • Rust 1.78.0

Build info

Disk image
fd360-fat-ELKS.img.zip

@Vutshi Vutshi added the bug Something isn't working label May 18, 2024
@dbalsom
Copy link
Owner

dbalsom commented May 18, 2024

thanks. does it crash with the 0.2.0 release build?

@dbalsom dbalsom added good first issue Good for newcomers question Further information is requested labels May 18, 2024
@Vutshi
Copy link
Author

Vutshi commented May 18, 2024

Hi @dbalsom,

thanks. does it crash with the 0.2.0 release build?

Yes, it does.

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Looks to be a bug with 0xFE - I execute 3 cycles but only provide 2 microcode addresses. The instruction in question is an undefined form, which is a bit unusual. My 8088 tests don't exercise the undefined 0xFE forms it so they didn't catch it.

This raises another point, that doing the bounds checking for the microcode line number slice every time I call cycles_i() is expensive, and the microcode logging was never fully accurate, so I may just tear that whole feature out.

@ghaerr
Copy link

ghaerr commented May 22, 2024

Hello @dbalsom,

Looks to be a bug with 0xFE
The instruction in question is an undefined form, which is a bit unusual.

May I ask exactly what is the instruction sequence being referred to? Are we talking about an undefined 8086 instruction starting with 0xFE, where the following byte modRMReg bits indicate (byte operand size and 2 <= modRMReg <= 6) or modRMReg == 7? (That is 0xFE-based CALL/JMP/PUSH with byte operand or regmod == 7).

@Vutshi, I am wondering whether this was reproduced using unmodified NASM output, or whether this is the CS segment having been scribbled upon and creating an invalid opcode... ?

Thank you!

@Vutshi
Copy link
Author

Vutshi commented May 22, 2024

@Vutshi, I am wondering whether this was reproduced using unmodified NASM output, or whether this is the CS segment having been scribbled upon and creating an invalid opcode... ?

This is minimally modified NASM code for DOS which barely runs in ELKS. I guess it writes over some parts of ELKS and creates invalid opcode.

Best

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Hello @dbalsom,

Looks to be a bug with 0xFE
The instruction in question is an undefined form, which is a bit unusual.

May I ask exactly what is the instruction sequence being referred to? Are we talking about an undefined 8086 instruction starting with 0xFE, where the following byte modRMReg bits indicate (byte operand size and 2 <= modRMReg <= 6) or modRMReg == 7? (That is 0xFE-based CALL/JMP/PUSH with byte operand or regmod == 7).

@Vutshi, I am wondering whether this was reproduced using unmodified NASM output, or whether this is the CS segment having been scribbled upon and creating an invalid opcode... ?

Thank you!

Sorry, the opcode is actually FF. The same bug is present in both FE and FF forms, and the instruction byte sequence is FF FF. The bug is present in both FE and FF forms, probably due to copy-paste. In any case both bugs involve the invalid form of CALL FAR with a register operand (mod=3,r/m=7) However REG is 7 which should be PUSH so I am not sure how I am winding up there. The disassembly viewer can be inaccurate if there is self modifying code, but IRET should clear the queue. I guess I have more debugging to do.

This value is actually from video memory at B800:000A. Here is where things go off the rails:

image

For some reason the trap flag is on and the trap handler iret returns us to a bad segment which wraps into VRAM.

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Ah, I figured it out. the disassembler does not resolve MMIO addresses (I suppose I never assumed one would try to execute code from VRAM). A peek at the memory viewer which does resolve MMIO shows the true instruction sequence being executed: FF DD.

image

Mystery solved there I suppose.

@ghaerr
Copy link

ghaerr commented May 22, 2024

the instruction byte sequence is FF FF.

This value is actually from video memory at B800:000A

Thanks for the clarification and screenshot. Since the IP is in video RAM that pretty much confirms a major problem. I would guess that since the code seems to be executing IRETs, a stack/data overwrite could be the cause. The application program is running as a first-port under real mode ELKS which doesn't allow setting interrupt vectors, and a hardware timer interrupt during an application interrupt handling could cause this sort of thing, as ELKS may have switched stacks, etc.

What I do find interesting is that special handling of 0xFFFF (like 0x00) is probably a good idea for emulators since those two values seem highly probable when a program takes a wrong turn... I've made a note to do the same for some of my emulators.

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

What I do find interesting is that special handling of 0xFFFF (like 0x00) is probably a good idea for emulators since those two values seem highly probable when a program takes a wrong turn... I've made a note to do the same for some of my emulators.

I do have an 'off rails detection' option, that triggers on 5+ repeated opcodes, which was very useful for development as I was off the rails often. It does have some false positives in rare code paths, but you could bump that up to 10 and probably be fine.

@ghaerr
Copy link

ghaerr commented May 22, 2024

shows the true instruction sequence being executed: FF DD.

Do you suppose the emulator ran through the 0xFFFF's then died on 0xFFDD? In any case, I've noted that invalid 0xFF-starting instruction sequences might want to trap for my emulators.

I do have an 'off rails detection' option, that triggers on 5+ repeated opcodes

That's a good idea :)

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Do you suppose the emulator ran through the 0xFFFF's then died on 0xFFDD? In any case, I've noted that 0xFF-starting instruction sequences might want to trap for my emulators.

yes FF FF itself is valid, PUSH DI.

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

That's another good idea :)

you can either have a list of opcodes to trigger on repeat (0x00 and 0xFF would catch a lot) or have exceptions for things like NOP, as NOP can be repeated a lot for timing purposes. Also probably avoid IN and OUT as IO was used for timing things, like adlib ports.

I have encountered a few titles (demos and such) that count on 0x00 0x00 being a fairly innocuous ADD instruction so they can roll through blocks of initialized but otherwise empty memory...

@ghaerr
Copy link

ghaerr commented May 22, 2024

you can either have a list of opcodes to trigger on repeat (0x00 and 0xFF would catch a lot) or have exceptions for things like NOP, as NOP can be repeated a lot for timing purposes. Also probably avoid IN and OUT as IO was used for timing things, like adlib ports.

Very good ideas. I'll take a look at how this might be added to my Blink16 and 86Sim projects. You've done a great job on MartyPC especially being cycle-exact, the GUI looks super also :)

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Very good ideas. I'll take a look at how this might be added to my Blink16 and 86Sim projects. You've done a great job on MartyPC especially being cycle-exact, the GUI looks super also :)

Thanks, I appreciate the kind words. As an emulator author I don't suppose you've seen my 8088 tests?
https://github.com/singleStepTests/8088

This issue is good motivation to add tests for the illegal FE and FF forms :D

@ghaerr
Copy link

ghaerr commented May 22, 2024

I don't suppose you've seen my 8088 tests?

Wow, you've really gone to town on that!! Thank you, I'l take a closer look at them.

This issue is good motivation to add tests for the illegal FE and FF forms

Yes, my emulators (but not disassembler) currently ignore many invalid opcodes but now I realize that misses an opportunity to flag to the application developer something they likely don't know about. Of course, a runtime flag to allow the program to keep running with undefined behavior anyways is probably also needed, as that's what real hardware does in real mode!

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Yes, my emulators (but not disassembler) currently ignore many invalid opcodes but now I realize that misses an opportunity to flag to the application developer something they likely don't know about. Of course, a runtime flag to allow the program to keep running with undefined behavior anyways is probably also needed, as that's what real hardware does in real mode!

Yeah, an 8088 is unstoppable, you can literally feed it random bytes (as long as you filter out HLT) and it will keep going...

I am working on a new test set for the NEC V20. It is not quite as resilient as the 8088 - a few undefined forms will make it unhappy and it enters a halt-like state (with no HALT bus condition)

@ghaerr
Copy link

ghaerr commented May 22, 2024

Interesting on the NEC V20. Is that chip the same internal Intel-licensed silicon design, or did NEC come up with a new implementation and only implement the chip ISA?

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

Interesting on the NEC V20. Is that chip the same internal Intel-licensed silicon design, or did NEC come up with a new implementation and only implement the chip ISA?

NEC originally cloned the 8088 and 8086, got sued by Intel over it, and settled the lawsuit by signing an agreement to become an official second-source provider. Everything was fine until they got restless and extended the 8088/86 with 186 opcodes, new NEC-specific opcodes and 8080 support with the V20/V30 and got sued a second time. They won that lawsuit and would go on to make a bunch more V-series CPUs, none of which were quite as notable.

One of the reasons they won the second lawsuit is they proved that they had a clean-room approach to writing the V20 microcode and didn't simply copy Intel's. What I find hilarious is that the ruling also said Intel lost their copyright to the microcode because they didn't stamp a (C) symbol on all of their chips. You used to have to register copyright for it to be in effect...

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

This weekend I will pull the 5150 back out and see if this mandelbrot works on hardware. If MartyPC is correctly emulating things, bugs aside, I would expect the mandelbrot program to hang when it gets stuck in VRAM.

it's still possible I have some bug in handling the trap flag. That is one of the areas I have done the least testing as I cannot turn the trap flag on in my arduino validator.

I guess another good question is - is the trap flag supposed to be on? If not, it's possible we went off the rails earlier and did a POPF from garbage.

@ghaerr
Copy link

ghaerr commented May 22, 2024

What I find hilarious is that the ruling also said Intel lost their copyright to the microcode because they didn't stamp a (C) symbol on all of their chips.

No wonder lawyers have to be cautious when one can lose IP over not having copyright stamped on just documents surrounding the work, but having to stamp the work of art itself. Now I'm wondering whether the replacement click-to-agree license agreements are valid for associated hardware. I guess I'll have to check my smart toaster for possible non-compliance.

it's still possible I have some bug in handling the trap flag.

IIRC, there's an IBM application note out on TF for the XT vs AT PCs when the trap handling changed, as well as for exactly what happens when PUSH SP is executed.

@dbalsom
Copy link
Owner

dbalsom commented May 22, 2024

IIRC, there's an IBM application note out on TF for the XT vs AT PCs when the trap handling changed, as well as for exactly what happens when PUSH SP is executed.

Yeah 286 changed quite a few things. Division exceptions are handled differently as well

@dbalsom
Copy link
Owner

dbalsom commented Jun 9, 2024

fixed in 8d48c74

@dbalsom dbalsom closed this as completed Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants