-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ktcp status #610
Comments
Thanks for testing! However, bad news. It used to work, I'm almost sure Marc was using it years ago, but that could have been slip only. A couple questions: I assume you're using it on a network card only? What is the startup command line exactly? Which source version are you running, the version before or after Christoph's mods? @pawosm-arm has tested using slip, and said it works at 300 baud, except for incoming connections to telnetd. In summary - we need to get to a working version, be it 300 baud slip or ethernet, before or after the waiting PR. If needed, we could go way back to an earlier kernel and/or ktcp version, as it will be much easier to debug from something working than apparently where it is now. And I will have to use an emulator, of which I don't have slip or NE2000 running yet. Ugh! |
Indeed, I managed to establish telnet connection from ELKS to Linux over SLIP (with |
Good news.
The symptoms were just too suspicious - like you said @ghaerr, it used to work.
A few hours later, via DOS and misc packet drivers - different machine and different NE2K - and we're sort of operational.
Or at least, we have two way traffic. Very unstable - 'select' runaway errors and hangs, so this will take some time. I'm using a commit from 24 hrs ago, the '-s 4800' is on the command line, but doesn't seem to affect anything.
netstat works:
# netstat
Retransmition memory : 0 bytes
Number of control blocks : 3
no State RTT lport raddress rport
-----------------------------------------------------
1 ESTABLISHED 4000ms 1024 0.0.0.0 2
2 LISTEN 4000ms 80 0.0.0.0 0
3 LISTEN 4000ms 23 0.0.0.0 0
Ping works 23 times just after reboot, then silence (seems like ktcp hangs).
Incoming (to ELKS) telnet connects, but doesn't spawn a login or shell.
Incoming lynx is successful - even after the telnet, which has to be terminated at the source.
New netstat with lynx active:
# netstat
Retransmition memory : 0 bytes
Number of control blocks : 4
no State RTT lport raddress rport
-----------------------------------------------------
1 ESTABLISHED 4000ms 1024 0.0.0.0 2
2 CLOSE_WAIT 4000ms 23 10.0.2.1 -32394
3 LISTEN 4000ms 80 0.0.0.0 0
4 LISTEN 4000ms 23 0.0.0.0 0
Next command (nslookup) causes hang (after reporting 'Nameserver queried: 203……).
Testing continues - hints at specific things to test appreciated.
—Mellvik
… 4. mai 2020 kl. 17:09 skrev Gregory Haerr ***@***.***>:
Thanks for testing!
However, bad news. It used to work, I'm almost sure Marc was using it years ago, but that could have been slip only.
A couple questions: I assume you're using it on a network card only? What is the startup command line exactly? Which source version are you running, the version before or after Christoph's mods?
@pawosm-arm <https://github.com/pawosm-arm> has tested using slip, and said it works at 300 baud, except for incoming connections to telnetd.
In summary - we need to get to a working version, be it 300 baud slip or ethernet, before or after the waiting PR. If needed, we could go way back to an earlier kernel and/or ktcp version, as it will be much easier to debug from something working than apparently where it is now. And I will have to use an emulator, of which I don't have slip or NE2000 running yet. Ugh!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#610 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGOAEWPOKWVX362ZFP23RP3LA3ANCNFSM4MYX5P5A>.
|
You're breathing life into it Helge!! Things sound a bit promising, since its working for at least a little while. I'm a bit curious as to your switching your DOS packet driver and now ELKS works better, untouched. So there could be issues on the DOS end also? I'd like to see screenshots of the select printks, etc. With regards to the older 24-hour commit, that's probably ok. The -s option that was added only affects slip operation, the other code wasn't changed, just cleanup. However - telnetd.c WAS changed, a one-liner at line 28, where /bin/login is execed, rather than /bin/sh. You can change that back and recompile telnetd to see if that changes anything and you get a shell prompt. It wasn't working for Paul. If you're running with full ktcp debugging, perhaps a screenshot or two when its hung... however I can't duplicate anything over here yet on QEMU. |
Things sound a bit promising, since its working for at least a little while. I'm a bit curious as to your switching your DOS packet driver and now ELKS works better, untouched. So there could be issues on the DOS end also?
I figured there had to be an interrupt problem, so I boded DOS and could't get packets in, only out. So - like I mentioned, I switched machines and Ethernet cards. Turned out there was indeed an interrupt problem with the first NE2K card - (soft configurable and I didn't find the tool to change it), so I dug out an older card...
I'd like to see screenshots of the select printks, etc.
With regards to the older 24-hour commit, that's probably ok. The -s option that was added only affects slip operation, the other code wasn't changed, just cleanup.
Commit is af736b6.
However - telnetd.c WAS changed, a one-liner at line 28, where /bin/login is execed, rather than /bin/sh. You can change that back and recompile telnetd to see if that changes anything and you get a shell prompt. It wasn't working for Paul.
If you're running with full ktcp debugging, perhaps a screenshot or two when its hung... however I can't duplicate anything over here yet on QEMU.
Coming!
…-M
|
Ok - so you're running The 'select: Bad file number' is not a kernel printk. It is coming from the select() call in arp.c. After @cjsthompson's first commit to his outstanding cleanup PR, we realized that the "static int tcpdevfd" was in error in multiple source files, and was cleaned up (that is, all occurrences combined). See discussion in #507 for details. Since there is only one file number in that select(), and that is tcpdevfd, which is static and = 0, that's the problem for sure. As a result, I suggest adding Christoph's PR #607, and recompiling all of ktcp, I think it will fix that error. Interestingy, its possible that my -b option broke the incorrectly coded arp select() option - that is, when run async from the shell with "&", tcpdevfd would be the first file opened, and thus have the value of 0 - which would have matched an uninitialized static variable!! We're making progress! Thank you! |
I got QEMU to run with the network card finally, and turned on CONFIG_ETH in .config, a couple things to report: The "static int tcpdevfd" is definitely the problem with the "select Bad file number", all static int declarations of it need "static" removed, or use PR #607. Also, it seems that the -b option I added some time back may be affecting operation. For the time being, I changed some lines in rootfs_template/etc/rc.d/rc.sys to remove it and run it async from the shell:
This seems to allow |
Ok - so you're running ktcp directly from the /etc/rc.d/rc.sys startup script. That's fine.
Yes and no. During testing, I've started the processes manually most of the time, then killing and restarting as required. Having https and telnet active prevents elvis from running (out of memory). [I really need to figure out the elks ed editor, I expected it to be like unix ed. It's not.]
However, I added the untested "-b" (run in background) option, but looks like the error messages are working, so we're probably ok there. It used to be run from the shell with "&", but that closes output file descriptors so I added the option to make things cleaner.
This does not seem to make any difference, i.e the -b options seems fine.
The 'select: Bad file number' is not a kernel printk. It is coming from the select() call in arp.c. After @cjsthompson <https://github.com/cjsthompson>'s first commit to his outstanding cleanup PR, we realized that the "static int tcpdevfd" was in error in multiple source files, and was cleaned up (that is, all occurrences combined). See discussion in #507 <#507> for details. Since there is only one file number in that select(), and that is tcpdevfd, which is static and = 0, that's the problem for sure.
As a result, I suggest adding Christoph's PR #607 <#607>, and recompiling all of ktcp, I think it will fix that error.
Thanks - I'll pull it in and check.
Interestingy, its possible that my -b option broke the incorrectly coded arp select() option - that is, when run async from the shell with "&", tcpdevfd would be the first file opened, and thus have the value of 0 - which would have matched an uninitialized static variable!!
We're making progress!
—Mellvik
|
Are you saying that running httpd and telnetd STOP elvis from out of memory? It should be the other way around.
I'm seeing definite differences in the debug output of ktcp and netstat, so it might be best to always run ktcp from the command line for our initial testing - FYI. |
Having https and telnet active prevents elvis from running (out of memory).
Are you saying that running httpd and telnetd STOP elvis from out of memory? It should be the other way around.
No - consider the parenthesized content a possible explanation, not a continuation of the sentence.
vi seems to be the application that uses the most memory at this time. Back when configurable L2 buffers was added, the default was increased from 64K to 128k of buffers. In further heavy ELKS use with all gettys running and multiple logins, I have seen ELKS run out of memory when trying to run vi. Probably not worth the tradeoff of 128k buffers when vi won't run, and now networking is active. I'm thinking that decreasing the default L2 buffers back to somewhere between 64-96k would be a good idea so vi always runs.
This does not seem to make any difference, i.e the -b options seems fine.
I'm seeing definite differences in the debug output of ktcp and netstat, so it might be best to always run ktcp from the command line for our initial testing - FYI.
Noted.
…-M
|
More testing today:
- added the changes from #607, removing the static declarations etc. Now the system hangs on any and all network accesses.
There is an additional complication that may affect the situation:
- Some time over the last week a change has badly affected the XT keyboard driver. The keyLEDs (capslock, numlock etc. no longer work, keypad send wrong characters).
- In the same timeframe a change in the serial driver (presumably) has made serial access non functional: Output is 'eating' most of the data, less than half of the bytes sent get to the terminal window. Input seems ok but it's hard to tell when output is incomplete,.
More testing on this tomorrow - @ghaerr, if you have any immediate guesses, I'd appreciate it.
This probably belongs in its own thread.
…--Mellvik
|
Ok, a bit of a mess, when combined with the XT kbd and serial driver problems (please comment on those in #612). Since #607 is not committed, we'll keep that on hold. Please do a 'git checkout -f master' and recompile from scratch, and we'll start working on each of the three problems separately. I will submit a small PR that only changes the 'static int tcpdevfd' fix that is required for the arp.c "select bad file descriptor" problem. That will allow you to continue ktcp testing. See #612 for serial and kbd fixes. |
Helge, here's a minimal patch to fix the "select" and "static int tcpdevfd" errors without yet using the full PR #607. This works on my system using the latest git HEAD. This patch also starts I've reviewed #607 and can't see why it causes system hangs, so I'd like to see whether this does before I test 607 on my own machine. |
Here's a serial style screen dump from a minimal tcp test. which also tests /bootopts for the first time. This is indeed a great improvement for debugging. printk needs some CRs to improve readability, otherwise great.
BTW - the halt at this point is hard, not soft - need power recycling. --M |
Continuation from the previous comment: Anyway, this is not bad at all.
ON the client:
Using netcat on the client side I at one point managed to login and get a shell prompt. It stopped (nethang, not system hang) after that (ps reporting the correct terminal).
|
What terminal emulator are you using to read the serial console? I'm running on macOS Terminal, which doesn't have this problem. Perhaps an option can be turned on that converts LFs to CRLFs. If this is not easily possible, then I can look into always converting LF -> CRLF for printk's in the kernel. Let me know so this can be fixed ASAP for you. |
First, this is using the latest commits, including #614 (basic fixes to enable network testing), correct? With ongoing PING, sounds like there is nasty memory corruption in either the /dev/eth driver or ktcp. We will need to find out which by running ktcp with slip at some point soon.
Sound like networking code is still buggy in other places.
That's exciting - that's the first time we've seen telnetd working on ethernet. I pretty certain the ^M echoing is on your Pi side, check the stty options there.
Can you get logged in, or does the system crash afterwards? Suggest looking at |
printk needs some CRs to improve readability, otherwise great.
What terminal emulator are you using to read the serial console? I'm running on macOS Terminal, which doesn't have this problem. Perhaps an option can be turned on that converts LFs to CRLFs.
If this is not easily possible, then I can look into always converting LF -> CRLF for printk's in the kernel. Let me know so this can be fixed ASAP for you.
I'm running MacOS terminal too, ssh to raspberrypi -> screen(1) to serial.
Will take a look under the hood and follow up.
…-M
|
First, this is using the latest commits, including #614 <#614> (basic fixes to enable network testing), correct?
Yes.
With ongoing PING, sounds like there is nasty memory corruption in either the /dev/eth driver or ktcp. We will need to find out which by running ktcp with slip at some point soon.
Sees the ping is entirely repeatable. Bombs after 20 pings.
rebooting w/o the ongoing PING, and instead telnetting into elks, leaves us with the following. It's not consistent in that it doesn't stop at the same place every time, and sometimes just loops in retransmits.
Sound like networking code is still buggy in other places.
The elks telnetd responds with login: but continues after a few seconds (doesn't wait for newline) and execs /bin/login with whatever was typed, and emits 'Password:', where it actually waits. The tty (pty) echoes control characters like '^M' etc.
That's exciting - that's the first time we've seen telnetd working on ethernet.
I pretty certain the ^M echoing is on your Pi side, check the stty options there.
Quite possible. It's different with NC, now - nc (as I'm using it here) is line oriented, not a good comparison. To be investigated.
Anyway, this is not bad at all.
Can you get logged in, or does the system crash afterwards? Suggest looking at ps to see if there's any memory left just before entering the password, the problem may be no more system memory. Run meminfo as well to see.
Like the 2nd report indicated, yes, I can log in via netcat, even get a shell prompt. Nothing further. The shell does not seem to get any of the input.
—Mellvik
|
Run meminfo when you have the shell prompt connected via netcat. That will report how much free memory there is, in case this might be a memory problem. Another thought is to login as toor, and use sash. Unfortunately, sash is very big with all the builtins and is only 15k smaller than ash at this point. But might be interesting to compare. Unfortunately I can't do replicate anything over here - running QEMU when I try "telnet localhost" nothing happens. I can't use an external program for testing!! |
yes, I can log in via netcat, even get a shell prompt. Nothing further. The shell does not seem to get any of the input.
Run meminfo when you have the shell prompt connected via netcat. That will report how much free memory there is, in case this might be a memory problem. Another thought is to login as toor, and use sash. Unfortunately, sash is very big with all the builtins and is only 15k smaller than ash at this point. But might be interesting to compare.
Will do! With the recent memory improvements this may just be possible.
Unfortunately I can't do replicate anything over here - running QEMU when I try "telnet localhost" nothing happens. I can't use an external program for testing!!
—
It would be extremely beneficial to get you closer into the Debugging loop. There must be a way to get this to work reasonably w qemu. I'm inclined to spend some time figuring that out.
Not knowing anything about the inside of ktcp, would it make sense to locate and fix the icmp echo problem to begin with, assuming it's on the 'outside' softwarewise as it is in the stack?
Btw, outgoing telnet (from elks) connects, then disconnects, leaving the calling shell in a loop, printing prompts. This only if started from the console. If started from serial, outgoing telnet just hangs.
Seems to me there may be a stdio problem (to be tested) in both telnet situations. that when I get a shell from telnetd in nc on the client, that shell expects input from the console, not via the ttyp.
Again, given the improvements over just a few days, inbound & outbound telnet should be operational in a matter of days.
…--M
|
I just got incoming QEMU figured out! Have duplicated httpd running well, telnetd gets a shell and fails with double characters and other problems. Will be figuring this out and submitting PRs shortly. Still don't have outgoing working, so can't test telnet yet.
QEMU supports host forwarding, I'll be working on seeing if I can ping ELKS. I took a quick look and ktcp is supposed to support ICMP echo.
I can't test that yet, but already have a handful of problems I can finally duplicate :) |
Hello @Mellvik, An update on networking: I've deep-dived into it, and its a big mess. There's lots of work to be done, and I'm a bit hesitant to start, as it could take lots of time. I have incoming-networking only working, so only tested that. Outgoing remains entirely untested, and internal (localhost) connections appear not to work at all. Brief synopsis:
Overall, One scenario is to add significant good debugging into Another possibility is to ditch ktcp and move to userland-only networking. This would be equivalent to running Karn's code or a micro TCP stack as a user program complete within itself, and stay out of the kernel, at least at the start. This could work great, but ultimately fails because we need sockets in the kernel to support arbitrary networking. Since ELKS is real mode, I'm not sure how important this is, versus getting real usable networking running. |
@ghaerr,
thanks for a thorough rundown and summary. Very enlightening indeed. Whatver the status is, now we have a baseline. (Which somewhat ironically reminds me of the BSD 4.1-4.2 transition: Integrate external networking code (from BBN) or develop from scratch - in hindsight they made the right choice - and created sockets (and more) to boot).
In our case that would be a choice between ktcp (somewhat workring) and Brian Kern (not entirely from scratch, but as I understand it, a clean kernel-integrated approach). Fully userland just doesn't taste good.
I appreciate the hesitation to take on the challenge of the Kern approach, and my choice - in the name of speed (although an assumption) would be to go for a cleanup of ktcp and get file transfer (either webdav via the current http server or a new ftpd) and a telnetd stable, leaving outgoing clients for now.
From a practical perspective, having two way file transfer operational would ease debugging alot, eliminating the make--move-floppy-to-ext-drive--dd-to-floppy--move-floppy-to-PC--boot'n'watch sequence much shorter and faster. Helped further by the newly added serial console flexibility.
My 2¢...
…--Mellvik
An update on networking: I've deep-dived into it, and its a big mess. There's lots of work to be done, and I'm a bit hesitant to start, as it could take lots of time.
I have incoming-networking only working, so only tested that. Outgoing remains entirely untested, and internal (localhost) connections appear not to work at all.
Brief synopsis:
telnetd is completely broken. It doesn't implement any telnet sequences, its basically a stripped-down login-style daemon. As a result, it works terribly with any telnet-protocol compliant client, and reverts telnet clients to line mode, or double-echoing, which makes it unusable.
httpd appears to work well.
ktcp works well for a few minutes or connections, then fails with unreported memory allocation fails and subsequent use of NULL pointers. I've got that fixed, but there remains a big memory corruption problem that causes it to run out of memory improperly and stop executing after a few minutes.
Overall, ktcp is poorly written, unfinished, has corruption problems, and is married to the kernel /dev/tcp kluge within the socket implementation. We really need something like Phil Karn's well-written TCP/IP suite, but fear it would take lots of work to understand the kernel handling and subsequent passing to user code and back.
One scenario is to add significant good debugging into ktcp, but it suffers ultimately from never being finished, and writing TCP/IP network code from scratch is not a great idea these days. I fear it can't be made bug-free unless its memory corruption is tracked down. In addition, telnetd needs to be completely replaced with the MINIX 2 version, which I was going to do, but ktcp won't run long enough to be useful. Running outside applications using telnetd and httpd simultaneously will cause ktcp to crash within seconds.
Another possibility is to ditch ktcp and move to userland-only networking. This would be equivalent to running Karn's code or a micro TCP stack as a user program complete within itself, and stay out of the kernel, at least at the start. This could work great, but ultimately fails because we need sockets in the kernel to support arbitrary networking. Since ELKS is real mode, I'm not sure how important this is, versus getting real usable networking running.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
For me it's not a matter of taste, it's the practicality. Fully userland solution isn't that bad and isn't that uncommon, especially if you're limited in space for kernel. Shifting the burden to a daemon running in userspace is the solution then. |
Update on networking status, and thanks for both your comments. After careful consideration of the work involved, I decided to stick with I've now got ktcp able to run for longer periods of time, have fixed a big memory overwrite bug and have added code to prevent it from running out of memory. I have tracked down the unbounded memory growth problem, which happens for a couple reasons: 1) for reasons yet unknown, TCP gets out of sync, and unlimited packets are allocated for retransmission, which grew to a very large size and malloc fails, and 2) TCP close wait operations with
All of this is being tested with NE2K card emulation on QEMU. I badly need a method of testing SLIP/CSLIP to determine at which level the bugs are. Can anyone help to setup a test SLIP scenaro using an emulator? I plan on submitting a few cleanup PRs that should allow testing volunteers to help move this forward. |
It looks like deveth.c line 94 checks for two MAC addresses: broadcast and ELKS. I'm thinking that ARP requests with a broadcast header (all 0xFF) are coming in through there, into arp_recvpacket, adding the request to the cache (which may not make sense if the request is a query, not a response), then issuing an ARP reply regardless (line 199 in arp.c). I can't exactly debug this in QEMU, is ELKS responding to all ARP requests with its own MAC/ethernet address? I think we can fix this by comparing the requested src IP in arp.c line 198 and not call arp_reply if its local_ip. Perhaps test this by adding the line I'll read up a bit on ARP headers to see whether the call to arp_cache_add is appropriate. Note that ktcp doesn't timeout ARP entries either, so there's still work to be done. |
Elks will answer any and all arp reqests. Neither arp.c nor deveth.c checks that the request is for its ip address.
It looks like deveth.c line 94 checks for two MAC addresses: broadcast and ELKS. I'm thinking that ARP requests with a broadcast header (all 0xFF) are coming in through there, into arp_recvpacket, adding the request to the cache (which may not make sense if the request is a query, not a response), then issuing an ARP reply regardless (line 199 in arp.c).
Yes, this is what I'm reading too. This is not per the recommendation in RFC 826. It may in fact be argued that with such a small ARP cache, this behaviour is bad news for ELKS on a populated network. More on that below.
While reading the ELKS ARP code I believe I'm seeing a related but different problem. arp_recvpacket calls arp_cache_add before doing any processing (other than checking the ether addresses), which is per the RFC guidelines. arp_cache_add calls arp_cache_get to check if the IP address is already in the cache. Still correct. THEN arp_cache_get OVERWRITES the ether-address from the call with the ether address in the cache, thus not only passing the opportunity to correct the entry if it's wrong, but also preserving incorrect info in the cache. This is not per the RFC guide for obvious reasons. @ghaerr, are you seeing the same?
I can't exactly debug this in QEMU, is ELKS responding to all ARP requests with its own MAC/ethernet address?
Yes.
I think we can fix this by comparing the requested src IP in arp.c line 198 and not call arp_reply if its local_ip. Perhaps test this by adding the line if (arp->ip>src == local_ip) before arp_reply on line 199.
Yes, this is per the RFC recommendation - which essentially says 1) check the cache, update cached ether_addr if IP_addr found, 2) check if it's we are the destination (our IP in the request), 3) add to cache if not found in step 1, 4) send arp-reply:
""
[optionally check the protocol length ar$pln]
Merge_flag := false
If the pair <protocol type, sender protocol address> is
already in my translation table, update the sender
hardware address field of the entry with the new
information in the packet and set Merge_flag to true.
?Am I the target protocol address?
Yes:
If Merge_flag is false, add the triplet <protocol type,
sender protocol address, sender hardware address> to
the translation table.
""
I'll read up a bit on ARP headers to see whether the call to arp_cache_add is appropriate. Note that ktcp doesn't timeout ARP entries either, so there's still work to be done.
Yes, I realize that - there is a lot of ARP stuff ELKS doesn't do, which is fine for now. I think correcting the behaviour above and adding an arp command that can flush, list, maybe manually populate the cache, will reduce the need for a timed cache to almost zero.
My ktcp branch is too messy to lend itself to a PR for this right now, but I can send a diff when it's tested.
—Mellvik
|
Diff for arp.c fixes, re. message above:
|
Hello @Mellvik, Thanks for the ARP fix! Is this version RFC 826 compliant, according to your reading of it? I read most of the RFC yesterday. This fixes the ARP replies being generated by ELKS when it shouldn't, does it fix anything else that can be noted? I don't have a way to fully test it, but will check what I can on QEMU. I would like to get rid of the merge_flag global and will submit a slightly revised PR with that passed as a parameter to arp_cache_get() for you to test and we'll commit it. I've been trying to get a PC that will run serial port or ethernet and have struck out twice in the last week. The first system was too modern for an ISA NE2K, but wouldn't boot an ELKS FAT32 MBR USB stick for some reason. The second system was quite ancient but its BIOS didn't support any USB boot. I can't write floppies so USB boot seems to be the way to go. Can't wait to actually get some real hardware that can run ELKS lol :) |
Is this version RFC 826 compliant, according to your reading of it? I read most of the RFC yesterday.
Cache updates are Now per the RFC recommendations.
This fixes the ARP replies being generated by ELKS when it shouldn't, does it fix anything else that can be noted?
The bug pointed out in my previous message is fixed, in addition to the 'automatic' cache update.
I don't have a way to fully test it, but will check what I can on QEMU. I would like to get rid of the merge_flag global and will submit a slightly revised PR with that passed as a parameter to arp_cache_get() for you to test and we'll commit it.
I expected that :-) so I took the short route.
I've been trying to get a PC that will run serial port or ethernet and have struck out twice in the last week. The first system was too modern for an ISA NE2K, but wouldn't boot an ELKS FAT32 MBR USB stick for some reason. The second system was quite ancient but its BIOS didn't support any USB boot. I can't write floppies so USB boot seems to be the way to go. Can't wait to actually get some real hardware that can run ELKS lol :)
Good luck with that. May I recommend compaq portables on ebay? I've sold 4 so far this year. They're great fun to restore... just got a broken compaq slt386s/20 from australia. Looks great, unfortunately, xircom parallel (and serial) is the only way to network it. Runs win 3.1 :-)
M
.
BTW the ping 24 problem seems to be in the driver - I'm having fun with 86 assembly code right now.
|
That's a neat idea, I used to have one years ago that I developed on... but I need a system that will boot an ELKS MBR USB, since I have no way of writing a floppy! |
May I recommend compaq portables on ebay?
That's a neat idea, I used to have one years ago that I developed on... but I need a system that will boot an ELKS MBR USB, since I have no way of writing a floppy!
A standard 1.44 usb floppy works fine on your mac (last time I checked, I'm using the raspi these days - from the mac) ... and costs little..
M
|
Thanks for you suggestions on hardware. I bought a Compaq Portable 386 on ebay, along with a Gotek floppy drive emulator. I hope to use this to improve the serial port driver for higher speeds, and, with the single ISA expansion bay, get an NE2K network card up and running for network testing, and ultimately the implementation of TFTP for updating ELKS on its internal hard drive. |
Congratulations!! Presumably with an expansion box so you get the ISA bus. I used a floppy emulator like the gotek for a while, works fine. If the machine has the original (slow) disk, you may consider replacing it with a CF card IDE socket. Too bad I couldn't sell you one - I'm preparing two 386/20s for ebay right now. Let me know if there is anything you need. BTW - I just noticed that the 3COM 503 card uses the same chipset as the ne2k, so expanding ELKS support to that 3com model would be easy. |
Yes.
I've heard that the system has to be booted from 5" floppy first and some sort of setup program run... The Compaq BIOS expects a 1.2M 5.25" drive, will there be a problem if I try running the Gotek as a 1.44M 3.5" drive when attached? I guess I'll have to open the case to even get that working.
Just looked on eBay, and the 503 cards have the older BNC connector... what about this one, a 3C509 which has the ethernet connector, do you think that would work? |
Presumably with an expansion box so you get the ISA bus
Yes.
I used a floppy emulator like the gotek for a while, works fine.
I've heard that the system has to be booted from 5" floppy first and some sort of setup program run... The Compaq BIOS expects a 1.2M 5.25" drive, will there be a problem if I try running the Gotek as a 1.44M 3.5" drive when attached? I guess I'll have to open the case to even get that working.
No, booting from a 1,44MB drive from scratch works. If you're getting a machine with a broken battery - most likely - You have to open it anyway. I can send you pictures on how to fix that. Minimal soldering, a cr2032 socket, cr2032 battery and some double sided tape.
the 3COM 503 card uses the same chipset as the ne2k, so expanding ELKS support to that 3com model would be easy.
Just looked on eBay, and the 503 cards have the older BNC connector... what about this one, a 3C509 which has the ethernet connector, do you think that would work?
No, go for a real ne2k card, that's easier. A real old one with jumpers instead of programmed config (IRQ, ioaddr ) is an unbelievable simplification. You won't believe the time I spent to find a DOS utility that actually worked with my 'other' ne2k card - to change the IRQ to 9 - some don't even allow that ... or something else.
I'll have a look on ebay. May make sense to take this thread off of github. helge@mymayday.com is good.
M
|
Update on ktcp status:
I’m (partly) rewriting the low level of the ne2k driver, now being tested - incoming ping works, arp works, incoming telnet is almost stable.
While finishing up, here’s a (probably minor) challenge for you @ghaerr:
The case is incoming telnet to elks. If the client sends a RESET, tcp.c line 239 handles that fine, cleans up the receiving end. But if there are outgoing packets pending, ktcp panics in tcpdev.c line 310. Seems a send-cleanup is needed too ... here’s the debug output (w/my extensions)
IP: recv tcp packet
retrans check buffers 3, mem 255
retrans retry #1 rto 2048
arp: using cached entry for 10.0.2.2
IP: send 10.0.2.15 -> 10.0.2.2 v4 hl:5 tos:0 len:105 id:182 fo:0 ihl:5 chk:c961 (OK) ttl:64 prot:6
|t||r|/C53|B4d/ |RD: 1536 64 94da|IP: recv 10.0.2.2 -> 10.0.2.15 v4 hl:5 tos:0 len:40 id:0 fo:4000 ihl:5 chk:c022 (OK) ttl:64 prot:6
IP: recv tcp packet
tcp: RST received, removing retrans packets
retrans free buffers 2 mem 170
retrans free buffers 1 mem 85
retrans free buffers 0 mem 0
Free CB
ktcp: panic in write
—Mellvik
|
Thanks for the update!
Nice - did you find particular areas of the NE2K driver that were broken? What were they?
Unfortunately I haven't been able to determine the problem with this yet. ktcp has a number of problems on session termination, whether from RST, or from normal or abnormal telnet termination. These are complicated with the kernel/tcpdev design which passes socket information to/from user programs and ktcp. I have been testing using localhost, which only works for a while before hanging. None of the session disconnects work properly and there also seems to be memory corruption errors afterwards as well. |
I’m (partly) rewriting the low level of the ne2k driver, now being tested - incoming ping works, arp works, incoming telnet is almost stable.
Nice - did you find particular areas of the NE2K driver that were broken? What were they?
The key problem was the ring buffer wraparound, which the driver did ‘manually’ in addition to the NIC doing it automatically. Several special cases arouse which cased the driver to deliver the wrong data on read.
BTW, I ordered the network card you recommended for the Compaq 386.
Congrats! The next thing to look for (on eBay) is keyboard cables. The coating is cracking up and falling off the original cables.
The case is incoming telnet to elks. If the client sends a RESET, tcp.c line 239 handles that fine, cleans up the receiving end. But if there are outgoing packets pending, ktcp panics in tcpdev.c line 310.
Unfortunately I haven't been able to determine the problem with this yet. ktcp has a number of problems on session termination, whether from RST, or from normal or abnormal telnet termination. These are complicated with the kernel/tcpdev design which passes socket information to/from user programs and ktcp. I have been testing using localhost, which only works for a while before hanging. None of the session disconnects work properly and there also seems to be memory corruption errors afterwards as well.
It will be interesting to see how much of this is going away when the low level driver works correctly. I just found that the mentioned panic is less severe than I thought. Ktcp continues to work afterwards. Additionally - outgoing telnet seems to work fine in limited testing, even terminates correctly. I still have too much debug output to start real life testing, but this is very promising.
My final step is to add code to make the NIC recover from ring buffer overruns, which turns out to be quite complicated (and a situation that is likely to occur on slow HW).
…-M
|
I went ahead and merged #667 which finalized the ethernet and ARP fixes discussed here, with the exception of your enhanced NE2K driver. Since we had a number of other intervening PRs, I figured it would be better to pull everything down from the repo for your next round of testing. Since this ktcp status issue is getting pretty long, go ahead and close this if the ARP wait kluge works, I'll then close the original issue, and you can bring up subsequent problems in a new thread. Thanks! |
I've built ELKS this morning with all of the recent ktcp and ARP changes and here are some observations:
Unfortunately. this telnet session didn't seem to work any longer and I couldn't open new telnet connection from Linux to ELKS. I couldn't kill A side note about
I guess it's only possible on the real n2k-compatible device. |
@ghaerr, --Mellvik |
Hello @pawosm-arm, Thanks again for your continued testing of networking over SLIP!
That's great! It is taking a while, but I think we're slugging away slowly eliminating many of the bugs involved in getting networking to operate for a reasonable amount of time. It sounds like the serial driver is working now (up to 4800 baud) on your system. Did you require the larger 4k input buffer, or can ktcp work with the standard 1k buffer? I worked to change the MTU to 1k bytes so a larger ring buffer wasn't required. I'm wondering whether we need some kind of
Yes, I'm aware of the retransmission problem, and haven't quite figured out why its happening so often. It doesn't appear to be the result of dropped serial characters, but I don't see it happen on ethernet, so still a bit confusing. I have no idea what the main "repeating character" problem is you found doing a telnet session. I would guess its not a new bug, but a result of one of the 3-4 major memory corruption bugs still identified and present in the networking stack.
FYI - ktcp is absolutely NOT reliable yet after any session has closed. I don't yet know the reasons but will attack these after we get the basic networking working (which is almost complete, awaiting further testing and a NIC driver enhancement from @Mellvik).
Yes, once networking has an error or closes a session, it seems there is corruption that causes innumerable other issues. Thanks again for your testing, and its nice to hear progress is being made! |
Can you explain the ARP issues a bit more? I added the MAC address to arp debug issues after finding how I broke the ARP code for one commit, but am not aware of any new ARP bugs... and am anxious to know whether you're talking about something I might have broken with the ARP wait kluge fix rewrite, or something else. Obviously, the way that ARP works is changed, by delaying the IP packet until a reply is seen, but that's the only change made. I'm dependent on your testing for ethernet - the QEMU networking is pretty strange... for instance, the ARP commit I broke ended up sending an all-zero MAC address in replying to packets, and guess what? The QEMU "stack" worked, it apparently doesn't even look at the destination packet address! If I hadn't been in the middle of adding the new |
Nah it wasn't required, it's just a change that slipped in from my previous experiments with the higher baud rates. I'd remove it for next batch of tests. |
I can confirm that the original input buffer size (1k) is OK, no need to change it. I've been testing today's changes, no change in operation and stability observed. |
I was planning to get into this today, but ran into some other issues. :-)
The initial symptom is that incoming arp doesn't work, I need to start an elks outgoing Telnet To trigger arp in order to get started. The reason may be somewhere else though, I'm starting with an arp debug enabled kernel and a rebooted raspi in the morning.
…--M
19. jul. 2020 kl. 17:58 skrev Gregory Haerr ***@***.***>:
@Mellvik,
the first using the latest commit shows definite ARP issues, so I'm holding back on the closing till I have a clear problem report
Can you explain the ARP issues a bit more? I added the MAC address to arp debug issues after finding how I broke the ARP code for one commit, but am not aware of any new ARP bugs... and am anxious to know whether you're talking about something I might have broken with the ARP wait kluge fix rewrite, or something else. Obviously, the way that ARP works is changed, by delaying the IP packet until a reply is seen, but that's the only change made.
I'm dependent on your testing for ethernet - the QEMU networking is pretty strange... for instance, the ARP commit I broke ended up sending an all-zero MAC address in replying to packets, and guess what? The QEMU "stack" worked, it apparently doesn't even look at the destination packet address! If I hadn't been in the middle of adding the new arp command which showed a table of zero MAC address, I wouldn't even have known it!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
ARP seems to be working flawlessly. I'm not getting any consistency in the ICMP reply duplicates, so I'm resting the case.
As to the 'not working' status of yesterday, it was once again the ne2k IRQs that hit me (I keep changing it in ports.h and forgetting that it has to be changed in two places).
BTW - I'm also seeing the huge RTO reported by @pawosm-arm some times (telnet session):
…
retrans free buffers 1 mem 23
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e)
retrans alloc buffers 2, mem 50
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e)
retrans check buffers 2, mem 50
retrans retry #1 rto 38654544 mem 50
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e)
retrans check buffers 2, mem 50
retrans free buffers 1 mem 27
retrans check buffers 1, mem 27
retrans free buffers 0 mem 0
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e) [hang]
—Mellvik
… 20. jul. 2020 kl. 22:03 skrev Helge Skrivervik ***@***.***>:
I was planning to get into this today, but ran into some other issues. :-)
The initial symptom is that incoming arp doesn't work, I need to start an elks outgoing Telnet To trigger arp in order to get started. The reason may be somewhere else though, I'm starting with an arp debug enabled kernel and a rebooted raspi in the morning.
--M
> 19. jul. 2020 kl. 17:58 skrev Gregory Haerr ***@***.***>:
>
>
>
> @Mellvik <https://github.com/Mellvik>,
>
> the first using the latest commit shows definite ARP issues, so I'm holding back on the closing till I have a clear problem report
>
> Can you explain the ARP issues a bit more? I added the MAC address to arp debug issues after finding how I broke the ARP code for one commit, but am not aware of any new ARP bugs... and am anxious to know whether you're talking about something I might have broken with the ARP wait kluge fix rewrite, or something else. Obviously, the way that ARP works is changed, by delaying the IP packet until a reply is seen, but that's the only change made.
>
> I'm dependent on your testing for ethernet - the QEMU networking is pretty strange... for instance, the ARP commit I broke ended up sending an all-zero MAC address in replying to packets, and guess what? The QEMU "stack" worked, it apparently doesn't even look at the destination packet address! If I hadn't been in the middle of adding the new arp command which showed a table of zero MAC address, I wouldn't even have known it!
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub <#610 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGODSTA6CWJT2IAWNREDR4MJ3RANCNFSM4MYX5P5A>.
>
|
I'm now starting the merge of the updated ne2k driver and related changes to other files, which will enable more extensive testing of both arp and the tcp level with (presumably) more time between hangs.
—Mellvik
… 21. jul. 2020 kl. 14:13 skrev Helge Skrivervik ***@***.***>:
ARP seems to be working flawlessly. I'm not getting any consistency in the ICMP reply duplicates, so I'm resting the case.
As to the 'not working' status of yesterday, it was once again the ne2k IRQs that hit me (I keep changing it in ports.h and forgetting that it has to be changed in two places).
BTW - I'm also seeing the huge RTO reported by @pawosm-arm some times (telnet session):
…
retrans free buffers 1 mem 23
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e)
retrans alloc buffers 2, mem 50
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e)
retrans check buffers 2, mem 50
retrans retry #1 rto 38654544 mem 50
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e)
retrans check buffers 2, mem 50
retrans free buffers 1 mem 27
retrans check buffers 1, mem 27
retrans free buffers 0 mem 0
arp: using cached entry for 10.0.2.94 (00.80.c7.30.5c.2e) [hang]
—Mellvik
> 20. jul. 2020 kl. 22:03 skrev Helge Skrivervik ***@***.*** ***@***.***>>:
>
> I was planning to get into this today, but ran into some other issues. :-)
> The initial symptom is that incoming arp doesn't work, I need to start an elks outgoing Telnet To trigger arp in order to get started. The reason may be somewhere else though, I'm starting with an arp debug enabled kernel and a rebooted raspi in the morning.
>
> --M
>
>> 19. jul. 2020 kl. 17:58 skrev Gregory Haerr ***@***.*** ***@***.***>>:
>>
>>
>>
>> @Mellvik <https://github.com/Mellvik>,
>>
>> the first using the latest commit shows definite ARP issues, so I'm holding back on the closing till I have a clear problem report
>>
>> Can you explain the ARP issues a bit more? I added the MAC address to arp debug issues after finding how I broke the ARP code for one commit, but am not aware of any new ARP bugs... and am anxious to know whether you're talking about something I might have broken with the ARP wait kluge fix rewrite, or something else. Obviously, the way that ARP works is changed, by delaying the IP packet until a reply is seen, but that's the only change made.
>>
>> I'm dependent on your testing for ethernet - the QEMU networking is pretty strange... for instance, the ARP commit I broke ended up sending an all-zero MAC address in replying to packets, and guess what? The QEMU "stack" worked, it apparently doesn't even look at the destination packet address! If I hadn't been in the middle of adding the new arp command which showed a table of zero MAC address, I wouldn't even have known it!
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub <#610 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3WGODSTA6CWJT2IAWNREDR4MJ3RANCNFSM4MYX5P5A>.
>>
|
Great! I am going to close the three year old issue regarding the ARP wait hack as being fixed.
Lets consider this a separate issue, should it come up again.
Just found that bug: the basic timer mechanism within ktcp was broken, the microsecond time field was being interpreted as signed rather than unsigned and it sign-extended that into other time data, causing wildly inaccurate time values, depending on the microsecond! Kind of amazing it worked at all. Haven't tested everything yet, but this may fix other timeout and retransmission issues.
Super - the ARP and ICMP stuff should hopefully be stable enough to test your new driver for submission! |
Just found that bug: the basic timer mechanism within ktcp was broken, the microsecond time field was being interpreted as signed rather than unsigned and it sign-extended that into other time data, causing wildly inaccurate time values, depending on the microsecond! Kind of amazing it worked at all. Haven't tested everything yet, but this may fix other timeout and retransmission issues.
I'm now starting the merge of the updated ne2k driver and related changes to other files, which will enable more extensive testing of both arp and the tcp level
Super - the ARP and ICMP stuff should hopefully be stable enough to test your new driver for submission!
I have a good feeling about this one. It may have been the root cause om many issues.
Rather looking forward to the next PR.
…--M
|
@ghaerr todays pull request indeed fixed those large rto values bug, frankly, I'm not observing retrans at 4800bps at all! So I tried 9600 and although retransmissions started to pop out from time to time (infrequently), with rto 16, my TCP connections were still stable and undisturbed.
Note that with CSLIP it took me minutes to get the above. I guess CSLIP issue deserves separate ticket. |
I believe it makes sense to open a new issue on ktcp: It has not been tested in a while, and much has changed in and around the kernel since.
I fired up KTCP on a physical machine this morning, and the overall status is: It does not work. Then the good news:
Connecting to 10.0.2.1
message and maybe 20 successful echoes. Once I even had something coming back from the other end. Garbled, but in terms of byte count, something like a login prompt.This is from ELKS:
This is from the pinging host:
The text was updated successfully, but these errors were encountered: