Implement spelling dictionary #12

hackerb9 · 2022-08-02T05:06:26Z

I had an idea for how one could create a spelling dictionary by using a hashtable but not storing the words themselves, just a bit vector. There'd be some false negatives (bogus words that happen to hash to a valid word), but the chances are low if the vector is large enough. Normally a large vector would be a problem for limited memory, but since it would be sparse, it should be easily compressible.

Turns out someone beat me to... by forty years. There's an IEEE paper from 1982 by Doug McIlroy which lays out how he managed to fit a spell checker for 30,000 words (250 kilobytes) in a 64 kilobyte machine. From the abstract:

Stripping prefixes and suffixes reduces the list below one third of its original size, hashing discards 60 percent of the bits that remain, and data compression halves it once again.

So, potentially, hashing and compression could cut the wordlist down to a quarter of the size.

It appears Wordle uses a 13,000 word (72 kilobyte) list of what it will accept. Even a quarter of that, 18 kilobytes, is still rather large for a Model T, so it may make sense to use a smaller corpus.

SCOWL makes it easy to create a list of the most frequent five letter words. For example, here is a list of 7,000 words (35 kilobytes), but the size is flexible since the words are partitioned into frequency bins.
Alternately, one could use SCOWL's list of least common words and subtract them from Wordle's list, that way unusual words that Wordle knows about but SCOWL doesn't will be kept. Here's a list of 7500 words (44 kilobytes) created that way.

Currently, M100LE takes up about 8KB of storage (for the program and one year's worth of words). I do not know how much RAM is required at runtime, but I would not expect it to be more than a kilobyte. On a Model 200, which has only 19KB of RAM free for BASIC to use, That'd leave about 10KB for a wordlist to be stored plus the extra code to access it plus any extra RAM usage.

I believe, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible!

hackerb9 · 2022-08-02T08:26:46Z

I believe, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible!

I wrote a test program on my Tandy 200 and I am able to instantly load up the word of the day by reading it from memory instead of looping over LINE INPUT #1 as M100LE currently does. I do not know if the NEC PC-8201A has the same RAM directory structure, but I bet it does since that seems to have been something that came from their common evolutionary ancestor, the Kyocera.

Basic method for randomly accessing files in RAM:

CLEAR to make memory locations sane.
Check PEEK(1) to determine machine type.
Read RAM directory at 63842 (M100) or 62034 (T200)
Each entry is eleven bytes:
- 1: File attributes
- 2,3: File address in memory (little endian)
- 4-9: File name before the dot (padded with spaces at end if necessary)
- 10,11: File name extension after the dot (starts with space if no extension)
Keep reading filenames until "WL2022.DO" is found
Let X←File address in memory
Let DY←Ordinal day number ("Julian date")
Today's word can be found at PEEK( X + (DY-1)*7)

bgri · 2022-08-03T02:11:43Z

Well, that paper was an interesting read! Seems like Dr. McIlroy knows his stuff /jk I won't pretend to understand a fair bit of what was written :) https://en.wikipedia.org/wiki/Douglas_McIlroy That being said, I am intrigued by this. If it's possible to give a realistic sub-set of acceptable words to test against, then that sounds like a fun goal. I like the idea of the SCOWL exclusion list -- retaining more of the original content and sacrificing least common words. Would have to test to see what kind of savings we get. I *believe*, but am not sure, that, since files are actually already in

RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible!

I saw your following message and like this idea. It's what I kinda had in mind with the line input #1 method... but this is much neater! I remain impressed (and grateful) that you're continuing to be interested in this! I should soon be in a position to assist (beyond cheerleading) as I think I've found my m100 hardware boxes :)

…

--Brad

On Mon, Aug 1, 2022 at 11:06 PM hackerb9 ***@***.***> wrote: I had an idea for how one could create a spelling dictionary by using a hashtable but not storing the words themselves, just a bit vector. There'd be some false negatives (bogus words that happen to hash to a valid word), but the chances are low if the vector is large enough. Normally a large vector would be a problem for limited memory, but since it would be sparse, it should be easily compressible. Turns out someone beat me to... by forty years. There's an IEEE paper from 1982 by Doug McIlroy <https://ia800805.us.archive.org/33/items/development-of-spelling-list/Image092317125441_text.pdf> which lays out how he managed to fit a spell checker for 30,000 words (250 kilobytes) in a 64 kilobyte machine. From the abstract: Stripping prefixes and suffixes reduces the list below one third of its original size, hashing discards 60 percent of the bits that remain, and data compression halves it once again. So, potentially, hashing and compression could cut the wordlist down to a quarter of the size. It appears Wordle uses a 13,000 word <https://github.com/tabatkins/wordle-list> (72 kilobyte) list of what it will accept. Even a quarter of that, 18 kilobytes, is still rather large for a Model T, so it may make sense to use a smaller corpus. - SCOWL <http://wordlist.aspell.net/> to create a list of the most frequent five letter words. For example, here is a list of 7,000 words <https://gist.github.com/hackerb9/0d18b7b68149faa8f22841bfcec7ad35> (35 kilobytes), but the size is flexible since the words are partitioned into frequency bins. - Alternately, one could use SCOWL's list of *least* common words and subtract them from Wordle's list, that way unusual words that Wordle knows about but SCOWL doesn't will be kept. Here's a list of 7500 words <https://gist.github.com/hackerb9/5f951b41bda4348623d85d48ec1397fb> (44 kilobytes) created that at way. Currently, M100LE takes up about 8KB of storage (for the program and one year's worth of words). I do not know how much RAM is required at runtime, but I would not expect it to be more than a kilobyte. On a Model 200, which has only 19KB of RAM free for BASIC to use, That'd leave about 10KB for a wordlist to be stored plus the extra code to access it plus any extra RAM usage. I *believe*, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible! — Reply to this email directly, view it on GitHub <#12>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLJII6ZPUW6CZTLSQGDB2TVXCUFZANCNFSM55JX2KJQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- -- Brad Grier ----------

bgri · 2022-08-03T02:32:19Z

Lovely! I think the NEC has the same process, perhaps accessed slightly differently. Will have to look. Assuming we're able to get a nice tidy (and suitably slim) set of acceptable words -- I was wondering, since we know the 'guess' word to be tested (against a valid word as well as against the daily word), is there a way to speed up the test by indexing against the hash? (I don't understand how hashing works very well). For example, if our daily word is ABOVE, and the guess is ABOUT, can the guess-hash only test against a subset of the whole acceptable words dictionary, testing the first two characters -- known words starting with AB -- and abort/exclude any others (we don't care about ACRES)? Basically using the first two characters of the guess to eliminate all others in the acceptable words dictionary. Can we specify how the hash is built to accommodate something like this? (yes, a noob when it comes to this level of stuff). And then, would we know in RAM where to look? All AB hashes start at location #0643 (for example)? Or does this even matter and would any speed gains be negligible?

…

--Brad

On Tue, Aug 2, 2022 at 2:26 AM hackerb9 ***@***.***> wrote: I *believe*, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible! I wrote a test program on my Tandy 200 and I am able to instantly load up the word of the day by reading it from memory instead of looping over LINE INPUT #1 as M100LE currently does. I do not know if the NEC PC-8201A has the same RAM directory structure, but I bet it does since that seems to have been something that came from their common evolutionary ancestor, the Kyocera. Basic method for randomly accessing files in RAM: 1. CLEAR to make memory locations sane. 2. Check PEEK(1) to determine machine type. 3. Read RAM directory at 63842 (M100) or 62034 (T200) 4. Each entry is eleven bytes: - 1: File attributes - 2,3: File address in memory (little endian) - 4-9: File name before the dot (padded with spaces at end if necessary) - 10,11: File name extension after the dot (starts with space if no extension) 5. Keep reading filenames until "WL2022.DO" is found 6. Let X←File address in memory 7. Let DY←Ordinal day number ("Julian date") 8. Today's word can be found at PEEK( X + (DY-1)*7) — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLJII5TZPGBGTZUV3KHJC3VXDLVDANCNFSM55JX2KJQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- -- Brad Grier ----------

hackerb9 · 2022-08-03T07:21:20Z

Lovely! I think the NEC has the same process, perhaps accessed slightly differently.

Excellent. My main computer died in a recent heatwave, but I'll see if I can jury-rig something to send over my test program.

I was wondering, since we know the 'guess' word to be tested (against a valid word as well as against the daily word), is there a way to speed up the test by indexing against the hash? (I don't understand how hashing works very well).

Hashing already speeds up the checking by (essentially) using the guess word as an index into the array of valid words.

Think of hashing like a checksum: a magic function that, given a bunch of data — such as the string "ABOVE" — returns a number — such as 49989. Ideally, it acts like a blackbox: giving apparently random output, uniformly distributed.

hackerb9 · 2022-08-03T22:23:35Z

Here's a sample program that can randomly access the wordlist file directly from RAM:

0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 ' 
7 ' Files change their location in RAM,     moving aside as other files grow. 
8 ' Note: EDIT modifies a hidden file,      but not the directory pointers!
9 ' CLEAR updates the RAM directory.
10 CLEAR
14 ' Ram Directory address (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 IF PEEK(1)=171 THEN RD=62034: ELSE RD=63842
17 ' WL20xx.DO is the wordle wordlist for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO" 
20 FOR A = RD TO RD+11*55 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A) 
39 ' File address in memory
40 X=PEEK(A+1)+256*PEEK(A+2)
45 FN$=""
50 FOR T = A+3 TO A+8
70 C$=CHR$(PEEK(T))
72 ' Filenames are padded with spaces.
73 IF C$=" " THEN T=A+8: GOTO 80
75 FN$=FN$+C$
80 NEXT
89 ' BASIC, TELCOM, have no extension.
90 IF PEEK(A+9)=ASC(" ") THEN 150
100 EX$=CHR$(PEEK(A+9))+CHR$(PEEK(A+10))
110 FN$=FN$+"."+EX$
150 ' Got filename in FN$
160 REM PRINT FN$, X
170 IF FN$=WL$ THEN 200
180 NEXT
199 END
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT

It should work on any of the Tandys, but I'm curious if it works on your NEC.

bgri · 2022-08-04T01:47:04Z

Nice! Got my 8201 unpacked and running today. Did a quick load of your program and, while it ran, no joy for the file names. Digging in to convert the addresses to their NEC values based on this: https://www.web8201.net/default.asp?content=tech.asp Thinking we're looking at 63633 for the start of the file namespace: ... EDTDIR F886 63622 Directory entry for edit workspace USRDIR F891 63633 First user directory entry, of 21 DIREND F978 63864 End of directory mark (0FF) Format of 11-byte directory entry ================================= Byte 0 Directory Flag Bit 7 - Master bit (1=valid entry) Bit 6 - ASCII bit (1=ASCII text file) Bit 5 - Binary bit (1=Machine language file) Bit 4 - File-in-ROM (1=File is in ROM) Bit 3 - Hidden file (1=hidden from main menu) Bit 2 - (not used) Bit 1 - RAM File open (1=currrently open flag) Bit 0 - (internal use - set to 0 normally) Bytes 1-2 Address Field *For a BASIC program, it's the address to what TXTTAB must be set. *For an ASCII text file, it's the beginning addr *For a binary file, it's the beginning addr *For a ROM file, it's the entry address Bytes 3-10 File name ########################################################### DIRPNT F979 63865 Pointer to directory of current BASIC program CASPRV F97B 63867 Storage for previous character for cassette COMPRV F97C 63868 Storage for previous character for COM port WNDPRV F97D 63869 Storage for previous character for bar code reader ... But crashing for tonight. Will look again at this tomorrow... great fun!

…

On Wed, Aug 3, 2022 at 4:23 PM hackerb9 ***@***.***> wrote: Here's a sample program that can randomly access the wordlist file directly from RAM: 0 REM RNDACC by hackerb9 20221 REM Random access to files in RAM.2 ' This program can read directly3 ' from a file without OPENing it.4 ' When you just need a small bit5 ' of a large file, this is faster.6 ' 7 ' Files change their location in RAM, moving aside as other files grow. 8 ' Note: EDIT modifies a hidden file, but not the directory pointers!9 ' CLEAR updates the RAM directory.10 CLEAR14 ' Ram Directory address (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)15 IF PEEK(1)=171 THEN RD=62034: ELSE RD=6384217 ' WL20xx.DO is the wordle wordlist for each day in 20xx.18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"19 ' Search directory for "WL20xx.DO" 20 FOR A = RD TO RD+11*55 STEP 1129 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.30 FL=PEEK(A) 39 ' File address in memory40 X=PEEK(A+1)+256*PEEK(A+2)45 FN$=""50 FOR T = A+3 TO A+870 C$=CHR$(PEEK(T))72 ' Filenames are padded with spaces.73 IF C$=" " THEN T=A+8: GOTO 8075 FN$=FN$+C$80 NEXT89 ' BASIC, TELCOM, have no extension.90 IF PEEK(A+9)=ASC(" ") THEN 150100 EX$=CHR$(PEEK(A+9))+CHR$(PEEK(A+10))110 FN$=FN$+"."+EX$150 ' Got filename in FN$160 REM PRINT FN$, X170 IF FN$=WL$ THEN 200180 NEXT199 END200 REM Found WL20xx.DO. Now access it.210 INPUT "Enter an ordinal date (1 to 366)"; DY220 DY=DY-1228 ' X is WL20XX.DO's address in RAM229 ' Format is 5 letters + CR + LF.230 FOR T = X+DY*7 TO X+DY*7+5240 PRINT CHR$(PEEK(T));250 NEXT260 PRINT It should work on any of the Tandys, but I'm curious if it works on your NEC. — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLJII6AVHFED34J5LTR5GTVXLWPFANCNFSM55JX2KJQ> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Brad Grier ----------

hackerb9 · 2022-08-04T07:58:33Z

That is good information. It looks to be the same 11-byte format, plus it mentions something I didn't know before: I can detect the DIREND by looking for 0xFF in the Flag byte. Going to crash for tonight myself, but this is quite promising.

hackerb9 · 2022-08-06T23:12:43Z

I've improved the random access program so that it should handle the NEC 8201A. Does this work on your machine?

Click to see RNDACC.DO

0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 ' 
7 ' Files change their location in RAM,     moving aside as other files grow. 
8 ' Note: EDIT modifies a hidden file,      but not the directory pointers!
9 ' CLEAR refreshes the pointers.
10 CLEAR
12 ' HW ID. 51=M100, 171=T200, 148=NEC,      35=M10, 225=K85
13 ID=PEEK(1)
14 ' Ram Directory address. (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 ' (Gary Weber's NEC.MAP gives RD=63567, but we can skip the system files by starting at 63633.)
16 RD=-( 63842*(ID=51) + 62034*(ID=171) + 63633*(ID=148) )
17 ' WL20xx.DO is the wordle wordlist        for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO" 
20 FOR A = RD TO 65535 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A) 
39 ' Stop at end of directory (255)
40 IF FL=255 THEN 300
49 ' X is file address in memory
50 X=PEEK(A+1)+256*PEEK(A+2)
59 ' Add filename all at once for speed
60 FN$=CHR$(PEEK(A+3)) + CHR$(PEEK(A+4)) + CHR$(PEEK(A+5)) + CHR$(PEEK(A+6)) + CHR$(PEEK(A+7)) + CHR$(PEEK(A+8)) + "." + CHR$(PEEK(A+9)) + CHR$(PEEK(A+10))
69 ' Got filename in FN$
70 PRINT FN$, X
80 IF FN$=WL$ THEN 200
90 NEXT A
99 GOTO 300
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT
299 END
300 REM File not found
310 PRINT "Error: File ";WL$;" not found."
320 END

This version is also much faster because I hadn't realized previously how slow BASIC was at repeated string concatenation. Now, I create the filename from the Ram Directory entry all at once, but the downside is the filename is padded with spaces if it has less than six characters. (E.g., ABC␠␠␠.DO.) That's okay for this test program as I mainly wanted to print the directory listing for debugging. The actual M100LE code could use PEEK to compare each filename directly to WL2022DO and skip doing any string concatenation at all.

I should note that, to be correct, this program ought to check each entry's attributes flag to make sure the file hasn't been KILLed. Skipping invalid files can be an optimization in the M100LE code.

hackerb9 · 2022-08-07T04:20:54Z

I just tested and all CR and LF can be removed for 29% smaller wordlists, if M100LE uses the random access method instead of LINE INPUT #1.

File	Bytes	Lines	Chars per entry
WL2022.DO	2555	365	7
WL2022.DA	1825	1	5

Side note: While it is not as easy, the Tandy text editor can still edit the wordlist even though it appears as a single line of 1825 characters (1830 for leap years). That's surprising considering that other machines of the era had much shorter line length limitations. (VAX/VMS was 255, IIRC).

I am mulling using a different extension, like .DA, instead of the usual .DO to signify to people that they probably want to treat it as a raw data file not a text document. The Tandy computers seem to accept any extension that starts with D , so it works fine on my machine, but I wonder about your NEC 8201A.

hackerb9 · 2022-08-08T03:54:48Z

I'm going to separate the speed up from random access to a separate issue so that this one can focus on the spelling dictionary.

bgri · 2022-08-08T21:27:24Z

Oh! That is neat! I like that idea for distributed word lists, though I'd likely want to keep the 'original' with the CRLF pair, just to keep things sane for me. Easy enough to strip them out using a macro, etc for distribution. That's wild that it can handle that much text per line (255 cap sounds right to me). Interesting. The NEC 'can't see' the .DA file on my backpack. But if I try and copy WL2022.DO from the backpack to the NEC, and use the file name WL2022.DA, then it copies. BUT, when viewing the file list on the NEC, it shows as a .DO file. Which your test file finds. But if I rename the string on line 18 to look for .DA instead of .DO, then it can't find the file. So maybe using the .DA format on modern computers would work, but as far as the NEC is concerned, it may introduce user confusion... *(edit: Do -> DO)

…

On Sat, Aug 6, 2022 at 10:21 PM hackerb9 ***@***.***> wrote: I just tested and all CR and LF can be removed for 29% smaller wordlists, if M100LE uses the random access method instead of LINE INPUT #1. File Bytes Lines Chars per entry WL2022.DO 2555 365 7 WL2022.DA 1825 1 5 Side note: While it is not as easy, the Tandy text editor can still edit the wordlist even though it appears as a single line of 1825 characters (1830 for leap years). That's surprising considering that other machines of the era had much shorter line length limitations. (VAX/VMS was 255, IIRC). I am mulling using a different extension, like .DA, instead of the usual .DO to signify to people that they probably want to treat it as a raw data file not a text document. The Tandy computers seem to accept any extension that starts with D , so it works fine on my machine, but I wonder about your NEC 8201A. — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLJII7NX3DPVDA2MWTO4W3VX42TBANCNFSM55JX2KJQ> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Brad Grier ----------

bgri · 2022-08-08T21:27:44Z

Good idea.

…

On Sun, Aug 7, 2022 at 9:55 PM hackerb9 ***@***.***> wrote: I'm going to separate the speed up from random access to a separate issue so that this one can focus on the spelling dictionary. — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLJII4O7YHM7XDDJ3QNI73VYCAJFANCNFSM55JX2KJQ> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Brad Grier ----------

hackerb9 · 2022-08-09T15:36:32Z

While it would be cool to try and fit the spelling
dictionary in, if it's missing, does it actually detract from the game? As
it is, yes, we'll accept 'XXXXX' as a guess if you choose to use that as a
strategy to solve the word. Was it fun? Was it more fun being told 'XXXXX'
isn't a word?

And is it worth the tradeoff that to use the spelling dictionary, we'll eat
a large portion of your free RAM?

Good questions. While I think it does add to the game, it doesn't add much. Let's put the spelling dictionary on a back burner to simmer.

I guess we could make it an optional play mode if they have the RAM
available -- a light version for folk without the available space, and the
full experience for someone with a REX, or a second 32k RAM bank just for
the game, etc.

I was hoping to get it to work in a stock M100, but using another RAM bank is worth considering. The NEC had two RAM banks, right? And I think the Olivetti M10 had that option as well. My Tandy 200 has three RAM banks, but each is only 24K.

bgri · 2022-08-09T16:14:03Z

Good questions. While I think it does add to the game, it doesn't add much. Let's put the spelling dictionary on a back burner to simmer.

Sounds good. Neat feature to have... but yeah, more pressing things.

The NEC had two RAM banks, right? And I think the Olivetti M10 had that option as well. My Tandy 200 has three RAM banks, but each is only 24K.

Yep, with all the RAM sockets populated the NEC has bank #1 and bank #2 available (32k ea). There is an expansion port on the left side that allows for another 32k (bank #3). But (to my knowledge) there's no easy way to pass data between them...

hackerb9 · 2022-08-27T07:44:53Z

Just a note for future selves: Using the RAM Directory to access files in storage is a perfect way to keep a large bit vector.

bgri · 2022-10-11T07:55:00Z

Nice! One minor thing in line 18 - NEC DATE$ leads with the year, so changing RIGHT$ to LEFT$ fixed it right up! [image: image.png]

…

On Sat, Aug 6, 2022 at 5:12 PM hackerb9 ***@***.***> wrote: I've improved the sample random access sample program so that it should work on the NEC 8201A. Does this work on your 8201A? Click to see RNDACC.DO 0 REM RNDACC by hackerb9 2022 1 REM Random access to files in RAM. 2 ' This program can read directly 3 ' from a file without OPENing it. 4 ' When you just need a small bit 5 ' of a large file, this is faster. 6 ' 7 ' Files change their location in RAM, moving aside as other files grow. 8 ' Note: EDIT modifies a hidden file, but not the directory pointers! 9 ' CLEAR refreshes the pointers. 10 CLEAR 12 ' HW ID. 51=M100, 171=T200, 148=NEC, 35=M10, 225=K85 13 ID=PEEK(1) 14 ' Ram Directory address. (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.) 15 ' (Gary Weber's NEC.MAP gives RD=63567, but we can skip the system files by starting at 63633.) 16 RD=-( 63842*(ID=51) + 62034*(ID=171) + 63633*(ID=148) ) 17 ' WL20xx.DO is the wordle wordlist for each day in 20xx. 18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO" 19 ' Search directory for "WL20xx.DO" 20 FOR A = RD TO 65535 STEP 11 29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details. 30 FL=PEEK(A) 39 ' Stop at end of directory (255) 40 IF FL=255 THEN 300 49 ' X is file address in memory 50 X=PEEK(A+1)+256*PEEK(A+2) 59 ' Add filename all at once for speed 60 FN$=CHR$(PEEK(A+3)) + CHR$(PEEK(A+4)) + CHR$(PEEK(A+5)) + CHR$(PEEK(A+6)) + CHR$(PEEK(A+7)) + CHR$(PEEK(A+8)) + "." + CHR$(PEEK(A+9)) + CHR$(PEEK(A+10)) 69 ' Got filename in FN$ 70 PRINT FN$, X 80 IF FN$=WL$ THEN 200 90 NEXT A 99 GOTO 300 200 REM Found WL20xx.DO. Now access it. 210 INPUT "Enter an ordinal date (1 to 366)"; DY 220 DY=DY-1 228 ' X is WL20XX.DO's address in RAM 229 ' Format is 5 letters + CR + LF. 230 FOR T = X+DY*7 TO X+DY*7+5 240 PRINT CHR$(PEEK(T)); 250 NEXT 260 PRINT 299 END 300 REM File not found 310 PRINT "Error: File ";WL$;" not found." 320 END It's also much faster because I hadn't realized previously how slow BASIC was at repeated string concatenation. Now, I create the filename from the Ram Directory entry all at once, but the downside is the filename is padded with spaces if it has less than six characters. (E.g., ABC␠␠␠DO.) That's okay for this test program as I mainly wanted to print the directory listing for debugging. The actual M100LE code could use PEEK to compare each filename directly to "WL2022DO" and skip doing any string concatenation at all. I should note that, to be correct, this program ought to check each entry's attributes flag to make sure the file hasn't been KILLed. Again, in M100LE, it can be an optimization to skip invalid files. — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLJII2WFCPIZJNAPTO4AHTVX3WPLANCNFSM55JX2KJQ> . You are receiving this because you commented.Message ID: ***@***.***>

-- -- Brad Grier ----------

hackerb9 · 2022-10-16T08:19:22Z

Another note for our future selves: A “Bloom filter” may work for spell checking. The short of it: Like my proposed data structure above, it saves space by allowing a chance of incorrectly accepting words. Bloom filters save even more memory by using multiple hash functions to reduce the likelihood of a false positive, which allows smaller bit arrays to be used. See the Wikipedia description of Bloom Filters. Sample Python code is here. While not necessary for implementation, it is interesting to see the probability of false positives (accepting words which aren't actually in the word list): see Probability in Data Science.

Additionally, we should look into “Prefix Sets”. Google Chrome used to use Bloom filters, but a decade ago switched to Prefix Sets for a 33% space saving. I know nothing about Prefix Sets, but if they are significantly more complicated or slower than Bloom filters, then they may not be appropriate for a Model 100.

hackerb9 mentioned this issue Aug 8, 2022

Maybe compress wordlist #14

Closed

bgri added the enhancement New feature or request label Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement spelling dictionary #12

Implement spelling dictionary #12

hackerb9 commented Aug 2, 2022 •

edited

Loading

hackerb9 commented Aug 2, 2022

bgri commented Aug 3, 2022 via email

bgri commented Aug 3, 2022 via email

hackerb9 commented Aug 3, 2022

hackerb9 commented Aug 3, 2022 •

edited

Loading

bgri commented Aug 4, 2022 via email

hackerb9 commented Aug 4, 2022 •

edited

Loading

hackerb9 commented Aug 6, 2022 •

edited

Loading

hackerb9 commented Aug 7, 2022

hackerb9 commented Aug 8, 2022

bgri commented Aug 8, 2022 via email •

edited

Loading

bgri commented Aug 8, 2022 via email

hackerb9 commented Aug 9, 2022

bgri commented Aug 9, 2022

hackerb9 commented Aug 27, 2022

bgri commented Oct 11, 2022 via email

hackerb9 commented Oct 16, 2022 •

edited

Loading

Implement spelling dictionary #12

Implement spelling dictionary #12

Comments

hackerb9 commented Aug 2, 2022 • edited Loading

hackerb9 commented Aug 2, 2022

bgri commented Aug 3, 2022 via email

bgri commented Aug 3, 2022 via email

hackerb9 commented Aug 3, 2022

hackerb9 commented Aug 3, 2022 • edited Loading

bgri commented Aug 4, 2022 via email

hackerb9 commented Aug 4, 2022 • edited Loading

hackerb9 commented Aug 6, 2022 • edited Loading

hackerb9 commented Aug 7, 2022

hackerb9 commented Aug 8, 2022

bgri commented Aug 8, 2022 via email • edited Loading

bgri commented Aug 8, 2022 via email

hackerb9 commented Aug 9, 2022

bgri commented Aug 9, 2022

hackerb9 commented Aug 27, 2022

bgri commented Oct 11, 2022 via email

hackerb9 commented Oct 16, 2022 • edited Loading

hackerb9 commented Aug 2, 2022 •

edited

Loading

hackerb9 commented Aug 3, 2022 •

edited

Loading

hackerb9 commented Aug 4, 2022 •

edited

Loading

hackerb9 commented Aug 6, 2022 •

edited

Loading

bgri commented Aug 8, 2022 via email •

edited

Loading

hackerb9 commented Oct 16, 2022 •

edited

Loading