-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement spelling dictionary #12
Comments
I wrote a test program on my Tandy 200 and I am able to instantly load up the word of the day by reading it from memory instead of looping over Basic method for randomly accessing files in RAM:
|
Well, that paper was an interesting read! Seems like Dr. McIlroy knows his
stuff /jk
I won't pretend to understand a fair bit of what was written :)
https://en.wikipedia.org/wiki/Douglas_McIlroy
That being said, I am intrigued by this. If it's possible to give a
realistic sub-set of acceptable words to test against, then that sounds
like a fun goal.
I like the idea of the SCOWL exclusion list -- retaining more of the
original content and sacrificing least common words. Would have to test to
see what kind of savings we get.
I *believe*, but am not sure, that, since files are actually already in
RAM, a BASIC program can access the data without having to load up a second
copy into memory. If so, it would be tight, but possible!
I saw your following message and like this idea. It's what I kinda had in
mind with the line input #1 method... but this is much neater!
I remain impressed (and grateful) that you're continuing to be interested
in this! I should soon be in a position to assist (beyond cheerleading) as
I think I've found my m100 hardware boxes :)
…--Brad
On Mon, Aug 1, 2022 at 11:06 PM hackerb9 ***@***.***> wrote:
I had an idea for how one could create a spelling dictionary by using a
hashtable but not storing the words themselves, just a bit vector. There'd
be some false negatives (bogus words that happen to hash to a valid word),
but the chances are low if the vector is large enough. Normally a large
vector would be a problem for limited memory, but since it would be sparse,
it should be easily compressible.
Turns out someone beat me to... by forty years. There's an IEEE paper
from 1982 by Doug McIlroy
<https://ia800805.us.archive.org/33/items/development-of-spelling-list/Image092317125441_text.pdf>
which lays out how he managed to fit a spell checker for 30,000 words (250
kilobytes) in a 64 kilobyte machine. From the abstract:
Stripping prefixes and suffixes reduces the list below one third of its
original size, hashing discards 60 percent of the bits that remain, and
data compression halves it once again.
So, potentially, hashing and compression could cut the wordlist down to a
quarter of the size.
It appears Wordle uses a 13,000 word
<https://github.com/tabatkins/wordle-list> (72 kilobyte) list of what it
will accept. Even a quarter of that, 18 kilobytes, is still rather large
for a Model T, so it may make sense to use a smaller corpus.
-
SCOWL <http://wordlist.aspell.net/> to create a list of the most
frequent five letter words. For example, here is a list of 7,000 words
<https://gist.github.com/hackerb9/0d18b7b68149faa8f22841bfcec7ad35>
(35 kilobytes), but the size is flexible since the words are partitioned
into frequency bins.
-
Alternately, one could use SCOWL's list of *least* common words and
subtract them from Wordle's list, that way unusual words that Wordle knows
about but SCOWL doesn't will be kept. Here's a list of 7500 words
<https://gist.github.com/hackerb9/5f951b41bda4348623d85d48ec1397fb>
(44 kilobytes) created that at way.
Currently, M100LE takes up about 8KB of storage (for the program and one
year's worth of words). I do not know how much RAM is required at runtime,
but I would not expect it to be more than a kilobyte. On a Model 200, which
has only 19KB of RAM free for BASIC to use, That'd leave about 10KB for a
wordlist to be stored plus the extra code to access it plus any extra RAM
usage.
I *believe*, but am not sure, that, since files are actually already in
RAM, a BASIC program can access the data without having to load up a second
copy into memory. If so, it would be tight, but possible!
—
Reply to this email directly, view it on GitHub
<#12>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLJII6ZPUW6CZTLSQGDB2TVXCUFZANCNFSM55JX2KJQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
--
Brad Grier
----------
|
Lovely! I think the NEC has the same process, perhaps accessed slightly
differently. Will have to look.
Assuming we're able to get a nice tidy (and suitably slim) set of
acceptable words -- I was wondering, since we know the 'guess' word to be
tested (against a valid word as well as against the daily word), is there a
way to speed up the test by indexing against the hash? (I don't understand
how hashing works very well).
For example, if our daily word is ABOVE, and the guess is ABOUT, can the
guess-hash only test against a subset of the whole acceptable words
dictionary, testing the first two characters -- known words starting with
AB -- and abort/exclude any others (we don't care about ACRES)? Basically
using the first two characters of the guess to eliminate all others in the
acceptable words dictionary. Can we specify how the hash is built to
accommodate something like this? (yes, a noob when it comes to this level
of stuff).
And then, would we know in RAM where to look? All AB hashes start at
location #0643 (for example)?
Or does this even matter and would any speed gains be negligible?
…--Brad
On Tue, Aug 2, 2022 at 2:26 AM hackerb9 ***@***.***> wrote:
I *believe*, but am not sure, that, since files are actually already in
RAM, a BASIC program can access the data without having to load up a second
copy into memory. If so, it would be tight, but possible!
I wrote a test program on my Tandy 200 and I am able to instantly load up
the word of the day by reading it from memory instead of looping over LINE
INPUT #1 as M100LE currently does. I do not know if the NEC PC-8201A has
the same RAM directory structure, but I bet it does since that seems to
have been something that came from their common evolutionary ancestor, the
Kyocera.
Basic method for randomly accessing files in RAM:
1. CLEAR to make memory locations sane.
2. Check PEEK(1) to determine machine type.
3. Read RAM directory at 63842 (M100) or 62034 (T200)
4. Each entry is eleven bytes:
- 1: File attributes
- 2,3: File address in memory (little endian)
- 4-9: File name before the dot (padded with spaces at end if
necessary)
- 10,11: File name extension after the dot (starts with space if no
extension)
5. Keep reading filenames until "WL2022.DO" is found
6. Let X←File address in memory
7. Let DY←Ordinal day number ("Julian date")
8. Today's word can be found at PEEK( X + (DY-1)*7)
—
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLJII5TZPGBGTZUV3KHJC3VXDLVDANCNFSM55JX2KJQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
--
Brad Grier
----------
|
Excellent. My main computer died in a recent heatwave, but I'll see if I can jury-rig something to send over my test program.
Hashing already speeds up the checking by (essentially) using the guess word as an index into the array of valid words. Think of hashing like a checksum: a magic function that, given a bunch of data — such as the string "ABOVE" — returns a number — such as 49989. Ideally, it acts like a blackbox: giving apparently random output, uniformly distributed. |
Here's a sample program that can randomly access the wordlist file directly from RAM: 0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 '
7 ' Files change their location in RAM, moving aside as other files grow.
8 ' Note: EDIT modifies a hidden file, but not the directory pointers!
9 ' CLEAR updates the RAM directory.
10 CLEAR
14 ' Ram Directory address (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 IF PEEK(1)=171 THEN RD=62034: ELSE RD=63842
17 ' WL20xx.DO is the wordle wordlist for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO"
20 FOR A = RD TO RD+11*55 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A)
39 ' File address in memory
40 X=PEEK(A+1)+256*PEEK(A+2)
45 FN$=""
50 FOR T = A+3 TO A+8
70 C$=CHR$(PEEK(T))
72 ' Filenames are padded with spaces.
73 IF C$=" " THEN T=A+8: GOTO 80
75 FN$=FN$+C$
80 NEXT
89 ' BASIC, TELCOM, have no extension.
90 IF PEEK(A+9)=ASC(" ") THEN 150
100 EX$=CHR$(PEEK(A+9))+CHR$(PEEK(A+10))
110 FN$=FN$+"."+EX$
150 ' Got filename in FN$
160 REM PRINT FN$, X
170 IF FN$=WL$ THEN 200
180 NEXT
199 END
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT It should work on any of the Tandys, but I'm curious if it works on your NEC. |
Nice! Got my 8201 unpacked and running today. Did a quick load of your
program and, while it ran, no joy for the file names. Digging in to convert
the addresses to their NEC values based on this:
https://www.web8201.net/default.asp?content=tech.asp
Thinking we're looking at 63633 for the start of the file namespace:
...
EDTDIR F886 63622 Directory entry for edit workspace
USRDIR F891 63633 First user directory entry, of 21
DIREND F978 63864 End of directory mark (0FF)
Format of 11-byte directory entry
=================================
Byte 0 Directory Flag
Bit 7 - Master bit (1=valid entry)
Bit 6 - ASCII bit (1=ASCII text file)
Bit 5 - Binary bit (1=Machine language file)
Bit 4 - File-in-ROM (1=File is in ROM)
Bit 3 - Hidden file (1=hidden from main menu)
Bit 2 - (not used)
Bit 1 - RAM File open (1=currrently open flag)
Bit 0 - (internal use - set to 0 normally)
Bytes 1-2 Address Field
*For a BASIC program, it's the address to what
TXTTAB must be set.
*For an ASCII text file, it's the beginning addr
*For a binary file, it's the beginning addr
*For a ROM file, it's the entry address
Bytes 3-10 File name
###########################################################
DIRPNT F979 63865 Pointer to directory of current BASIC program
CASPRV F97B 63867 Storage for previous character for cassette
COMPRV F97C 63868 Storage for previous character for COM port
WNDPRV F97D 63869 Storage for previous character for bar code reader
...
But crashing for tonight. Will look again at this tomorrow... great fun!
…On Wed, Aug 3, 2022 at 4:23 PM hackerb9 ***@***.***> wrote:
Here's a sample program that can randomly access the wordlist file
directly from RAM:
0 REM RNDACC by hackerb9 20221 REM Random access to files in RAM.2 ' This program can read directly3 ' from a file without OPENing it.4 ' When you just need a small bit5 ' of a large file, this is faster.6 ' 7 ' Files change their location in RAM, moving aside as other files grow. 8 ' Note: EDIT modifies a hidden file, but not the directory pointers!9 ' CLEAR updates the RAM directory.10 CLEAR14 ' Ram Directory address (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)15 IF PEEK(1)=171 THEN RD=62034: ELSE RD=6384217 ' WL20xx.DO is the wordle wordlist for each day in 20xx.18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"19 ' Search directory for "WL20xx.DO" 20 FOR A = RD TO RD+11*55 STEP 1129 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.30 FL=PEEK(A) 39 ' File address in memory40 X=PEEK(A+1)+256*PEEK(A+2)45 FN$=""50 FOR T = A+3 TO A+870 C$=CHR$(PEEK(T))72 ' Filenames are padded with spaces.73 IF C$=" " THEN T=A+8: GOTO 8075 FN$=FN$+C$80 NEXT89 ' BASIC, TELCOM, have no extension.90 IF PEEK(A+9)=ASC(" ") THEN 150100 EX$=CHR$(PEEK(A+9))+CHR$(PEEK(A+10))110 FN$=FN$+"."+EX$150 ' Got filename in FN$160 REM PRINT FN$, X170 IF FN$=WL$ THEN 200180 NEXT199 END200 REM Found WL20xx.DO. Now access it.210 INPUT "Enter an ordinal date (1 to 366)"; DY220 DY=DY-1228 ' X is WL20XX.DO's address in RAM229 ' Format is 5 letters + CR + LF.230 FOR T = X+DY*7 TO X+DY*7+5240 PRINT CHR$(PEEK(T));250 NEXT260 PRINT
It should work on any of the Tandys, but I'm curious if it works on your
NEC.
—
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLJII6AVHFED34J5LTR5GTVXLWPFANCNFSM55JX2KJQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
--
Brad Grier
----------
|
That is good information. It looks to be the same 11-byte format, plus it mentions something I didn't know before: I can detect the DIREND by looking for 0xFF in the Flag byte. Going to crash for tonight myself, but this is quite promising. |
I've improved the random access program so that it should handle the NEC 8201A. Does this work on your machine? Click to see RNDACC.DO0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 '
7 ' Files change their location in RAM, moving aside as other files grow.
8 ' Note: EDIT modifies a hidden file, but not the directory pointers!
9 ' CLEAR refreshes the pointers.
10 CLEAR
12 ' HW ID. 51=M100, 171=T200, 148=NEC, 35=M10, 225=K85
13 ID=PEEK(1)
14 ' Ram Directory address. (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 ' (Gary Weber's NEC.MAP gives RD=63567, but we can skip the system files by starting at 63633.)
16 RD=-( 63842*(ID=51) + 62034*(ID=171) + 63633*(ID=148) )
17 ' WL20xx.DO is the wordle wordlist for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO"
20 FOR A = RD TO 65535 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A)
39 ' Stop at end of directory (255)
40 IF FL=255 THEN 300
49 ' X is file address in memory
50 X=PEEK(A+1)+256*PEEK(A+2)
59 ' Add filename all at once for speed
60 FN$=CHR$(PEEK(A+3)) + CHR$(PEEK(A+4)) + CHR$(PEEK(A+5)) + CHR$(PEEK(A+6)) + CHR$(PEEK(A+7)) + CHR$(PEEK(A+8)) + "." + CHR$(PEEK(A+9)) + CHR$(PEEK(A+10))
69 ' Got filename in FN$
70 PRINT FN$, X
80 IF FN$=WL$ THEN 200
90 NEXT A
99 GOTO 300
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT
299 END
300 REM File not found
310 PRINT "Error: File ";WL$;" not found."
320 END This version is also much faster because I hadn't realized previously how slow BASIC was at repeated string concatenation. Now, I create the filename from the Ram Directory entry all at once, but the downside is the filename is padded with spaces if it has less than six characters. (E.g., I should note that, to be correct, this program ought to check each entry's attributes flag to make sure the file hasn't been KILLed. Skipping invalid files can be an optimization in the M100LE code. |
I just tested and all CR and LF can be removed for 29% smaller wordlists, if M100LE uses the random access method instead of
Side note: While it is not as easy, the Tandy text editor can still edit the wordlist even though it appears as a single line of 1825 characters (1830 for leap years). That's surprising considering that other machines of the era had much shorter line length limitations. (VAX/VMS was 255, IIRC). I am mulling using a different extension, like |
I'm going to separate the speed up from random access to a separate issue so that this one can focus on the spelling dictionary. |
Oh! That is neat! I like that idea for distributed word lists, though I'd
likely want to keep the 'original' with the CRLF pair, just to keep things
sane for me. Easy enough to strip them out using a macro, etc for
distribution.
That's wild that it can handle that much text per line (255 cap sounds
right to me).
Interesting. The NEC 'can't see' the .DA file on my backpack. But if I try
and copy WL2022.DO from the backpack to the NEC, and use the file name
WL2022.DA, then it copies. BUT, when viewing the file list on the NEC, it
shows as a .DO file. Which your test file finds. But if I rename the string
on line 18 to look for .DA instead of .DO, then it can't find the file.
So maybe using the .DA format on modern computers would work, but as far as
the NEC is concerned, it may introduce user confusion...
*(edit: Do -> DO)
…On Sat, Aug 6, 2022 at 10:21 PM hackerb9 ***@***.***> wrote:
I just tested and all CR and LF can be removed for 29% smaller wordlists,
if M100LE uses the random access method instead of LINE INPUT #1.
File Bytes Lines Chars
per entry
WL2022.DO 2555 365 7
WL2022.DA 1825 1 5
Side note: While it is not as easy, the Tandy text editor can still edit
the wordlist even though it appears as a single line of 1825 characters
(1830 for leap years). That's surprising considering that other machines of
the era had much shorter line length limitations. (VAX/VMS was 255, IIRC).
I am mulling using a different extension, like .DA, instead of the usual
.DO to signify to people that they probably want to treat it as a raw data
file not a text document. The Tandy computers seem to accept any extension
that starts with D , so it works fine on my machine, but I wonder about
your NEC 8201A.
—
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLJII7NX3DPVDA2MWTO4W3VX42TBANCNFSM55JX2KJQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
--
Brad Grier
----------
|
Good idea.
…On Sun, Aug 7, 2022 at 9:55 PM hackerb9 ***@***.***> wrote:
I'm going to separate the speed up from random access to a separate issue
so that this one can focus on the spelling dictionary.
—
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLJII4O7YHM7XDDJ3QNI73VYCAJFANCNFSM55JX2KJQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
--
Brad Grier
----------
|
Good questions. While I think it does add to the game, it doesn't add much. Let's put the spelling dictionary on a back burner to simmer.
I was hoping to get it to work in a stock M100, but using another RAM bank is worth considering. The NEC had two RAM banks, right? And I think the Olivetti M10 had that option as well. My Tandy 200 has three RAM banks, but each is only 24K. |
Sounds good. Neat feature to have... but yeah, more pressing things.
Yep, with all the RAM sockets populated the NEC has bank #1 and bank #2 available (32k ea). There is an expansion port on the left side that allows for another 32k (bank #3). But (to my knowledge) there's no easy way to pass data between them... |
Just a note for future selves: Using the RAM Directory to access files in storage is a perfect way to keep a large bit vector. |
Nice! One minor thing in line 18 - NEC DATE$ leads with the year, so
changing RIGHT$ to LEFT$ fixed it right up!
[image: image.png]
…On Sat, Aug 6, 2022 at 5:12 PM hackerb9 ***@***.***> wrote:
I've improved the sample random access sample program so that it should
work on the NEC 8201A. Does this work on your 8201A?
Click to see RNDACC.DO
0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 '
7 ' Files change their location in RAM, moving aside as other files grow.
8 ' Note: EDIT modifies a hidden file, but not the directory pointers!
9 ' CLEAR refreshes the pointers.
10 CLEAR
12 ' HW ID. 51=M100, 171=T200, 148=NEC, 35=M10, 225=K85
13 ID=PEEK(1)
14 ' Ram Directory address. (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 ' (Gary Weber's NEC.MAP gives RD=63567, but we can skip the system files by starting at 63633.)
16 RD=-( 63842*(ID=51) + 62034*(ID=171) + 63633*(ID=148) )
17 ' WL20xx.DO is the wordle wordlist for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO"
20 FOR A = RD TO 65535 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A)
39 ' Stop at end of directory (255)
40 IF FL=255 THEN 300
49 ' X is file address in memory
50 X=PEEK(A+1)+256*PEEK(A+2)
59 ' Add filename all at once for speed
60 FN$=CHR$(PEEK(A+3)) + CHR$(PEEK(A+4)) + CHR$(PEEK(A+5)) + CHR$(PEEK(A+6)) + CHR$(PEEK(A+7)) + CHR$(PEEK(A+8)) + "." + CHR$(PEEK(A+9)) + CHR$(PEEK(A+10))
69 ' Got filename in FN$
70 PRINT FN$, X
80 IF FN$=WL$ THEN 200
90 NEXT A
99 GOTO 300
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT
299 END
300 REM File not found
310 PRINT "Error: File ";WL$;" not found."
320 END
It's also much faster because I hadn't realized previously how slow BASIC
was at repeated string concatenation. Now, I create the filename from the
Ram Directory entry all at once, but the downside is the filename is padded
with spaces if it has less than six characters. (E.g., ABC␠␠␠DO.) That's
okay for this test program as I mainly wanted to print the directory
listing for debugging. The actual M100LE code could use PEEK to compare
each filename directly to "WL2022DO" and skip doing any string
concatenation at all.
I should note that, to be correct, this program ought to check each
entry's attributes flag to make sure the file hasn't been KILLed. Again, in
M100LE, it can be an optimization to skip invalid files.
—
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLJII2WFCPIZJNAPTO4AHTVX3WPLANCNFSM55JX2KJQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
--
Brad Grier
----------
|
Another note for our future selves: A “Bloom filter” may work for spell checking. The short of it: Like my proposed data structure above, it saves space by allowing a chance of incorrectly accepting words. Bloom filters save even more memory by using multiple hash functions to reduce the likelihood of a false positive, which allows smaller bit arrays to be used. See the Wikipedia description of Bloom Filters. Sample Python code is here. While not necessary for implementation, it is interesting to see the probability of false positives (accepting words which aren't actually in the word list): see Probability in Data Science. Additionally, we should look into “Prefix Sets”. Google Chrome used to use Bloom filters, but a decade ago switched to Prefix Sets for a 33% space saving. I know nothing about Prefix Sets, but if they are significantly more complicated or slower than Bloom filters, then they may not be appropriate for a Model 100. |
I had an idea for how one could create a spelling dictionary by using a hashtable but not storing the words themselves, just a bit vector. There'd be some false negatives (bogus words that happen to hash to a valid word), but the chances are low if the vector is large enough. Normally a large vector would be a problem for limited memory, but since it would be sparse, it should be easily compressible.
Turns out someone beat me to... by forty years. There's an IEEE paper from 1982 by Doug McIlroy which lays out how he managed to fit a spell checker for 30,000 words (250 kilobytes) in a 64 kilobyte machine. From the abstract:
So, potentially, hashing and compression could cut the wordlist down to a quarter of the size.
It appears Wordle uses a 13,000 word (72 kilobyte) list of what it will accept. Even a quarter of that, 18 kilobytes, is still rather large for a Model T, so it may make sense to use a smaller corpus.
SCOWL makes it easy to create a list of the most frequent five letter words. For example, here is a list of 7,000 words (35 kilobytes), but the size is flexible since the words are partitioned into frequency bins.
Alternately, one could use SCOWL's list of least common words and subtract them from Wordle's list, that way unusual words that Wordle knows about but SCOWL doesn't will be kept. Here's a list of 7500 words (44 kilobytes) created that way.
Currently, M100LE takes up about 8KB of storage (for the program and one year's worth of words). I do not know how much RAM is required at runtime, but I would not expect it to be more than a kilobyte. On a Model 200, which has only 19KB of RAM free for BASIC to use, That'd leave about 10KB for a wordlist to be stored plus the extra code to access it plus any extra RAM usage.
I believe, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible!
The text was updated successfully, but these errors were encountered: