Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement spelling dictionary #12

Open
hackerb9 opened this issue Aug 2, 2022 · 17 comments
Open

Implement spelling dictionary #12

hackerb9 opened this issue Aug 2, 2022 · 17 comments
Labels
enhancement New feature or request

Comments

@hackerb9
Copy link
Contributor

hackerb9 commented Aug 2, 2022

I had an idea for how one could create a spelling dictionary by using a hashtable but not storing the words themselves, just a bit vector. There'd be some false negatives (bogus words that happen to hash to a valid word), but the chances are low if the vector is large enough. Normally a large vector would be a problem for limited memory, but since it would be sparse, it should be easily compressible.

Turns out someone beat me to... by forty years. There's an IEEE paper from 1982 by Doug McIlroy which lays out how he managed to fit a spell checker for 30,000 words (250 kilobytes) in a 64 kilobyte machine. From the abstract:

Stripping prefixes and suffixes reduces the list below one third of its original size, hashing discards 60 percent of the bits that remain, and data compression halves it once again.

So, potentially, hashing and compression could cut the wordlist down to a quarter of the size.

It appears Wordle uses a 13,000 word (72 kilobyte) list of what it will accept. Even a quarter of that, 18 kilobytes, is still rather large for a Model T, so it may make sense to use a smaller corpus.

  • SCOWL makes it easy to create a list of the most frequent five letter words. For example, here is a list of 7,000 words (35 kilobytes), but the size is flexible since the words are partitioned into frequency bins.

  • Alternately, one could use SCOWL's list of least common words and subtract them from Wordle's list, that way unusual words that Wordle knows about but SCOWL doesn't will be kept. Here's a list of 7500 words (44 kilobytes) created that way.

Currently, M100LE takes up about 8KB of storage (for the program and one year's worth of words). I do not know how much RAM is required at runtime, but I would not expect it to be more than a kilobyte. On a Model 200, which has only 19KB of RAM free for BASIC to use, That'd leave about 10KB for a wordlist to be stored plus the extra code to access it plus any extra RAM usage.

I believe, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible!

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 2, 2022

I believe, but am not sure, that, since files are actually already in RAM, a BASIC program can access the data without having to load up a second copy into memory. If so, it would be tight, but possible!

I wrote a test program on my Tandy 200 and I am able to instantly load up the word of the day by reading it from memory instead of looping over LINE INPUT #1 as M100LE currently does. I do not know if the NEC PC-8201A has the same RAM directory structure, but I bet it does since that seems to have been something that came from their common evolutionary ancestor, the Kyocera.

Basic method for randomly accessing files in RAM:

  1. CLEAR to make memory locations sane.
  2. Check PEEK(1) to determine machine type.
  3. Read RAM directory at 63842 (M100) or 62034 (T200)
  4. Each entry is eleven bytes:
    • 1: File attributes
    • 2,3: File address in memory (little endian)
    • 4-9: File name before the dot (padded with spaces at end if necessary)
    • 10,11: File name extension after the dot (starts with space if no extension)
  5. Keep reading filenames until "WL2022.DO" is found
  6. Let X←File address in memory
  7. Let DY←Ordinal day number ("Julian date")
  8. Today's word can be found at PEEK( X + (DY-1)*7)

@bgri
Copy link
Owner

bgri commented Aug 3, 2022 via email

@bgri
Copy link
Owner

bgri commented Aug 3, 2022 via email

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 3, 2022

Lovely! I think the NEC has the same process, perhaps accessed slightly differently.

Excellent. My main computer died in a recent heatwave, but I'll see if I can jury-rig something to send over my test program.

I was wondering, since we know the 'guess' word to be tested (against a valid word as well as against the daily word), is there a way to speed up the test by indexing against the hash? (I don't understand how hashing works very well).

Hashing already speeds up the checking by (essentially) using the guess word as an index into the array of valid words.

Think of hashing like a checksum: a magic function that, given a bunch of data — such as the string "ABOVE" — returns a number — such as 49989. Ideally, it acts like a blackbox: giving apparently random output, uniformly distributed.

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 3, 2022

Here's a sample program that can randomly access the wordlist file directly from RAM:

0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 ' 
7 ' Files change their location in RAM,     moving aside as other files grow. 
8 ' Note: EDIT modifies a hidden file,      but not the directory pointers!
9 ' CLEAR updates the RAM directory.
10 CLEAR
14 ' Ram Directory address (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 IF PEEK(1)=171 THEN RD=62034: ELSE RD=63842
17 ' WL20xx.DO is the wordle wordlist for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO" 
20 FOR A = RD TO RD+11*55 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A) 
39 ' File address in memory
40 X=PEEK(A+1)+256*PEEK(A+2)
45 FN$=""
50 FOR T = A+3 TO A+8
70 C$=CHR$(PEEK(T))
72 ' Filenames are padded with spaces.
73 IF C$=" " THEN T=A+8: GOTO 80
75 FN$=FN$+C$
80 NEXT
89 ' BASIC, TELCOM, have no extension.
90 IF PEEK(A+9)=ASC(" ") THEN 150
100 EX$=CHR$(PEEK(A+9))+CHR$(PEEK(A+10))
110 FN$=FN$+"."+EX$
150 ' Got filename in FN$
160 REM PRINT FN$, X
170 IF FN$=WL$ THEN 200
180 NEXT
199 END
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT

It should work on any of the Tandys, but I'm curious if it works on your NEC.

@bgri
Copy link
Owner

bgri commented Aug 4, 2022 via email

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 4, 2022

That is good information. It looks to be the same 11-byte format, plus it mentions something I didn't know before: I can detect the DIREND by looking for 0xFF in the Flag byte. Going to crash for tonight myself, but this is quite promising.

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 6, 2022

I've improved the random access program so that it should handle the NEC 8201A. Does this work on your machine?

Click to see RNDACC.DO
0 REM RNDACC by hackerb9 2022
1 REM Random access to files in RAM.
2 ' This program can read directly
3 ' from a file without OPENing it.
4 ' When you just need a small bit
5 ' of a large file, this is faster.
6 ' 
7 ' Files change their location in RAM,     moving aside as other files grow. 
8 ' Note: EDIT modifies a hidden file,      but not the directory pointers!
9 ' CLEAR refreshes the pointers.
10 CLEAR
12 ' HW ID. 51=M100, 171=T200, 148=NEC,      35=M10, 225=K85
13 ID=PEEK(1)
14 ' Ram Directory address. (Anderson's "Programming Tips" gives RD=63842 for M100 and 62034 for T200.)
15 ' (Gary Weber's NEC.MAP gives RD=63567, but we can skip the system files by starting at 63633.)
16 RD=-( 63842*(ID=51) + 62034*(ID=171) + 63633*(ID=148) )
17 ' WL20xx.DO is the wordle wordlist        for each day in 20xx.
18 WL$="WL20"+RIGHT$(DATE$, 2)+".DO"
19 ' Search directory for "WL20xx.DO" 
20 FOR A = RD TO 65535 STEP 11
29 ' Attribute flag: See Oppedahl's "Inside the TRS-80 Model 100" for details.
30 FL=PEEK(A) 
39 ' Stop at end of directory (255)
40 IF FL=255 THEN 300
49 ' X is file address in memory
50 X=PEEK(A+1)+256*PEEK(A+2)
59 ' Add filename all at once for speed
60 FN$=CHR$(PEEK(A+3)) + CHR$(PEEK(A+4)) + CHR$(PEEK(A+5)) + CHR$(PEEK(A+6)) + CHR$(PEEK(A+7)) + CHR$(PEEK(A+8)) + "." + CHR$(PEEK(A+9)) + CHR$(PEEK(A+10))
69 ' Got filename in FN$
70 PRINT FN$, X
80 IF FN$=WL$ THEN 200
90 NEXT A
99 GOTO 300
200 REM Found WL20xx.DO. Now access it.
210 INPUT "Enter an ordinal date (1 to 366)"; DY
220 DY=DY-1
228 ' X is WL20XX.DO's address in RAM
229 ' Format is 5 letters + CR + LF.
230 FOR T = X+DY*7 TO X+DY*7+5
240 PRINT CHR$(PEEK(T));
250 NEXT
260 PRINT
299 END
300 REM File not found
310 PRINT "Error: File ";WL$;" not found."
320 END

This version is also much faster because I hadn't realized previously how slow BASIC was at repeated string concatenation. Now, I create the filename from the Ram Directory entry all at once, but the downside is the filename is padded with spaces if it has less than six characters. (E.g., ABC␠␠␠.DO.) That's okay for this test program as I mainly wanted to print the directory listing for debugging. The actual M100LE code could use PEEK to compare each filename directly to WL2022DO and skip doing any string concatenation at all.

I should note that, to be correct, this program ought to check each entry's attributes flag to make sure the file hasn't been KILLed. Skipping invalid files can be an optimization in the M100LE code.

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 7, 2022

I just tested and all CR and LF can be removed for 29% smaller wordlists, if M100LE uses the random access method instead of LINE INPUT #1.

File Bytes Lines Chars
per entry
WL2022.DO 2555 365 7
WL2022.DA 1825 1 5

Side note: While it is not as easy, the Tandy text editor can still edit the wordlist even though it appears as a single line of 1825 characters (1830 for leap years). That's surprising considering that other machines of the era had much shorter line length limitations. (VAX/VMS was 255, IIRC).

I am mulling using a different extension, like .DA, instead of the usual .DO to signify to people that they probably want to treat it as a raw data file not a text document. The Tandy computers seem to accept any extension that starts with D , so it works fine on my machine, but I wonder about your NEC 8201A.

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 8, 2022

I'm going to separate the speed up from random access to a separate issue so that this one can focus on the spelling dictionary.

@bgri
Copy link
Owner

bgri commented Aug 8, 2022 via email

@bgri
Copy link
Owner

bgri commented Aug 8, 2022 via email

@hackerb9
Copy link
Contributor Author

hackerb9 commented Aug 9, 2022

While it would be cool to try and fit the spelling
dictionary in, if it's missing, does it actually detract from the game? As
it is, yes, we'll accept 'XXXXX' as a guess if you choose to use that as a
strategy to solve the word. Was it fun? Was it more fun being told 'XXXXX'
isn't a word?

And is it worth the tradeoff that to use the spelling dictionary, we'll eat
a large portion of your free RAM?

Good questions. While I think it does add to the game, it doesn't add much. Let's put the spelling dictionary on a back burner to simmer.

I guess we could make it an optional play mode if they have the RAM
available -- a light version for folk without the available space, and the
full experience for someone with a REX, or a second 32k RAM bank just for
the game, etc.

I was hoping to get it to work in a stock M100, but using another RAM bank is worth considering. The NEC had two RAM banks, right? And I think the Olivetti M10 had that option as well. My Tandy 200 has three RAM banks, but each is only 24K.

@bgri
Copy link
Owner

bgri commented Aug 9, 2022

Good questions. While I think it does add to the game, it doesn't add much. Let's put the spelling dictionary on a back burner to simmer.

Sounds good. Neat feature to have... but yeah, more pressing things.

The NEC had two RAM banks, right? And I think the Olivetti M10 had that option as well. My Tandy 200 has three RAM banks, but each is only 24K.

Yep, with all the RAM sockets populated the NEC has bank #1 and bank #2 available (32k ea). There is an expansion port on the left side that allows for another 32k (bank #3). But (to my knowledge) there's no easy way to pass data between them...

@bgri bgri added the enhancement New feature or request label Aug 18, 2022
@hackerb9
Copy link
Contributor Author

Just a note for future selves: Using the RAM Directory to access files in storage is a perfect way to keep a large bit vector.

@bgri
Copy link
Owner

bgri commented Oct 11, 2022 via email

@hackerb9
Copy link
Contributor Author

hackerb9 commented Oct 16, 2022

Another note for our future selves: A “Bloom filter” may work for spell checking. The short of it: Like my proposed data structure above, it saves space by allowing a chance of incorrectly accepting words. Bloom filters save even more memory by using multiple hash functions to reduce the likelihood of a false positive, which allows smaller bit arrays to be used. See the Wikipedia description of Bloom Filters. Sample Python code is here. While not necessary for implementation, it is interesting to see the probability of false positives (accepting words which aren't actually in the word list): see Probability in Data Science.

Additionally, we should look into “Prefix Sets”. Google Chrome used to use Bloom filters, but a decade ago switched to Prefix Sets for a 33% space saving. I know nothing about Prefix Sets, but if they are significantly more complicated or slower than Bloom filters, then they may not be appropriate for a Model 100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants