## Word games

The purpose of these exrecises is to give you some practice with user-defined classes,
in poarticular, practice at adding methods and refining your class definitions
as the functionality oif the class becomes richer.

## Spelling Bees

The New York Times games section has a game called **Spelling Bee** which presents the user with the following sort of information:

<figure>
<center>
<img 
src='https://gawron.sdsu.edu/spelling_bee_06_03_25.png'
     width=250
     />
<figcaption>A partially filled mini crossword (NYT June 9, 2025)</figcaption>
</center>
</figure>

We are given a set of seven letters we'll call  `candidates`,  in this case:

```python
{'a', 'g', 'i', 'l', 'm', 'n', 'u'}
```

The goal is to construct as many words as possible using only the letters in `candidates`. Valid words  must be at least four letters long, may contain duplicate instances of the same letter,  and must contain the center letter.

For example, in the game above, *mull*, *glum*, and *lunging* are valid words,  but *mangling* (missing the center letter), *gum* (too short) and *mauls* (contains a letter not in `candidates`) are not.

Our task will be to implement a function that returns all valid words built from `candidates`. 

We will call the list of English words satisfying our minimum length constraint
`english_words`.  We will the set of possible letters  the `upper_bd` set.  We will allow it to be of any size. 
And we will allow there to be a (possibly disjoint) set of required letters  (the `lower_bd` set).  We can use
`set(wd)` to compute the set of letters in a word string.  Then
we want our function to return the following set:

$$
\lbrace \text{wd} \in \text{english_words} \mid \text{lower_bd} \subseteq \text{set}(\text{wd}) \subseteq \text{upper_bd} \rbrace
$$


We will use the `nltk` words corpus as `english_words`. Since it's constructed from
naturally occurring texts, this word set actually contains
a number of forms not accepted  in playing the *Spelling Bee* game online (proper names, extremely rare words,
and misspellings), but for the sake of clean code and simplicity, let's set those problems
aside in this demonstration exercise.

Here's the code:

In [416]:
from nltk.corpus import words

corp = words.words()
min_length= 4
corp = [wd for wd in corp if len(wd) >= min_length]

def find_words(lower_bd, upper_bd, candidates=corp):
    """
    Return the result of filtering candidates down to::
    
    { wd in candidates | lower_bd < wd < upper_bd }
    """
    
    lower_bd, upper_bd = set(lower_bd),set(upper_bd)
    # what's required must also be allowed
    upper_bd.update(lower_bd)
    return [wd for wd in candidates if upper_bd >= (letterset := set(wd)) and lower_bd <= letterset ]

In [417]:
# Using the game pictured above as our example
upper_bound,lower_bound = "limgna","u"
L1 = find_words(lower_bound, upper_bound)
L1[:25]

['agua',
 'alalunga',
 'algum',
 'almug',
 'alnuin',
 'alula',
 'alum',
 'alumina',
 'aluminium',
 'aluminum',
 'alumium',
 'alumna',
 'alumnal',
 'alumni',
 'amamau',
 'ammu',
 'amula',
 'amulla',
 'amunam',
 'anagua',
 'angula',
 'anilau',
 'annual',
 'annul',
 'aula']

In [418]:
len(L1)

152

In [273]:
# Find words containing "b" containing no letters not in the word habitability
L2 = find_words("b", "habitability")
L2[60:70]

['blay',
 'byth',
 'haab',
 'hability',
 'habit',
 'habitability',
 'habitably',
 'habitally',
 'habitat',
 'hillbilly']

### Coding point: Walrus operator

The code above contains the expression:

```python
[wd for wd in candidates if upper_bd >= (letterset := set(wd)) and lower_bd =< letterset ]
```

What does that `:=` mean?  It's called the Walrus operator and it is worth a little discussion.

Let

In [405]:
S = set("abc")

Then

In [406]:
(T := S - {"b"}).issubset(S)

True

is shorthand for:

In [409]:
T = (S - {"b"})
T.issubset(S)

True

In general

```python
<name> := <expression>
```

is an expression whose value is the same as `<expression>`.  However the assignment 

```python
<name> = <expression>
```

has also been performed.  Note that using a normal assignment in  the example above is a Syntax error:

In [408]:
(T = S - {"b"}).issubset(S)

SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='? (2306610694.py, line 1)

because 

```python
T = S - {"b"}
```

is not an expression.  It has no value.  Python assignments are part of a larger class of Python terms
called **statements** which include deletions (`del dd[key]`) and `if`-constructions (`if X < 2: X += 3`).
Statements have no values. The idea is that either they just do something that changes the computational state or they are complex blocks of code with no part that can naturally thought of as the value for the whole block.   Expressions, which include most Python terms, do have values. The expression `3+4`, for example, has the value `7`.

Terms containing the so-called **walrus operator** (`|=`) have the form `<name> := <expression>`; they are themselves expressions (which therefore have values), but, simultaneously, they  change the computational state by executing assignments.  

They are a shortcut.  Generally it is considered clearer and better computational style to avoid the shortcut.
Write

```python
T = (S - {"b"})
T.issubset(S)
```

instead of 

```python
(T := S - {"b"}).issubset(S)
```

However using the Walrus operator sometimes saves computation. This is the case in the code above.

```python
[wd for wd in candidates if upper_bd >= (letterset := set(wd)) and lower_bd <= letterset ]
```

could have been written

```python
[wd for wd in candidates if upper_bd >= set(wd) and lower_bd <= set(wd) ]
```

But then the computation `set(wd)` would have to be done twice.  This conversion to a set at the
very least involves removing duplicates, which could be costly if `wd` is a large string.  Another idea

```python
[wd for wd in candidates if letterset = set(wd) and upper_bd > letterset and lower_bd < letterset ]
```

raises a `SyntaxError`.  This is because terms formed with the `and` operator are expressions and expressions can contain expressions, but not statements.  So we can see that  by providing an expression that executes an assignment inside it, the walrus operator serves a real need; but it's worth pointing out that even in this example there are conflicting considerations.  One is saving a computation; the other is writing clear and simple code.  You complicate the code with something like the Walrus operator when the tradeoff in computation time is worth it. If your strings are always short (English words probably always count as short) it may not be worth it.

**Footnote**:  A final option which may occur to the attentive reader is 

```python
[wd for wd in candidates if letterset := set(wd) and upper_bd > letterset and lower_bd < letterset ]
```

which is valid, and arguably clearer; but it is not quite equivalent. The if-condition will  fail whenever `letterset` is the empty set, so empty-string `wd`s will be filtered, even when  the lower bound is the emoty set.

## Word Information States

Consider the problem of representing partial information about a computational or linguistic object.

To be concrete, let's consider partial information about a **word**, and to be even more concrete,
let's the consider the information you acquire about a word as you are filling
in a crossword puzzle.  The highlighted squares in the 
image below show what we know at a particular stage in
solving the the New York Times mini-puzzle. 

<figure>
<center>
<img 
width=500,
src='https://gawron.sdsu.edu/mini_crossword_game_partial.png'
     />
<figcaption>A partially filled mini crossword (NYT June 9, 2025)</figcaption>
</center>
</figure>

Let's call the things we know abour our **constraints**.  The constraints on  the word at 4-down are that
it has five letters; the first letter is "p"; the third and fourth are "i" and "o" respectively.

Let's cook up a class (call it `WordInformationState`) for representing constraints on a word.  The intuition guiding the implementation is that a `WordInformationState` should behave like a word string as much as possible.  It should print like a string; it should be indexable like a string (`word[2]`) and it should print and return the letter at an index when it is known and a place-holder when it is not. It should also enforce length constraints.

In [179]:
from nltk.corpus import words

corp = words.words()

class WordInformationState ():
    
    # self.__position_dict_[idx] is the value at position idx 
    # in self's character sequence
    # We implement it as a dictionary so that there can
    # be positions whose letter is unknown.
   
    # This one shd be accessible to users
    blank_char = "*"
    
    def __init__ (self,position_dict,length,blank_char=None):
      
        self.len = length
        if blank_char is not None:
            self.blank_char = blank_char
        # We update the word state with positional info and simultaneously
        # check that each position index is a valid index
        self._position_dict_ = {}
        for (k,v) in position_dict.items():
            self._index_check_(k)
            self._position_dict_[k] = v
          
        
    def _index_check_(self,idx):
        # check that position index idx is a valid index 
        assert isinstance(idx,int), f"{idx} is not an integer and therefore not a valid position index"
        assert idx < self.len, f"{idx} is too large a position index for a word of length {self.len}"
        
    def __getitem__ (self, k):
        self._index_check_(k)
        try:
            return self._position_dict_[k]
        except KeyError:
            return self.blank_char
        
    def __repr__ (self):
        #return "".join([self._position_dict_[k] if k in self._position_dict_ else "*" 
        #                for k in range(self.len)])
        return "".join([self[k] for k in range(self.len)])

# A sample instance of a WordInformationState incorporating
# the length and position constraints of 4 down in the partially
# solved crossword above.
wis = WordInformationState (position_dict={0:"p",2:"i",3:"o"}, length=5)
wis

p*io*

In [286]:
wis[3]

'o'

In [288]:
wis[1]

'*'

In [287]:
wis[6]

AssertionError: 6 is too large a position index for a word of length 5

Building on the class definition above, complete the following tasks:

1.  Your first task is to add a new method to the `WordInformationState` class (call the new method `matches`), which has two arguments, the word information-state (call it `self`) and a word (call it `word`),  The `matches` method returns  `True` if `word` satisfies the constraints in `self`. Otherwise, it returns `False`. 
In particular in order for `matches` to return `True`, `word` has to be the right length 
and it has to have the required letters at the positions `self` has information about.
2.  Your second task is to add a second new method to the `WordInformationState` class (call it `filter_words`); `filter_words` should also  have two arguments, the word information-state (call it `self`) and a word sequence.  The `filter_words` method returns the words in the sequence which satisfy the constraints in `self`. 

Before testing your new methods don't forget to re-execute the cell defining the class
and make sure you create your new instance (`wis`) after redefining the class.


Here are some test cases for both of the new methods; `wis` is the `WordInformationState` instance defined
above.  It incorporates the constraints from the mini-crossword example.

Since we have only partial information on the word in question, multiple character
sequences may be compatible with what we know:

In [164]:
word_list = []
# This satisfies the constraints in wis and it happens to be the right word for the  crossword
word_list.append("prior")
# This also satisfies the constraints in wis and is also a word (though a weird one),
# but it happens not to be the right word for the  crossword
word_list.append("prion")
# This is not a word, but it satisfies the constraints in wis.
word_list.append("priod")
# This is a word, but it does not satisfy the crossword's constraints
word_list.append("proof")
for word in word_list:
    print(word, wis.matches(word))

prior True
prion True
priod True
proof False


To test `filter_words`, use the `corp` word_list defined in the previous example.  Even limiting
our candidates to a set of known words, multiple words may still satisfy the constraints:

In [165]:
wis.filter_words(corp)

['prion', 'prior']

Here's another word state to test `filter_words` on:

In [166]:
wis2 = WordInformationState ({0:"h",7:"l",8:"y"},9)
wis2.filter_words(corp)

['habitably',
 'habitally',
 'hackingly',
 'haggardly',
 'haggishly',
 'haltingly',
 'hangingly',
 'haplessly',
 'haplolaly',
 'harmfully',
 'hastately',
 'hatefully',
 'haughtily',
 'healingly',
 'healthily',
 'heartedly',
 'heatingly',
 'hedgingly',
 'heedfully',
 'heinously',
 'helically',
 'hellishly',
 'helpfully',
 'helpingly',
 'heritably',
 'hideously',
 'hillbilly',
 'hintingly',
 'hissingly',
 'hoggishly',
 'holdingly',
 'homophyly',
 'homostyly',
 'honeyedly',
 'honorably',
 'hootingly',
 'hopefully',
 'hoppingly',
 'hostilely',
 'howlingly',
 'huffingly',
 'huffishly',
 'hugeously',
 'huggingly',
 'hurriedly',
 'hurtfully',
 'husbandly',
 'hushfully',
 'hushingly',
 'hypertely']

## A further wrinkle

The last exercise is to modify the definition of the `WordInformationState` class to allow two new kinds of letter constraints on the unknown word: upper and lower bound constraints. These kinds of constraints were discussed in the **Spelling Bees** section of this notebook.

That is, in addition to length and positional information, we allow constraints of the following form:

1.  Any letter in the lower bound set must occur in a valid candidate word.
2.  Only letters in the upper bound set may occur in a valid candidate word.
3.  When defining a `WordInformationState` any of the following may be supplied: a lower bound set, an upper bound set, a position dict, and a length.  None of those **must** be supplied.

Let's take an example instance of your new `WordInformationStateHere` class and call it `wis`. Here are some things to think about.

1.  Modify the `wis.match()`  method to enforce lower and upper bound constraints.  The code in `find_words` defined in the **Spelling Bees** section should be of help.
2.  There are potentially consistency issues among the constraints.  For example, suppose the `position_dict` contains a letter not in the upper bound set. What should you do about that? Note that this is the tip of an iceberg.  What if our position dict incorporates constraints not satisfiable by any English word, for example, that "q" must be immediately followed by "b"? What should we do about that?  Let's assume that we have a default corpus that includes all English words, possibly filtered by some additional constraints such as minimum length. Consider using the `nltk` words corpus to help with that. For convenience make the default corpus an attribute of each `WordInformationState` instance.  Can we use that to help with all our consistency issues?
3.  You should be able to create a `wis` that captures the Spelling Bee game in the Spelling Bee section of this notebook; `wis.filter_words(wis.default_corp)` should give the set of solution words.  To better capture the constraints of the game, filter `wis.default_corp` to only include words with length greater than 4.  You can then check the result of `wis.filter_words(wis.default_corp)` against the word set produced by `find_words` in the Spelling Bee section. 
4. You should also be able to create a `wis` that captures the position and length constraints of 4-down in crossword puzzle example above; `wis.filter_words(wis.default_corp)` should then give you all possible fillers for 4-down.

## Final wrinkle (negative information)

We have been using word games to concoct instances where we have partial information about the identity
of a word, and defining classes that represent that partial information.

Now consider a more difficult kind of game state which arises in playing the game **Wordle**.

[Game description]

At first glance Wordle may seem like more of the same.  You have the length constraint of 5 letters and when a guess letter is colored green you know the identity of the letter at a particular position.  When a guess letter is colored yellow you know an upper bound constraint. Valid words must contain that letter.

However, there is more to it.  Wordl game states introduce two new kinds of information, both of which can be thought of as **negative** information.

1.  If a guess letter is colored black, that means that letter may **not** occur in a valid word.
2.  A guess letter colored yellow tells us more than just an upper bound constraint. If a guess letter at position k is colored yellow, that also means that a valid word may not contain that letter at position k.


> Introduce a new kind of internal data structure to explicitly represent the negative constraints and modify the `.match()` method to enforce them.
    
    
We might then proceed in either of two ways.

1.  Enforce all constraints when the `WordInformationState` instance is created by calling `.match()` on the default corpus: Call the result of that the Filtered English Words.  This is a big efficiency win since constraint-matching only need to be computed once, after which all constraints can be forgotten.  Then `.filter_words()` could simply be the intersection of the Filtered English Words with whatever word list passed in as an argument to `.filter_words()`.   Note the hidden assumption here: We are assuming default corpus contains all possible English words.  Therefore after the constraints are applied, the Filtered English Words contained all possible words that satisfy our constraints.  
2.  Proceed along the lines we outline when we introduced out first wrinkle.  Maintain explicit representations of all constraints and enforce them on whatever word  `.match()`  is called on.   In that case `.filter_words()` would call `.match()` and the word set returned by `.filter_words()` might returns words not in the default corpus.

With respect to distinguishing the two solution paths, consider the word "gleepsite", which is the title of a story by Joanna Russ, but is also a word used by a small community of real speakers, as she has explained.  It happens that it does not occur in our unfiltered default corpus.  Suppose that the word "gleepsite" satisfies all constraints used to define `wis`.  What should `wis.filter_words(["gleepsite"])` return?  Will both solutions (1) and (2) get this right?

We can have the best of both worlds if we implement solution 1 with two versions of `filter_words`, one
which applies the `wis` constraints while iterating through the word list (by calling `match` on each word and without consulting the default corpus) and one which performs an intersection with the default corpus.  Let's call the first version `filter_words_by_matching`. This is the version which will be used to create the Filtered English Words by filtering the default corpus when `wis` is created.

In [290]:
"gleepsite" in corp

False

### Debugging advice

To check your ideas for implementing the final wrinkle, make sure that you can represent the game state of
some Wordle game.  Here's an example of a partially completed game.  See if your solution to the final wrinkle
can represent what we know about the target word after these four guesses:

<p>
<figure>
<center>
<img 
src='https://gawron.sdsu.edu/wordle_archive_108.png'
     width=250
     />
<figcaption>An incomplete Wordle game</figcaption>
</center>
</figure>
</p>

One way to represent the above information (there are others).

In [347]:
# yellow and black letter info

black_letters = set("rtioclshmndpu")
# negative yellow letter info
neg_position_dict = {0:set("a"),
                     1:set("a"),
                     2:set("ae"),
                     3:set(),
                     4:set("e")}
# merge negative yellow and black letter info
neg_position_dict = {k:black_letters|s for (k,s) in neg_position_dict.items()}
# green letter info
position_dict = {
                 3:"a",
                 }

# yellow letter info
lower_bound = set("ae")

Note:  The solution to the Wordle game above (which is quite difficult) will be computed in
the exercise solution.

## Solutions

### Information states solution

We implement `matches` and `filter_words`.

In [278]:
class WordInformationState ():
    
    # self.__position_dict_[idx] is the value at position idx 
    # in self's character sequence
    # We implement it as a dictionary so that there can
    # be positions whose letter is unknown.
   
    # This one shd be accessible to users.
    # character to be used as  filler when the letter
    # at a position is unknown.
    filler_char = "*"
    
    def __init__ (self,position_dict,length,filler_char=None):
      
        self.len = length
        if filler_char is not None:
            self.filler_char = filler_char
        # We update the word state with positional info and simultaneously
        # check that each position index is a valid index
        self._position_dict_ = {}
        for (k,v) in position_dict.items():
            self._index_check_(k)
            self._position_dict_[k] = v
          
        
    def _index_check_(self,idx):
        # check that position index idx is a valid index 
        assert isinstance(idx,int), f"{idx} is not an integer and therefore not a valid position index"
        assert idx < self.len, f"{idx} is too large a position index for a word of length {self.len}"
        
    def __getitem__ (self, k):
        """
        This allows us to look up the letter at position key using
        list-indexing syntax: self[2] returns the letter at poisition 2 if
        known, or the the filler character if not knot known.
        """
        self._index_check_(k)
        try:
            return self._position_dict_[k]
        except KeyError:
            return self.filler_char
        
    def __repr__ (self):
        #return "".join([self._position_dict_[k] if k in self._position_dict_ else "*" 
        #                for k in range(self.len)])
        return "".join([self[k] for k in range(self.len)])
    
    def matches (self, word):
        if not len(word) == self.len:
            return False
        for (idx,let) in self._position_dict_.items():
                if not word[idx] == let:
                    return False
        return True
    
    def filter_words (self, word_list):
        return [wd for wd in word_list if self.matches(wd)]

In [282]:
wis = WordInformationState ({0:"p",2:"i",3:"o"},5)
wis

p*io*

In [279]:
wis.filter_words(corp)

['prion', 'prior']

In [178]:
wis2 = WordInformationState ({0:"h",7:"l",8:"y"},9)
wis2.filter_words(corp)

['habitably',
 'habitally',
 'hackingly',
 'haggardly',
 'haggishly',
 'haltingly',
 'hangingly',
 'haplessly',
 'haplolaly',
 'harmfully',
 'hastately',
 'hatefully',
 'haughtily',
 'healingly',
 'healthily',
 'heartedly',
 'heatingly',
 'hedgingly',
 'heedfully',
 'heinously',
 'helically',
 'hellishly',
 'helpfully',
 'helpingly',
 'heritably',
 'hideously',
 'hillbilly',
 'hintingly',
 'hissingly',
 'hoggishly',
 'holdingly',
 'homophyly',
 'homostyly',
 'honeyedly',
 'honorably',
 'hootingly',
 'hopefully',
 'hoppingly',
 'hostilely',
 'howlingly',
 'huffingly',
 'huffishly',
 'hugeously',
 'huggingly',
 'hurriedly',
 'hurtfully',
 'husbandly',
 'hushfully',
 'hushingly',
 'hypertely']

### A further wrinkle solution

We implement lower bound and upper bound constraints, allowing them to be passed in as
container arguments to `WordInformationState` when the instance is created;
we also modify `matches` to enforce the intended constraints:

In [339]:
from nltk.corpus import words
from string import ascii_lowercase

corp = words.words()

class WordInformationState ():
    
    # self.__position_dict_[idx] is the value at position idx 
    # in self's character sequence
    # We implement it as a dictionary so that there can
    # be positions whose letter is unknown.
   
    # This one shd be accessible to users
    blank_char = "*"
    default_corp = corp
    
    def __init__ (self,position_dict=None,length=None,
                  lower_bound=None, upper_bound=None,
                  blank_char=None,default_corp=None):
      
        self.len = length
        if blank_char is not None:
            self.blank_char = blank_char
        # We update the word state with positional info and simultaneously
        # check that each position index is a valid index
        self._position_dict_ = {}
        if position_dict is not None:
            for (k,v) in position_dict.items():
                self._index_check_(k)
                self._position_dict_[k] = v
        if default_corp is not None:
            self.default_corp = default_corp
        self.lower_bound = lower_bound
        self.upper_bound = upper_bound
        if self.upper_bound is not None:
            self.upper_bound = set(upper_bound)
        else:
            self.upper_bound = set(ascii_lowercase)
        if self.lower_bound is not None:
            self.lower_bound = set(lower_bound)
        else:
            self.lower_bound = set()
        # what's required must also be allowed
        self.upper_bound.update(self.lower_bound)
        #consistency check
        assert len(self.filter_words(self.default_corp)) > 0, "The current constraints are not consistent"\
        " with the default corpus.  Check for a contradiction."
          
        
    def _index_check_(self,idx):
        # check that position index idx is a valid index 
        assert isinstance(idx,int), f"{idx} is not an integer and therefore not a valid position index"
        if self.len is not None:
            assert idx < self.len, f"{idx} is too large a position index for a word of length {self.len}"
        
    def __getitem__ (self, k):
        self._index_check_(k)
        try:
            return self._position_dict_[k]
        except:
            return self.blank_char
        
    def __repr__ (self):
        #return "".join([self._position_dict_[k] if k in self._position_dict_ else "*" 
        #                for k in range(self.len)])
        if self.len is not None:
            return "".join([self[k] for k in range(self.len)])
        elif len(self._position_dict_) > 0:
            max_pos = max(list(self._position_dict_.keys()))
            return "".join([self[k] for k in range(max_pos)]) + "..."
        else:
            return "..."

    def matches (self, word,verbose=False):
        if self.len is not None and not len(word)== self.len:
            if verbose: print("len")
            return False
        if self._position_dict_ is not None:
            for (idx,let) in self._position_dict_.items():
                if not word[idx] == let:
                    if verbose: print("position",idx,let)
                    return False
        letterset = set(word)
        if not self.upper_bound >= letterset:
            if verbose: print("upper_bound")
            return False
        if not self.lower_bound <= letterset:
            if verbose: print("lower_bound")
            return False
        return True
     
    def filter_words (self, word_list):
        return [wd for wd in word_list if self.matches(wd)]
    
# A sample instance of a WordInformationState incorporating
# the length and position constraints of 4 down in the partially
# solved crossword above.
wis = WordInformationState (position_dict={0:"p",2:"i",3:"o"}, length=5)
wis

p*io*

In [243]:
wis.filter_words(wis.default_corp)

['prion', 'prior']

In [248]:
wis[0]

'p'

Add the constraint of an upper bound (requiring "r"):

In [238]:
wis2 = WordInformationState (position_dict={0:"p",2:"i",3:"o"}, length=5, upper_bound="prio")
wis2

p*io*

In [240]:
wis2.upper_bound

{'i', 'o', 'p', 'r'}

In [241]:
wis2.filter_words(wis.default_corp)

['prior']

Create an inconsistent word state (position dict contains letter not in upper_bound):

In [242]:
wis3 = WordInformationState (position_dict={0:"p",2:"i",3:"o"}, length=5, upper_bound="pri")
wis3

AssertionError: The current constraints are not consistent with the default corpus.  Check for a contradiction.

Create a word state inconsistent with the facts of English words:

In [340]:
wis4 = WordInformationState (position_dict={0:"q",2:"b",3:"o"}, length=5)
wis4

AssertionError: The current constraints are not consistent with the default corpus.  Check for a contradiction.

Create a word state with no constraints:

In [255]:
wis5 = WordInformationState ()
wis5

...

In [256]:
res = wis5.filter_words(wis.default_corp)
len(res)

236734

In [259]:
len(wis5.default_corp)

236736

The `wis` for the Spelling Bee game above.

In [401]:
min_length = 4
default_corpus = [word for word in corp if len(word) >= min_length]
upper_bound,lower_bound = "limgna","u"
wis6 = WordInformationState (lower_bound=lower_bound, upper_bound=upper_bound,
                             default_corp = default_corpus)
wis6

...

In [269]:
allowed = wis6.filter_words(wis6.default_corp)
len(allowed)

152

From the Spelling Bee section.

In [275]:
L1 == allowed

True

## The final wrinkle: solution

A more constrained word set with no proper names (Adam Kilgariff's lemma set from the British National Corpus).

We'll use this below.

In [355]:
import pandas as pnd
fn="https://www.kilgarriff.co.uk/BNClists/all.num.o5"
#word_df = pd.read_csv(fn,index_col=1,sep=" ", names=["Freq","POS", "DocFreq"])
word_df = pnd.read_csv(fn,sep=" ", names=["Freq","Word", "POS", "DocFreq"])

In [359]:
punct = r"\?\+_!\.@\$%\^&\*\(\)=\[\]{}#<>`~'\",/\|~:;"
digits = "01234567890"


def isword (wd):
    """
    Let's filter out a host of nonwords.
    """
    if not isinstance(wd,str):
        return False
    if wd.startswith("-"):
        return False
    for char in punct:
        if char in wd:
            return False
    for char in digits:
        if char in wd:
            return False
    return True

real_words = [word for word in word_df.iloc[1:]["Word"].unique() if isword(word)]

In [380]:
from nltk.corpus import words
from collections import defaultdict
from string import ascii_lowercase

corp = words.words()
corp = {wd for wd in corp if len(wd)==5}
#corp = [wd for wd in real_words if len(wd)==5]

class WordInformationState ():
    
    # self.__position_dict_[idx] is the value at position idx 
    # in self's character sequence
    # We implement it as a dictionary so that there can
    # be positions whose letter is unknown.
    
    # self.__neg_position_dict_[idx] is a set of values that may NOT occur at position idx 
   
    # This one shd be accessible to users
    blank_char = "*"
    default_corp = corp
    
    def __init__ (self,position_dict=None, neg_position_dict=None, length=None,
                  lower_bound=None, upper_bound=None,
                  blank_char=None,default_corp=None):
      
        self.len = length
        if blank_char is not None:
            self.blank_char = blank_char
        # We update the word state with positional info and simultaneously
        # check that each position index is a valid index
        self._position_dict_ = {}
        if position_dict is not None:
            for (k,v) in position_dict.items():
                self._index_check_(k)
                self._position_dict_[k] = v
        # create a default dict whose initial default values are empty sets
        self._neg_position_dict_ = defaultdict(set)
        if neg_position_dict is not None:
            for (k,v) in neg_position_dict.items():
                self._index_check_(k)
                self._neg_position_dict_[k].update(v)
        if default_corp is not None:
            self.default_corp = default_corp
        self.lower_bound = lower_bound
        self.upper_bound = upper_bound
        if self.upper_bound is not None:
            self.upper_bound = set(upper_bound)
        else:
            self.upper_bound = set(ascii_lowercase)
        if self.lower_bound is not None:
            self.lower_bound = set(lower_bound)
        else:
            self.lower_bound = set()
        # what's required must also be allowed
        self.upper_bound.update(self.lower_bound)
        # enforce constraints on default corpus
        self.default_corp = self.filter_words_by_matching(self.default_corp)
        try:
            assert len(self.default_corp) > 0, "The current constraints are not consistent"\
            " with the default corpus.  Check for a contradiction."
        except AssertionError as e:
            print(e)
          
        
    def _index_check_(self,idx):
        # check that position index idx is a valid index 
        assert isinstance(idx,int), f"{idx} is not an integer and therefore not a valid position index"
        if self.len is not None:
            assert idx < self.len, f"{idx} is too large a position index for a word of length {self.len}"
        
    def __getitem__ (self, k):
        self._index_check_(k)
        try:
            return self._position_dict_[k]
        except:
            return self.blank_char
        
    def __repr__ (self):
        #return "".join([self._position_dict_[k] if k in self._position_dict_ else "*" 
        #                for k in range(self.len)])
        if self.len is not None:
            return "".join([self[k] for k in range(self.len)])
        elif len(self._position_dict_) > 0:
            max_pos = max(list(self._position_dict_.keys()))
            return "".join([self[k] for k in range(max_pos)]) + "..."
        else:
            return "..."

    def matches (self, word,verbose=False):
        if self.len is not None and not len(word)== self.len:
            if verbose: print("len")
            return False
        if self._position_dict_ is not None:
            for (idx,let) in self._position_dict_.items():
                if not word[idx] == let:
                    if verbose: print("position",idx,let)
                    return False
        if self._neg_position_dict_ is not None:
            for (idx,set0) in self._neg_position_dict_.items():
                if  word[idx] in set0:
                    if verbose: print("neg position",idx,set0)
                    return False
        letterset = set(word)
        if not self.upper_bound >= letterset:
            if verbose: print("upper_bound")
            return False
        if not self.lower_bound <= letterset:
            if verbose: print("lower_bound")
            return False
        return True
     
    def filter_words_by_matching (self, word_list):
        return {wd for wd in word_list if self.matches(wd)}
    
    def filter_words (self, word_list):
        return self.default_corp.intersection(word_list)


# A sample instance of a WordInformationState incorporating
# the constraints learned in the Wordle game above
black_letters = set("rtioclshmndpu")
# negative yellow letter info
neg_position_dict = {0:set("a"),
                     1:set("a"),
                     2:set("ae"),
                     3:set(),
                     4:set("e")}
# merge negative yellow and black letter info
neg_position_dict = {k:black_letters|s for (k,s) in neg_position_dict.items()}
# green letter info
position_dict = {
                 3:"a",
                 }

# yellow letter info
lower_bound = set("ae")

wis = WordInformationState (position_dict=position_dict,
                            neg_position_dict=neg_position_dict,
                            lower_bound=lower_bound,
                            length=5)
wis

***a*

In [373]:
len(real_words)

113982

The only 5-letter words in `nltk.words` compatible with the constraints (arguably, one of them is not English):

In [378]:
wis.default_corp

['bebay', 'begay', 'kebab']

So we filter against a set we think is more reliably just English words:

In [381]:
wis.filter_words(real_words)

{'kebab'}

#### The gleepsite example

In [397]:
min_length = 4
default_corpus = [word for word in words.words() if len(word) >= min_length]
upper_bound,lower_bound = set("gleepsite"),"p"
position_dict = {2:"e",3:"e"}
wis9 = WordInformationState (lower_bound=lower_bound, upper_bound=upper_bound,
                             default_corp = default_corpus,
                             position_dict=position_dict)

In [399]:
wis9.default_corp

{'epee',
 'epeeist',
 'sleep',
 'sleepless',
 'speel',
 'speelless',
 'steep',
 'steeple',
 'steepleless'}

The word *gleepsite* is not among the filtered English words so it's not in the intersection with `["gleepsite"]`

In [391]:
wis9.filter_words(["gleepsite"])

set()

But

In [392]:
wis9.filter_words_by_matching(["gleepsite"])

{'gleepsite'}

Which tells us that the word *gleepsite* meets the coinstraints on this word state.