##  Missing: A section introducing Wordl and the concept of a WordlBot.

In this notebook we move a significant part of the way toward building a Wordl-bot, a program that can play Wordl significantly better than you.  The  first thing we will focus on is a way of scoring candidate guesses.  Some guesses are  much better than others because on average they knock out more contenders than other candidates.  You need a scoring metric that measures this property: `metric(candidate, all_candidates) = score`.

In the rest of this note we give a quick sketch of how a scoring metric works. The intuition
we'll try to flesh out is that the best scoring metric knocks out more possibilities than others **on average**.
But what does that mean?

We saw in the last exercise that the guess "audio" left us with 85 candidates.  And we knew that because
we executed the function `compatible_words`.  But maybe there was a guess that would
have left us with only 50 candidates.  That would have been better. 
That's an insight, but not an immediately helpful one.
How did we arrive at the information that one guess left us with 85 candidates?
We computed that by knowing the coloring and then finding the
words compatible with that coloring, but computing
a coloring requires knowing the target.
How do we score guesses when we don't know the target?

Here's what we do. We hold the guess constant and we compute
the coloring that guess would produce if each of the possible candidate words
were the target.  That gives a set of colorings, and each coloring is associated with a set
of candidate words (possibly a set of size 1).  This set of sets is called a **partition** and
the key insight is that it should be partitions we score. Each of the sets in the partition  is
one possible outcome for the set of words left after our guess.  So the
simplest idea for scoring the guess is just to compute the average size
of those partition sets.  The smaller the average size, the more words 
eliminated on average, and the better the guess.
In fact, since the the sizes of the partition sets add up to
N, the total number of candidate words, and we divide by the number
of partitions to get average size, minimizing average partition set size
is the same as maximizing the number of partition sets.
That's not bad.  You could build a pretty good WordlBot with that idea.

But it turns out there is a better idea.  Our problem can be phrased:
How do we measure the amount of information gained in going
from the original set of candidates to a smaller set? 
If we can answer that question, then we just compute the
average of the information gain for each set in the partition
and we have a useful score for our guess.  We show
below that this sometimes gives a different answer 
to maximizing the number of partition sets.

This is exactly the sort of question answered by a branch of computer science called **Information Theory**.
The information theoretic approach often settles on the same answer
as average partition size, but it's somewhat more principled
and a lot more sophisticated because it is grounded in a 
mathematical theory for measuring the information of an event,
which is in turn grounded in a very satisfying way in probability
theory.  What follows is a **very brief** sketch of how 
something called (**information gain** (closely related to entropy)
can be used to determine which guess achieves the average gain in
information. 

#### An information-theoretic measure

We have a set of candidate words.  Choose one of them as a guess.
Consider each candidate word in turn and determine what
coloring would result if that word were the target.
For example, if the guess is "mouth" and the target 
word is "snort", the resulting coloring would be "kykyk" because
target letters "o" and "t" are contained in the guess but not in
the correct positions. The candidates "toxic", "south", and "topic" would all
get the same coloring as the guess "mouth".
So "toxic", "south",  and "topic" all
go in the same coloring set as the guess "mouth".
The colorings **partition** the set of candidates: every candidate gets a coloring and no candidate
gets more than one.  Another way of saying it is that a guess partitions
the set of candidates into a set of coloring classes.

So how does this help score a guess? Well, it does help.
Now consider comparing the partitions that result from different guesses.
Better generally means more coloring classes with
fewer members.  For example if the current set
of candidate has 24 words, the best possible partitioning would be 24 classes with
1 member each.  Because then no matter which candidate turns out to be right,
we (errorless logical agents that we are)  would take take one look at the resulting
coloring and we would be sure to guess the word on the next guess.

So how do we quantify this?  Say there are 24 candidates. What's better,
guess A which defines a partition with  2 groups of size 3, 4 groups of size 2, and 10 of size 1,
or guess B which defines a partition with 1 group of size 3, 8 groups of size 2, and 5 of size 1?

$$
\begin{array}{l|c|c}
\text{Name}& \text{Number of groups} & \text{Size}\\
\hline
\text{A} &  2 & 3\\
         &  4 & 2\\
         & 10 & 1 \\
         \hline
\text{B} & 1 & 3\\
         & 8 & 2\\
         & 5 & 1\\
         \hline
\end{array}
$$

Here's  the information theoretic way of looking at. it.
When we make a guess, it earns a coloring, and now we have a new (and smaller!) set of candidate words.
That's information gained.
We score a  guess  by computing the amount of information
gained for each of the colorings it might result in
and then taking the average of all the possible information gains.

Suppose, for example,  the game thus far has led us to a set of
candidate words which determines partitioning A above.
And suppose our next guess narrows things down to one of the two groups with 3 members.  Then a random
guess has a 1/3 chance of being the target.  Suppose,
on the other hand, we land in one of the 10 groups with 1 member.
Then a random guess has a 100% chance of being the target.
(we say the "random" guess has a probability of 1.0 of being right).
Clearly we have gained more information by landing in the seond
group than in the first.  We'd like to quantify the intuition that smaller
candidate sets are better, and and we're to do that by using the fact
that in smaller groups the probability of guessing the answer is higher.

The key step is to define the amount of information gained by determining
an event $e$ ($I(e)$) (in our case, finding out what the target is).

$$
I(e) = - log_{2} \,p(e)
$$

Read this as $I(e)$, the information needed to determine $e$, 
is equal to the negative of the log of the probability of $e$ (to the base of 2).
The unit is **bits**, binary choices.
For example, suppose we are talking about the information needed to determine
that a coin flip X has resulted in heads.  Assuming a fair coin, the probability of
a heads outcome is 1/2, so
$$
I(X=\text{heads}) = -log_{2} 1/2 = - (- 1) = 1 \text{ bit}
$$
So 1 bit of information is needed to determine that X=heads.  Or suppose
we are rolling a 6-sided die (Y).  The probability that the outcome
is 3 is 1/6, so $I(Y=\text{3})$ is:

In [1]:
import math

- math.log(1/6,2)

2.584962500721156

Since logs of probabilities are negative, determining events with lower
probability takes more information.   Hence the outcome of the die toss carries more information
than the outcome of the coin flip.

We can now quantify the intuitions
above.  In the scenario above we start in a situation
with 24 equally probable candidates.  That means the information
needed to determine any one of them is

In [12]:
Ic = - math.log(1/24,2)
Ic

4.584962500721157

Then we assumed that the next guess narrowed things down to a group of 3 members.  That means the information
needed to determine any one of them  went down: 

In [13]:
Is = - math.log(1/3,2)
Is

1.5849625007211563

So the information gain achieved by the guess that got us from a set of equally likely candidates
of size 24 to a set of candidates of size 3 is:

In [14]:
#4.584962500721157 - 1.5849625007211563
Ic - Is

3.000000000000001

That's 3 bits, give or take a floating point twitch.  Note that when we went from a set of size 24 to a set
of size 3, we shrunk the number of possibilities by a factor of 8.  The 3 bits comes from the fact
that $\log 8 = 3$.  Or to put it another way, halving the number of possibilities takes one bit
(as it did when we flipped the coin and went from 2 possible outcomes to 1).  Halving the number of possibilties
again takes another bit.  Halving them a third time takes a third bit.  So reducing the number of
possibilities by a factor of 8 takes 3 bits.  

That's the key idea about what 1 bit of information means: if we are in a state in which'there are N equally likely possibilities and we move to a state in which there are N/2 possibilities then we have received
1 bit of information. If we want to end up  in  a state in which there is only 1 possibility
it will take $\log_{2} N$ bits ($\log_{2} N$ successive halvings of the number of possibilities).

What about reducing the number of equally likely possibilities by 1? The amount of information that will
take depends on N, the total number of possibilities.  If N=2, it's 1 bit; if N is greater
it will be less, in general it's

$$
\log_{2} \frac{N}{N-1} = \log_{2} {N} - \log_{2} ({N-1}) 
$$

Let's summarize what we just did and introduce some
notation.  Call the original set with 24 possibilities C.
And let's use $I_{C}$ for the information required to determine any single member of C
assuming they're all equally likely, that is,

$$
\begin{array}{lcl}
\text{I}_{C} &=& - \log_{2} \frac{1}{\mid \text{C} \mid}
             & = & \log_{2} \mid \text{C} \mid \\
&=& \log_{2}(24,2) = 4.584962500721157
\end{array}
$$

Then we assumed that the next guess narrowed things down to a set of 3 members.
Let's call that set with 3 members $s$
And let's use $I_{s}$ for the information required to determine 1 member of $s$:

$$
\begin{array}{lcl}
\text{I}_{s} &=& - \log_{2} \frac{1}{\mid \text{s} \mid}
             & = & \log_{2} \mid s \mid \\
&=& \log_{2}(3,2) = 1.5849625007211563
\end{array}
$$

The information gained by that guess is:

$$
\text{I}_{C}  - \text{I}_{s} = 3 \text{ bits}.
$$

More generally, we define the Information Gain of a partition of a set C (the set
of candidates) as the average of the information gains of 
narrowing the possibilities down to $s_{i}$ for each $s_{i}$
in the partition:

$$
\begin{array}{llcl}
(a) &\text{IG}(\pi) & = & \sum_{s \in \pi} p_{\pi}\,(s_{i})\left \lbrack \text{I}_{C} - \text{I}_{s_{i}} \right \rbrack \\
(b)&                & = & \sum_{s \in \pi} p_{\pi}(s_{i})\,\text{I}_{C} - \sum_{s \in \pi}p_{\pi}(s_{i})\cdot \text{I}_{s_{i}}\\
(c)&                & = & \text{I}_{C} - \sum_{s \in \pi}p_{\pi}(s_{i})\cdot \text{I}_{s_{i}}\\
\end{array}
$$


Here,the weighting of each set in the average is given by its probability. The probability
associated with a set  $s_{i}$ in the partition $\pi$ is simply the probability that a randomly
chosen member of C  will be in $s_{i}$.  Writing that out :

$$
p_{\pi}(s_{i}) = \frac{\mid s_{i} \mid}{\mid \text{C} \mid}
$$.

Given our definition of information for a partition,
let's calculate the information gain for the two cases above.

In [15]:
# A:  2 groups of size 3  4 groups of size 2 10 groups of size 1
# the weighted gains 
GainsA = \
2 * -math.log(1/3,2) * 3/24 + \
4 * -math.log(1/2,2) * 2/24 + \
10 * -math.log(1/1,2) * 1/24
GainsA 

0.7295739585136224

So given this partioning  on average it will take .73 yes-no questions to determine target.  How is that possible?  Bear in mind that this is the **average** number of questions, and that for 10 of the partitions, amounting to 
10/24 of the probability mass, 0 questions are required:

In [9]:
-math.log(1/1,2)

-0.0

Our final information gain:

In [18]:
Ic - GainsA

3.8553885422075345

Computing the average partition size for partition A:

In [8]:
# average partitio size
(2*3 + 4*2 + 10 * 1)/16

1.5

Turning to Partition B:

In [20]:
# B:  1 groups of size 3  8 groups of size 2 5 groups of size 1
# the weighted gains 
GainsB = \
1 * -math.log(1/3,2) * 3/24 + \
8 * -math.log(1/2,2) * 2/24 + \
5 * -math.log(1/1,2) * 1/24
GainsB 

0.8647869792568111

In [21]:
Ic - GainsB

3.7201755214643457

In [17]:
# average partition size for Partition B
(1*3 + 8*2 + 5 * 1)/14

1.7142857142857142

So both metrics agree that Partition A should be preferred.

#### Coding exercise

**Part One**

Write a function that scores a partition using the above definition of information
gain for a partition.  Assume a partition is a container of
sets.  Assume the sets are disjoint so that we have a genuine
partition.  The function `score_partition` has the signature

```python
score_partition(partition)
```

and returns a floating point number that is the average information gain for
that partition.

Check your function  by seeing if you get the same answer for partitions
like A and B as we got in  the calculations above.
You will have to cook up your own partitions to do the test.
Bear in mind that the set members don't matter.  What matters
is the total number of members in C (24, for partitions A and B) and the sizes of the sets
C is partitioned into.  Also bear in mind the partition sets
must not overlap.

In [1]:
import math

def score_partition(partition):
    """
    Use information gain to score a partition.
    """
    C = {s for S in partition for s in S}
    N = len(C)
    Ic = math.log(N,2)
    Gains = 0
    for S in partition:
        sz = len(S)
        Gains += (math.log(sz,2)*(sz/N))
    return Ic-Gains



In [2]:

# A:  2 groups of size 3  4 groups of size 2 10 groups of size 1
part_A = [set(lets) for lets in ("abc","def","gh","ij","kl", "mn", "o","p","q","r","s","t","u","v","w","x")]
    
print("A", score_partition(part_A))

# B:  1 groups of size 3  8 groups of size 2 5 groups of size 1
part_B = [set(lets) for lets in ("abc","de","fg","hi","jk", "lm", "no","pq","rs","t","u","v","w","x")]
    
print("B", score_partition(part_B))

A 3.8553885422075345
B 3.7201755214643457


**Part Two**

Take your partition scorer out for a spin by using it to find the best Wordl opening word.

The situation:  All possible words are your candidates. Consider each as a potential guess
and use it to partition the set of candidates.  Score your partition using your partition scorer,
then rank words by partition score.  Report the highest ranking word.  For fun you should
also report the lowest ranking word to answer this too infrequently asked question:  "What is the worst possible
Wordl opener?" (Keep these words around; they make great target words to test on when you finish building
your WordlBot).

1. To partition the set of candidates, use the function `color_guess` provided to you below for free.
   It handles the case of guesses that have duplicate letters correctly, which is a wrinkle
   that introduces some slightly tricky logic. It also represents colorings as strings of length 5,
   which means they can be keys in a dictionary.  Hint, hint. You will probably want to write a function
   `get_partition` that loops through the candidates and assigns each to a partition set 
   with `color_guess`.  Perfectly oblique to everything, you may want to know about `defaultdict`
   in the `collections` module, which makes it easy to create and maintain a dictionary whose
   values are sets.
2. Use the wordlist provided to you in the next cell as your set of all possible 5-letter English words.
   It's far from perfect, but it's a decent simulation of the unknown set of actual Wordl solutions.

In [3]:
import pandas as pd

def get_wordl_words():
    df = pd.read_csv("wordlebot_words.txt",header=None,names=["Word"])
    words = df["Word"].values
    words = [w for w in words if w.isalpha()]
    print(len(words), "words loaded")
    return words

In [6]:
words = get_wordl_words()

3205 words loaded


In [4]:
from collections import defaultdict

def color_guess (target, guess,wd_len=5):
    """
    This version appears to fix the duplicate
    letters bug.
    """
    def wd_dict(wd):
        dd = defaultdict(set)
        for (i,l) in enumerate(wd):
            dd[l].add(i)
        return dd

    coloring = list('k'*5)
    tgt_dict,guess_dict = wd_dict(target),wd_dict(guess)
    for let in set(guess):
        gs_toks,tgt_toks = guess_dict[let],tgt_dict[let]
        gs = gs_toks & tgt_toks
        hits, max_hits = 0, len(tgt_toks)
        for gs_i in gs:
            coloring[gs_i] = 'g'
            hits += 1
        # We have yellows to assign
        for gs_i in gs_toks - gs:
            if tgt_toks and hits < max_hits:
                coloring[gs_i] = 'y'
                hits += 1
    return ''.join((coloring))

def score_partition2(partition):
    """
    Alternative scoring metric.  Use average sz of partition sets as score
    """
    #Gains = 0
    #for s in partition:
    #    Gains += len(s)
    return len(partition)

Solution code 1:

In [5]:
def get_partition(word, candidates):
    partition_dict = defaultdict(set)
    for target in candidates:
        partition_dict[color_guess (target, word)].add(target)
    return list(partition_dict.values())
    
score_dict = dict()

for word in words:
    partition = get_partition(word, words)
    score_dict[word] = score_partition(partition)
    
best_words = sorted(score_dict.items(),key=lambda x: x[1])

NameError: name 'words' is not defined

Worst and best words:

In [10]:
best_words[:10]

[('fuzzy', 2.31120883170194),
 ('jazzy', 2.358289825160469),
 ('kudzu', 2.3962645970544916),
 ('yummy', 2.45230403439553),
 ('jiffy', 2.4762419366254154),
 ('whizz', 2.482124135306151),
 ('mummy', 2.4954390597304332),
 ('muzzy', 2.519383700381457),
 ('fizzy', 2.5281181350554522),
 ('fluff', 2.528793215233197)]

In [9]:
best_words[-20:]

[('alert', 5.73018532171354),
 ('laser', 5.751294160134383),
 ('stale', 5.7592753881157215),
 ('least', 5.761116634753805),
 ('alter', 5.7829269004299855),
 ('later', 5.79745275115367),
 ('trace', 5.800893346730176),
 ('arose', 5.805461934796522),
 ('crate', 5.806269755928691),
 ('snare', 5.815225585881023),
 ('orate', 5.818973122184737),
 ('rates', 5.829482177080974),
 ('carte', 5.8363495758139825),
 ('irate', 5.8389371539479535),
 ('caret', 5.840502120823964),
 ('stare', 5.855465911832635),
 ('saner', 5.860374367657325),
 ('arise', 5.869070725450308),
 ('slate', 5.885125881857474),
 ('raise', 5.948792890207549)]

The alltime winner is "raise".

In [48]:
# A favorite opener for many, "audio" only gets ranked 1273rd by this metric
best_words[-1273]

('audio', 4.772757713488988)

In [51]:
# This list used average partition set size as its metric (score_partition2).  It
# gives different rankings, but agrees on many things.
best_words2 = sorted(score_dict2.items(),key=lambda x: x[1])

#### Distiniguishing the two metrics

In [6]:
# 15 partitions of size 1, 9 of size 2, etc.
croup_ctr = {1: 15, 2: 9, 3: 5, 6: 2, 9: 1}
chirp_ctr = {1: 19, 2: 5, 3: 4, 5: 2, 4: 2, 10: 1}

def igain (ctr):
    N = sum(k*v for (k,v) in ctr.items())
    Ic = math.log(N,2)
    Gains = 0
    for (sz,ct) in ctr.items():
        Gains += ct*(math.log(sz,2)*(sz/N))
    return Ic-Gains

def num_partitions(ctr):
    """
    Since N is constant across partitionings,
    minimizing avg partition size is the same as maximizing 
    the number of groups (sum(ctr.keys()))
    """
    return sum(ctr.keys())

In [141]:
#croup is the winner, by an eyelash
igain(croup_ctr),igain(chirp_ctr)

(4.640070651960025, 4.63811703784482)

In [134]:
#chirp is the clear winner
num_partitions(croup_ctr),num_partitions(chirp_ctr)

(21, 25)

## Part Three

You have almost written a WordlBot.  

Want to finish?  Your Bot can operate in two modes.

1.  Write a while loop that is given a value for an initial guess, then waits to have a coloring input before outputting the next guess. Each coloring that is input should update the set of current words.  This will be quite straightforward if you still have the partition induced by your last guess, especially if you implemented your partition as a dictionary whose keys are colorings. Where are the colorings coming from?  You're playing Wordl online. Your Wordl skill scores should skyrocket.
2.  Give the Bot a solution and let it play on its own (it can now generate its own colorings).  You can now compare your Wordl games with its games.  You can also search for the hardest Wordl words, the ones that make it take 4 or more guesses.  

## Given code (either from earlier in this NB or new for this problem)

In [7]:
import math
import pandas as pd
from collections import defaultdict
import random
import numpy as np
from IPython.display import display

def get_wordl_words(url=None):
    """
    eliminate a few very odd words still in the list
    
    Note there needs to be second broader list of words that are allowable
    guesses (for run_game mode) in order for players to make strategic
    guesses that are English words but not wordle words (like regular
    plurals of 4-letter words).
    """
    # How did regular plurals ruses and doses get in there?
    eliminate = ("kylix","middy","ruses","doses","meses")
    if url is None:
        url = "https://raw.githubusercontent.com/gawron/python-for-social-science/refs/heads/master/" +\
    "text_classification/wordlebot_words.txt"
    df = pd.read_csv(url,header=None,names=["Word"])
    words = df["Word"].values
    words = [w for w in words if (w.isalpha() and w not in eliminate)]
    print(len(words), "words loaded")
    return words

def get_partition_dict(word, candidates):
    partition_dict = defaultdict(set)
    for target in candidates:
        partition_dict[color_guess (target, word)].add(target)
    return partition_dict


def color_guess (target, guess,wd_len=5):
    """
    This version appears to fix the duplicate
    letters bug.
    """
    def wd_dict(wd):
        dd = defaultdict(set)
        for (i,l) in enumerate(wd):
            dd[l].add(i)
        return dd

    coloring = list('k'*5)
    tgt_dict,guess_dict = wd_dict(target),wd_dict(guess)
    for let in set(guess):
        gs_toks,tgt_toks = guess_dict[let],tgt_dict[let]
        gs = gs_toks & tgt_toks
        hits, max_hits = 0, len(tgt_toks)
        for gs_i in gs:
            coloring[gs_i] = 'g'
            hits += 1
        # We have yellows to assign
        for gs_i in gs_toks - gs:
            if tgt_toks and hits < max_hits:
                coloring[gs_i] = 'y'
                hits += 1
    return ''.join((coloring))


def score_partition(partition):
    """
    Use information gain to score a partition.
    """
    C = {s for S in partition for s in S}
    N = len(C)
    Ic = math.log(N,2)
    Gains = 0
    for S in partition:
        sz = len(S)
        Gains += (math.log(sz,2)*(sz/N))
    return Ic-Gains

def score_partition2(partition):
    """
    Alternative scoring metric.  Use number of partition sets as score
    """
    return len(partition)



def color_word(word,coloring,display_style=False):
    """
    Using https://pandas.pydata.org/docs/user_guide/style.html
    """
    df = pd.DataFrame([list(word)])
    df.index=['']
    s= df.style
    s.set_table_styles([  # create internal CSS classes
        {'selector': '.g', 'props': 'background-color: green; color:white;'},
        {'selector': '.y', 'props': 'background-color: #ffffb3; color:black;'},
        {'selector': '.k', 'props': 'background-color: black; color:white;'},
        {'selector': 'th', 'props': 'background-color: white; color:white; border-bottom: 1px solid white'}
    ], overwrite=False)
    cell_color = pd.DataFrame([list(coloring)],
                          index=df.index,
                          columns=df.columns)
    s.set_td_classes(cell_color)
    if display_style:
        display(s)
    else:
        return s

## Solution

The solution below implates a bot with 3 modes:

1.  Call bot.play(verbose=1, target=word),  where word is a word you want to test the bot on.  It will play
    on its own, printing out guess and coloring info, and so far as I can tell, it will always get the answer
    within 6 tries (the maximum number of guesses in the official game).
2.  Call bot.play(verbose=1) It will play interactively. You give it a first guess and it will ask for a coloring,
    then upodate its list of condidates and output the best guess.  To be used to cheat while you are playing
    Wordle on the NYT Wordle page.
3.  Call bot.run_game().  It will simulate the NYT Wordle page game, though without the pretty
    graphics.  It chooses a target, then enters the loop,
    where it waits for your guess, then gives you back a coloring. And loop.
    It will tell you when you've won and when you've lost.

In [232]:
"https://raw.githubusercontent.com/gawron/python-for-social-science/refs/heads/master/""text_classification/bnc_word_and_doc_freqs_2023_04_13.csv"

'https://raw.githubusercontent.com/gawron/python-for-social-science/refs/heads/master/text_classification/bnc_word_and_doc_freqs_2023_04_13.csv'

In [233]:
bnc_freqs_url

'https://raw.githubusercontent.com/gawron/python-for-social-science/refs/heads/master/bnc_word_and_doc_freqs_2023_04_13.csv'

In [2]:
import os.path
import pandas as pd 

data_url= "https://raw.githubusercontent.com/gawron/python-for-social-science/refs/"\
"heads/master/text_classification/"
word_file = "wordlebot_words.txt"
bnc_freqs_file = "bnc_word_and_doc_freqs_2023_04_13.csv"
word_url = os.path.join(data_url,word_file)
bnc_freqs_url = os.path.join(data_url,bnc_freqs_file)
df = pd.read_csv(bnc_freqs_url,header=None,names=("Word","Freq","DocFreq"))
df = df.set_index('Word')
# proposal for lowest partition
#new_df = df[df["Word"].isin(wordl_words)]

In [3]:
wordl_words = sorted(get_wordl_words(url=word_url))

3216 words loaded


In [4]:
from nltk.corpus import wordnet as wn

In [5]:
wn.morphy("weeded")

'weed'

In [6]:
df.loc[["the","of","warbler"]]

Unnamed: 0_level_0,Freq,DocFreq
Word,Unnamed: 1_level_1,Unnamed: 2_level_1
the,2618291,1713
of,1417475,1713
warbler,58,15


In [None]:
#  2 ^ x = p
# (e ^{log 2}) ^ x = p
# e ^ (log 2 * x) = p
# log 2 * x  = log p
# x = log p/log 2

In [7]:
##  Do probs log probs base 2
# BNC has about 50 Mill. wds
import numpy as wn

N = df.Freq.sum()

df['Prob'] = df.Freq/N
df['LogProb'] =  np.log(df["Prob"]) /np.log(2)

In [8]:
df

Unnamed: 0_level_0,Freq,DocFreq,Prob,LogProb
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
the,2618291,1713,5.060588e-02,-4.304551
",",2398174,1713,4.635150e-02,-4.431240
.,2131797,1713,4.120301e-02,-4.601107
of,1417475,1713,2.739671e-02,-5.189853
and,1227434,1713,2.372363e-02,-5.397531
...,...,...,...,...
23c,2,1,3.865566e-08,-24.624745
comfortablest,2,1,3.865566e-08,-24.624745
Epitomy,2,1,3.865566e-08,-24.624745
self-fruition,2,1,3.865566e-08,-24.624745


The next step may not be quite right.  See discussion immediately below it.

In [279]:
#filtered_words = [word for word in wordl_words if word in df.index]
#new_df = df.loc[filtered_words]

Issue.  We want to smooth -- assign some minimal prob -- to wordlbot words that didnbt turn up in BNC

In [9]:
# schwa not in BNC but is in wordlebot words
# add 1,1 row  (way back above, before oprobs are compiuted)
new_df.loc[["shack","snack","stand","stank","spawn"]]

NameError: name 'new_df' is not defined

In [276]:
len(new_df)

2794

In [15]:
import os.path
import pandas as pd 

class Wordlbot(object):
    
    #word_url= "https://raw.githubusercontent.com/gawron/python-for-social-science/refs/heads/master/" +\
    #"text_classification/wordlebot_words.txt"
    data_url= "https://raw.githubusercontent.com/gawron/python-for-social-science/refs/heads/master/text_classification/"
    word_file = "wordlebot_words.txt"
    bnc_freqs_file = "bnc_word_and_doc_freqs_2023_04_13.csv"
    bnc_freqs_url = os.path.join(data_url,bnc_freqs_file)
    word_url = os.path.join(data_url,word_file)
    wordl_words = sorted(get_wordl_words(url=word_url))
    candidates = None
    initial_guess = "raise"
    #initial_guess="slate"
    print("Initializing...")
    initial_partition = get_partition_dict(initial_guess, wordl_words)
    print("Initialized")
    
    
    def __init__(self, target=None,verbose=0):
        """
        verbose = 1 some noitifications
        verbose =2 debug mode
        """
        self.target=target
        self.games = []
        self.verbose= verbose
        self.batch_mode=False
        self.wins=0
        self.losses=0
        
    def get_coloring(self,guess):
        colors = {"g","k","y"}
        if self.target is None:
            color = input("Please enter a coloring.\n")
            while not (len(color)==5 and set(color).issubset(colors)):
                color = input("Not a valid coloring.  Please enter a 5 color coloring using color kgy.\n")
            return color
        else:
            return color_guess (self.target, guess)
        
    def get_best_guess_and_partition(self,candidates,other_metric=False):
        global partitions, score_dict, best_words
        if candidates is None:
            # This happends when initializing
            return self.initial_guess, self.initial_partition
        partitions,score_dict = dict(),dict()
        if self.verbose:
            print("Searching for next guess\n")
        # Note dont limit the next guess to candidates
        for word in self.wordl_words:
            partition = get_partition_dict(word, candidates)
            if other_metric:
                score_dict[word] = self.score_partition_dict2(partition)
            else:
                score_dict[word] = self.score_partition_dict(partition)
            partitions[word] = partition
        best_words = sorted(score_dict.items(),key=lambda x: x[1])
        best_word, best_score = best_words[-1]
        # Use one of the candidates if they score well enough
        for c in candidates:
            if score_dict[c] == best_score:
                 best_word, best_score = (c, best_score)
        return best_word,partitions[best_word]
        
    def play (self,target=None,initial_guess=None,candidates=None,
              other_metric=False,use_user_guesses=False):
        """
        Play in two modes, target is None and target is given.
        
        In target is None mode: the bot prints out a guess based on the current 
        list of candidates and asks the user for the coloring (which
        the user presumably learns by inputing the guess to the NYT Wordl page);
        then the bot generates a fresh set of compatible candidates and re enters 
        the while-loop.
        
        In target is given mode, there is no interaction, the bot just generates a 
        guess based on the current list of candidates, computes the coloring on its own,
        then generates a new set of candidates and re-enters the while loop.
        
        Loop until either 6 guesses have been made or the coloring is `'ggggg'`.
        """
        if target is not None:
            self.target = target
        game,ctr = [],1
        self.games.append(game)
        #while ctr< 7 and (candidates is None or len(candidates)>1):
        while ctr < 7 and (candidates is None or coloring!="ggggg"):
            if self.verbose:
                print(f"Making guess {ctr}")
            if initial_guess is not None:
                guess,initial_guess=initial_guess,None
                partition = get_partition_dict(guess, self.wordl_words)
            elif use_user_guesses:
                guess = input("Please enter your guess.\n")
                partition = get_partition_dict(guess, candidates)
            else:
                guess,partition = self.get_best_guess_and_partition(candidates,other_metric=other_metric)
            if self.verbose or target is None:
                print(f"Guess is {guess}")
            coloring = self.get_coloring(guess)
            if self.verbose:
                print(f"Coloring is {coloring}")
            game.append((guess,coloring))
            candidates = partition[coloring]
            if self.verbose==2 and hasattr(self,"target") and self.target in partition[coloring]:
                print(f"{self.target} in  candidates")
            elif self.verbose==2 and hasattr(self,"target") :
                print(f"{self.target} not in  candidates")
            if self.verbose:
                print(len(candidates))
            ctr += 1
        candidates = list(candidates)
        #if len(candidates)==1 and game[-1][1] != "ggggg":
        #    game.append((candidates[0],"ggggg"))
        # go back to default: bot-is-player mode 
        self.target=None
        if len(candidates) == 0:
            print("*** Errorful state: No candidates left! Check the colorings for consistency! ***")
            return
        elif coloring == "ggggg":
            if not self.batch_mode:
                print("Winner winner! Chicken dinner!")
            self.wins += 1
        elif (ctr >= 7) and (coloring != 'ggggg'):
            self.losses += 1
        if not self.batch_mode:
            print(f"Target is {candidates[0]}")
        return candidates[0]
    

    def score_partition_dict (self,partition_dict):
        """
        Use information gain to score a partition.
        
        This makes score_partition superfluous.
        Implemented to allow passing of partition
        dicts instead of the simpler alternative, a partition
        sequence.  list(partition_dict.values()) is
        the list of values in the corresponding 
        partition sequence.  So after the first line
        of code, this function is just like score_partition.
        """
        partition  = list(partition_dict.values())
        C = {s for S in partition for s in S}
        N = len(C)
        Ic = math.log(N,2)
        Gains = 0
        for S in partition:
            sz = len(S)
            Gains += (math.log(sz,2)*(sz/N))
        return Ic-Gains
    
    def score_partition_dict2(self, partition_dict):
        """
        Alternative scoring metric.  Best guess
        minimizes average sz of partition sets
        
        Since the size of the set we are oartitioning
        is constant across all candidate guesses,
        minimizing N/num_psets is the same as maximizing
        num_psets
        """
        #partition  = list(partition_dict.values())
        #Gains = 0
        #for s in partition:
        #    Gains += len(s)
        #
        return len(partition_dict)

    def run_game(self, candidates=None,target=None):
        """
        Self simulates the Wordl game program on the web.  You 
        are player.  No cute keyboard mockup screen.
        
        Can supply target and play for debugging. Or just to make yourself feel smart.
        """
        if target is None:
            self.target = random.sample(self.wordl_words,1)[0]
        else:
            self.target = target
        ctr, game, guess, coloring = 1, [ ], None, None
        self.games.append(game)
        if candidates is None:
            candidates = self.wordl_words
        while (ctr < 7) and (guess != 'q') and coloring != 'ggggg':
            guess = input("Please input your guess! (q to quit): ")
            partition = get_partition_dict(guess, self.wordl_words)
            coloring = self.get_coloring(guess)
            if guess in self.wordl_words:
                candidates = partition[coloring]
                #print(f"Coloring is {coloring}")
                color_word(guess,coloring,display_style=True)
            else:
                print("Unknown word")
                ctr -= 1
            game.append((ctr, guess,coloring, self.target))
            if self.verbose:
                print(f"{ctr}. {len(candidates)} candidates left")
            ctr += 1
            assert self.target in candidates, "Something wrong.  Target not in candidates"
        candidates = list(candidates)
        last_candidate = candidates[-1]
        if (guess == "q"):
            print("Quitting!")
        elif (guess == last_candidate):
            print("Winner winner! Chicken dinner!")
        elif (ctr >= 7) and (coloring != 'ggggg'):
            print(f"Sorry! Target was {self.target}")
        else:
            print("Program ending!  Dont know why!")
    
    def batch_play (self,target_list, other_metric=False,initial_guess=None):
        self.target_list = target_list
        self.wins,self.losses = (0,0)
        self.batch_mode = True
        num_games = len(target_list)
        for target in target_list:
            self.play(target=target,other_metric=other_metric,initial_guess=initial_guess)
        print(f"Report: {sum(len(game) for game in self.games)/num_games:.2f} {self.losses} losses")
        
        

3211 words loaded
Initializing...
Initialized


In [10]:
##  Competition
num_games = 500
print(f"Initializing bots and approximately {num_games} tournament words")

#bot1,bot2 = Wordlbot(),Wordlbot()
bot1 = Wordlbot()
tournament_targets = list(set(random.sample(bot1.wordl_words,num_games)))
num_games = len(tournament_targets)
print(f"Final set: {num_games} tournament words")

Initializing bots and approximately 500 tournament words
Final set: 500 tournament words


In [11]:
list(tournament_targets).index("diner")

ValueError: 'diner' is not in list

In [42]:
tournament_targets[88]

'diner'

In [43]:
bot1.games[88]

[('raise', 'ykyky'),
 ('veldt', 'kykyk'),
 ('wormy', 'kkykk'),
 ('cider', 'kgygg'),
 ('diner', 'ggggg')]

In [72]:
bot1.initial_guess

'slate'

In [73]:
"mizen" in bot1.wordl_words

True

In [12]:
print("Bot1 playing")
bot1 = Wordlbot()
bot1.batch_play(tournament_targets,other_metric=True,initial_guess="slate")
#print("Bot2 playing")
#bot2.batch_play(tournament_targets)
print("Tournament over!")

Bot1 playing
Report: 3.53 0 losses
Tournament over!


A word that takes bot1 6 guesses: `'joker'`.  Another such word:

In [27]:
def find_max_words(bot, target_list, idx=None):
    """
    Find hardest words after batch play.
    """
    b = np.array([len(g) for g in bot.games[-len(target_list):]])
    target_array = np.array(target_list)
    idxs = b.argsort()
    max_score = b[idxs[-1]]
    num_max_scores = len(b[b==max_score])
    return max_score, target_array[idxs][-num_max_scores:]


(max_score,max_words)=find_max_words(bot1,tournament_targets)
print(max_score,max_words)

5 ['funky' 'cower' 'plump' 'erode' 'drome' 'sloop' 'gaged' 'bowel' 'crack'
 'stool' 'cheep' 'daunt' 'armor' 'graze' 'jaded' 'probe' 'flank' 'hippy'
 'older' 'diner' 'cover' 'smock' 'power' 'latch' 'gland' 'fuzzy']


In [36]:
(max_score,max_words)=find_max_words(bot1,tournament_targets)
print(max_score,max_words)

5 ['funky' 'cower' 'plump' 'erode' 'drome' 'sloop' 'gaged' 'bowel' 'crack'
 'stool' 'cheep' 'daunt' 'armor' 'graze' 'jaded' 'probe' 'flank' 'hippy'
 'older' 'diner' 'cover' 'smock' 'power' 'latch' 'gland' 'fuzzy']


In [49]:
bot1.verbose=1

In [52]:
"diner" in bot1.wordl_words

True

In [83]:
bot1 = Wordlbot()
bot1.play(initial_guess="slate")

Guess is slate
Please enter a coloring.
ykkkk
Guess is croup
Please enter a coloring.
kkygk
Guess is bogus
Please enter a coloring.
ggkgg
Guess is bonus
Please enter a coloring.
ggggg
Winner winner! Chicken dinner!
Target is bonus


'bonus'

The word `'joker'` also takes 6 guesses for bot2.  However `'boxer`' took only 5 guesses.

In [208]:
# boxer for b2
print(find_max_word(b2,idx=392))

(5, 'boxer', 392)


In [130]:
bot.play(target="sunny",initial_guess="slate")

Making guess 1
Guess is slate
Coloring is gkkkk
69
Making guess 2
Searching for next guess

Guess is croup
Coloring is kkkyk
9
Making guess 3
Searching for next guess

Guess is suing
Coloring is ggkgk
1
Making guess 4
Searching for next guess

Guess is sunny
Coloring is ggggg
1
Target is sunny


'sunny'

In [129]:
bot.play(target="sunny",initial_guess="slate",other_metric=True)

Making guess 1
Guess is slate
Coloring is gkkkk
69
Making guess 2
Searching for next guess

Guess is chirp
Coloring is kkkkk
10
Making guess 3
Searching for next guess

Guess is snowy
Coloring is gykkg
1
Making guess 4
Searching for next guess

Guess is sunny
Coloring is ggggg
1
Target is sunny


'sunny'

In [17]:
bot1 = Wordlbot()
bot1.play(initial_guess="ratio",use_user_guesses=True)

Guess is ratio
Please enter a coloring.
kykkk
Please enter your guess.
clash
Guess is clash
Please enter a coloring.
kkykk
Please enter your guess.
amend
Guess is amend
Please enter a coloring.
ykykk
Please enter your guess.
pupae
Guess is pupae
Please enter a coloring.
kkkgy
Please enter your guess.
pupae
Guess is pupae
Please enter a coloring.
kkkgy
Please enter your guess.
pupae
Guess is pupae
Please enter a coloring.
kkkgy
Target is kebab


'kebab'

In [82]:
bot.play(initial_guess="slate")

Making guess 1
Guess is slate
Please enter a coloring.
gkkkk
Coloring is gkkkk
69
Making guess 2
Searching for next guess

Guess is croup
Please enter a coloring.
kkkyk
Coloring is kkkyk
9
Making guess 3
Searching for next guess

Guess is suing
Please enter a coloring.
ggkgk
Coloring is ggkgk
1
Making guess 4
Searching for next guess

Guess is sunny
Please enter a coloring.
ggggg
Coloring is ggggg
1
Target is sunny


'sunny'

In [22]:
#from IPython.display import HTML, display
bot1 = Wordlbot()
bot1.run_game()
#s=color_word("tripe","ggkky")
#display(s)
#True

Please input your guess! (q to quit): slate


Unnamed: 0,0,1,2,3,4
,s,l,a,t,e


Please input your guess! (q to quit): dress


Unnamed: 0,0,1,2,3,4
,d,r,e,s,s


Please input your guess! (q to quit): bases


Unnamed: 0,0,1,2,3,4
,b,a,s,e,s


Please input your guess! (q to quit): muses
Unknown word
Please input your guess! (q to quit): ruses


Unnamed: 0,0,1,2,3,4
,r,u,s,e,s


Please input your guess! (q to quit): vises
Unknown word
Please input your guess! (q to quit): wises
Unknown word
Please input your guess! (q to quit): doses


Unnamed: 0,0,1,2,3,4
,d,o,s,e,s


Please input your guess! (q to quit): sizes
Unknown word
Please input your guess! (q to quit): wizes
Unknown word
Please input your guess! (q to quit): doses


Unnamed: 0,0,1,2,3,4
,d,o,s,e,s


Sorry! Target was meses


In [225]:
# sample prob dist 53 cands; guess is "plant"
def_prob = "2.1 "
###############################################
gp1 = "chaos shack shady shaky smack".split()
gp1_probs = (def_prob*5).split()
gp2 = "shall small shawl scaly scald".split()
gp2_probs = (def_prob*3 +  "2 1.7").split()
gp3  = "squad scuba sumac squab assay schwa".split()
gp3_probs = "2.1 2.1 2 2 1.8 .1".split()
gp4 = "stack staff swath".split()
gp4_probs = (def_prob * 3).split()
gp5 = "swamp soapy scamp".split()
gp5_probs = "2.1 2 1.9".split()
gp6 = "snack snafu shaky".split()
gp6_probs = "2.1 1.9 1.9".split()
gp7 = "stall stalk".split()
gp7_probs = "2.1 2.1 ".split()
gp8 = "shank swank swang ".split()
gp8_probs = "2.1 2 .1".split()
gp9 = "stamp staph".split()
gp9_probs = "2.1 1.9".split()
gp10 = "stand stank".split()
gp10_probs = "2.1 1.9".split()
gp11 = "squat ascot".split()
gp11_probs = "2.1 1.9".split()
gp12 = "usual shoal".split()
gp12_probs = "2.1 1.9".split()
gp13 = "scant".split()
gp13_probs = "2.1".split()
gp14 = "slang".split()
gp14_probs = "2.1".split()
gp15 = "slack".split()
gp15_probs = "2.1".split()
gp16 = "shaft".split()
gp16_probs = "2.1".split()

gp17 = "spawn".split()
gp17_probs = "2.1".split()
gp18 = "slant".split()
gp18_probs = "2.1".split()
gp19 = "scalp".split()
gp19_probs = "2.1".split()
gp20 = "shalt".split()
gp20_probs = "2.1".split()
gp21 = "splat".split()
gp21_probs = "2.1".split()
gp22 = "spank".split()
gp22_probs = "2.1".split()
gp23 = "splay".split()
gp23_probs = "2.1".split()
gp24 = "psalm".split()
gp24_probs = "2.1".split()
gp25 = "pshaw".split()
gp25_probs = ".1".split()
gp26 = "unsay".split()
gp26_probs = ".1".split()

words22 = (gp1 + gp2 + gp3 + gp5 + gp6 + gp7  + gp8  + gp9 + gp10 + gp11 + gp12 + gp13 + gp14 + \
           gp15 + gp16 + gp17 + gp18 + gp19 + gp20 + gp21 + gp22 + gp23 + gp24 + gp25 + gp26)
probs22 = (gp1_probs + gp2_probs + gp3_probs + gp5_probs + \
        gp6_probs + gp7_probs  + gp8_probs  + gp9_probs + gp10_probs + gp11_probs + gp12_probs + gp13_probs + \
        gp14_probs + gp15_probs + gp16_probs + gp17_probs + gp18_probs + gp19_probs + gp20_probs + gp21_probs + \
        gp22_probs + gp23_probs + gp24_probs + gp25_probs + gp26_probs)



In [226]:
partition = get_partition_dict("plant", words22)

In [227]:
set(words22) - set(words)

{'ascot',
 'psalm',
 'pshaw',
 'shall',
 'shalt',
 'snafu',
 'staph',
 'swang',
 'swank'}

In [221]:
len(partition)

25

## Introducing entropy

In this small section we discuss the relationship of our definition
of Information Gain to some standard concepts of Information Theory.

We were able to greatly simplify the discussion of Information Gain above because we
were dealing with uniform distributions, quite reasonably so.  From
the point of view of a Wordl player every 5-letter has to have an equal
chance being the target. So we did quite well by defining something
we called $\text{I}_{s}$, the information required to determine
any single member of S, which depended only the size of S. 
>Note:  If you're a follower of the WordleBot analyses, you may have noticed that the Bot doesn't actually respect this assumption when assigning probabilities to solutions.  Some words have very low probabilities of being targets. It's not at all clear what justifies these numbers and I'm not sure how a Wordl player could know them, so I've left this out of the discussion.  As will become clear below, it made life simpler.

But Information Theory is designed  to deal with a general
probability distributions, where all the events may have different
probabilities, so the average information required to determine
an event for a set of events has a slightly more complicated definition.
The term in Information Theory is **Entropy** and the definition of Entropy is:

$$
H(X) = -\sum_{x\in X} p(x) \log_{2} p(x)
$$

Your favorite Information Theory textbook
will call X a random variable (for example, Cover and Thomas 1991 *Elements of Information
Theory*),  but for our purposes (finite discrete distributions)
we can think of it as a set of events or values with an associated
probability distribution $p$. The entropy is the probability-weighted average
of the information carried by events in X, also called the **Expected Information**.
Let's apply this definition to our set of candidate words C,
assuming as usual they all have an equal chance of being the target:

$$
\begin{array}{lcl}
H(C) &=& - \sum_{x\in C} p(x) \log_{2} p(x)\\
    & = & \mid C \mid \cdot \frac{1}{\mid C \mid} \log_{2} \mid C \mid   \\
     & = & \log_{2} \mid C \mid  \\
\end{array}
$$

This of course is what we have been calling $I_{C}$.
When all the candidates carry an equal amount of information
the average information is the same as the information
carried by any single candidate. This allowed us to 
compute the entropy of C just by computing the information carried by one
candidate.

The next important concept is conditional entropy.

$$
H(X \mid Y) = \sum_{y\in Y} p(y) \cdot H(X\mid y)
$$

That is the expected conditional entropy of
X given Y, or the average amount of information
carried by an event in X when the value of Y is known.

In our context we  have been talking about
C and various sets in a partition of C,
$\pi$.  We computed the information in 
knowing the Partition Information (PI) as:

$$
\text{PI} = \sum_{s \in \pi}p_{\pi}(s)\cdot \text{I}_{s}\\
$$

And we now know $\text{I}_{s}$ is equal to
$H(s)$.  Also, since $s$ is a subset of C,
for any $x \in C$, the probability of $x$
being the target given that we know it is in $s$
is the same as the probability of any x in $s$ the target, so we can write $H(s)$ as 
$H(\text{C}\mid s)$.  So we have:

$$
H(C \mid \pi)  = \text{PI} = \sum_{s \in \pi}p_{\pi}(s)\cdot H(\text{C}\mid s)\\
$$

That is, what we've been calling 
the PI is the Conditional Entropy of $C$ given $\pi$,
or the average information needed determine the target when it is known
which of the partitions of $\pi$ the target is in.

When we plug these rewrites back into our definition of
Information Gain, we get the standard definition of
Information Gain

$$
IG(C,\pi) = H(C) - H(C\mid \pi)
$$

This is also equivalent to the **Mutual Information** of C and $\pi$,
written $I(C;\pi)$.


## fragment 1
Spelling this out a bit:  the definition of the amount of information in an event with probability $p$ is

$$
- \log p
$$

There are good reasons for this definition.  For example, teh minus sign
tells us the more improbable an event is teh more information it carries.
Also the infomeation
carried by two independent events is just 

$$
- \log pq = - \log p - \log q.
$$

Thus, the information content of two independent events is just the sum of their respective
information contents.  There are other motivations we leave out for lack of space.


## fragment 2

Entropy can be thought of as measuring how many binary choices on
average we need to determine one event given a probability
distribution over events. The units are called bits. The worst case is when all the events are
equally probable.  The number above means that to discriminate among 24 equally probable candidates
on average it will take 4.585 or so binary choices (answers to yes-no questions).  For example,
supposing it's a number between 1 and 24, and supposing we get a series of yes answers, we
might preoceed as follows

1.  Is it less than 13?
2.  Is it less than 7?
3.  Is it less than 4?

Whereupon we have 3 candidates left to choose among, which will take either 1 or 2 questions,
and on average it will be:

In [6]:
math.log(3,2)

1.5849625007211563

questions.

If you know that the target is in a set
of size 1, you are done.  It takes 0 binary questions to identify the target. That's a
0-entropy distribution.  If you know the target is in a set of size 2, it takes
1 yes-no question to identify the target.  That partition set has an entropy of 1 bit.

$$
\begin{array}{lcl}
1 * 1 * (- \log 1) & = & 0\\
2* .5 * (- \log .5) & = & 1 \\
\end{array}
$$

If you have a set of size 3, as we just saw, on average it takes 1.58 yes-no questions
(= 1.58 bits). 