### Riddler Classic: Built After Seeing Other Solutions

Solution can be found [here](https://fivethirtyeight.com/features/can-you-design-the-perfect-wedding/)

### Goal

We want to maximize our chances of winning wordle in <= guesses. 

### General Concept 

Zach (the riddler writer) provides a really intuitive strategy that I will restate in Python below.

Let's say we have 10 words that our mystery word might be. Each step we want to whittle down our possible list as much as possible. 

Say after our first guess we have 3 possible buckets of words

In [1]:
sample_words = ['at', 'ab', 'ad', 'bb', 'bd', 'cb', 'cd', 'eb', 'te', 'pe']

mystery_guess = ['az'] # set 1: a*, set 2: remaining (no a and no z)
mystery_guess2 = ['be'] # set 1: b*, set 2: *b, set 3: e* , set 4: *e

The above is a weird example, but we can see that the first guess only separates us into two sets:
- `['at', 'ab', 'ad'` and `'bb', 'bd', 'cb', 'cd', 'eb', 'te', 'pe']`
- in other words we aren't leveraging enough information. The guess of a `z` is actually a wasted guess. 

Our second guess better differentiates, and builds 4 sets:
- set 1: b*, set 2: *b, set 3: e* , set 4: *e

We can actually calculate probabilities for each set:

#### Guessing `az`:
- 3/10 for first set, and guess likelihood is 1/3 -> EV = $\frac{3}{10} * \frac{1}{3} = \frac{1}{10}$
- 7/10 for second set, and guess likelihood is 1/7 -> EV = $\frac{7}{10} * \frac{1}{7} = \frac{1}{10}$

- Tota likelihood: $\frac{3}{10} * \frac{1}{3} + \frac{7}{10} * \frac{1}{7} = \frac{2}{10} $

#### Guessing `be`:
- 3/10 for first set, and guess likelihood is 1/3 -> EV = $\frac{3}{10} * \frac{1}{3} = \frac{1}{10}$
- 7/10 for second set, and guess likelihood is 1/7 -> EV = $\frac{7}{10} * \frac{1}{7} = \frac{1}{10}$

- Tota likelihood: $\frac{3}{10} * \frac{1}{3} + \frac{7}{10} * \frac{1}{7} = \frac{2}{10} $


#### Takeaway

As explained in the RC solution, our probability of being correct in a step is $\frac{1}{n}$ where `n` is the number of distinct sets of words we can make based on character (incorrect, correct but wrong position, correct).

So, with 5-letter words we have options of each character being `{incorrect, correct, correct but wrong position}`, which means our theorical max is `3^{5} = 243` distinct sets. 

The goal is to find which words maximize the `n` at each step, which would effectively reduce the total guess size.


### Code: 

#### Part 1: Find the Top Word To Start With

We need the word that maximizes the number of distinct sets of words.

Solution: `TRACE`

In [21]:
import pandas as pd
import random
import time
from collections import defaultdict

# read in mystery words
mystery_corpus = pd.read_csv("data/mystery_words.csv", header=None)
mystery_list = [w[0] for w in mystery_corpus.values]
mystery_words = set(mystery_list)

# read in eligible guess words
guess_corpus = pd.read_csv("data/guess_words.csv", header=None)
guess_list = [w[0] for w in guess_corpus.values]

In [16]:
def indexStatus(idx: str, guess: str, actual:str) -> str:
    """
    Return status representing the following relative to actual mystery word:
    - '0': char not in word
    - '1': char in word but not in proper position
    - '2': char in proper pos
    """
    # see if we have a clean match
    if guess[idx] == actual[idx]:
        return '2'

    # if not, check if char even exists
    for idx2, char in enumerate(actual):
        if (idx != idx2) and (guess[idx] == char):
            return '1'
    return '0'

assert(indexStatus(1, 'dog', 'ogl') == '1')
assert(indexStatus(0, 'dog', 'dogl') == '2')
assert(indexStatus(2, 'dog', 'eat') == '0')

In [23]:
# Step 1: Find the initial word that has largest `n`
initial_dict = {}
i = 1
for guess in guess_list:
    distinct_keys = set()
    for mystery in mystery_words:
        key = ''
        for idx, char in enumerate(guess):
            key = key + indexStatus(idx, guess, mystery)

        distinct_keys.add(key)
    
    # add len of set back to dict
    initial_dict[guess] = len(distinct_keys)
    
    if i % 1000 == 0:
        print(f"Reviewing {i} words so far")
    i += 1
    
# Now we can find the best starter word:
top_word = max(initial_dict, key=initial_dict.get)
win_p = initial_dict[top_word] / len(mystery_list)
print(f"Win prob in AT MOST TWO GUESSES is: {100 * win_p:.5f} with top word {top_word}")

Reviewing 1000 words so far
Reviewing 2000 words so far
Reviewing 3000 words so far
Reviewing 4000 words so far
Reviewing 5000 words so far
Reviewing 6000 words so far
Reviewing 7000 words so far
Reviewing 8000 words so far
Reviewing 9000 words so far
Reviewing 10000 words so far
Reviewing 11000 words so far
Reviewing 12000 words so far


#### Part 2: Depending on Information Received, Maximize Chance

Depending on feedback, we then run a very similar process:

1) our initiual guess will be `TRACE`
2) We then receive feedback on this, in our case in the form of `0,1,2` concatenated as a string if ints. 
    - `00101` means all chars are wrong position, but `A` and `E` both exist in our mystery word
3) Based on feedback, we run a similar `maximize n` process as below:
    - We iterate through all possible guesses based on feedback received (example above whittles us down from `12K+ guesses to 457 possible guesses`
    - We look at how each of the possible guesses separates remaining eligible `mystery words`. 
        - We are effectively rerunning Part 1, but on a subset of guesses and over a subset of mystery words.
4) Finally, we determine the new `max n` and guess. 
5) Note: After this guess we get feedback and are in the smallest set of eligible words, and it is a totally random guess from this point. 

The below is an example for a single word of the process:

In [157]:
guess = 'trace'
mystery = mystery_list[3] # testing 3
#mystery = 'taxis' -> confirmed noily
#mystery = 'shack'
print(mystery)
feedback = ''
for idx, char in enumerate(guess):
    feedback = feedback + indexStatus(idx, guess, mystery)
print(feedback)

#feedback = '01000'

# set of words are those in 00101
step_two_remains = defaultdict(list)
for mystery in guess_list:
    key = ''
    for idx, char in enumerate(guess):
        key = key + indexStatus(idx, guess, mystery)
    step_two_remains[key].append(mystery)
    
# remaining:
print(len(step_two_remains[feedback]))

# we now rerun process above to find next best guess based on feedback
# Step 2: Find the next guess with highest n
second_dict = {}
i = 1
new_mystery = [x for x in mystery_words if x in step_two_remains[feedback]]
for guess in guess_list:
    distinct_keys = set()
    for mystery in new_mystery:
        key = ''
        for idx, char in enumerate(guess):
            key = key + indexStatus(idx, guess, mystery)
        distinct_keys.add(key)
    
    # add len of set back to dict
    second_dict[guess] = len(distinct_keys)
    
# Now we can find the best starter word:
top_word = max(second_dict, key=second_dict.get)
print(f"Total wins: {second_dict[top_word]}")
win_p = second_dict[top_word] / len(new_mystery)
print(f"Win prob in THREE STEPS when mystery word is {mystery_list[3]} is: {100 * win_p:.5f} with top word {top_word}")

abbey
00101
457
Total wins: 27
Win prob in THREE STEPS when mystery word is abbey is: 56.25000 with top word genal


### Build out Necessary Dictionary:

Running the above over all `mystery words`, storing a dictionary of feedback with top word. 

In [130]:
# let's find subset of mystery words that are '22000' -> FTE got LOUTS as besdt
test = []
guess = 'trace'
for mystery in mystery_list:
    feedback = ''
    for idx, char in enumerate(guess):
        feedback = feedback + indexStatus(idx, guess, mystery)
    #print(feedback)
    if feedback == '22000':
        test.append(mystery)
        
print(test)

['troll', 'troop', 'trout', 'truly', 'trump', 'trunk', 'truss', 'trust', 'truth', 'tryst']


In [147]:
start = time.time()
guess = 'trace'
my_dict = defaultdict(lambda: defaultdict(int))
j = 0
for mystery in mystery_list:
    feedback = ''
    for idx, char in enumerate(guess):
        feedback = feedback + indexStatus(idx, guess, mystery)

    # build a dictionary which can store eligible guesses based on key
    step_two_remains = defaultdict(list)
    for mystery in guess_list:
        key = ''
        for idx, char in enumerate(guess):
            key = key + indexStatus(idx, guess, mystery)
        step_two_remains[key].append(mystery)
    
    # Step 2: Find the next guess with highest n
    second_dict = {}
    i = 1
    
    # subset eligible mystery words
    new_mystery = [x for x in mystery_words if x in step_two_remains[feedback]]
    
    for guess in guess_list:
        distinct_keys = set()
        for mystery in new_mystery:
            key = ''
            for idx, char in enumerate(guess):
                key = key + indexStatus(idx, guess, mystery)
            distinct_keys.add(key)

        # add len of set back to dict
        my_dict[feedback][guess] += len(distinct_keys)
        
    # print out
    j += 1
    if j % 50 == 0:
        print(f"On step {j}, total_time: {time.time() - start:.3f}")
        
print(f"total_time: {time.time() - start:.3f}")

KeyboardInterrupt: 

In [148]:
print(f"On step {j}, total_time: {time.time() - start:.3f}")

On step 18, total_time: 338.709


In [153]:
top_word = max(my_dict['00220'], key=my_dict['00220'].get)
top_word

'blush'

In [154]:
my_dict['00220']['blush']

10

In [156]:
for k,v in my_dict['00220'].items():
    if v >= 10:
        print(k)

blush
shlub
shuln


### To Do:

This is slow, so I think it makes sense to encode this in numpy arrays