In [5]:
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
from tqdm.notebook import tqdm
from functools  import partial
import random

### Basic Methodolgy


The source code in `helper.py` can be examined directly. The basic process:

1. For every target word create a dictionary of the wordle line score for each possible 12000+ guess:

```python
    [{
        'score': get_num_line(x, target_word),
        'target': target_word,
        'guess': x
    } for x in short_words]
```
2. Then using that list as input, convert the score to a string, and for every unique result string, e.g. 11010 compute the sum of the word frequencies that could produce that string. If it's only one rare word `aahed` then it'll be a low (0) value. If it's several common words it will be high. Each target word now has its own dictionary that looks like:

```
{'00000': 12134875945,
 '00001': 1061445461,
 '00002': 2490766909,
 '00010': 2277035094,
 '00011': 74674653,
 '00012': 349391730,
 '00020': 732438223,
 '00021': 5105185,
 '00022': 53822692,
 '00100': 6964659770,
 '00101': 166668574,
 '00102': 1014367158,
 '00110': 772725995,
 '00111': 3277303,
 '00112': 9080173
 ...
 }
 
``` 

That is all the precomputed data needed to solve for wordles!

### Read in target and guessable words

In [1]:

all_words = pd.read_csv('wordle-dictionary-full.txt',header=None)[0].tolist()
target_words = pd.read_csv('wordle-targets_2022-02-15.txt',header=None)[0].tolist()

### Load the functions from helper.py and make the frequencies

In [2]:

from helper import helper_func,make_freqs

The `make_freqs` func is borrowed from the original kaggle project. It ranks commonalitiy derived from the Google Web Trillion Word Corpus and [is available on Kaggle](https://www.kaggle.com/rtatman/english-word-frequency) for all guessable wordle words.

In [3]:
freqs = make_freqs()

Here we can see the results. The function inserts 0s for missing words (not a minimum value as the function originally did)

In [6]:
for s in random.sample(all_words,10):
    print(f"{s} - {freqs.get(s,0)}")


lutea - 94001
volts - 2454421
raxed - 0
mobey - 0
bling - 1627355
wifes - 563949
funny - 34281806
bimah - 0
chill - 4489643
dyers - 89927


### Run the code on all target words

In [7]:

p = partial(helper_func,freqs=freqs)

with ProcessPoolExecutor(max_workers=8) as executor:
    all_enhanced_counters = list(tqdm(executor.map(p, target_words), total=len(target_words)))


  0%|          | 0/2309 [00:00<?, ?it/s]

In [8]:
zipped_counters = list(zip(target_words, all_enhanced_counters))

### Save the file

In [10]:
import json
json.dump(zipped_counters, open( "zipped_counters_nyt_2022_02_15.json", "w" ),indent=4 )

In [9]:
# pickle.dump(zipped_counters, open( "zipped_counters_nyt_2022_02_15.pickle", "wb" ) )

### Run the same code on _all_ the guessable words

One might feel that using the 2315 target dictionary is too much inside information (which [I think for my other bot](https://twitter.com/thewordlebot/status/1481628447541809162?s=20&t=XikGUr5F4Pb2ICovJigjdA)). We can precompute the dictionaries for all 12000 guessable words instead.

Since some of the true wordle answers might score 0 in our frequency dictionary (the number is actually 1), so I'll remake the `freq` dictionary with a non-zero minimum value.

This dataset can be accessed in the class with `TwitterWordle(use_limited_targets=False)`

In [None]:
new_freqs = {key:(max(val,12716)) for key,val in freqs.items()}
p = partial(helper_func,freqs=new_freqs)

with ProcessPoolExecutor(max_workers=8) as executor:
    all_enhanced_counters_full_list = list(tqdm(executor.map(p, all_words), total=len(all_words)))

import pickle

pickle.dump(all_enhanced_counters_full_list, open( "zipped_counters_allwords.pickle", "wb" ) )