# Wordle analysis

Let's load the master word dictionary from the game, [found here](https://www.powerlanguage.co.uk/wordle/main.db1931a8.js). Look for the list after `var Aa=`. I've copied that into `wordle.csv`.

In [1]:
from collections import Counter

import pandas as pd

In [2]:
words = pd.read_csv("wordle.csv", names=["whole_word"])

# and create columns holding each letter in the given position
words[["pos1", "pos2", "pos3", "pos4", "pos5"]] = words["whole_word"].str.split(
    "", expand=True
)[[1, 2, 3, 4, 5]]

## How many times does each letter occur overall? And how many times in each spot?

In [3]:
df_counts = pd.DataFrame()  # empty df to hold the answers
for v in words.columns:
    # count the # of each letter in this column
    c = Counter("".join(words[v].tolist()))
    # and then add it to the df_counts (blunt, ugly code here, sry)
    df = pd.DataFrame.from_dict(c, orient="index").sort_values(0)
    df.columns = [v]
    df_counts = df_counts.merge(df, how="outer", left_index=True, right_index=True)
    
df_counts = df_counts.fillna(0)        

In [4]:
# print out the answer
df_counts.sort_values("whole_word", ascending=False)

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5
e,1233,72.0,242,177,318.0,424.0
a,979,141.0,304,307,163.0,64.0
r,899,105.0,267,163,152.0,212.0
o,754,41.0,279,244,132.0,58.0
t,729,149.0,77,111,139.0,253.0
l,719,88.0,201,112,162.0,156.0
i,671,34.0,202,266,158.0,11.0
s,669,366.0,16,80,171.0,36.0
n,575,37.0,87,139,182.0,130.0
c,477,198.0,40,56,152.0,31.0


## One way to pick a starting word: SHARE

Find the most common letter in each spot. Here is a function to do that:

In [5]:
def most_common(words):
    
    # this is just like the above
    df_counts = pd.DataFrame()
    for v in words.columns:    
        c = Counter("".join(words[v].tolist()))
        df = pd.DataFrame.from_dict(c, orient='index').sort_values(0)
        df.columns = [v]
        df_counts = df_counts.merge(df,how='outer',left_index=True,right_index=True)
        
    # now, for each position, get the most common letter (output the letter and count)    
    max1 = df_counts.filter(like='pos').idxmax()
    max2 = df_counts.filter(like='pos').max()
    max1.name, max2.name = 'letter','count'
    return pd.concat([max1,max2],axis=1)


The most common letter is "e" in the last spot.

In [6]:
most_common(words)

Unnamed: 0,letter,count
pos1,s,366.0
pos2,a,304.0
pos3,a,307.0
pos4,e,318.0
pos5,e,424.0


Now, we can repeat this analysis on the words that end in "e":

In [7]:
most_common(words.query('pos5=="e"'))

Unnamed: 0,letter,count
pos1,s,75.0
pos2,r,62.0
pos3,a,84.0
pos4,s,52.0
pos5,e,424.0


So "a" in the middle spot is the most common letter for words ending in "e". We can repeat this two more times to get to `sha-e`.

In [8]:
most_common(words.query('pos5=="e" & pos3 == "a"'))

Unnamed: 0,letter,count
pos1,s,24.0
pos2,r,20.0
pos3,a,84.0
pos4,t,11.0
pos5,e,84.0


In [9]:
most_common(words.query('pos5=="e" & pos3 == "a" & pos1 == "s"'))

Unnamed: 0,letter,count
pos1,s,24.0
pos2,h,7.0
pos3,a,24.0
pos4,r,5.0
pos5,e,24.0


At this point, we should pick "SHARE" because "R" is the most common of the letters for avaiable words:

In [10]:
words.query('pos5=="e" & pos3 == "a" & pos1 == "s" & pos2 == "h"')

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5
244,shake,s,h,a,k,e
306,shame,s,h,a,m,e
657,shade,s,h,a,d,e
1004,share,s,h,a,r,e
1213,shape,s,h,a,p,e
1543,shale,s,h,a,l,e
2314,shave,s,h,a,v,e


## Another way: ARISE or AROSE

What word gives us the most coverage over all the available words. 

Meaning: Which word will result in the highest likelihood of at least one match?

In [11]:
# def idiot_count(word):
#     myregex = "(" + "|".join(word) + ")"
#     return words["whole_word"].str.contains(myregex).mean()

def idiot_count(word,lim=2):
    myregex = "(" + "|".join(word) + ")"
    coverage = words["whole_word"].str.count(myregex) >= lim
    return coverage.mean()

words['any_hits_1plus'] = words['whole_word'].apply(idiot_count,lim=1)
words['any_hits_2plus'] = words['whole_word'].apply(idiot_count,lim=2)
words['any_hits_3plus'] = words['whole_word'].apply(idiot_count,lim=3)

In [12]:
words.sort_values('any_hits_1plus',ascending=False).head(5)

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus
560,arise,a,r,i,s,e,0.92743,0.660475,0.288553
1668,raise,r,a,i,s,e,0.92743,0.660475,0.288553
112,alone,a,l,o,n,e,0.921814,0.641901,0.238877
1252,arose,a,r,o,s,e,0.92095,0.681641,0.301944
1589,audio,a,u,d,i,o,0.917927,0.406479,0.07905


Between "ARISE" and "RAISE", arise will give you one of the letters in its spot exactly a little more often:

In [13]:
len(words.query('(pos1=="a" | pos2 == "r")'))

396

In [14]:
len(words.query('(pos1=="r" | pos2 == "a")'))

386

And if you want to focus on 2+ matches, the next line shows that you should switch to AROSE; "O" is in more words than "I". 

In [15]:
words.sort_values('any_hits_2plus',ascending=False).head(5)

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus
1252,arose,a,r,o,s,e,0.92095,0.681641,0.301944
872,irate,i,r,a,t,e,0.916199,0.673866,0.298056
1563,alert,a,l,e,r,t,0.915335,0.665659,0.308855
1508,alter,a,l,t,e,r,0.915335,0.665659,0.308855
873,later,l,a,t,e,r,0.915335,0.665659,0.308855


(The reason "ARISE" gives better 1+ coverage than "AROSE" despite "O">"I": It's because "I" is in more words without "ARSE" than "O" is.)

![](https://c.tenor.com/uRENbx5Ekw8AAAAC/brian-baumgartner-badumtss.gif)

And if you want to focus on 3+ matches, switch to ALERT.

In [16]:
words.sort_values('any_hits_3plus',ascending=False).head(5)

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus
1563,alert,a,l,e,r,t,0.915335,0.665659,0.308855
873,later,l,a,t,e,r,0.915335,0.665659,0.308855
1508,alter,a,l,t,e,r,0.915335,0.665659,0.308855
915,stare,s,t,a,r,e,0.901944,0.651836,0.307559
1252,arose,a,r,o,s,e,0.92095,0.681641,0.301944


# ALERT vs AROSE vs ARISE: Which gives most exact matches?

The three words have very similar coverage stats. How many times do the words yield exact position matches? Maybe thats a tiebreaker.

In [17]:
wordy = words
for c in ['pos1','pos2','pos3','pos4','pos5']:
    wordy =wordy.merge(df_counts[[c]],left_on=c,right_index=True,suffixes=('','_exact'))
    
wordy['tot_exact'] = wordy.filter(like='_exact').sum(axis=1)  

In [18]:
wordy.sort_values('tot_exact',ascending=False).head(10)


Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus,pos1_exact,pos2_exact,pos3_exact,pos4_exact,pos5_exact,tot_exact
878,slate,s,l,a,t,e,0.904536,0.615983,0.27473,366.0,201,307,139.0,424.0,1437.0
2032,sauce,s,a,u,c,e,0.895032,0.557667,0.174946,366.0,304,165,152.0,424.0,1411.0
1627,slice,s,l,i,c,e,0.873866,0.540389,0.183585,366.0,201,266,152.0,424.0,1409.0
1543,shale,s,h,a,l,e,0.895896,0.553348,0.22419,366.0,144,307,162.0,424.0,1403.0
275,saute,s,a,u,t,e,0.903672,0.584449,0.225486,366.0,304,165,139.0,424.0,1398.0
1004,share,s,h,a,r,e,0.897192,0.597408,0.251404,366.0,144,307,152.0,424.0,1393.0
1530,sooty,s,o,o,t,y,0.721814,0.29892,0.081641,366.0,279,244,139.0,364.0,1392.0
2270,shine,s,h,i,n,e,0.855724,0.494168,0.151188,366.0,144,266,182.0,424.0,1382.0
1704,suite,s,u,i,t,e,0.87257,0.532613,0.190497,366.0,186,266,139.0,424.0,1381.0
1729,crane,c,r,a,n,e,0.886393,0.613823,0.25054,198.0,267,307,182.0,424.0,1378.0


In [19]:
def dumb_exact(word):
    ''' how many words in the master list does this word have 1 exact position match for? 2? 3? 4? 5?'''
    return (
        pd.concat(
            [(words["whole_word"].str[i] == c).astype(int) for i, c in enumerate(word)],
            axis=1,
        )
        .sum(axis=1)
        .value_counts()
        [1:]
    )

In [20]:
# apply the function
test = wordy['whole_word'].apply(dumb_exact).fillna(0)

# more sensible columns are "# words with 1+ exact positional matches"
# so we need to cum sum across the columns, except right-to-left
test = test[test.columns[::-1]].cumsum(axis=1)
test.columns = ['5+ exact pos match','4+ exact pos match','3+ exact pos match','2+ exact pos match','1+ exact pos match']

In [21]:
wordy = pd.concat([wordy, test], axis=1).sort_values("2+ exact pos match").tail(10)

In [22]:
wordy.sort_values("any_hits_2plus", ascending=False).head(5)

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus,pos1_exact,pos2_exact,pos3_exact,pos4_exact,pos5_exact,tot_exact,5+ exact pos match,4+ exact pos match,3+ exact pos match,2+ exact pos match,1+ exact pos match
528,stale,s,t,a,l,e,0.904536,0.615983,0.27473,366.0,77,307,162.0,424.0,1336.0,1.0,12.0,54.0,289.0,980.0
878,slate,s,l,a,t,e,0.904536,0.615983,0.27473,366.0,201,307,139.0,424.0,1437.0,1.0,6.0,62.0,290.0,1078.0
1729,crane,c,r,a,n,e,0.886393,0.613823,0.25054,198.0,267,307,182.0,424.0,1378.0,1.0,6.0,54.0,305.0,1012.0
1004,share,s,h,a,r,e,0.897192,0.597408,0.251404,366.0,144,307,152.0,424.0,1393.0,1.0,16.0,73.0,303.0,1000.0
1543,shale,s,h,a,l,e,0.895896,0.553348,0.22419,366.0,144,307,162.0,424.0,1403.0,1.0,12.0,59.0,310.0,1021.0


In [23]:
wordy.sort_values('2+ exact pos match',ascending=False).head(5)

Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus,pos1_exact,pos2_exact,pos3_exact,pos4_exact,pos5_exact,tot_exact,5+ exact pos match,4+ exact pos match,3+ exact pos match,2+ exact pos match,1+ exact pos match
2270,shine,s,h,i,n,e,0.855724,0.494168,0.151188,366.0,144,266,182.0,424.0,1382.0,1.0,7.0,62.0,315.0,997.0
1543,shale,s,h,a,l,e,0.895896,0.553348,0.22419,366.0,144,307,162.0,424.0,1403.0,1.0,12.0,59.0,310.0,1021.0
1729,crane,c,r,a,n,e,0.886393,0.613823,0.25054,198.0,267,307,182.0,424.0,1378.0,1.0,6.0,54.0,305.0,1012.0
1004,share,s,h,a,r,e,0.897192,0.597408,0.251404,366.0,144,307,152.0,424.0,1393.0,1.0,16.0,73.0,303.0,1000.0
1750,slant,s,l,a,n,t,0.863067,0.510151,0.178402,366.0,201,307,182.0,253.0,1309.0,1.0,4.0,47.0,292.0,965.0


In [31]:
words.query('pos4 == "u" & pos2 == "e"')


Unnamed: 0,whole_word,pos1,pos2,pos3,pos4,pos5,any_hits_1plus,any_hits_2plus,any_hits_3plus
1,rebut,r,e,b,u,t,0.841037,0.499784,0.18013
196,rebus,r,e,b,u,s,0.843197,0.487257,0.168467
332,fetus,f,e,t,u,s,0.812959,0.450972,0.148164
566,recut,r,e,c,u,t,0.866955,0.535637,0.196112
859,begun,b,e,g,u,n,0.760691,0.368467,0.096328
962,demur,d,e,m,u,r,0.804752,0.452268,0.142981
1145,venue,v,e,n,u,e,0.717927,0.271706,0.053132
1203,serum,s,e,r,u,m,0.848812,0.494168,0.171058
1314,lemur,l,e,m,u,r,0.862635,0.511447,0.170194
1370,segue,s,e,g,u,e,0.754644,0.317927,0.07689
