This notebook is an attempt to find the ideal first word for the Wordle game.

# Load 5-letter words and split into 5 columns

- - -

Using Owen Yin's Wordle list from [here](https://medium.com/@owenyin/here-lies-wordle-2021-2027-full-answer-list-52017ee99e86). Analysis always starts with a good dataset.

In [135]:
# import string
import pandas as pd

# using Owen Yin's Wordle list from here: 
# https://medium.com/@owenyin/here-lies-wordle-2021-2027-full-answer-list-52017ee99e86
df = pd.read_csv('nyt.txt', header=None, names=['word'])

# Filter out rows where words are not exactly 5 characters long. This should not
# be necessary as the word list should already be limited to words that are only
# 5 characters long.
df = df[df['word'].str.len()==5]

# split each word into its characters
df[['char1','char2','char3','char4','char5']] = df['word'].apply(lambda x: pd.Series(list(x)))

# mark each row as to whether or not it consists of 5 unique letters
df['unique'] = df['word'].apply(lambda x: len(set(list(x)))==5)

print(F"number of words = {len(df)}")
df.to_csv('df.csv')
df.head()

number of words = 2315


Unnamed: 0,word,char1,char2,char3,char4,char5,unique
0,CIGAR,C,I,G,A,R,True
1,REBUT,R,E,B,U,T,True
2,SISSY,S,I,S,S,Y,False
3,HUMPH,H,U,M,P,H,False
4,AWAKE,A,W,A,K,E,False


# Analyze characters of each word

- - -

### Show the frequency of each letter in each position

Use Pandas `value_counts()` to count the number of times characters appear in each of the five columns. The result is a 26-row dataframe (one row for each letter of the alphabet). The five columns show how many times each letter appears in that position of the words in the dataset.

In [136]:
# create an empty dataframe whose index is the alphabet
df_freq = pd.DataFrame()
# df_freq.index = list(string.ascii_uppercase)

# create value counts for each column and concat them together
for col_name in ['char1','char2','char3','char4','char5']:
    temp_df = df[col_name].value_counts().to_frame()
    df_freq = pd.concat([df_freq, temp_df],axis=1)

# concat can create NaN values in Pandas; replace those with zero
df_freq.fillna(0,inplace=True)

# NaN in a column forces the column to be float; convert to int
df_freq = df_freq.astype(int)

df_freq.columns = ['char1','char2','char3','char4','char5']
df_freq = df_freq.sort_index()

### Show the number of words that contain each letter

The `total` column shows the number of *words* in which each letter appears. The word `SISSY` is one word (not three) that contains the letter `S`. That makes the logic to create that column more complex than simply summing each row.

In [137]:
from collections import Counter

# Initialize a Counter to keep track of letter counts
letter_counts = Counter()

# Iterate over each word, convert to a set of unique letters, and update the Counter
for word in df['word']:
    unique_letters = set(word)
    letter_counts.update(unique_letters)

df_freq = pd.concat(
    [df_freq, pd.DataFrame.from_dict(letter_counts, orient='index', columns=['total'])],
    axis=1
)

df_freq.to_csv('df_freq.csv')
df_freq

Unnamed: 0,char1,char2,char3,char4,char5,total
A,141,304,307,163,64,909
B,173,16,57,24,11,267
C,198,40,56,152,31,448
D,111,20,75,69,118,370
E,72,242,177,318,424,1056
F,136,8,25,35,26,207
G,115,12,67,76,41,300
H,69,144,9,28,139,379
I,34,202,266,158,11,647
J,20,2,3,2,0,27


### Show most common letters for each position

The following step simply sorts the results from the previous step to show the most common letter in each of the five positions. This output won't be used for anything other than visual inspection. This DataFrame is not referenced anywhere else in this workbook.

In [138]:
df_sorted = pd.DataFrame()

# go column by column and retrieve a list of the most common characters
# using value_counts(), which automatically sorts in descending order
for col_name in df_freq.columns:
    sort_list = df_freq.sort_values(by=[col_name],ascending=False).index
    df_sorted[col_name] = sort_list

df_sorted.to_csv('df_sorted.csv')
df_sorted.head()

Unnamed: 0,char1,char2,char3,char4,char5,total
0,S,A,A,E,E,E
1,C,O,I,N,Y,A
2,B,R,O,S,T,R
3,T,E,E,A,R,O
4,P,I,U,L,L,T


# Calculate scores

- - -

### Calculate scores based on overall popularity

Create a score based on overall popularity of each letter in each word. The words with the highest scores are all words with repeating letters. These are bad choices for a Wordle first word.

In [139]:
df['score'] = df['word'].apply(
    lambda word: sum(df_freq['total'][y] for y in word)
)
df[['word','score','unique']].sort_values(by=['score'], ascending=False).head()

Unnamed: 0,word,score,unique
620,EERIE,4652,False
737,EATER,4525,False
1522,ERASE,4476,False
1857,RARER,4476,False
1565,ELATE,4336,False


This cell simply filters out words that don't have 5 unique letters.

In [140]:
# remove words with double letters
df.loc[df['unique'],['word','score']].sort_values(by=['score'], ascending=False).head(10)

Unnamed: 0,word,score
1563,ALERT,4117
873,LATER,4117
1508,ALTER,4117
872,IRATE,4116
1252,AROSE,4093
915,STARE,4087
1668,RAISE,4067
560,ARISE,4067
1346,RENAL,4000
1278,LEARN,4000


### Calculate tie breaker based on positional popularity

In [141]:
# add up the points for each letter of the word based on
# its popularity in that position of df_freq
def calc_positional_score(row):
    col_names = ['char1','char2','char3','char4','char5']
    total = 0
    for col in col_names:
        letter = row[col]
        total += df_freq[col][letter]
    return total

df['tie break'] = df.apply(lambda x: calc_positional_score(x),axis=1)
df.loc[df['unique'],['word','score','tie break']].sort_values(by=['score','tie break'], ascending=False).head(10)

Unnamed: 0,word,score,tie break
873,LATER,4117,1033
1508,ALTER,4117,983
1563,ALERT,4117,924
872,IRATE,4116,1171
1252,AROSE,4093,1247
915,STARE,4087,1326
1668,RAISE,4067,1270
560,ARISE,4067,1269
1278,LEARN,4000,919
1346,RENAL,4000,805


### Calculate scores based on a hybrid approach

The hybrid score is just the sum of the score and the tie break points. This considers the overall popularity of each letter as well as the positional popularity of each letter.

In [142]:
df['hybrid'] = df.apply(lambda x: x['score']+x['tie break'], axis=1)
df.loc[df['unique'],['word','score','tie break','hybrid']].sort_values(by='hybrid',ascending=False).head()

Unnamed: 0,word,score,tie break,hybrid
915,STARE,4087,1326,5413
1252,AROSE,4093,1247,5340
1668,RAISE,4067,1270,5337
560,ARISE,4067,1269,5336
878,SLATE,3898,1437,5335


# Words with the most unique vowels or consonants

- - -

In [143]:
# returns a point for each unique vowel in a word
def score_vowels(val):
    total = 0
    for letter in 'AEIOUY':
        if letter in val:
            total += 1
    return total

df['vowels'] = df['word'].apply(lambda x: score_vowels(x))
df[['word','vowels','score']].sort_values(by=['vowels','score'], ascending=False).head()


Unnamed: 0,word,vowels,score
1589,AUDIO,4,3056
352,BAYOU,4,2723
872,IRATE,3,4116
1252,AROSE,3,4093
560,ARISE,3,4067


In [144]:
# returns a point for each consonant in the word
def score_consonants(val):
    total = 0
    for letter in val:
        if letter in 'BCDFGHJKLMNPQRSTVWXZ':
            total += 1
    return total

df['consonants'] = df['word'].apply(lambda x: score_consonants(x))
df[['word','consonants','score']].sort_values(by=['consonants','score'], ascending=False).head()

Unnamed: 0,word,consonants,score
1297,STERN,4,3728
84,START,4,3698
1128,CREST,4,3626
736,CRESS,4,3577
533,SNARL,4,3562


- - -

# Cheating at Wordle

If you are so inclined, there is some filtering that can be done once you have made one or more guesses. The dataset can be filtered to:
1. Show words that contains a list of letters you know are in the solution. These are letters that are either green or orange in Wordle.
1. Show words that do *not* contain a list of letters you know are *not* in the solution. These are letters on the Wordle keyboard that are grayed out.
1. Show words that have specific letters in a specific position. These are letters that are green in Wordle. 

This is cheating, but you can use this procedure to search for words that contain *all* of the letters specified by the variable `letters`.

In [152]:
# words must include ALL of the following letters - specify empty string if needed
include_letters = 'AOT'

# words cannot include ANY of the following letters - specify empty string if needed
exclude_letters = 'NCID'

# words must have specified letter in the specified position in this list - specify 
# an empty list [ ] letter positions are unknown and use None for any position where
# the character is unknown
known_letters = ['A', None, None, 'O', None]

# make a copy of only the 'word' column for the entire dataset
df_cheat = df['word'].copy().to_frame()

# if there are letters to exclude, create a regex of those letters and exclude them with ~
if len(exclude_letters):
    pattern = F"[{exclude_letters}]"
    df_cheat = df_cheat[~df_cheat['word'].str.contains(pattern)]

# save only words that contain each letter in include_letters
for letter in include_letters:
    df_cheat = df_cheat[df_cheat['word'].str.contains(letter)]

# save only words that have the specified letter in the specified position
for idx, letter in enumerate(known_letters):
    # we have to specify a starting position in the case where letters appear more than once
    if letter:
        df_cheat = df_cheat[df_cheat['word'].str.find(letter,idx)==idx]

df_cheat

0 A
1 None
2 None
3 O
4 None


Unnamed: 0,word
1222,AFOOT
1697,ABBOT
2098,ALLOT


In [146]:
# words must include ALL of the following letters - specify empty string if needed
include_letters = 'AOT'

# words cannot include ANY of the following letters - specify empty string if needed
exclude_letters = 'NCID'

# words must have specified letter in the specified position - specify { } if needed
known_letters = {
    '1': 'A',
    '4': 'O'
}

# make a copy of only the 'word' column for the entire dataset
df_cheat = df['word'].copy().to_frame()

# if there are letters to exclude, create a regex of those letters and exclude them with ~
if len(exclude_letters):
    pattern = F"[{exclude_letters}]"
    df_cheat = df_cheat[~df_cheat['word'].str.contains(pattern)]

# save only words that contain each letter in include_letters
for letter in include_letters:
    df_cheat = df_cheat[df_cheat['word'].str.contains(letter)]

# save only words that have the specified letter in the specified position
for k,v in known_letters.items():
    # the user will specify a 1-based position, so convert to zero-based
    start = int(k) - 1
    # we have to specify a starting position in the case where letters appear more than once
    df_cheat = df_cheat[df_cheat['word'].str.find(v,start)==start]

df_cheat

Unnamed: 0,word
1222,AFOOT
1697,ABBOT
2098,ALLOT
