This notebook is an attempt to find the ideal first word for the Wordle game.

# Load 5-letter words and split into 5 columns

- - -

Using Owen Yin's Wordle list from [here](https://medium.com/@owenyin/here-lies-wordle-2021-2027-full-answer-list-52017ee99e86). Analysis always starts with a good dataset.

In [None]:
import string
import pandas as pd

# using Owen Yin's Wordle list from here: 
# https://medium.com/@owenyin/here-lies-wordle-2021-2027-full-answer-list-52017ee99e86
df = pd.read_csv('nyt.txt', header=None, names=['word'])

# Filter out rows where words are not exactly 5 characters long. This should not
# be necessary as the word list should already be limited to words that are only
# 5 characters long.
df = df[df['word'].str.len()==5]

# split each word into its characters
df[['char1','char2','char3','char4','char5']] = df['word'].apply(lambda x: pd.Series(list(x)))

print(F"number of words = {df.size}")
df.head()

# Analyze characters of each word

- - -

The characters for each word can be analyzed to count how many times each character appears in each of the five positions of the five-character words. Calculating a sum for each row will show how many times each letter shows up overall.

### Show the frequency of each letter in each position

Use Pandas `value_counts()` to count the number of times characters appear in each of the five columns. The result is a 26-row dataframe (one row for each letter of the alphabet). The five columns show how many times each letter appears in that position of the words in the dataset.

The `sum` column shows the totals.

In [None]:
# create an empty dataframe whose index is the alphabet
df_freq = pd.DataFrame()
df_freq.index = list(string.ascii_uppercase)

# create value counts for each column and concat them together
for col_name in ['char1','char2','char3','char4','char5']:
    temp_df = df[col_name].value_counts().to_frame()
    df_freq = pd.concat([df_freq, temp_df],axis=1)

# concat can create NaN values in Pandas; replace those with zero
df_freq.fillna(0,inplace=True)

# NaN in a column forces the column to be float; convert to int
df_freq = df_freq.astype(int)

df_freq.columns = ['char1','char2','char3','char4','char5']

# create a column of sums showing the total occurrences of each letter
df_freq['sum'] = df_freq.sum(axis=1)
df_freq


### Show most common letters for each position

In [None]:
# most popular characters in each position

df_sorted = pd.DataFrame()

# go column by column and retrieve a list of the most common characters
# using value_counts(), which automatically sorts in descending order
for col_name in df_freq.columns:
    sort_list = df_freq.sort_values(by=[col_name],ascending=False).index
    df_sorted[col_name] = sort_list

df_sorted.head(10)


In [None]:
df_freq.sort_values(by='sum',ascending=False)['sum'].head(10)

# Calculate scores

- - -

### Calculate scores based on overall popularity

In [None]:
df['popularity score'] = df['word'].apply(lambda word: sum(df_freq['sum'][y] for y in word))
df.sort_values(by=['popularity score'], ascending=False).head(10)

In [None]:
# remove words with double letters
df['unique'] = df['word'].apply(lambda x: pd.Series(list(x)).unique().size==5)
df[df['unique']].sort_values(by=['popularity score'], ascending=False).head(20)

### Calculate scores based on positional popularity

In [None]:
def calc_positional_score(row):
    col_names = ['char1','char2','char3','char4','char5']
    total = 0
    for col in col_names:
        letter = row[col]
        total += df_freq[col][letter]
    return total

df['positional score'] = df.apply(lambda x: calc_positional_score(x),axis=1)
df.sort_values(by=['positional score'], ascending=False).head(20)

### Calculate scores based on a hybrid approach

In [None]:
def calc_hybrid_score(word):
    total = 0
    # print(word)
    for idx,letter in enumerate(word):
        # give one point if the character 
        total += df_freq.iloc[:,idx][letter]
        for col in ['char1','char2','char3','char4','char5']:
            total += df_freq[col][letter]
    return total

df['hybrid score'] = df.apply(lambda x: x['popularity score'] + x['positional score'], axis=1)
df[df['unique']==5].sort_values(by=['hybrid score'], ascending=False).head(10)

In [None]:
# we don't need the character columns in our main dataframe anymore

df.drop(labels=['char1','char2','char3','char4','char5'],axis=1,inplace=True)

# Words with the most unique vowels or consonants

- - -

In [None]:
# returns a point for each unique vowel in a word
def score_vowels(val):
    total = 0
    for letter in 'AEIOUY':
        if letter in val:
            total += 1
    return total

df['vowels'] = df['word'].apply(lambda x: score_vowels(x))
df.sort_values(by=['vowels','popularity score','positional score'], ascending=False).head(10)


In [None]:
# returns a point for each consonant in the word
def score_consonants(val):
    total = 0
    for letter in val:
        if letter in 'BCDFGHJKLMNPQRSTVWXZ':
            total += 1
    return total

df['consonants'] = df['word'].apply(lambda x: score_consonants(x))
df.sort_values(by=['consonants','popularity score','positional score'], ascending=False).head(10)

# Searching for patterns

- - -

In [None]:
pattern = 'EAROT'

df['pattern'] = df['word'].apply(lambda word: sum(1 for letter in pattern if letter in word))
df.sort_values(['pattern','popularity score','positional score'], ascending=False).head()