In early 2022, the game wordle started taking social media by storm. It's a fun game where players have six guesses to figure out a five letter word. The more I thought about it, the more I decided that it would be the perfect teaching tool for Natural Language Processing (NLP) methods - both classical and neural. By the end of this notebook you should be able to write an some algorithms to play the game as well as understand the fundamentals behind natural Lannguage processing

First let's create a training dictionary

In [1]:
toy_dictionary_file = "./data/toy_dictionary.txt"
training_dictionary_file = "./data/train.txt"

# Get data for running in colab from github
!mkdir -p data # Create the data directory if it doesn't exist
!wget https://raw.githubusercontent.com/evanyli007/nlp_wordle/main/data/toy_dictionary.txt -P ./data/
#!wget https://raw.githubusercontent.com/evanyli007/nlp_wordle/main/data/train.txt -P ./data/

training_dictionary = []
# with open(training_dictionary_file) as f:
with open(toy_dictionary_file) as f: # useful for debugging
    for line in f:
        line = line.rstrip('\n')
        line = line.lower() # case doesn't matter in the game
        training_dictionary.append(line)
print(training_dictionary)

--2025-09-04 19:26:32--  https://raw.githubusercontent.com/evanyli007/nlp_wordle/main/data/toy_dictionary.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24 [text/plain]
Saving to: ‘./data/toy_dictionary.txt’


2025-09-04 19:26:32 (408 KB/s) - ‘./data/toy_dictionary.txt’ saved [24/24]

['words', 'count', 'equal', 'fivey']


Let's make a mock board to play

In [None]:
print("__|__|__|__|__\n__|__|__|__|__\n__|__|__|__|__\n__|__|__|__|__\n__|__|__|__|__\n__|__|__|__|__\n")

__|__|__|__|__
__|__|__|__|__
__|__|__|__|__
__|__|__|__|__
__|__|__|__|__
__|__|__|__|__



Now let's choose a random word from our training dictionary. Everytime you run the next bit of code, a new random word is chosen.

In [None]:
import random
answer_word = random.choice(training_dictionary)
print("The answer word is \"", answer_word,"\"")

The answer word is " count "


Let's now take our first guess.

In [None]:
guess = input() #type it in and hit enter

words


In [None]:
guess = guess.lower()
print("Your guess was:", guess)
if  guess not in training_dictionary:
    print("However, it is not in the dictionary. Guess again.")
elif guess == answer_word:
    print("YAY! You got the answer_word")
else:
    print("Not correct. Guess again.")

Your guess was: words
Not correct. Guess again.


Of course, this isn't the whole game play. Let's add the rest of it. First, let's make a function that takes a guess and updates the game board. We will start with scoring words. Since colors are hard to do on the command line across platforms, we will mark correct letters in the correct place with uppercase and letters out of place with lower.

In [None]:
# pretty print output
def pretty_print_output(scored):
    recombined = ""
    for letter in scored:
        recombined = recombined + '|' + letter
    recombined = recombined + '|'
    print(recombined)

In [None]:
def score_guess(guess, answer_word):
    place = 0
    scored = []
    while place < len(answer_word):
        if guess[place] == answer_word[place]:
            scored.append("\033[42m\033[97m"+guess[place].upper()+"\033[0m")
        elif guess[place] in answer_word:
            scored.append("\033[43m\033[97m"+guess[place]+"\033[0m")
        else:
            scored.append("\033[100m\033[97m"+guess[place]+"\033[0m")
        place = place + 1
        #scored.append("|")
    return scored

scored = score_guess("sound", "doubt")
pretty_print_output(scored)

|[100m[97ms[0m|[42m[97mO[0m|[42m[97mU[0m|[100m[97mn[0m|[43m[97md[0m|


In [None]:
def make_a_guess(guess, num_guesses, guesses, max_guesses=6):
    guess = guess.lower()
    num_guesses = num_guesses + 1
    if  guess not in training_dictionary:
        print("Your guess was:", guess)
        num_guesses = num_guesses - 1
        print("However, it is not in the dictionary. Guess again.")
        return(num_guesses, guesses, max_guesses)
    elif guess == answer_word:
        scored = score_guess(guess, answer_word)
        pretty_print_output(scored)
        print("YAY! You got the answer_word in", num_guesses, "guesses.")
        num_guesses = 6 # Dumb way to continue without breaking the notebook
    else:
        #print("Not correct. Guess again.")
        print("")
        print("")
        #guesses.append(guess)
        #To do uncomment and get the previous guesses (bug)
        #for guess in guesses:
        #    print(guess) #Print previous guesses
        scored = score_guess(guess, answer_word)
        pretty_print_output(scored)
        for i in range(num_guesses, 6):
            print("|_|_|_|_|_|")

    return(num_guesses, guesses, max_guesses)





Now let's add some boiler plate code that asks a user for input.

In [None]:
num_guesses = 0
max_guesses = 6
guesses = []

while num_guesses < max_guesses:
    guess = input()
    num_guesses, guesses, max_guesses = make_a_guess(guess, num_guesses, guesses, max_guesses)

words


|[100m[97mw[0m|[42m[97mO[0m|[100m[97mr[0m|[100m[97md[0m|[100m[97ms[0m|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
count
|[42m[97mC[0m|[42m[97mO[0m|[42m[97mU[0m|[42m[97mN[0m|[42m[97mT[0m|
YAY! You got the answer_word in 2 guesses.


Now that we have the basic gameplay up and running, let's make an algorithm that plays the game.

### Dumb Algorithm ####
Let's create the dumbest algorithm that we can. We will just randomly guess from the list of training words.

In [None]:
num_guesses = 0
max_guesses = 6
guesses = []

while num_guesses < max_guesses:
    guess = random.choice(training_dictionary)
    num_guesses, guesses, max_guesses = make_a_guess(guess, num_guesses, guesses, max_guesses)



|[100m[97mf[0m|[100m[97mi[0m|[100m[97mv[0m|[100m[97me[0m|[100m[97my[0m|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|[42m[97mC[0m|[42m[97mO[0m|[42m[97mU[0m|[42m[97mN[0m|[42m[97mT[0m|
YAY! You got the answer_word in 2 guesses.


Now this is a dumb algorithm, but it still gets the answer correct some of the times. Each guess is a random pick from the dictionary (with replacement). This means that it can guess the same word every time ... it is not learning. Ever guess is exactly 1/num_words_in_dictionary. This is sampling from a uniform distribution.

Let's make a slightly less dumb algorithm. Now let's sample WITHOUT replacement. In other words, it cannot guess the same word twice.

In [None]:
num_guesses = 0
max_guesses = 6
guesses = []

potential_guesses = random.sample(training_dictionary, min(max_guesses,len(training_dictionary)))
for guess in potential_guesses:
    num_guesses, guesses, max_guesses = make_a_guess(guess, num_guesses, guesses, max_guesses)
    if guess == answer_word:
        potential_guesses = [] # Another dumb way to exit the loop without breaking the notebook



|[100m[97mw[0m|[42m[97mO[0m|[100m[97mr[0m|[100m[97md[0m|[100m[97ms[0m|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|


|[100m[97mf[0m|[100m[97mi[0m|[100m[97mv[0m|[100m[97me[0m|[100m[97my[0m|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|_|_|_|_|_|
|[42m[97mC[0m|[42m[97mO[0m|[42m[97mU[0m|[42m[97mN[0m|[42m[97mT[0m|
YAY! You got the answer_word in 3 guesses.


|[100m[97me[0m|[100m[97mq[0m|[42m[97mU[0m|[100m[97ma[0m|[100m[97ml[0m|


## Statistically Inspired Dumb Algorithm ##
Clearly, guessing randomly is not an intelligent way to solve the game wordle. After people have played the game a few times, they start to notice that guessing some words early on can tell them a lot for future guesses. This is basic corpus statistics and we will use this as a simple Machine Learning model. Let's rank the words in our dictionary by how often the letters within them occur.

In [None]:
def sort_by_frequency(training_dictionary):
    character_counts = {}
    for word in training_dictionary:
        #print(word)
        for character in word:
            #print(character)
            if character in character_counts:
                count = character_counts[character]
                count = count + 1
                character_counts[character] = count
            else:
                character_counts[character] = 1

    # The character counts of the whole dictionary
    #print(character_counts)

    scored_words = {}
    for word in training_dictionary:
        score = 0
        for character in word:
            count = character_counts[character]
            score = score + count
        scored_words[word] = score


    # Sort and print
    sorted_scored_words = sorted(scored_words.items(), key=lambda x: x[1], reverse=True)
    #print(sorted_scored_words)
    sorted_dictionary = []
    for word in sorted_scored_words:
        sorted_dictionary.append(word[0])
    #print(sorted_dictionary)
    return(sorted_dictionary)

print(sort_by_frequency(training_dictionary))

['count', 'equal', 'words', 'fivey']


Now this is a slightly better method .... at least in theory. However, it doesn't take into account any of the knowledge that you have from previous guesses. The method is still just as likely to win as our other dumb algorithm of sampling without replacement (assuming the hidden word is chosen randomly, not by letter statistics).

## Remaining Possibilities ##
To make a much less dumb algorithm, let's constrain ourselves to ONLY guess words that are possible based off of our previous guesses. This is actually easier to implement for the "Hard Mode" setting of wordle. It is actually basically what the game absurdle is doing as well.

### Dumb Remaining Possibilities ###
This is similar to our dumb model above. We randomly sample from the remaining possibilities. As we are not using learned information from the corpus (dictionary) this is NOT statistical NLP or machine learning.

The first thing to do is define a function that scores a guess with the answer word.

In [None]:
def dumb_mask(guess, answer_word):
    place = 0
    scored = []
    while place < len(answer_word):
        if guess[place] == answer_word[place]:
            scored.append(guess[place].upper())
        elif guess[place] in answer_word:
            scored.append(guess[place])
        else:
            scored.append("_")
        place = place + 1
    return scored

In [None]:
import copy

def dumb_remove_words(remaining, scored, guess):
    new_remaining = []
    for word in remaining:
        #print("remaining word:", word)
        place = 0
        still_valid = True
        while place < len(word):
            #print("guess:", guess[place], "word:", word[place])
            if guess[place] == word[place]:
                # Guess had letter in this place
                if scored[place] == "_":
                    # Word can no longer be correct
                    #print(word, "can no longer be correct")
                    still_valid = False
            place = place + 1
        if still_valid:
            new_remaining.append(word)

    # This has only filtered exact mistakes

    # Check for exact matches
    place = 0
    while place < len(scored):
        letter = scored[place]
        if letter.isupper(): # Exact matches
            print(letter)
            updated_remaining = []
            for word in new_remaining:
                if word[place].upper() == letter:
                    updated_remaining.append(word)
            new_remaining = updated_remaining
        elif letter.islower(): # Right letter, wrong place
            print(letter)
            updated_remaining = []
            for word in new_remaining:
                if letter in word: # It is in the word #TODO check logic
                    if word[place] != letter: # And not in this place
                        updated_remaining.append(word)
            new_remaining = updated_remaining
        place = place + 1



    return new_remaining

remaining = copy.deepcopy(training_dictionary)

print("REMAINING:", remaining)

num_guesses = 0
max_guesses = 6
guesses = []

answer_word = "fivey"

while num_guesses < max_guesses:
    if len(remaining) == 0:
        print("No remaining words")
        break
    guess = random.choice(remaining)
    print("Guess:", guess)
    scored = dumb_mask(guess, answer_word)
    print(scored)
    if guess == answer_word:
        remaining = [] # Another dumb way to exit the loop without breaking the notebook
    remaining = dumb_remove_words(remaining, scored, guess)
    print("Possible remaining words:", remaining)
    scored = score_guess(guess, answer_word)
    pretty_print_output(scored)



if num_guesses == max_guesses:
    print("You lose")
else:
    print("Algorithm solved this in", num_guesses, "guesses")



REMAINING: ['words', 'count', 'equal', 'fivey']
Guess: equal
['e', '_', '_', '_', '_']
e
Possible remaining words: ['fivey']
|[43m[97me[0m|[100m[97mq[0m|[100m[97mu[0m|[100m[97ma[0m|[100m[97ml[0m|
Guess: fivey
['F', 'I', 'V', 'E', 'Y']
F
I
V
E
Y
Possible remaining words: []
|[42m[97mF[0m|[42m[97mI[0m|[42m[97mV[0m|[42m[97mE[0m|[42m[97mY[0m|
No remaining words
Algorithm solved this in 0 guesses


Now, using our dumb selection algorithm, we can randomly guess and remove words.

### Statistically Informed Removing Possibilities ###
Now, we can combine this with the statistically informed model we created earlier. Rather than randomly selecting from our remaining words, we sort by the most likely.

In [None]:
import copy

num_guesses = 0
max_guesses = 6
guesses = []

answer_word = "fivey"

# Remaining Guesses. First just use all the words in the dictionary sorted by frequency
remaining = copy.deepcopy(training_dictionary)
remaining = sort_by_frequency(remaining)
print("Remaining words in dictionary:", remaining)

while len(remaining) > 0:
    guess = remaining[0]
    print("Guessing:", guess)
    scored = score_guess(guess, answer_word)
    pretty_print_output(scored)
    #num_guesses, guesses, max_guesses = make_a_guess(guess, num_guesses, guesses, max_guesses)
    scored = dumb_mask(guess, answer_word)
    remaining = dumb_remove_words(remaining, scored, guess) # Remove non-possibilities
    if guess == answer_word:
        remaining = [] # Another dumb way to exit the loop without breaking the notebook
    print("Possible remaining words:", remaining)
    remaining = sort_by_frequency(remaining) # Sort remaining by frequency
    num_guesses = num_guesses + 1

print("No remaining words")
print("Algorithm solved this in:", num_guesses, "guesses.")


Remaining words in dictionary: ['count', 'equal', 'words', 'fivey']
Guessing: count
|[100m[97mc[0m|[100m[97mo[0m|[100m[97mu[0m|[100m[97mn[0m|[100m[97mt[0m|
Possible remaining words: ['fivey']
Guessing: fivey
|[42m[97mF[0m|[42m[97mI[0m|[42m[97mV[0m|[42m[97mE[0m|[42m[97mY[0m|
F
I
V
E
Y
Possible remaining words: []
No remaining words
Algorithm solved this in: 2 guesses.
