# Monte Carlo Simulator
This script is the big simulator. Effectively we just run the original monte_carlo_word_demo script over all the words in ./data/top_words.txt and output that data to a CSV file. I have kept the old version in because **WARNING** this script can take a while to run. With top_words being about 100 entries long, it takes about 20-30 minutes usually.

If you're interested in just checking how good your opener is, go use monte_carlo_wordle_demo, if you're interested in what the best Wordle opener is, stay here.

## The Script

We start by grabbing all the data we need, this means our list of all valid guesses and all valid answers, we will also get the list of top words, but save that for later.

In [None]:
# Set up the global variables and housekeeping.
from collections import Counter
from random import Random
import csv
import matplotlib.pyplot as plt

# --- Load Word Lists ---
# Answers are all possible wordle answer about 2000 words.
with open("./data/answers.txt") as f:
    ANSWERS = [w.strip().upper() for w in f if len(w.strip()) == 5]

# Guesses are all possible wordle guesses and the possible answers combined about 12000 words.
with open("./data/guesses.txt") as f:
    GUESSES = [w.strip().upper() for w in f if len(w.strip()) == 5]

## The Wordle Simulator
This is very similar to the original one in monte_carlo_wordle_demo, except we return the values instead of printing them. If you're curios how it works you can dig into the code, but essentially this whole section just allows us to play a game of wordle with a single word.

In [2]:
# --- Feedback Function ---
def feedback(secret: str, guess: str):
    """Return pattern as tuple of G/Y/- (greens, yellows, grays)."""
    secret = secret.upper(); guess = guess.upper()
    res = ["-"] * 5
    counts = Counter(secret)

    # Greens
    for i in range(5):
        if guess[i] == secret[i]:
            res[i] = "G"
            counts[guess[i]] -= 1

    # Yellows
    for i in range(5):
        if res[i] == "G":
            continue
        if counts[guess[i]] > 0:
            res[i] = "Y"
            counts[guess[i]] -= 1

    return tuple(res)

def filter_candidates(cands, guess, patt):
    """Keep only words consistent with guess & feedback pattern."""
    return [w for w in cands if feedback(w, guess) == patt]

# --- The actual gameplay strategy ---
def play_one(secret, opener, max_guesses=6):
    """
    Play one game:
    - Start with `opener`
    - Then always pick the first candidate from remaining ANSWERS
    """
    cands = ANSWERS[:]  # secrets are always from the ANSWERS list
    guess = opener.upper()

    for turn in range(1, max_guesses + 1):
        patt = feedback(secret, guess)
        if patt == ("G","G","G","G","G"):
            return turn, True

        cands = filter_candidates(cands, guess, patt)
        if not cands:
            return turn, False

        # Next guess: naive (first candidate in sorted order)
        guess = sorted(cands)[0]

    return max_guesses, False

# --- Monte Carlo Simulation ---
def run_monte_carlo(opener, trials, seed=42):
    """
    Run Monte Carlo simulation:
    - Randomly sample `trials` secrets from ANSWERS
    - Play each game with the strategy
    - Summarize results
    """
    rng = Random(seed)
    secrets = rng.sample(ANSWERS, min(trials, len(ANSWERS)))

    results = [play_one(s, opener=opener) for s in secrets]
    solved = [t for (t, ok) in results if ok]

    success_rate = len(solved) / len(secrets)
    avg_guesses = sum(solved) / len(solved) if solved else float('nan')

    return {
        "opener": opener,
        "secrets_tested": len(secrets),
        "success_rate": success_rate,
        "average_guesses": avg_guesses
    }

## Analysis
**WARNING** this is the bit that will take half an hour to run.

All it really does is loop through all the words in top_words and play 1000 games of Wordle for each one.

In [3]:
# Run the simulation across the 100 top openers from ./top_words.txt then export those results to a 
# file in order of how good the average_guesses are, so lower average guess higher on the list

results = []
with open("data/top_words.txt") as f:
    for line in f:
        opener, _ = line.split(":")
        result = run_monte_carlo(opener=opener.strip(), trials=1000)
        results.append(result)

# Sort results by average_guesses
results.sort(key=lambda x: x["average_guesses"])

# Export to results.csv
with open("./data/results.csv", "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["opener", "average_guesses", "success_rate", "secrets_tested"])
    for res in results:
        writer.writerow([res['opener'], f"{res['average_guesses']:.6f}", f"{res['success_rate']:.6f}", res['secrets_tested']])


## Caveats

You'll notice the best words according to this script are *NOT* the same as the best words in top_words. Why is this?

- **Duplicate letters**: The first reason, is that duplicate letters just don't do well in wordle, a lot of words with two e's or two a's score really high, because those are common letters on their own, but having two of them is redundant and takes up a slot that could be another letter.
- **Over-reliance on vowels**: For some reason, words with 3 or 4 vowels actually perform quite poorly. I am not entirely sure of the reason. Wordle pros say they lead to more "traps" but I have no idea what that means. Regardless, high vowel count words score high, but don't perform. 

**This is important** because while there is clear corelation between scoring high and performing well, there are tons of outliers. It is entirely possible that the best performing word is the 215th highest scoring one, in which case we wouldn't get that answer. However, I estimate running the simulation on all 12,000 valid Wordle words would take approximately 50 hours. The 100 top scoring words almost certainly include the best Wordle word. But if you disagree, please try running the full experiment and getting back to me.

I do want to try a version where I only run say 10 simulations on all 12,000 words, then take the best performing words from there, run 100 simulations, and so on. That might have a better chance of pulling up some weird words, but I don't have the time yet..
