<a href="https://colab.research.google.com/github/andersonmdcanteli/wordle/blob/main/wordle_only_answers_pair_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mastering WORDLE - Dataset 1: Only answers (PART 2)

This notebook contains part of the analyzes carried out to find the best words for the game ***WORLDE***. Here we will get the best ***pair of words*** to use in the first two attempts.

The focus is on the dataset of words that can be used as the word of the day. For other datasets and a general discussion, see this [other notebook](https://colab.research.google.com/drive/1ulRd4zAWIo9Yq6GujbEX7eyp8XhXRCkO?usp=sharing).



## Libraries and versions

To perform this analysis, I'm using [Google Colab](https://colab.research.google.com/drive/1ulRd4zAWIo9Yq6GujbEX7eyp8XhXRCkO?usp=sharing), and the following libraries:

- Python: `3.7.13`
- Pandas: `1.3.5`
- NumPy: `1.21.6`
- matplotlib: `3.2.2`
- Seaborn: `0.11.2`
- SciPy: `1.7.3`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## Data collection and preparation

The list of words that are used as answers in the game is available in this [repository](https://gist.github.com/cfreshman/).

We can import the data using pandas:


In [2]:
df_answers = pd.read_csv(
    "https://gist.githubusercontent.com/cfreshman/a03ef2cba789d8cf00c08f767e0fad7b/raw/28804271b5a226628d36ee831b0e36adef9cf449/wordle-answers-alphabetical.txt",
    header=None, 
    names=['words'])
df_answers.head(2)

Unnamed: 0,words
0,aback
1,abase


In [3]:
df_answers.describe().transpose()

Unnamed: 0,count,unique,top,freq
words,2315,2315,aback,1


In the first part of the analysis of this dataset, we obtained the best first word to be used in the game. Now let's get the first couple of words.

To do this we need to get all the unique combinations among the available words.To find out the number of possible unique combinations I'll use the `comb` function from the `scipy.special` package:

In [4]:
from scipy import special

In [5]:
special.comb(df_answers.shape[0], 2, exact=True)

2678455

The single match among the 2315 words available in the dataset results in 2678455 combinations, and we want to know which one is the best.

To effectively get all the word pairs, I'm going to use the `combinations` function from the `itertools` package:

In [6]:
from itertools import combinations

We can create a function that produces the combinations as follows:

In [7]:
def generator_func(data, x):
  for comb in combinations(data, x):
    yield comb

To get the pairs, just pass the data with all the words as the first parameter and the number of elements that the combination should return as the second parameter. As we want the pairs, just pass the number two.

In [8]:
pairs = generator_func(df_answers['words'], 2)
type(pairs)

generator

However, the result obtained is a generator, which must be looped to be used. As we have many combinations, it is more efficient to split this generator into smaller parts (chunks) and consume it step by step.

To do this, we can use a `ichunked` from package `more_itertools`:

In [9]:
from more_itertools import ichunked

In [10]:
all_chunks = ichunked(pairs, 1000)

Variable `all_chunks` contains chunks of word pairs, which have not yet been generated. Each chunk contains 1000 possible pairs (this value was chosen arbitrarily), and we can get all pairs in a `while` loop. The advantage of using `ichunked` is that once the combinations of each chunk are obtained, the chunk does not take up any more space, which reduces memory consumption.

We can create the following structure to consume the chunks:

In [11]:
out = True
count = 0
while out:
  try:
    next(all_chunks)
    count += 1
  except:
    out = False
print(count)

2679


To effectively get all possible combinations, we just need adapt the code above to get a `DataFrame` with the word pairs. But since we used `ichunked` to create the combinations, it is necessary to get the chunks again:

In [12]:
pairs = generator_func(df_answers['words'], 2)
all_chunks = ichunked(pairs, 1000)

In [13]:
df_pairs = pd.DataFrame(columns=['word_1', 'word_2'])
out = True
super_df = []
super_df.append(df_pairs)
while out:
    try:
        chunck_n = next(all_chunks)
    except:
        out = False
    super_df.append(pd.DataFrame(chunck_n, columns=['word_1', 'word_2']))
df_pairs = pd.concat(super_df, axis=0, ignore_index=True)
super_df = None

Note that at each iteration we get a new `DataFrame` which is stored in a `list` (`super_df`). After all chunks were consumed the loop is terminated, and then all the `DataFrames` contained in the `super_df` `list` are concatenated into a single `DataFrame`.

In [14]:
df_pairs.head(2)

Unnamed: 0,word_1,word_2
0,aback,abase
1,aback,abate


In [15]:
df_pairs.shape

(2678455, 2)

Now we have all the unique pairs in a data frame, and we can estimate the strength of the pair of words.

## Strength of a pair of words

To estimate the best first-word pair, we need to get the letters of each word pair in a single cell. For this, I will add a new column with the concatenation of each pair of words:

In [16]:
df_pairs['words_combined'] = df_pairs['word_1'] + df_pairs['word_2']
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined
0,aback,abase,abackabase
1,aback,abate,abackabate


Now I will split the combination of the word pair into a list, where each element corresponds to a single letter:

In [17]:
df_pairs['letters'] = df_pairs['words_combined'].apply(list)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]"
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]"


Now let's create an auxiliary dataframe containing the letters of the word pair separated into 10 columns (1 letter in each cell):

In [18]:
df_pairs_aux = pd.DataFrame(df_pairs['letters'].to_list(), columns=range(1,11)).copy()
df_pairs_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,a,b,a,c,k,a,b,a,s,e
1,a,b,a,c,k,a,b,a,t,e


Then we need replace each letter with its respective strength, similar to what we did before. However, we do not yet have this dataset. Let's import it:

In [19]:
url_freq = 'https://drive.google.com/file/d/1UIemFrAwSlsnE8BzZnJeDphALIYNE446/view?usp=sharing'
url_freq = 'https://drive.google.com/uc?id=' + url_freq.split('/')[-2]
df_freq_answers = pd.read_csv(url_freq, index_col=0)
df_freq_answers.head(2)

Unnamed: 0,frequency_answers
a,0.084579
b,0.024276


Now we can replace the letters with their frequencies:

In [20]:
# This cell takes around 8 minutes to run
for i in df_freq_answers.index:
  df_pairs_aux.replace(i, df_freq_answers['frequency_answers'][i], inplace=True)
df_pairs_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,0.084579,0.024276,0.084579,0.04121,0.018143,0.084579,0.024276,0.084579,0.057797,0.106523
1,0.084579,0.024276,0.084579,0.04121,0.018143,0.084579,0.024276,0.084579,0.062981,0.106523


Now we need to sum all the columns to estimate the strength of the word pair:

In [21]:
df_pairs_aux['strength'] = df_pairs_aux.sum(axis=1)
df_pairs_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,strength
0,0.084579,0.024276,0.084579,0.04121,0.018143,0.084579,0.024276,0.084579,0.057797,0.106523,0.61054
1,0.084579,0.024276,0.084579,0.04121,0.018143,0.084579,0.024276,0.084579,0.062981,0.106523,0.615724


Finally, we just need to concatenete both `DataFrames`:

In [22]:
df_pairs['strength'] = df_pairs_aux['strength'].round(7).copy()
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723


And then we create a new column with the ranking of the pair of words:

In [23]:
df_pairs['rank'] = df_pairs['strength'].rank(ascending=False, method="dense").astype(int)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464


Hence, the **TOP 10 pair of words**  are:

In [24]:
df_pairs[df_pairs['rank'] <= 10].sort_values(by='rank')

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank
1258004,eater,eerie,eatereerie,"[e, a, t, e, r, e, e, r, i, e]",0.893477,1
1266448,eerie,erase,eerieerase,"[e, e, r, i, e, e, r, a, s, e]",0.888294,2
1267323,eerie,rarer,eerierarer,"[e, e, r, i, e, r, a, r, e, r]",0.879309,3
1266419,eerie,elate,eerieelate,"[e, e, r, i, e, e, l, a, t, e]",0.877927,4
1267803,eerie,tease,eerietease,"[e, e, r, i, e, t, e, a, s, e]",0.873607,5
1254633,easel,eerie,easeleerie,"[e, a, s, e, l, e, e, r, i, e]",0.872743,6
1266893,eerie,lease,eerielease,"[e, e, r, i, e, l, e, a, s, e]",0.872743,6
1258038,eater,erase,eatererase,"[e, a, t, e, r, e, r, a, s, e]",0.871361,7
1267811,eerie,tepee,eerietepee,"[e, e, r, i, e, t, e, p, e, e]",0.86946,8
1266454,eerie,ester,eerieester,"[e, e, r, i, e, e, s, t, e, r]",0.866695,9


## Adding penalty for words with repeated letters

Again, the best words are those with the letter `e` repeated. We need to add a penalty to these words. For this, I will add a new column containing a list with only single letters, applying the function `unique`:

In [25]:
def unique(sequence):
  # source  https://stackoverflow.com/a/58666031/17872198
  seen = set()
  return [x for x in sequence if not (x in seen or seen.add(x))]
## example
unique("juliana")

['j', 'u', 'l', 'i', 'a', 'n']

In [26]:
df_pairs['unique_letters'] = df_pairs['letters'].apply(unique).apply("".join)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte


Now we need to count the number of letters left after removing the repeated letters. For this, we can use the `len` function in the `"unique_letters"` column:

In [27]:
df_pairs['count_unique'] = df_pairs['unique_letters'].str.len()
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6


To find out how many words contain 5 unique letters just use the `value_counts` method:

In [28]:
df_pairs['count_unique'].value_counts()

8     1002502
9      765306
7      558029
10     196175
6      140781
5       15082
4         569
3          11
Name: count_unique, dtype: int64

Out of curiosity, the pair of words that contains only 3 unique letters are:

In [29]:
df_pairs[df_pairs['count_unique'] == 3]

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique
156194,amass,mamma,amassmamma,"[a, m, a, s, s, m, a, m, m, a]",0.56311,3073,ams,3
265479,assay,sassy,assaysassy,"[a, s, s, a, y, s, a, s, s, y]",0.616155,2459,asy,3
522532,bobby,booby,bobbybooby,"[b, o, b, b, y, b, o, o, b, y]",0.390238,5074,boy,3
685559,cacao,cocoa,cacaococoa,"[c, a, c, a, o, c, o, c, o, a]",0.613996,2484,cao,3
1541425,freer,refer,freerrefer,"[f, r, e, e, r, r, e, f, e, r]",0.776501,603,fre,3
1582864,gamma,magma,gammamagma,"[g, a, m, m, a, m, a, g, m, a]",0.501253,3789,gam,3
1582869,gamma,mamma,gammamamma,"[g, a, m, m, a, m, a, m, m, a]",0.501685,3784,gam,3
1991123,llama,mamma,llamamamma,"[l, l, a, m, a, m, a, m, m, a]",0.57175,2973,lam,3
2034918,madam,mamma,madammamma,"[m, a, d, a, m, m, a, m, m, a]",0.508769,3702,mad,3
2039444,magma,mamma,magmamamma,"[m, a, g, m, a, m, a, m, m, a]",0.501685,3784,mag,3


Now it is necessary to decide a criterion for penalizing words with repeated letters. I will use the same criteria adopted previously, increasing the penalty by 20% for each word repeated. However, for word pairs with less than 5 unique letters I will set the penalty at 90%.

Note that these values are arbitrary, and will impact the results.

In [30]:
def penalty_func(x):
  if x == 10:
    return 1
  elif x == 9:
    return 0.8
  elif x == 8:
    return 0.6
  elif x == 7:
    return 0.4
  elif x == 6:
    return 0.2    
  else:
    return 0.1

Now we just need to create a new column with the penalty (correction factor):

In [31]:
df_pairs['penalty'] = df_pairs['count_unique'].apply(penalty_func)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6,0.2
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6,0.2


To apply the penalty, simply multiply column `"strength"` by the `"penalty"` column:

In [32]:
df_pairs['strength_penalty'] = df_pairs['strength']*df_pairs['penalty']
df_pairs['strength_penalty'] = df_pairs['strength_penalty'].round(7)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty,strength_penalty
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6,0.2,0.122108
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6,0.2,0.123145


Finally, we can create a new ranking based on the `"strength_penalty"` column:

In [33]:
df_pairs['rank_penalty'] = df_pairs['strength_penalty'].rank(ascending=False, method="dense").astype(int)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty,strength_penalty,rank_penalty
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6,0.2,0.122108,16481
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6,0.2,0.123145,16420


Hence, the **TOP 10 pair of words** are:

In [34]:
df_pairs[df_pairs['rank_penalty'] <= 10].sort_values(by='rank_penalty')

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty,strength_penalty,rank_penalty
33335,acorn,islet,acornislet,"[a, c, o, r, n, i, s, l, e, t]",0.665659,1886,acornislet,10,1.0,0.665659,1
722809,carol,stein,carolstein,"[c, a, r, o, l, s, t, e, i, n]",0.665659,1886,carolstein,10,1.0,0.665659,1
875889,clone,stair,clonestair,"[c, l, o, n, e, s, t, a, i, r]",0.665659,1886,clonestair,10,1.0,0.665659,1
877985,close,train,closetrain,"[c, l, o, s, e, t, r, a, i, n]",0.665659,1886,closetrain,10,1.0,0.665659,1
2189333,octal,siren,octalsiren,"[o, c, t, a, l, s, i, r, e, n]",0.665659,1886,octalsiren,10,1.0,0.665659,1
...,...,...,...,...,...,...,...,...,...,...,...
2082091,merit,salon,meritsalon,"[m, e, r, i, t, s, a, l, o, n]",0.651749,2047,meritsalon,10,1.0,0.651749,10
2086585,metro,slain,metroslain,"[m, e, t, r, o, s, l, a, i, n]",0.651749,2047,metroslain,10,1.0,0.651749,10
2086628,metro,snail,metrosnail,"[m, e, t, r, o, s, n, a, i, l]",0.651749,2047,metrosnail,10,1.0,0.651749,10
2097539,minor,stale,minorstale,"[m, i, n, o, r, s, t, a, l, e]",0.651749,2047,minorstale,10,1.0,0.651749,10


## Adding weights

The last step is to consider the position of the letters in each word. To do this, I'm going to recreate the auxiliary `DataFrame` with the letters of the word pair aplited in each column:

In [35]:
df_pairs_aux = pd.DataFrame(df_pairs['letters'].to_list(), columns=range(1,11)).copy()
df_pairs_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,a,b,a,c,k,a,b,a,s,e
1,a,b,a,c,k,a,b,a,t,e


Now, we just need to replace each letter by its probability multiplied by the weight considered. But before that, you need to get these weights:



In [36]:
url_weights = 'https://drive.google.com/file/d/10vlb5SJCO3uNIKHghS7spP-SeKvwPAxA/view?usp=sharing'
url_weights = 'https://drive.google.com/uc?id=' + url_weights.split('/')[-2]
df_freq_weights = pd.read_csv(url_weights, index_col=0)
df_freq_weights.head(2)

Unnamed: 0,1,2,3,4,5
a,0.005151,0.011107,0.011216,0.005955,0.002338
b,0.001814,0.000168,0.000598,0.000252,0.000115


Before using this data, it is necessary to transform the column name from string to integer, which is done as follows:

In [37]:
df_freq_weights.columns = df_freq_weights.columns.astype(int) 

Now we can run the loop to replace the letters with the frequencies that consider the position of the letter in the words:



In [38]:
# This cell takes around 8 minutes to run
for column in df_freq_weights.columns:
  for i in df_freq_weights.index:
    df_pairs_aux[column].replace(i, df_freq_weights[column][i], inplace=True)
    df_pairs_aux[column + 5].replace(i, df_freq_weights[column][i], inplace=True)
df_pairs_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,0.005151,0.000168,0.011216,0.002706,0.000886,0.005151,0.000168,0.011216,0.004269,0.01951
1,0.005151,0.000168,0.011216,0.002706,0.000886,0.005151,0.000168,0.011216,0.003782,0.01951


Now, we just need to sum each line to get the strength of each pair of words:

In [39]:
df_pairs_aux['strength'] = df_pairs_aux.sum(axis=1)

Finally, we just need to concatenate both `DataFrames`:

In [40]:
df_pairs['strength_weigth'] = df_pairs_aux['strength'].copy()
df_pairs['strength_weigth'] = df_pairs['strength_weigth'].round(7)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty,strength_penalty,rank_penalty,strength_weigth
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6,0.2,0.122108,16481,0.060442
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6,0.2,0.123145,16420,0.059954


Before get the rank, we just need to multiply the `"strength_weight"` column by the `"penalty"` column to get the desired results:

In [41]:
df_pairs['strength_weigth_penalty'] = df_pairs['strength_weigth']*df_pairs['penalty']
df_pairs['strength_weigth_penalty'] = df_pairs['strength_weigth_penalty'].round(7)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty,strength_penalty,rank_penalty,strength_weigth,strength_weigth_penalty
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6,0.2,0.122108,16481,0.060442,0.012088
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6,0.2,0.123145,16420,0.059954,0.011991


Finally, we get the new rank for the pair of words:

In [42]:
df_pairs['rank_weigth_penalty'] = df_pairs['strength_weigth_penalty'].rank(ascending=False, method="dense").astype(int)
df_pairs.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength,rank,unique_letters,count_unique,penalty,strength_penalty,rank_penalty,strength_weigth,strength_weigth_penalty,rank_weigth_penalty
0,aback,abase,abackabase,"[a, b, a, c, k, a, b, a, s, e]",0.61054,2524,abckse,6,0.2,0.122108,16481,0.060442,0.012088,467418
1,aback,abate,abackabate,"[a, b, a, c, k, a, b, a, t, e]",0.615723,2464,abckte,6,0.2,0.123145,16420,0.059954,0.011991,468321


Hence, the **TOP 10 pair of words** are:

In [43]:
df_pairs[['word_1', 'word_2', 'rank_weigth_penalty']][df_pairs['rank_weigth_penalty'] <= 10].sort_values(by='rank_weigth_penalty')

Unnamed: 0,word_1,word_2,rank_weigth_penalty
1009866,crony,slate,1
2562809,soapy,trice,2
1007968,crone,shalt,3
2317939,price,slant,4
610356,briny,slate,5
998887,crime,slant,6
1674687,grant,slice,7
1007901,crone,salty,8
926661,corny,slate,9
604307,brine,soapy,10


And the **10 WORST pair of words** are:

In [44]:
df_pairs[['word_1', 'word_2', 'rank_weigth_penalty']][df_pairs['rank_weigth_penalty'] >= (df_pairs['rank_weigth_penalty'].max() - 9)].sort_values(by='rank_weigth_penalty')

Unnamed: 0,word_1,word_2,rank_weigth_penalty
2442649,rumba,umbra,526085
1309609,ennui,undid,526086
1820020,humus,mummy,526087
2135059,mucus,music,526088
1492279,fluff,lupus,526089
2023377,lupus,usurp,526090
1819493,humph,thump,526091
817587,chump,humph,526092
1820017,humus,mucus,526093
1818439,humph,humus,526094
