<a href="https://colab.research.google.com/github/andersonmdcanteli/wordle/blob/main/wordle_only_allowed_pair_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mastering WORDLE Dataset 2: Only not answers words (PART 2)

This notebook contains part of the analyzes carried out to find the best words for the game ***WORLDE***. Here we will get the ***best pair of words*** to use in the first two attempts.

This notebook focuses on getting the best word pair for the dataset that contains the allowed words, disregarding the words that might be the word of the day. For other datasets and a general discussion, see this [other notebook](https://colab.research.google.com/drive/1ulRd4zAWIo9Yq6GujbEX7eyp8XhXRCkO?usp=sharing).



## Libraries and versions

To perform this analysis, I'm using [Google Colab](https://colab.research.google.com/drive/1ulRd4zAWIo9Yq6GujbEX7eyp8XhXRCkO?usp=sharing), and the following libraries:

- Python: `3.7.13`
- Pandas: `1.3.5`
- NumPy: `1.21.6`
- matplotlib: `3.2.2`
- Seaborn: `0.11.2`
- SciPy: `1.7.3`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## Data collection and preparation

The list of words that can be used for guesses but cannot be the word of the day is available in this  [repository](https://gist.github.com/cfreshman/).

We can import the data using pandas:


In [2]:
df_allowed = pd.read_csv(
    "https://gist.githubusercontent.com/cfreshman/cdcdf777450c5b5301e439061d29694c/raw/b8375870720504ecf89c1970ea4532454f12de94/wordle-allowed-guesses.txt",
    header=None, 
    names=['words'])
df_allowed.head(2)

Unnamed: 0,words
0,aahed
1,aalii


In [3]:
df_allowed.describe().transpose()

Unnamed: 0,count,unique,top,freq
words,10657,10657,aahed,1


In the first part of the analysis of this dataset, we obtained the best first word to be used in the game. Now let's get the best first pairs of words.

To do this we need to get all the unique combinations among the available words.To find out the number of possible unique combinations I'll use the `comb` function from the `scipy.special` package:

In [3]:
from scipy import special

In [4]:
special.comb(df_allowed.shape[0], 2, exact=True)

56780496

The single match among the 10657 words available in the dataset results in 56780496 combinations, and we want to know which one is the best. Yes, over 56 million unique combinations!!!

To effectively get all the word pairs, I'm going to use the `combinations` function from the `itertools` package:

In [5]:
from itertools import combinations

We can create a function that produces the combinations as follows:

In [6]:
def generator_func(data, x):
  for comb in combinations(data, x):
    yield comb

To get the pairs, just pass the data with all the words as the first parameter and the number of elements that the combination should return as the second parameter. As we want the pairs, just pass the number two.

In [8]:
pairs = generator_func(df_allowed['words'], 2)
type(pairs)

generator

However, the result obtained is a generator, which must be looped to be used. As we have many combinations, it is more efficient to split this generator into smaller parts (chunks) and consume it step by step.

To do this, we can use a `ichunked` from package `more_itertools`:

In [7]:
from more_itertools import ichunked

In [8]:
all_chunks = ichunked(pairs, 1000)

NameError: ignored

Variable `all_chunks` contains chunks of word pairs, which have not yet been generated. Each chunk contains 1000 possible pairs (this value was chosen arbitrarily), and we can get all pairs in a `while` loop. The advantage of using `ichunked` is that once the combinations of each chunk are obtained, the chunk does not take up any more space, which reduces memory consumption.

We can create the following structure to consume the chunks:

In [11]:
out = True
count = 0
while out:
  try:
    next(all_chunks)
    count += 1
  except:
    out = False
print(count)

56781


To effectively get all possible combinations, we just need adapt the code above to get a `DataFrame` with the word pairs. But since we used `ichunked` to create the combinations, it is necessary to get the chunks again:

In [9]:
pairs = generator_func(df_allowed['words'], 2)
all_chunks = ichunked(pairs, 1000)

In [10]:
# this cell takes one minute to run
df_pairs = pd.DataFrame(columns=['word_1', 'word_2'])
out = True
super_df = []
super_df.append(df_pairs)
while out:
  try:
    chunck_n = next(all_chunks)
  except:
    out = False
  df_aux = pd.DataFrame(chunck_n, columns=['word_1', 'word_2'])
  super_df.append(df_aux)
  df_aux = None
df_pairs = pd.concat(super_df, axis=0, ignore_index=True)
super_df = None

Note that at each iteration we get a new `DataFrame` which is stored in a `list` (`super_df`). After all chunks were consumed the loop is terminated, and then all the `DataFrames` contained in the `super_df` `list` are concatenated into a single `DataFrame`.

In [14]:
df_pairs.head(2)

Unnamed: 0,word_1,word_2
0,aahed,aalii
1,aahed,aargh


In [15]:
df_pairs.shape

(56780496, 2)

Note that the resulting `DataFrame` contains exactly the number of possible combinations! Because it has this large amount of data, which are objects, this dataframe takes up a reasonable amount of memory: **866 MB**.

In [16]:
df_pairs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   word_1  object
 1   word_2  object
dtypes: object(2)
memory usage: 866.4+ MB


Unfortunately, doing operations on `objects` consumes a lot of memory. If the full `DataFrame` is used to do all the necessary operations, the colab environment will disconnect because it will *run out of RAM*.

To get around this problem, let's split the dataset into small `DataFrames`, perform the necessary operations, and then put it all back together to find the best word pairs.

To separate the data frame into parts, we can use the `array_split` method of the `NumPy` library. I chose to split it into 1000 parts because the results of each part finishes in a few seconds (testing!). But this choice was arbitrary!

In [17]:
df_split = np.array_split(df_pairs, 1000)
len(df_split)

1000

Note that the `DataFrames` were stored in a `list`.

Each `DataFrame` averages 887 KB, which is very good for avoiding excessive memory usage!

In [18]:
df_split[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56781 entries, 0 to 56780
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word_1  56781 non-null  object
 1   word_2  56781 non-null  object
dtypes: object(2)
memory usage: 887.3+ KB


To develop the code I will use only the first dataframe. So, after finishing this step, we just need to put the code inside a loop.

We start by creating a copy of the first `DataFrame`:

In [19]:
j = 0
df = df_split[j].copy()
df.head(2)

Unnamed: 0,word_1,word_2
0,aahed,aalii
1,aahed,aargh


As we are going to repeat the following steps inside a loop, it is important to reset the data frame index. Hence:

In [21]:
df.reset_index(inplace=True, drop=True)
df.head(2)

Unnamed: 0,word_1,word_2
0,aahed,aalii
1,aahed,aargh


Now, we need to sum the two words together to get a string with the 10 letters of the two words combined:

In [22]:
df['words_combined'] = df['word_1'] + df['word_2'] 
df.head(2)

Unnamed: 0,word_1,word_2,words_combined
0,aahed,aalii,aahedaalii
1,aahed,aargh,aahedaargh


Now we need to convert each cell of the new column into a list of 10 elements, where each element will correspond to a single letter:

In [23]:
df['letters'] = df['words_combined'].apply(list)
df.head()

Unnamed: 0,word_1,word_2,words_combined,letters
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]"
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]"
2,aahed,aarti,aahedaarti,"[a, a, h, e, d, a, a, r, t, i]"
3,aahed,abaca,aahedabaca,"[a, a, h, e, d, a, b, a, c, a]"
4,aahed,abaci,aahedabaci,"[a, a, h, e, d, a, b, a, c, i]"


Now we use this new column (`"letters"`) to create an auxiliary `DataFrame`, which will only contain the letters of each pair of words in individual cells:

In [24]:
df_aux = pd.DataFrame(df['letters'].to_list(), columns=range(1,11)).copy()
df_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,a,a,h,e,d,a,a,l,i,i
1,a,a,h,e,d,a,a,r,g,h


Now we just need to replace each letter with its respective frequency. The frequencies were previously obtained (in [this notebook](https://colab.research.google.com/drive/1mkjHoHicvU3hMVs7iQGtcams12jarrr-?usp=sharing)), and we can import them as follows:


In [25]:
url_freq = 'https://drive.google.com/file/d/1HlLaMr2R0XO6jmnjBQA_XG18SfG9h95l/view?usp=sharing'
url_freq = 'https://drive.google.com/uc?id=' + url_freq.split('/')[-2]
df_freq_allowed = pd.read_csv(url_freq, index_col=0)
df_freq_allowed.head(2)

Unnamed: 0,frequency_allowed
a,0.094041
b,0.02526


Now let's replace each letter with its respective frequency:

In [26]:
for i in df_freq_allowed.index:
  df_aux.replace(i, df_freq_allowed['frequency_allowed'][i], inplace=True)
df_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,0.094041,0.094041,0.02573,0.101886,0.03866,0.094041,0.094041,0.04977,0.057953,0.057953
1,0.094041,0.094041,0.02573,0.101886,0.03866,0.094041,0.094041,0.061162,0.025016,0.02573


Now, we just need to sum each line to get the strength of each pair of words:

In [27]:
df_aux['strength'] = df_aux.sum(axis=1)
df_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,strength
0,0.094041,0.094041,0.02573,0.101886,0.03866,0.094041,0.094041,0.04977,0.057953,0.057953,0.708117
1,0.094041,0.094041,0.02573,0.101886,0.03866,0.094041,0.094041,0.061162,0.025016,0.02573,0.654349


Finally, just insert column `"strength"` in the original `DataFrame`:

In [28]:
df['strength'] = df_aux['strength'].round(7).copy()
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]",0.708117
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]",0.654349


To save some memory I will remove columns `"words_combined"` and `"letters"`:

In [29]:
df = df.drop(['words_combined', 'letters'], axis=1) 
df.head(2)

Unnamed: 0,word_1,word_2,strength
0,aahed,aalii,0.708117
1,aahed,aargh,0.654349


Finally, we replace this `DataFrame` (`df`) in the `list` (`df_split`) that contains all parts of the original `DataFrame` (`df_pairs`):

In [30]:
df_split[j] = df.copy()

Now we need to perform all the steps above for each chunk of the full `DataFrame`. To do this, we can use a `for` loop:

In [31]:
# This loop takes around 2 hours to finish
for j in range(len(df_split)):
  df = df_split[j].copy()
  df.reset_index(inplace=True, drop=True)
  df['words_combined'] = df['word_1'] + df['word_2'] 
  df['letters'] = df['words_combined'].apply(list)
  df_aux = pd.DataFrame(df['letters'].to_list(), columns=range(1,11)).copy()
  for i in df_freq_allowed.index:
    df_aux.replace(i, df_freq_allowed['frequency_allowed'][i], inplace=True)
  df_aux['strength'] = df_aux.sum(axis=1)
  df['strength'] = df_aux['strength'].round(7).copy()
  df_aux = None # free memory
  df = df.drop(['words_combined', 'letters'], axis=1) 
  df_split[j] = df.copy()
  df = None # free memory


After the iterations are finished, we need to concatenate all the `DataFrames` that are in list `df_split`. That is:

In [32]:
df_pair_of_words = pd.concat(df_split, axis=0, ignore_index=True)
df_pair_of_words.head(2)

Unnamed: 0,word_1,word_2,strength
0,aahed,aalii,0.708117
1,aahed,aargh,0.654349


This new `DataFrame` takes up a fair amount of memory:

In [33]:
df_pair_of_words.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   word_1    object 
 1   word_2    object 
 2   strength  float64
dtypes: float64(1), object(2)
memory usage: 1.3+ GB


Now we need to rank the pair of words. For this, I will create a new column using the `rank` function:

In [34]:
df_pair_of_words['rank'] = df_pair_of_words['strength'].rank(ascending=False, method="dense").astype(int)
df_pair_of_words.head(2)

Unnamed: 0,word_1,word_2,strength,rank
0,aahed,aalii,0.708117,11585
1,aahed,aargh,0.654349,14450


So, the **TOP 10 pair of words** are:

In [35]:
df_pair_of_words[df_pair_of_words['rank'] <= 10].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength,rank
25931381,esses,sessa,1.074862,1
5012415,asses,esses,1.074862,1
25931179,esses,sasse,1.074862,1
5017539,asses,sasse,1.067017,2
5017741,asses,sessa,1.067017,2
53050151,sasse,sessa,1.067017,2
25931294,esses,sease,1.064221,3
24664549,eases,esses,1.064221,3
53357546,sease,sessa,1.056376,4
53050064,sasse,sease,1.056376,4


and the **WORST 10 pair of words** are:

In [36]:
df_pair_of_words[df_pair_of_words['rank'] >= (df_pair_of_words['rank'].max() - 9)].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength,rank
29539942,fuffy,muzzy,0.234381,34937
29544169,fuffy,whizz,0.233443,34938
45910606,muzzy,whizz,0.229558,34939
29538193,fuffy,huzzy,0.228958,34940
13197334,buzzy,fuffy,0.228488,34941
36222166,huzzy,muzzy,0.225073,34942
13200052,buzzy,muzzy,0.224604,34943
36226393,huzzy,whizz,0.224134,34944
13204279,buzzy,whizz,0.223665,34945
13198303,buzzy,huzzy,0.21918,34946


## Adding penalty for words with repeated letters

As we obtained in the previous cases, the best pair of words are those with repeated letters. We need to add a penalty to these words. 

To do this, we need to know which pairs of words have repeating letters, and how many repeating letters each pair has. We also need to define how the penalty will work. 

I'm going to use the same strategy adopted before: separate the original data frame into several chunks, perform all the necessary operations and finally put everything together in a `DataFrame` to rank the results.

First, we need to split the `DataFrame`:

In [37]:
df_split = np.array_split(df_pairs, 1000)
len(df_split)

1000

Then, we need to get a copy of the first chunk:

In [38]:
j = 0
df = df_split[j].copy()
df.reset_index(inplace=True, drop=True)
df.head(2)

Unnamed: 0,word_1,word_2
0,aahed,aalii
1,aahed,aargh


We need to combine the pair of words:

In [39]:
df['words_combined'] = df['word_1'] + df['word_2'] 
df.head(2)

Unnamed: 0,word_1,word_2,words_combined
0,aahed,aalii,aahedaalii
1,aahed,aargh,aahedaargh


Now we need to find out how many repeated letters each word has. For this, I'll get a new string containing only the unique letters using the `unique` function:

In [15]:
def unique(sequence):
  # source  https://stackoverflow.com/a/58666031/17872198
  seen = set()
  return [x for x in sequence if not (x in seen or seen.add(x))]
## example
unique("juliana")

['j', 'u', 'l', 'i', 'a', 'n']

In [41]:
df['unique'] = df['words_combined'].apply(unique).apply("".join)
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,unique
0,aahed,aalii,aahedaalii,ahedli
1,aahed,aargh,aahedaargh,ahedrg


Now we just need to count the length of each string in column `"unique"`:

In [42]:
df['count_unique'] = df['unique'].str.len()
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,unique,count_unique
0,aahed,aalii,aahedaalii,ahedli,6
1,aahed,aargh,aahedaargh,ahedrg,6


Now that we know how many unique letters each word has, we can use this information to apply a penalty rule. The rule used is the same adopted in the other cases and was adopted arbitrarily. The function used is as follows:

In [14]:
def penalty_func(x):
  if x == 10:
    return 1
  elif x == 9:
    return 0.8
  elif x == 8:
    return 0.6
  elif x == 7:
    return 0.4
  elif x == 6:
    return 0.2    
  else:
    return 0.1

Now just apply the function:

In [44]:
df['penalty'] = df['count_unique'].apply(penalty_func)
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,unique,count_unique,penalty
0,aahed,aalii,aahedaalii,ahedli,6,0.2
1,aahed,aargh,aahedaargh,ahedrg,6,0.2


Thus, the data frame with the penalty was obtained. To apply the penalty, simply multiply column `"penalty"` by the `"strength"` column. However, this column is not stored in this `DataFrame`, beiing necessary to join these two pieces of information in a single `DataFrame`.

But before that, let's remove columns `'words_combined'` and `'unique'` as they are object types and take up a lot of memory.

In [45]:
df = df.drop(['words_combined', 'unique'], axis=1) 
df.head(2)

Unnamed: 0,word_1,word_2,count_unique,penalty
0,aahed,aalii,6,0.2
1,aahed,aargh,6,0.2


Now we replace the obtained dataframe in the original list:

In [46]:
df_split[j] = df.copy()

Now we are ready to put all these steps together to get the penalties for the entire dataset. That is:

In [47]:
df_split = np.array_split(df_pairs, 1000)
len(df_split)

1000

In [48]:
# This loop takes around 4 minuts
for j in range(len(df_split)):
  df = df_split[j].copy()
  df.reset_index(inplace=True, drop=True)
  df['words_combined'] = df['word_1'] + df['word_2'] 
  df['unique'] = df['words_combined'].apply(unique).apply("".join)
  df['count_unique'] = df['unique'].str.len()
  df['penalty'] = df['count_unique'].apply(penalty_func)
  df = df.drop(['words_combined', 'unique'], axis=1) 
  df_split[j] = df.copy()
  df = None


Once the loop is finished, just concatenate all the `DataDrames`:

In [49]:
df_penalty = pd.concat(df_split, axis=0, ignore_index=True)
df_penalty.head(2)

Unnamed: 0,word_1,word_2,count_unique,penalty
0,aahed,aalii,6,0.2
1,aahed,aargh,6,0.2


In [50]:
df_penalty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 4 columns):
 #   Column        Dtype  
---  ------        -----  
 0   word_1        object 
 1   word_2        object 
 2   count_unique  int64  
 3   penalty       float64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ GB


Now we need to join the `df_penalty` `DataFrame` with the column `"strength"` of the `df_pair_of_words` `DataFrame`. As the order of the lines has not changed, the word pairs are in the same position in both dataframes, which makes our work easier:

In [51]:
df_penalty["strength"] = df_pair_of_words["strength"].copy()
df_penalty.head(2)

Unnamed: 0,word_1,word_2,count_unique,penalty,strength
0,aahed,aalii,6,0.2,0.708117
1,aahed,aargh,6,0.2,0.654349


To effectively *apply* the penalty, simply multiply the `"penalty"` column by the `"strength"` column:

In [52]:
df_penalty['strength_penalty'] = df_penalty['penalty']*df_penalty['strength']
df_penalty.head(2)

Unnamed: 0,word_1,word_2,count_unique,penalty,strength,strength_penalty
0,aahed,aalii,6,0.2,0.708117,0.141623
1,aahed,aargh,6,0.2,0.654349,0.13087


Let's drop some columns:

In [54]:
df_penalty = df_penalty.drop(['count_unique', 'penalty', 'strength'], axis=1)
df_penalty.head(2)

Unnamed: 0,word_1,word_2,strength_penalty
0,aahed,aalii,0.141623
1,aahed,aargh,0.13087


Now we just need to get a column ranking the word pairs, which is done with the `rank` function:

In [55]:
df_penalty['rank'] = df_penalty['strength_penalty'].rank(ascending=False, method="dense").astype(int)
df_penalty.head(2)

Unnamed: 0,word_1,word_2,strength_penalty,rank
0,aahed,aalii,0.141623,91129
1,aahed,aargh,0.13087,94704


So, the **TOP 10 pair of words** are:

In [56]:
df_penalty[df_penalty['rank'] <= 1].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength_penalty,rank
957697,adits,enrol,0.677902,1
41889583,lints,oread,0.677902,1
41889444,lints,oared,0.677902,1
41886785,linos,tared,0.677902,1
41885064,linos,rated,0.677902,1
...,...,...,...,...
22984216,dorts,elain,0.677902,1
22970189,dorsa,lenti,0.677902,1
22969443,dorsa,intel,0.677902,1
22967781,dorsa,elint,0.677902,1


and the **WORST 10 pair of words** are:

In [57]:
df_penalty[df_penalty['rank'] >= (df_penalty['rank'].max() - 10)].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength_penalty,rank
49243156,phizz,whizz,0.024378,140600
15588633,chizz,whizz,0.024189,140601
9459031,bizzy,buzzy,0.02383,140602
36027622,huffy,huzzy,0.023808,140603
12454754,buffy,buzzy,0.023714,140604
29539942,fuffy,muzzy,0.023438,140605
29538193,fuffy,huzzy,0.022896,140606
13197334,buzzy,fuffy,0.022849,140607
36222166,huzzy,muzzy,0.022507,140608
13200052,buzzy,muzzy,0.02246,140609


In [58]:
df_penalty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 4 columns):
 #   Column            Dtype  
---  ------            -----  
 0   word_1            object 
 1   word_2            object 
 2   strength_penalty  float64
 3   rank              int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ GB


## Adding weights

The last step is to consider the position of the letters in each word. I will continue using the strategy of separating the dataframe into chunks to avoid memory problems.

Since we're going to need to recalculate the word strength, I'm going to clean up the dataframes:



In [59]:
df_penalty = None
df_pair_of_words = None
df_split = None

In [87]:
import gc

In [92]:
gc.collect()

88

Primeiro, recriamos os chunks do dataframe

In [60]:
df_split = np.array_split(df_pairs, 1000)
len(df_split)

1000

To develop the code I will use only the first dataframe. So, after finishing this step, we just need to put the code inside a loop.

We start by creating a copy of the first `DataFrame`:

In [61]:
j = 0
df = df_split[j].copy()
df.head(2)

Unnamed: 0,word_1,word_2
0,aahed,aalii
1,aahed,aargh


As we are going to repeat the following steps inside a loop, it is important to reset the data frame index. Hence:

In [62]:
df.reset_index(inplace=True, drop=True)

Now, we need to sum the two words together to get a string with the 10 letters of the two words combined:

In [63]:
df['words_combined'] = df['word_1'] + df['word_2'] 
df.head(2)

Unnamed: 0,word_1,word_2,words_combined
0,aahed,aalii,aahedaalii
1,aahed,aargh,aahedaargh


Now we need to convert each cell of the new column into a list of 10 elements, where each element will correspond to a single letter:

In [64]:
df['letters'] = df['words_combined'].apply(list)
df.head()

Unnamed: 0,word_1,word_2,words_combined,letters
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]"
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]"
2,aahed,aarti,aahedaarti,"[a, a, h, e, d, a, a, r, t, i]"
3,aahed,abaca,aahedabaca,"[a, a, h, e, d, a, b, a, c, a]"
4,aahed,abaci,aahedabaci,"[a, a, h, e, d, a, b, a, c, i]"


Now we use this new column (`"letters"`) to create an auxiliary `DataFrame`, which will only contain the letters of each pair of words in individual cells:

In [65]:
df_aux = pd.DataFrame(df['letters'].to_list(), columns=range(1,11)).copy()
df_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,a,a,h,e,d,a,a,l,i,i
1,a,a,h,e,d,a,a,r,g,h


Now let's replace each letter with its respective frequency that considers the position of each word. But, we need to import this data:

In [66]:
url_freq_weigth = 'https://drive.google.com/file/d/1XOwGPnKWCKozj-CgSbF7OzZo_am8QdpU/view?usp=sharing'
url_freq_weigth = 'https://drive.google.com/uc?id=' + url_freq_weigth.split('/')[-2]
df_freq_weigth_allowed = pd.read_csv(url_freq_weigth, index_col=0)
df_freq_weigth_allowed.head(2)

Unnamed: 0,1,2,3,4,5
a,0.005259,0.017287,0.008198,0.008039,0.005436
b,0.001745,0.000154,0.000659,0.000519,0.000114


Before using this data, it is necessary to transform the column name from string to integer, which is done as follows:

In [68]:
df_freq_weigth_allowed.columns = df_freq_weigth_allowed.columns.astype(int) 
df_freq_weigth_allowed.head(2)

Unnamed: 0,1,2,3,4,5
a,0.005259,0.017287,0.008198,0.008039,0.005436
b,0.001745,0.000154,0.000659,0.000519,0.000114


In [69]:
for column in df_freq_weigth_allowed.columns:
  for i in df_freq_weigth_allowed.index:
    df_aux[column].replace(i, df_freq_weigth_allowed[column][i], inplace=True)
    df_aux[column + 5].replace(i, df_freq_weigth_allowed[column][i], inplace=True)
df_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,0.005259,0.017287,0.000268,0.019207,0.002558,0.005259,0.017287,0.003437,0.003926,0.001463
1,0.005259,0.017287,0.000268,0.019207,0.002558,0.005259,0.017287,0.00594,0.000815,0.000558


Now, we just need to sum each line to get the strength of each pair of words:

In [70]:
df_aux['strength'] = df_aux.sum(axis=1)
df_aux.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,strength
0,0.005259,0.017287,0.000268,0.019207,0.002558,0.005259,0.017287,0.003437,0.003926,0.001463,0.075951
1,0.005259,0.017287,0.000268,0.019207,0.002558,0.005259,0.017287,0.00594,0.000815,0.000558,0.074437


And then add the new metric to the dataset:

In [71]:
df['strength_weigth'] = df_aux['strength'].round(7).copy()
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength_weigth
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]",0.075951
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]",0.074437


Now, we just need to multiply the `"strength_weight"` column by the `"penalty"` column to get the desired results:

In [72]:
df['unique'] = df['words_combined'].apply(unique).apply("".join)
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength_weigth,unique
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]",0.075951,ahedli
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]",0.074437,ahedrg


Now we just need to count the length of each string in column `"unique"`:

In [73]:
df['count_unique'] = df['unique'].str.len()
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength_weigth,unique,count_unique
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]",0.075951,ahedli,6
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]",0.074437,ahedrg,6


Now just apply the function:

In [74]:
df['penalty'] = df['count_unique'].apply(penalty_func)
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength_weigth,unique,count_unique,penalty
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]",0.075951,ahedli,6,0.2
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]",0.074437,ahedrg,6,0.2


and multiply by the penalty:

In [75]:
df['strength_weigth_penalty'] = df['strength_weigth']*df['penalty']
df.head(2)

Unnamed: 0,word_1,word_2,words_combined,letters,strength_weigth,unique,count_unique,penalty,strength_weigth_penalty
0,aahed,aalii,aahedaalii,"[a, a, h, e, d, a, a, l, i, i]",0.075951,ahedli,6,0.2,0.01519
1,aahed,aargh,aahedaargh,"[a, a, h, e, d, a, a, r, g, h]",0.074437,ahedrg,6,0.2,0.014887


Lets drop some columns

In [76]:
df = df.drop(['words_combined', 'letters', 'count_unique',	'penalty', 'strength_weigth', 'unique'], axis=1) 
df.head(2)

Unnamed: 0,word_1,word_2,strength_weigth_penalty
0,aahed,aalii,0.01519
1,aahed,aargh,0.014887


And replace the original dataset:

In [77]:
df_split[j] = df.copy()
df_split[j].head(2)

Unnamed: 0,word_1,word_2,strength_weigth_penalty
0,aahed,aalii,0.01519
1,aahed,aargh,0.014887


Now just put everything together in a loop, and wait a few hours until the calculations are finished.

In [78]:
df_split = np.array_split(df_pairs, 1000)
len(df_split)

1000

In [93]:
for j in range(len(df_split)):
  df = df_split[j].copy()
  df.reset_index(inplace=True, drop=True)
  df['words_combined'] = df['word_1'] + df['word_2'] 
  df['letters'] = df['words_combined'].apply(list)
  df_aux = pd.DataFrame(df['letters'].to_list(), columns=range(1,11)).copy()
  for column in df_freq_weigth_allowed.columns:
    for i in df_freq_weigth_allowed.index:
      df_aux[column].replace(i, df_freq_weigth_allowed[column][i], inplace=True)
      df_aux[column + 5].replace(i, df_freq_weigth_allowed[column][i], inplace=True)

  df_aux['strength'] = df_aux.sum(axis=1)
  df['strength_weigth'] = df_aux['strength'].round(7).copy()
  df_aux = None
  df['unique'] = df['words_combined'].apply(unique).apply("".join)
  df['count_unique'] = df['unique'].str.len()
  df['penalty'] = df['count_unique'].apply(penalty_func)
  df['strength_weigth_penalty'] = df['strength_weigth']*df['penalty']
  df = df.drop(['words_combined',	'letters', 'count_unique', 'penalty', 'strength_weigth',	'unique'], axis=1) 
  df_split[j] = df.copy()
  df = None


In [94]:
df_weigth_penalty = pd.concat(df_split, axis=0, ignore_index=True)
df_weigth_penalty.head(2)

Unnamed: 0,word_1,word_2,strength_weigth_penalty
0,aahed,aalii,0.01519
1,aahed,aargh,0.014887


In [95]:
df_weigth_penalty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 3 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   word_1                   object 
 1   word_2                   object 
 2   strength_weigth_penalty  float64
dtypes: float64(1), object(2)
memory usage: 1.3+ GB


In [96]:
df_split = None

In [109]:
gc.collect()

224

Now we can crete the rank column:

In [101]:
df_weigth_penalty['rank'] = df_weigth_penalty['strength_weigth_penalty'].rank(ascending=False, method="dense").astype(int)
df_weigth_penalty.head(2)

Unnamed: 0,word_1,word_2,strength_weigth_penalty,rank
0,aahed,aalii,0.01519,3588629
1,aahed,aargh,0.014887,3609558


In [105]:
df_weigth_penalty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 4 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   word_1                   object 
 1   word_2                   object 
 2   strength_weigth_penalty  float64
 3   rank                     int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ GB


Hence, the **TOP 10 pair of words** are:

In [102]:
df_weigth_penalty[df_weigth_penalty['rank'] <= 10].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength_weigth_penalty,rank
40786913,laris,tones,0.119775,1
48584077,paris,tones,0.119617,2
20189518,daris,tones,0.119614,3
48584061,paris,toles,0.119604,4
48388625,palis,tores,0.119604,4
49920983,polis,tares,0.119604,4
19975371,dalis,tores,0.119601,5
20189502,daris,toles,0.119601,5
22916565,doris,tales,0.119601,5
48563793,pares,toils,0.119404,6


And the **10 WORST pair of words** are:

In [103]:
df_weigth_penalty[df_weigth_penalty['rank'] >= (df_weigth_penalty['rank'].max() - 9)].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength_weigth_penalty,rank
19654582,cwtch,phpht,0.001229,4380043
12341993,buchu,chuff,0.001215,4380044
36335559,hyphy,yucch,0.001214,4380045
15823748,chuff,uhuru,0.001198,4380046
33514490,gyppo,hyphy,0.00119,4380047
36849436,infix,unfix,0.001182,4380048
36335401,hyphy,xylyl,0.001181,4380049
15824599,chuff,yucch,0.001144,4380050
36332156,hyphy,psych,0.001054,4380051
38177632,jugum,ungum,0.000984,4380052


## Rethinking frequencies and weights

In this notebook we find out which are the best pairs of words to use as a first try. However, the words used do not have words that are used as the word of the day. Thus, the results obtained are not optimized for the possible answers.

To fix this, instead of using the frequencies and weights estimated for the df_allowed we can use the results obtained for the `df_answers` dataset.

We jus need to import this dataset

In [11]:
url_answers = 'https://drive.google.com/file/d/10vlb5SJCO3uNIKHghS7spP-SeKvwPAxA/view?usp=sharing'
url_answers = 'https://drive.google.com/uc?id=' + url_answers.split('/')[-2]
df_freq_weights_answers = pd.read_csv(url_answers, index_col=0)
df_freq_weights_answers.columns = df_freq_weights_answers.columns.astype(int) 
df_freq_weights_answers.head(2)

Unnamed: 0,1,2,3,4,5
a,0.005151,0.011107,0.011216,0.005955,0.002338
b,0.001814,0.000168,0.000598,0.000252,0.000115


And replace `df_freq_weigth_allowed` with `df_freq_weights_answers`

In [16]:
df_split = np.array_split(df_pairs, 1000)
len(df_split)

1000

In [17]:
for j in range(len(df_split)):
  df = df_split[j].copy()
  df.reset_index(inplace=True, drop=True)
  df['words_combined'] = df['word_1'] + df['word_2'] 
  df['letters'] = df['words_combined'].apply(list)
  df_aux = pd.DataFrame(df['letters'].to_list(), columns=range(1,11)).copy()
  for column in df_freq_weights_answers.columns:
    for i in df_freq_weights_answers.index:
      df_aux[column].replace(i, df_freq_weights_answers[column][i], inplace=True)
      df_aux[column + 5].replace(i, df_freq_weights_answers[column][i], inplace=True)

  df_aux['strength'] = df_aux.sum(axis=1)
  df['strength_weigth'] = df_aux['strength'].round(7).copy()
  df_aux = None
  df['unique'] = df['words_combined'].apply(unique).apply("".join)
  df['count_unique'] = df['unique'].str.len()
  df['penalty'] = df['count_unique'].apply(penalty_func)
  df['strength_weigth_penalty'] = df['strength_weigth']*df['penalty']
  df = df.drop(['words_combined',	'letters', 'count_unique', 'penalty', 'strength_weigth',	'unique'], axis=1) 
  df_split[j] = df.copy()
  df = None


Now we need to concat all chunks:

In [18]:
df_weigth_penalty_answers = pd.concat(df_split, axis=0, ignore_index=True)
df_weigth_penalty_answers.head(2)

Unnamed: 0,word_1,word_2,strength_weigth_penalty
0,aahed,aalii,0.011249
1,aahed,aargh,0.011476


In [19]:
df_weigth_penalty_answers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 3 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   word_1                   object 
 1   word_2                   object 
 2   strength_weigth_penalty  float64
dtypes: float64(1), object(2)
memory usage: 1.3+ GB


Now we can crete the rank column:

In [20]:
df_weigth_penalty_answers['rank'] = df_weigth_penalty_answers['strength_weigth_penalty'].rank(ascending=False, method="dense").astype(int)
df_weigth_penalty_answers.head(2)

Unnamed: 0,word_1,word_2,strength_weigth_penalty,rank
0,aahed,aalii,0.011249,1844445
1,aahed,aargh,0.011476,1831096


In [21]:
df_weigth_penalty_answers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56780496 entries, 0 to 56780495
Data columns (total 4 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   word_1                   object 
 1   word_2                   object 
 2   strength_weigth_penalty  float64
 3   rank                     int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ GB


Hence, the **TOP 10 pair of words** are:

In [22]:
df_weigth_penalty_answers[df_weigth_penalty_answers['rank'] <= 10].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength_weigth_penalty,rank
16498569,clint,soare,0.079181,1
50314580,prate,soily,0.07918,2
11639139,brane,soily,0.079173,3
29187084,frate,soily,0.078402,4
54564875,soily,trape,0.078192,5
18884826,crout,saine,0.077981,6
18711072,crine,slaty,0.077861,7
54565717,soily,wrate,0.077839,8
54565181,soily,urate,0.07781,9
18519012,crame,soily,0.07778,10


And the **10 WORST pair of words** are:

In [24]:
df_weigth_penalty_answers[df_weigth_penalty_answers['rank'] >= (df_weigth_penalty_answers['rank'].max() - 9)].sort_values(by='rank')

Unnamed: 0,word_1,word_2,strength_weigth_penalty,rank
45680434,mumus,umphs,0.001067,2443304
50740000,pumps,umphs,0.001061,2443305
56356842,undug,ungum,0.001059,2443306
40165770,kudzu,kukus,0.001037,2443307
40165786,kudzu,kuzus,0.001029,2443308
45666295,mumms,umphs,0.001005,2443309
36698547,immix,imshi,0.000997,2443310
45671009,mumps,umphs,0.000993,2443311
36052409,huhus,umphs,0.000982,2443312
36129599,humps,umphs,0.000967,2443313
