<a href="https://colab.research.google.com/github/brettevenhouse/NEW-IC-TEST/blob/main/Wordle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#IDENTIFY

**Background:** Wordle is a popular simple word game that was created by was recently purchased by NY Times for more than a million US dollars.
Objectives:
1. What is the list of five letter words that should used as the starting guess?
1. What are the list of two five-letter words that should be used as the starting two guesses?

#COLLECT

**Objectives:**

  1. Import modules / packages 
  1. Import data (official answers and guesses from the Wordle) 

## Import Modules

In [None]:
!pip install pandas_bokeh
import os
import pandas as pd
import pandas_bokeh
import re
pandas_bokeh.output_notebook()
pd.set_option('plotting.backend', 'pandas_bokeh')

# notes: see creatingwordlist.txt on instructions on how to create wordle lists csvs

def printvar(variable):
  print('==========PRINTVAR============')
  print('Type: ' + str(type(variable)))
  #print('Name: ' + str(variable.name))
  print('Length: ' + str(len(variable)))
  print('Index: ' + str(variable.index))
  print('Contents: \n' + str(variable))
  print('==========PRINTVAR============')

Collecting pandas_bokeh
  Downloading pandas_bokeh-0.5.5-py2.py3-none-any.whl (29 kB)
Installing collected packages: pandas-bokeh
Successfully installed pandas-bokeh-0.5.5


##Load Data

In [None]:
!wget -nc 'https://raw.githubusercontent.com/uscprofessor/itp487_enterprise_data_analytics/main/data/englishwordlist.txt'
!wget -nc 'https://github.com/uscprofessor/itp487_enterprise_data_analytics/raw/main/data/wordle_answers.csv'
!wget -nc 'https://github.com/uscprofessor/itp487_enterprise_data_analytics/raw/main/data/wordle_guesses.csv'
datafile = 'englishwordlist.txt'
answers = pd.read_csv('wordle_answers.csv')
guesses = pd.read_csv('wordle_guesses.csv')
dictionary = pd.read_csv('englishwordlist.txt')

File ‘englishwordlist.txt’ already there; not retrieving.

File ‘wordle_answers.csv’ already there; not retrieving.

File ‘wordle_guesses.csv’ already there; not retrieving.



#CLEAN

**Objectives:**
  1. View Data Sample
  1. Rename Columns
  1. Remove extra columns / non-alpha words / NaNs
  1. Normalize data set 
  1. Create distribution of letters 

In [None]:
# view data 
print(answers)
print(guesses)
print(dictionary)

      cigar
0     rebut
1     sissy
2     humph
3     awake
4     blush
...     ...
2303  judge
2304  rower
2305  artsy
2306  rural
2307  shave

[2308 rows x 1 columns]
       aahed
0      aalii
1      aargh
2      aarti
3      abaca
4      abaci
...      ...
10632  zuzim
10633  zygal
10634  zygon
10635  zymes
10636  zymic

[10637 rows x 1 columns]
                  a
0                aa
1               aaa
2            aachen
3          aardvark
4         aardvarks
...             ...
194427  zymotically
194428      zymurgy
194429       zyrian
194430      zyrians
194431       zythum

[194432 rows x 1 columns]


In [None]:
# name the columns
answers.columns = ['word']
guesses.columns = ['word']
dictionary.columns = ['word']


In [None]:
# REMOVE EXTRA COLUMNS
answers = answers[['word']]
guesses = guesses[['word']]
dictionary = dictionary[['word']]

# DROP NAs
# <example code or the function call>
# option 1: df.dropna(inplace=True) <-- drops all NA values XYX
# option

# <insert your code>
answers.dropna(inplace=True)
guesses.dropna(inplace=True)
dictionary.dropna(inplace=True)

# NORMALIZE DATA SET
# remove words with non-alpha characters
answers = answers[answers.word.str.isalpha()]
guesses = guesses[guesses.word.str.isalpha()]
dictionary = dictionary[dictionary.word.str.isalpha()]

# make all letters uppercase
answers = answers.apply(lambda x: x.astype(str).str.upper())
guesses = guesses.apply(lambda x: x.astype(str).str.upper())
dictionary = dictionary.apply(lambda x: x.astype(str).str.upper())

In [None]:
# Select Which Word List we will Analyze
wordlist = pd.concat([answers, guesses])
# Use only 5 letter words from the wordlist
# only use the 5 letter words from wordlist
fiveletterwords = wordlist[wordlist.word.str.len() == 5]
# sort defaults to ascending order
fiveletterwords = fiveletterwords.sort_values(by=['word'])
# reset indices
fiveletterwords = fiveletterwords.reset_index()

In [None]:
# let's count the number of words contain each letter of the alphabet
# note: already converted all words to upper case
upperalpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

# have to set type so pandas_bokeh can display properly
countwords = pd.DataFrame({'letter':pd.Series(dtype='str'),
                           'lettercount':pd.Series(dtype='int')})

for letter in upperalpha:
  wordscontainingletter = fiveletterwords[fiveletterwords.word.str.contains(letter)]
  lettercount = wordscontainingletter.shape[0]
  row = {'letter':letter,'lettercount': int(lettercount)}
  countwords = countwords.append(row, ignore_index = True)

countwords = countwords.sort_values(by=['lettercount'], ascending=False)

#MODEL

**Objectives:**

1. Visualize distribution 
1. Find words that include the top 5 letters
1. Use top 10 letters to find best guesses for first 2 attempts

In [None]:
# visualize
pandas_bokeh.output_notebook()
import pandas as pd
import pandas_bokeh
pandas_bokeh.output_notebook()
countwords.plot_bokeh(
    kind='bar',
    x='letter',
    y='lettercount',
    ylabel='Number of Words',
    xlabel='Letter',
    title='Number of Words by Letter'
)

In [None]:
# get five letters
# could also use countwords.head(10)
topfiveletters = countwords.iloc[:5]

#make a string with top five letters
topfiveletterstring = ''
for index, row in topfiveletters.iterrows():
  topfiveletterstring += str(row[0])

print('Topfiveletterstring: ' + topfiveletterstring)
char = set(topfiveletterstring)
temp = fiveletterwords.word
topfiveletterwords = temp[temp.apply(lambda x: char.issubset(x))]
printvar(topfiveletterwords)

Topfiveletterstring: SEARO
Type: <class 'pandas.core.series.Series'>
Length: 3
Index: Int64Index([139, 547, 10342], dtype='int64')
Contents: 
139      AEROS
547      AROSE
10342    SOARE
Name: word, dtype: object


In [None]:
# get top ten letters
# could also use countwords.head(10)
toptenletters = countwords.iloc[:10]
print(toptenletters)

#make a string with top ten letters
toptenletterstring = ''
for index, row in toptenletters.iterrows():
  toptenletterstring += str(row[0])

print(toptenletterstring)

# get the other letters
# could also use countwords.tail(16)
nontoptenletters = countwords.iloc[-16:]
print(nontoptenletters)



#make a string with non top ten letters
nontoptenletterstring = ''
for index, row in nontoptenletters.iterrows():
  nontoptenletterstring += str(row[0])
print(nontoptenletterstring)

   letter  lettercount
18      S         5924
4       E         5695
0       A         5322
17      R         3904
14      O         3904
8       I         3581
11      L         3109
19      T         3030
13      N         2783
20      U         2433
SEAROILTNU
   letter  lettercount
3       D         2293
24      Y         2024
2       C         1912
15      P         1880
12      M         1867
7       H         1701
6       G         1538
1       B         1516
10      K         1435
22      W         1026
5       F          987
21      V          673
25      Z          391
9       J          289
23      X          287
16      Q          111
DYCPMHGBKWFVZJXQ


In [None]:
# get words with only the top ten letters
# will do this by finding all words with the not ten letters in it
# reversing the boolean to only get the words that have top ten letters only in it
import re
pattern = '['+nontoptenletterstring+']'
nottoptenwordsboolean = fiveletterwords.word.str.match(pat=pattern)
toptenwordsboolean = ~nottoptenwordsboolean
toptenwordsonly = fiveletterwords[toptenwordsboolean]
toptenwordsonly

Unnamed: 0,index,word
0,0,AALII
1,1,AARGH
2,2,AARTI
3,3,ABACA
4,4,ABACI
...,...,...
11985,1089,UTILE
11986,1511,UTTER
11987,9813,UVEAL
11988,9814,UVEAS


In [None]:
cartesianproduct = toptenwordsonly.merge(toptenwordsonly, how='cross')
print(cartesianproduct)

          index_x word_x  index_y word_y
0               0  AALII        0  AALII
1               0  AALII        1  AARGH
2               0  AALII        2  AARTI
3               0  AALII        3  ABACA
4               0  AALII        4  ABACI
...           ...    ...      ...    ...
30880244     9815  UVULA     1089  UTILE
30880245     9815  UVULA     1511  UTTER
30880246     9815  UVULA     9813  UVEAL
30880247     9815  UVULA     9814  UVEAS
30880248     9815  UVULA     9815  UVULA

[30880249 rows x 4 columns]


In [None]:
word_xy = cartesianproduct['word_x'] + cartesianproduct['word_y']

In [None]:
# create top ten & letters string
chars = set(toptenletterstring)

In [None]:
char = set(toptenletterstring)
besttwowords = word_xy[word_xy.apply(lambda x: set(char).issubset(x))]
printvar(besttwowords)

Type: <class 'pandas.core.series.Series'>
Length: 600
Index: Int64Index([  777884,   777906,  1269047,  1284562,  1285270,  1285567,
             1586638,  1586641,  1588066,  1588191,
            ...
            30779421, 30815093, 30818752, 30848463, 30849965, 30850668,
            30850847, 30853025, 30855254, 30856423],
           dtype='int64', length=600)
Contents: 
777884      AEROSUNLIT
777906      AEROSUNTIL
1269047     AIRTSNOULE
1284562     AITUSENROL
1285270     AITUSLONER
               ...    
30850668    UTERISLOAN
30850847    UTERISOLAN
30853025    UTILEARSON
30855254    UTILEROANS
30856423    UTILESONAR
Length: 600, dtype: object


In [None]:
for index, word in topfiveletterwords.iteritems():
  temp_xy = besttwowords[besttwowords.apply(lambda x: x.startswith(word))]
  print(word)
  print(temp_xy)

temp_xy

AEROS
777884    AEROSUNLIT
777906    AEROSUNTIL
dtype: object
AROSE
3045140    AROSEUNLIT
3045162    AROSEUNTIL
dtype: object
SOARE
21727774    SOAREUNLIT
21727796    SOAREUNTIL
dtype: object


21727774    SOAREUNLIT
21727796    SOAREUNTIL
dtype: object

#MODEL 2
## Komal's Version

**Objectives:**
  
  1. Use 'answers' or 'guesses' to find most common prefix and suffix
  1. Find all potential 5-letter words 
  1. Filter out repeats and common letters 

*note: Uses prior clean and load 

In [None]:
# KP -- Alternate Approach 
# Find best Prefix and Suffix 
# note: you will get different answers if you use answers vs guesses 
# I think using answers gives a more practical suffix and prefix 

prefix = {}
suffix = {}
# find all prefixes & suffixes in answers
for i in range((len(answers))): 
  word = answers.iloc[i,0]
  pre = word[0:3]
  suf = word[len(word)-3:len(word)]
  if pre in prefix.keys():
    prefix[pre] = prefix[pre] + 1
  else: 
    prefix[pre] = 1
  if suf in suffix.keys():
    suffix[suf] = suffix[suf] + 1
  else: 
    suffix[suf] = 1


In [None]:
# sort prefix values 
prefix = pd.DataFrame(prefix, index=[0])
prefix = prefix.transpose()
prefix = prefix.sort_values(by=0, ascending=False)
best_pre = prefix.index[0]

# sort suffix values
suffix = pd.DataFrame(suffix, index=[0])
suffix = suffix.transpose()
suffix = suffix.sort_values(by=0, ascending=False)
best_suf = suffix.index[0]

In [None]:
# loop through dictionary to find best guess
# each word has to either have prefix or suffix and length of 5  
first_guess = []
second_guess = []
for i in range(len(dictionary)): 
  word = str(dictionary.iloc[i,0])
  if word[0:3] == best_pre and len(word) == 5:
    first_guess.append(word)
  if word[len(word)-3:len(word)] == best_suf and len(word) == 5:
    second_guess.append(word)

In [None]:
# first word can't include best suffix (remove those words)
for i in range(len(first_guess)-1, -1, -1): 
  word = first_guess[i]
  if 'I' in word: 
    first_guess.remove(word)
  elif 'N' in word: 
    first_guess.remove(word)
  elif 'G' in word:
    first_guess.remove(word)
  else: 
    continue
  
# second guess can't include the best prefix (remove those words)
for word in second_guess: 
  if 'S' in word: 
    second_guess.remove(word)
  elif 'T' in word: 
    second_guess.remove(word)
  elif 'A' in word:
    second_guess.remove(word)
  else: 
    continue



In [None]:
# more filtering == words can't include bottom ten letters in distribution
for letter in nontoptenletterstring:
  for f_word in first_guess: 
    if letter in f_word: 
      first_guess.remove(f_word)

print(first_guess, second_guess)

for letter in nontoptenletterstring:
  for s_word in second_guess: 
    if letter in s_word:
        second_guess.remove(s_word)

  

['STALE', 'STARE', 'STARS', 'STATE'] ['BRING', 'LUING', 'POING', 'STING', 'TOING']


In [None]:
# remove words with repeating letters 
def repeating_letters(guess):
  repeat = False
  for i in range(len(guess)-1):
    for j in range(len(guess)-1,i,-1):
      if guess[i] == guess[j]:
        repeat = True
  return repeat 

for word in first_guess: 
  if repeating_letters(word): 
    first_guess.remove(word)

for word in second_guess: 
  if repeating_letters(word): 
    second_guess.remove(word)

In [None]:
print('First Guesses: ', first_guess)
print('Second Guesses:', second_guess)

print('Best Pairs (first guess, second guess):\n - (STALE, TOING) \n - (STARE, LUING)')


First Guesses:  ['STALE', 'STARE']
Second Guesses: ['LUING', 'TOING']
Best Pairs (first guess, second guess):
 - (STALE, TOING) 
 - (STARE, LUING)


#ANALYZE

#PUBLISH