## POS Tagging and Analysis

This notebook uses the NLTK Natural Language Toolkit (https://www.nltk.org/) for POS tagging of the Wordle words.



### Prepare Environment

Load necessary packages and set `datapath` location

In [None]:
# !pip install nltk

In [1]:
import numpy as np
import pandas as pd
import nltk

Download Perceptron tagger:

In [2]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Set `GOOGLE_DRIVE = True` and adjust the datapath reference below to use Google Drive to access the data files

In [3]:
GOOGLE_DRIVE = True

In [4]:
if GOOGLE_DRIVE:
  from google.colab import drive
  drive.mount('/content/drive/')
  datapath = '/content/drive/My Drive/Colab Notebooks/UIUC/CS_410/Wordle/data/'
else:
  datapath = 'data/'

Mounted at /content/drive/


### Prepare Words

In [5]:
with open(datapath + "past_words.txt", "r") as fin:
  past_words = fin.read().splitlines()

len(past_words)

509

In [6]:
with open(datapath + "allowed_words.txt", "r") as fin:
  all_words = fin.read().splitlines()

len(all_words)

12972

### POS Tagging

POS tag past Wordle solution words:

In [7]:
past_pos = nltk.pos_tag(past_words)

In [9]:
past_pos[:20]

[('cigar', 'NN'),
 ('rebut', 'NN'),
 ('sissy', 'JJ'),
 ('humph', 'NN'),
 ('awake', 'VBP'),
 ('blush', 'JJ'),
 ('focal', 'JJ'),
 ('evade', 'NN'),
 ('naval', 'JJ'),
 ('serve', 'VBP'),
 ('heath', 'NN'),
 ('dwarf', 'NN'),
 ('model', 'NN'),
 ('karma', 'NN'),
 ('stink', 'VBP'),
 ('grade', 'VBN'),
 ('quiet', 'JJ'),
 ('bench', 'NN'),
 ('abate', 'VBP'),
 ('feign', 'NN')]

POS tag all of the Wordle allowable gameplay words:

In [8]:
all_pos = nltk.pos_tag(all_words)
all_df = pd.DataFrame(all_pos, columns = ['word', 'tag'])

In [12]:
all_df.head(n=10)

Unnamed: 0,word,tag
0,aahed,VBN
1,aalii,JJ
2,aargh,NN
3,aarti,NN
4,abaca,NN
5,abaci,VBP
6,aback,NN
7,abacs,NN
8,abaft,NN
9,abaka,NN


### POS Analysis

Get count by POS tag for past Wordle solution words:

In [13]:
past_np = np.array(past_pos)
past_tags, past_counts = np.unique(past_np[:, 1], return_counts=True)

Get count by POS tag for all Wordle gameplay words:

In [14]:
all_np = np.array(all_pos)
all_tags, all_counts = np.unique(all_np[:, 1], return_counts=True)

Compare frequency distribution of POS tags between all words and past words:

In [19]:
past_freq_df = pd.DataFrame({'tag':past_tags, 'past_count':past_counts})
all_freq_df = pd.DataFrame({'tag':all_tags, 'all_count':all_counts})

pos_df = past_freq_df.merge(all_freq_df, how='outer', on='tag')
pos_df.fillna(0, inplace=True)

pos_df['past_p'] = pos_df['past_count']/pos_df['past_count'].sum()
pos_df['all_p'] = pos_df['all_count']/pos_df['all_count'].sum()

In [20]:
pos_df

Unnamed: 0,tag,past_count,all_count,past_p,all_p
0,CC,1.0,11,0.001965,0.000848
1,DT,2.0,7,0.003929,0.00054
2,IN,5.0,78,0.009823,0.006013
3,JJ,111.0,2645,0.218075,0.203901
4,JJR,3.0,59,0.005894,0.004548
5,JJS,1.0,15,0.001965,0.001156
6,MD,3.0,8,0.005894,0.000617
7,NN,315.0,4711,0.618861,0.363167
8,NNP,1.0,44,0.001965,0.003392
9,NNS,4.0,2375,0.007859,0.183087


Comparing the frequency distributions, one interesting finding is that 61.9% of the past Wordle solution words are tagged ‘NN’, which represents a singular noun in the Penn Treebank schema, whereas only 36.3% of allowable Wordle gameplay words have the ‘NN’ tag.  

### Process Results

Use the frequency distribution of POS tags for past Wordle solution words to assign probability by POS tag to the list of allowable Wordle gameplay words.  This will allow the Wordle strategy tool to prefer words with POS tags that have a higher likelihood of being a solution word.

In [21]:
all_df['p'] = all_df['tag'].map(pos_df.set_index('tag')['past_p'])

In [23]:
all_df.head(n=20)

Unnamed: 0,word,tag,p
0,aahed,VBN,0.013752
1,aalii,JJ,0.218075
2,aargh,NN,0.618861
3,aarti,NN,0.618861
4,abaca,NN,0.618861
5,abaci,VBP,0.031434
6,aback,NN,0.618861
7,abacs,NN,0.618861
8,abaft,NN,0.618861
9,abaka,NN,0.618861


In [None]:
all_df.to_csv(datapath + 'all_pos.csv', index=False)