## TF-IDF Scoring and Analysis

This notebook uses the Google Books Ngram Viewer API (https://books.google.com/ngrams/) to obtain the TF-IDF for the 12,972 5-letter words allowed in Wordle gameplay.  

The input is the term list in `allowed_words.txt` and the output is the CSV file `all_tf.csv` which contains the two columns `word` and `p` where `p` is the average percentage likelihood, i.e. normalized TF-IDF, of seeing that word in the Google Ngram English language corpus over the period from 2010-2020.


### Prepare Environment

Load necessary packages and set `datapath` location

In [55]:
import numpy as np
import pandas as pd
import requests
import os

Set `GOOGLE_DRIVE = True` and adjust the datapath reference below to use Google Drive to access the data files

In [56]:
GOOGLE_DRIVE = True

In [57]:
if GOOGLE_DRIVE:
  from google.colab import drive
  drive.mount('/content/drive/')
  datapath = '/content/drive/My Drive/Colab Notebooks/UIUC/CS_410/Wordle/data/'
else:
  datapath = 'data/'

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


### Example

We will demonstrate how to use Google Ngram Viewer API on a test sample using only two words: `abyss, crate`

In [58]:
words = ['abyss', 'crate']
query = ','.join(word for word in words)
year_start = 2010
year_end = 2020

In [59]:
url = 'https://books.google.com/ngrams/json?content={0}&year_start={1}&year_end={2}&corpus=26&smoothing=0'.format(query, year_start, year_end)
r = requests.get(url)

Inspect result:

In [60]:
r.json()

[{'ngram': 'abyss',
  'parent': '',
  'type': 'NGRAM',
  'timeseries': [2.2948961486690678e-06,
   2.331440100533655e-06,
   2.230736527053523e-06,
   2.389529527135892e-06,
   2.6553443603916094e-06,
   2.6752152280096198e-06,
   2.7062797016697004e-06,
   3.1142387797444826e-06,
   3.449187715887092e-06,
   3.261609663240961e-06]},
 {'ngram': 'crate',
  'parent': '',
  'type': 'NGRAM',
  'timeseries': [1.9144608813803643e-06,
   2.3333661829383345e-06,
   2.157176368200453e-06,
   1.8032590105576674e-06,
   2.24838709073083e-06,
   2.2294332211458823e-06,
   2.0214131382090272e-06,
   1.9424621768848738e-06,
   1.84960515525745e-06,
   1.9351048194948817e-06]}]

In [61]:
query_tf_df = pd.DataFrame.from_dict(r.json())
query_tf_df['p'] = query_tf_df.apply(lambda row: np.mean(row['timeseries']), axis=1)

In [62]:
query_tf_df.head(n=2)

Unnamed: 0,ngram,parent,type,timeseries,p
0,abyss,,NGRAM,"[2.2948961486690678e-06, 2.331440100533655e-06...",3e-06
1,crate,,NGRAM,"[1.9144608813803643e-06, 2.3333661829383345e-0...",2e-06


### Prepare Words

In [63]:
with open(datapath + "allowed_words.txt", "r") as fin:
  all_words = fin.read().splitlines()

In [64]:
n_words = len(all_words)
n_words

12972

The Gooogle Ngram Viewer API did not permit the full word list in a single API call, so we need to chunk the word list into batches of size 500 words.

In [65]:
batch_size = 500
n_batches = n_words//batch_size + 1
n_batches

26

In [66]:
queries = []

for i in range(n_batches):
  queries.append(','.join(word for word in all_words[i*batch_size:(i*batch_size + batch_size)]))

In [67]:
queries[0][:89]

'aahed,aalii,aargh,aarti,abaca,abaci,aback,abacs,abaft,abaka,abamp,aband,abase,abash,abask'

In [68]:
queries[-1][-89:]

'zoris,zorro,zouks,zowee,zowie,zulus,zupan,zupas,zuppa,zurfs,zuzim,zygal,zygon,zymes,zymic'

### Run Queries

The Google Books Ngram Viewer API calls take a significant amount of time to run (~1 hour on Colab).

- Set `REFRESH_TF = True` to generate a new job and run the queries.
- Set `REFRESH_TF = False` to check for `all_tf.csv` and use previous TF-IDF results.

In [77]:
REFRESH_TF = False

In [78]:
def getNgramTf(queries, year_start, year_end):
  queries_tf = []

  for query in queries:
    url = 'https://books.google.com/ngrams/json?content={0}&year_start={1}&year_end={2}&corpus=26&smoothing=0'.format(query, year_start, year_end)
    r = requests.get(url)

    query_tf_df = pd.DataFrame.from_dict(r.json())
    query_tf_df['p'] = query_tf_df.apply(lambda row: np.mean(row['timeseries']), axis=1)
    queries_tf.append(query_tf_df[['ngram', 'p']])
        
  words_tf = pd.concat(queries_tf, ignore_index=True, axis=0).reset_index(drop=True)
  words_tf.rename(columns={'ngram':'word'}, inplace=True)
  
  return words_tf

In [79]:
tf_file = datapath + 'all_tf.csv'

In [80]:
if REFRESH_TF or not os.path.exists(tf_file):
  print('Getting data...')
  words_tf = getNgramTf(queries, 2010, 2020)
else:
  words_tf = pd.read_csv(tf_file)

print('Done!')

Getting data...
Done!


In [81]:
words_tf.shape

(12901, 2)

In [82]:
words_tf.head(n=10)

Unnamed: 0,word,p
0,aahed,4.629283e-08
1,aalii,3.59456e-10
2,aargh,4.751212e-09
3,aarti,1.33111e-08
4,abaca,2.638774e-08
5,abaci,1.290335e-08
6,aback,2.593479e-06
7,abacs,7.097529e-11
8,abaft,1.051571e-07
9,abaka,9.413733e-10


In [83]:
words_tf.tail(n=10)

Unnamed: 0,word,p
12891,zulus,7.271149e-10
12892,zupan,1.31166e-09
12893,zupas,5.376945e-11
12894,zuppa,1.7021e-08
12895,zurfs,3.66251e-11
12896,zuzim,2.757089e-09
12897,zygal,2.307935e-10
12898,zygon,1.668479e-09
12899,zymes,7.890288e-09
12900,zymic,1.525064e-10


### TF-IDF Analysis

In [88]:
with open(datapath + "past_words.txt", "r") as fin:
  past_words = fin.read().splitlines()

len(past_words)

509

In [89]:
past_tf = words_tf.loc[words_tf['word'].isin(past_words)]

len(past_tf)

509

In [93]:
tf_df = pd.DataFrame({'past_tf':past_tf['p'].describe(), 'all_tf':words_tf['p'].describe()})
tf_df

Unnamed: 0,past_tf,all_tf
count,509.0,12901.0
mean,3.224554e-05,5.33724e-06
std,0.0001321954,4.594543e-05
min,1.070396e-08,3.7379e-12
25%,6.549039e-07,2.002063e-09
50%,2.550427e-06,1.864217e-08
75%,1.465864e-05,3.377631e-07
max,0.001927869,0.001990731


In [95]:
tf_df.loc['mean']['past_tf']/tf_df.loc['mean']['all_tf']

6.041613922275527

Comparing the frequency distributions, one interesting finding is that past Wordle solution words have an average TF-IDF likelihood score that is 6x higher than the average among allowable Wordle gameplay words, meaning that more common words are much more likely to be Wordle solutions.

### Process Results

If Google Ngram Viewer does not contain results for a particular word, it will be excluded from the results.  Here we check for any missing words from the API calls:

In [84]:
if REFRESH_TF or not os.path.exists(tf_file):
  print('Processing results...')

  missing_words = []

  # identify the missing words in words_tf
  for word in all_words:
    if len(words_tf.loc[words_tf['word'] == word]) == 0:
      missing_words.append(word)

  print('Missing words: ', ', '.join(word for word in missing_words))

  # assign zero probability to the missing words in missing_tf
  missing_tf = pd.DataFrame({'word':missing_words, 'p':[0]*len(missing_words)})
  missing_tf

  # merge the two dataframes as all_tf and sort alphabetically
  all_tf = pd.concat([words_tf, missing_tf], ignore_index=True, axis=0).reset_index(drop=True)
  all_tf.sort_values('word', inplace=True)

  # save the output file as all_tf.csv so it can be used by the Wordle strategy tool
  all_tf.to_csv(datapath + 'all_tf.csv', index=False)

  print('Done!')

Processing results...
Missing words:  avyze, awdls, azygy, boygs, byked, byrls, daych, dorbs, dsobo, dsomo, durgy, dzhos, eevns, egmas, ennog, erevs, euked, evhoe, ewked, gowfs, hiois, humfs, hwyls, jarps, jokol, kerky, khafs, koaps, kophs, kuzus, lance, mausy, nabks, odyls, omovs, pebas, peeoy, peghs, phpht, poupt, pyins, qapik, qophs, ryked, sdayn, skyfs, skyrs, snebs, sohur, sowfs, syped, takky, tiyns, uraos, viffs, voema, voips, vutty, wembs, whyda, wudus, xysts, yaffs, yarco, yesks, ylems, ylkes, yrivd, zedas, zexes, zimbs
Done!
