<a href="https://colab.research.google.com/github/anirudhvaliathan/wordshenanigans/blob/main/whatstheword.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WHAT'S THE WORD?


I remember hearing the song *Supercalifragilisticexpialidocious* in the film *Mary Poppins* as a child and reading about how it was the longest word in the dictionary, only to come across the word *pneumonoultramicroscopicsilicovolcanoconiosis* a few years later. It referred to a lung disease caused by inhaling very fine ash and sand dust. But it was intentionally created to be the longest.

And further research led me to find there are much longer words if we allow ourselves to include scientific names, which puts all the aforementioned title holders to shame. The chemical name for the largest known protein, titin, is *methionylthreonylthreonylglutaminylalanyl…isoleucine.* And it can only be represented here by ellipsis because it is  a whopping 189,819 letters long!

So let us take a look at some other questions about English words, the shortest and the longest, the weird and the obscure, and answer them with a helper who never shies away from the tedious, the computer.

### Setup

To get started we will need a list of words. We'll begin by importing the nltk library. We'll use the word list present here.


Of course the results and conclusions drawn from the data are only as good as the data itself. That is to say, with another dictionary that contains a  different collection of words, some of the results could be different.

In [2]:
import random
import string
from math import comb
import pandas as pd
import nltk
nltk.download('words')
from nltk.corpus import words

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [3]:
# Get the list of English words
english_word_list = words.words()

In [4]:
# Check number of words in the dictionary
print("The dictionary contains {} words".format(len(english_word_list)))


The dictionary contains 236736 words


Let's look at a random sample of these

In [5]:
print("A few random words: {}".format(', '.join(random.sample(english_word_list, 10))))

A few random words: horizontality, catheterism, precompensate, northerliness, unguentarium, annexionist, drawnet, Bellovaci, precriticism, cyclistic


Running the previous cell a few times shows that the corpus seems to contain proper nouns. Let's remove these.

In [6]:
for word in english_word_list[:]:
  if word[0].isupper():
    english_word_list.remove(word)

print("The dictionary now contains {} words.".format(len(english_word_list)))

The dictionary now contains 211536 words.


Also, looking at the last few entries of the list, it seems that a few words are repeated, perhaps for *NLP* purposes. Let's remove these as well. A little experimenting shows that we need the last 849 entries removed.

In [7]:
del english_word_list[-849:-1]

We have now cleaned our data and can start answering questions.

### Longests and Shortests

First, let's look at some longest and shortest words with different restrictions on the letters and patterns allowed, beginning with the longest word in this dictionary. But before that, lets create a dictionary with the length of the words as keys and the list of words of that particular length as its values.

In [9]:
# Length of words

word_lengths = {}
for word in english_word_list:
  length = len(word)
  if length not in word_lengths.keys():
    word_lengths[length] = [word]
  else:
    word_lengths[length].append(word)

Now let's check the longest word by indexing into the largest (the largest number) key.

In [10]:
print('The longest words in this dictionary are {}, containing {} letters each'.format
 ((', ').join(word_lengths[max(word_lengths.keys())]), max(word_lengths.keys())))


The longest words in this dictionary are formaldehydesulphoxylate, pathologicopsychological, scientificophilosophical, tetraiodophenolphthalein, thyroparathyroidectomize, containing 24 letters each


What about the longest word that contains only vowels?

In [11]:
# All words with only vowels

vowels = ['a', 'e', 'i', 'o', 'u']
consonantless = english_word_list.copy()

for word in english_word_list:
  for letter in word:
    if letter not in vowels:
      consonantless.remove(word)
      break

In [16]:
# Longest word with only vowels

m = max(map(len, consonantless))
longest = [x for x in consonantless if len(x) == m]

print("The longest word in this dictionary containing only vowels is {}".format(', '.join(longest)))

The longest word in this dictionary containing only vowels is iao, oii


[Euouae](https://en.wikipedia.org/wiki/Euouae) is an abbreviation used as a musical mnemonic in Latin psalters and other liturgical books of the Roman Rite. Hmm... a mnemonic...

Music reminds me of the time I discovered the word *rhythm* had no vowels in it. I thought hard and long to come up with something longer but to no avail. Until I realised I could add an *s* and get *rhythms*. But there was no way to know if there was a larger word still, hiding from my limited vocabulary. Well, today my curiosity will be sated.

In [18]:
# Words with only consonants

vowelless = english_word_list.copy()

for word in english_word_list:
  for letter in word:
    if letter in vowels:
      vowelless.remove(word)
      break


In [23]:
# Longest words with only consonants

m = max(map(len, vowelless))
longest_vowelless = [x for x in vowelless if len(x) == m ]

print("The longest word in this dictionary containing only consonants is {}".format(', '.join(longest_vowelless)))

The longest word in this dictionary containing only consonants is symphysy


It seems like I have been beaten, but the internet states the word [symphysy](https://en.wiktionary.org/wiki/symphysy) is both obsolete and rare, so the spoils can be shared.

What about the shortest word that contains all the vowels?

In [24]:
# Words with all the vowels

all_vowels = []
for word in english_word_list:
  vowellist = []
  for letter in word:
    if letter in vowels:
      vowellist.append(letter)
  if sorted(vowellist) == vowels:
    all_vowels.append(word)

In [25]:
# Shortest word with all vowels

m = min(map(len, all_vowels))
shortest_allvowels = [x for x in all_vowels if len(x) == m]
shortest_allvowels

print("The shortest words in this dictionary containing all the vowels are {}.".format(', '.join(shortest_allvowels)))

The shortest words in this dictionary containing all the vowels are adoulie, eulogia, moineau.


What about words that have all vowels in order? And what about all vowels but in reverse? The first one should be possible because of the plentiful common words that end with the *ious* suffix. We need to only remove the part where we call the 'sorted' function on *vowellist*

```
if sorted(vowellist) == vowels:
```

for the first part, and use the reverse flag for the second .

In [26]:
# Words with all the vowels in order

all_vowels_ordered = []
for word in english_word_list:
  vowellist = []
  for letter in word:
    if letter in vowels:
      vowellist.append(letter)
  if vowellist == vowels:
    all_vowels_ordered.append(word)

m = min(map(len, all_vowels_ordered))
shortest_allvowels_ordered  = [x for x in all_vowels_ordered if len(x) == m]
shortest_allvowels_ordered

print("The shortest word in the dictionary containing all the vowels in order is {}.".format(', '.join(shortest_allvowels_ordered )))


# Words with all the vowels in reverse order

all_vowels_revordered  = []
for word in english_word_list:
  vowellist = []
  for letter in word:
    if letter in vowels:
      vowellist.append(letter)
  if sorted(vowellist, reverse=True) == vowels:
    all_vowels_revordered.append(word)


if len(all_vowels_revordered) > 0:
  m = min(map(len, all_vowels_revordered))
  shortest_allvowels_revordered  = [x for x in all_vowels_revordered if len(x) == m]
  shortest_allvowels_revordered
else:
  print('There are no words that contain all the vowels in reverse order')


The shortest word in the dictionary containing all the vowels in order is caesious.
There are no words that contain all the vowels in reverse order


Unoinkebla! No words that fit the bill for the second part. However we have a candidate for the first part. [Caesious](https://www.merriam-webster.com/dictionary/caesious#:~:text=adjective,color%20very%20low%20in%20chroma) is an adjective that refers to anything that has a blue colour which is very low in chroma.

What if we loosened the restrictions and instead of vowels looked at the longest words in which all the alphabets appear in the usual, or reverse order (Which would mean we are tightening certain other restrictions simultaneously).

In [27]:
# Longest word with alphabets in order

longest_word_with_letters_ordered = []
length = 0

for word in english_word_list:
  # only check if current word is as long as previous longest
  if len(word) > length - 1:
    # flag
    is_ordered = True
    for i in range(len(word) - 1):
      if word[i] > word[i + 1]:
        is_ordered = False
        break
    if is_ordered and (len(word) == length):
      longest_word_with_letters_ordered.append(word)
    if is_ordered and (len(word) > length):
      longest_word_with_letters_ordered = [word]
      length = len(word)

In [None]:
print("The longest words in this dictionary with letters appearing in alphabetical order are {}".format(', '.join(longest_word_with_letters_ordered)))

The longest word with ordered letters are alloquy, beefily, begorry, billowy, egilops


In [28]:
longest_word_with_letters_unordered = []
length = 0
counter = 0
for word in english_word_list:
  # only check if current word is as long as previous
  if len(word) > length - 1:
    counter += 1
    # flag
    is_unordered = True
    for i in range(len(word) - 1):
      if word[i] < word[i + 1]:
        is_unordered = False
        break
    if is_unordered and (len(word) == length):
      longest_word_with_letters_unordered.append(word)
    if is_unordered and (len(word) > length):
      longest_word_with_letters_unordered = [word]
      length = len(word)

In [None]:
print("The longest words in this dictionary with letters appearing in reverse alphabetical order are {}".format(', '.join(longest_word_with_letters_unordered)))

The longest words in this dictionary with letters appearing in alphabetical order are spiffed, sponged, troolie, wronged


Finally, let us find the longest word which has all different letters

In [29]:
# longest word with all different letters

longest_word_with_different_letters = []
length = 0

for word in english_word_list:
  # only check if current word is as long as previous
  if len(word) > length - 1:
    # list of letters checked
    prev_letters = []
    for letter in word:
      if letter in prev_letters:
        break
      else:
        prev_letters.append(letter)
    # only runs if all letters are different
    if (len(word) == len(prev_letters)) and (len(word) == length):
      longest_word_with_different_letters.append(word)
    if (len(word) == len(prev_letters)) and (len(word) > length):
      longest_word_with_different_letters = [word]
      length = len(word)


In [None]:
print("The longest word in this dictionary with all different letters is {}.".format(', '.join(longest_word_with_different_letters)))

The longest word in this dictionary with all different letters is dermatoglyphics.


The word *dermatoglyphics* itself has something to do with patterns, a fabulous conincidence! It refers to the study of skin markings or patterns on fingers, hands, and feet, and its application, especially in criminology.

###Repetitions and Sequences

For the first part of this section, I would like to find out the answer to the question: which word has the most occurrence of a single letter?

We will iterate through the word list, and for each word maintain a dictionary recording the occurrences of all its letters. If it ties with the previous best we will just add it to our main dictionary. If it beats it, we empty the dictionary first. As a small optimization, the loop only checks for words whose length is greater than the record up till then, for the highest count of letters in a single word.

In [30]:
# Most occurrences of a letter in a word

most_occurrences = {}
count = 0

for word in english_word_list:
  letter_count = {}
  # only check if word is longer than the max occurrences found till now
  if len(word) > count:
    for letter in word:
      if letter not in letter_count.keys():
        letter_count[letter] = 1
      else:
        letter_count[letter] += 1
    if max(letter_count.values()) == count:
      most_occurrences[word] = [x for x in letter_count if letter_count[x] == max(letter_count.values())], max(letter_count.values())
    if max(letter_count.values()) > count:
      most_occurrences = {}
      most_occurrences[word] = [x for x in letter_count if letter_count[x] == max(letter_count.values())], max(letter_count.values())
      count = max(letter_count.values())

print(len(most_occurrences))


1


It seems like we have an undisputed winner. There is only one word in our dictionary.

In [31]:
print("The word which contains the most occurrences of a single letter is {}. The letter {} appears {} times in it.".format(list(most_occurrences.keys())[0], list(most_occurrences.values())[0][0],  list(most_occurrences.values())[0][1]))

The word which contains the most occurrences of a single letter is possessionlessness. The letter ['s'] appears 8 times in it.


It is natural to look for a generalisation. That is, for a given letter, which word contains the most occurrences of it?

In [32]:
# Most occurrences for each letter in any word

# initialize a dictionary with each letter as a key and a list to record the word(s) and the number of times it occurs in it
most_occurrences_of_each_letter = {letter: [0, []] for letter in string.ascii_lowercase}

for word in english_word_list:
  letter_count = {}
  for letter in word:
    if letter not in letter_count.keys():
      letter_count[letter] = 1
    else:
      letter_count[letter] += 1
  for key in letter_count.keys():
    if letter_count[key] > most_occurrences_of_each_letter[key][0]:
      most_occurrences_of_each_letter[key][0] = letter_count[key]
      most_occurrences_of_each_letter[key][1].clear()
      most_occurrences_of_each_letter[key][1].append(word)
    elif letter_count[key] == most_occurrences_of_each_letter[key][0]:
      most_occurrences_of_each_letter[key][1].append(word)
    else:
      pass


Some of the lists will contain many words so it would be more informative to view this table as a Pandas Dataframe.

In [33]:
pd.set_option('max_colwidth',1000)
table = pd.DataFrame.from_dict(most_occurrences_of_each_letter).transpose()
table

Unnamed: 0,0,1
a,6,"[astragalocalcaneal, calcaneoastragalar]"
b,4,"[beblubber, beerbibber, bubbybush, flibbertigibbet, gibblegabble, gibblegabbler]"
c,5,"[chroococcaceous, circumcrescence, coccochromatic, cryptococcic]"
d,5,[disdodecahedroid]
e,7,[electrotelethermometer]
f,4,"[giffgaff, riffraff]"
g,4,"[cuggermugger, ganggang, giggling, gigglingly, gumdigging, higglehaggle, huggermugger, huggermuggery, throughganging]"
h,4,"[choledochorrhaphy, chromophotolithograph, ichthyophthalmite, ichthyophthiriasis, ophthalmophthisis, phenolsulphonephthalein, photochromolithograph, thymolsulphonephthalein]"
i,6,"[impossibilification, indistinguishability, indivisibility, minimifidianism, pericardiomediastinitis]"
j,2,"[ajaja, avijja, cholecystojejunostomy, duodenojejunal, duodenojejunostomy, gastrojejunal, gastrojejunostomy, hajilij, jajman, jeewhillijers, jejunal, jejunator, jejune, jejunely, jejuneness, jejunitis, jejunity, jejunoduodenal, jejunoileitis, jejunostomy, jejunotomy, jejunum, jeremejevite, jimberjaw, jimberjawed, jimjam, jinglejangle, jinja, jinjili, jipijapa, jojoba, jujitsu, juju, jujube, jujuism, jujuist, perijejunitis, pidjajap]"


It seems that for every letter there is at least one word where it appears twice, and only three letters for which two is the mazximum - *x*, *q*, and *j*. There is also only one letter for which the maximum is 7, *e*, and only one word where this happens.

There are a lot of words where two different letters appear as a double consecutively, for instance take the word *balloon*. But is there a word where this happens for three different letters? The word *committee* comes very close, but not quite. If only *i* didn't come in between it, humph.

In [34]:
# Two letters thrice consecutively
thricecons = []
for word in english_word_list:
  if len(word) > 5:
    for i in range(len(word) - 5):
      if (word[i] == word [i+1]) & (word[i+2] == word[i+3]) & (word[i+4] == word[i+5]):
        thricecons.append(word)
        break

print("The words where a double appears thrice consecutively are {}.".format(', '.join(thricecons)))

The words where a double appears thrice consecutively are bookkeeper, bookkeeping, subbookkeeper.


We have 3 words! Of course, the same question for 4 needs no further answering. Subbookkeepers rejoice!

If we have gone down this line of curiosity, we must ask if there is a word where a letter appears as a triple. Unfortunately, the answer is disappointing.

In [35]:
# A letter hattrick
hattrick = []
for word in english_word_list:
  if len(word) > 2:
    for i in range(len(word) - 2):
      if (word[i] == word [i+1] == word[i+2]):
        thricecons.append(word)
        break

print(len(hattrick))

0


### Anagrams

In this section we seek to answer the following question: Which subset of letters can create the most (repetition is allowed)? This is inspired by the [Spelling Bee](https://spellbee.org/) game where you must make words from the seven letters given, each of which must include the middle letter.

Of course if we allow all 26 then we may create all the words. So we must impose a restriction on the size of the subset. So, our question becomes thus: which group of three or four or five letters can create the most words? And what are they?

To answer this we will create a function which takes as input a number and returns a dictionary.

In [36]:
#  set of n letters with which you can make the most words

def anagram_subset_finder(n):
  '''
  Parameters: n (a positive integer)
  Returns: a dictionary called anagrams with keys as ordered strings (alphabetically) of n distinct letters,
  and values are a list of the corresponding words that can be formed with those letters.

  For example, if n=3, the function returns a dictionary with keys being the unique three letters
  and the values being all the possible words that contain those 3 letters.
  If you want to see the possible combinations with the letters 'r', 'o', and 'e',
  anagrams['eor'] gives you a list of all the words. If a key is not present, there are no words with that combination.
  '''

  # initialise the dictionary
  anagrams = {}

  # return empty
  if n < 1:
    return anagrams

  # loop through list
  for word in english_word_list:

    # maintain list of letters for each word
    list_of_letters = []
    for letter in word:
      if letter not in list_of_letters:
        list_of_letters.append(letter)
      # if the current word has more than n distinct letters break out of inner loop
      if len(list_of_letters) > n:
        break

    # if current word does not have exactly n distinct letters move to the next word
    if len(list_of_letters) < n or len(list_of_letters) > n:
      continue
    else:
      # create key
      sorted_string = ''.join(sorted(list_of_letters))
      # create entry in dict if not already there; else add to existing list
      if sorted_string not in anagrams.keys():
        anagrams[sorted_string] = [word]
      else:
        anagrams[sorted_string].append(word)

  return anagrams


In [37]:
anagram_table = {number: len(anagram_subset_finder(number))  for number in range(1, 27)}

In [38]:
anagram_dataframe = pd.DataFrame.from_dict(anagram_table, orient="index")
anagram_dataframe.rename(columns={0: 'combinations'}, inplace=True)
anagram_dataframe

Unnamed: 0,combinations
1,26
2,104
3,858
4,3147
5,7125
6,12882
7,18030
8,19980
9,17003
10,11348


In [39]:
# Calculate the total number of possible combinations for each n
total_combinations = []
for i in range(1, 27):
  total_combinations.append(comb(26, i))

pd.set_option('display.float_format', '{:.20f}'.format)
anagram_dataframe['total_combinations'] = total_combinations
anagram_dataframe['proportion'] = anagram_dataframe['combinations']/total_combinations
anagram_dataframe

Unnamed: 0,combinations,total_combinations,proportion
1,26,26,1.0
2,104,325,0.32
3,858,2600,0.33
4,3147,14950,0.2105016722408026
5,7125,65780,0.1083155974460322
6,12882,230230,0.0559527429092646
7,18030,657800,0.0274095469747643
8,19980,1562275,0.0127890416219935
9,17003,3124550,0.0054417436110799
10,11348,5311735,0.0021364017595004


Since the function returns an empty dictionary for all values above 16, the words that use the most different letters use 16 of them.

As an example, let us see which combination of four letters can create the most words

In [46]:
fourletters = anagram_subset_finder(4)
m = max(map(len, fourletters.values()))
most = [x for x in fourletters if len(fourletters[x]) == m]
m, most

(30, ['aert', 'erst'])

In [45]:
fourletters['erst']

['ester',
 'estre',
 'reest',
 'reester',
 'reset',
 'resetter',
 'rest',
 'rester',
 'restes',
 'restress',
 'retest',
 'sert',
 'setter',
 'steer',
 'steerer',
 'stere',
 'stert',
 'stre',
 'stree',
 'street',
 'streets',
 'stress',
 'stresser',
 'stret',
 'strette',
 'terse',
 'tester',
 'tress',
 'trest',
 'tsere']

In [44]:
fourletters['aert']

['aerate',
 'arete',
 'ateeter',
 'atter',
 'eater',
 'errata',
 'rate',
 'rater',
 'ratter',
 'rerate',
 'retare',
 'retreat',
 'retreater',
 'tare',
 'tarea',
 'tarrer',
 'tartaret',
 'tartrate',
 'tater',
 'tatter',
 'teaer',
 'tear',
 'tearer',
 'teart',
 'tera',
 'terrar',
 'tetra',
 'treat',
 'treatee',
 'treater']