An analysis of English words by length and number of unique letters. (Python)
- Project Overview
- Data Source and Preparation
- Findings
- Longest word
- Average number of letters
- Words with most unique letters
The goal of the project was to create two distributions: (1) Count of English words by length, and (2) Count of English words by number of unique letters. Furthermore, the goal was to explore the data set by answering questions like:
- What is the longest word in the English language? (Is it "antidisestablishmentarianism"?)
- What is the average number of letters in an English word?
- Which words have the most unique letters?
The data file for this project can be found here. This file was originally obtained from GitHub user dwyl. The list contains 370,103 words. Note that hyphenated words (like "self-image") are excluded.
All data cleaning and analysis occurred in a Jupyter Notebook using Python. The initial cleaning steps involved eliminating missing values and checking for duplicates, among a few other minor corrections to the dataframe. The data was then ready for analysis.
When I was in grade school, we were taught that antidisestablishmentarianism is the longest word in the dictionary. Is it true?
Yeah, sort of. "Antidisestablishmentarianism" is 28 letters long. There are other words that are just as long or longer, but they are the technical names for chemical compounds. In my opinion, those don't really count.
- hydroxydehydrocorticosterone (28 letters)
- cyclotrimethylenetrinitramine (29)
- trinitrophenylmethylnitramine (29)
- dichlorodiphenyltrichloroethane (31)
However, Merriam-Webster doesn't consider it a real word. Some outlets claim that "floccinaucinihilipilification" (29 letters) is the longest word, but that word didn't appear in the data set used here.
I found this a bit surprising. As shown at the very top of this page in the bar graph, the average English word is 9 letters long. (This seems long to me!) Of course, we don't actually use most of those really long words (like antidisestablishmentarianism) in everyday discourse, so these long, rarely used words probably shift the distribution to the right. A distribution of the most commonly used words would certainly show that the average English word is shorter than 9 letters.
16 letters is the record for the most unique letters, and six different English words can claim that title. They are:
- blepharoconjunctivitis
- formaldehydesulphoxylic
- pneumoventriculography
- pseudolamellibranchiata
- pseudolamellibranchiate
- superacknowledgment
Once again, these words are mostly technical, used only in chemistry, medicine, or biology. To me, these don't really count. "Superacknowledgement", however, does count, and I think it can lay claim to being the longest word in English that uses the most unique letters. Good luck using it in everyday conversation, though.
If you're curious, the distribution of English words by unique letters is shown below.