Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
52 lines (29 sloc) 9.01 KB
published layout title date author
true
post
Kangaroo words
2018-03-26 00:01
funvill

Recently I saw two great art projects by @whisbe. Where a sign that has a phrase written on it. The phrase's meaning changes when certain letters or words are added or removed.

<script async defer src="//www.instagram.com/embed.js"></script>

24/7 365 Happy Fucking Valentines Day you filthy animals 🖤

A post shared by WhIsBe (@whisbe) on Feb 14, 2018 at 1:26pm PST

<script async defer src="//www.instagram.com/embed.js"></script>

I loved these projects and I wanted to make something similar of my own. I searched for other examples of it, without too much success. Maybe my google-fu is lacking today.

In my search, I discovered List of different types of word play and Kangaroo words.

A Kangaroo word is a word that contains letters of another word, in order (without transposing any letters). For example: encourage contains courage, cog, cur, urge, core, cure, nag, rag, age, nor, rage and enrage.

Kangaroo word wasn't exactly what I was looking for but it was on the right path. I started searching for a list of all the common Kangaroo word, and what is the largest Kangaroo words that exist. I was able to find a few small lists but nothing that was complete.

The Word Circus: A Letter - Perfect Book (Lighter Side of Language Series) was a a good book with lots of Kangaroo words and phrases.

Because there was no exhaustive list Kangaroo words, and I needed to practice my python I decided to create my own list. The script goes thought all ~250,000 words and finds all the sub words that appear in the word. The problem with this version is that a lot of the sub words that were being found were not common words.

For example: Districts has 44 subword that include: ric, iris, srs, ist, tit, disc, dcs, discs, itc, tis, irc, sic, sti, ics, dst, dir, str, tits, src, ict, sri, irs, its, sits, dss, dist, iss, sit, tic, district, tri, sis, rcs, rts, dit, sts, dts, dsc, isis, cts, iis, dis, strict, dirt,

I would have preferred that it only showed the most common words: disc, discs, strict, dirt, tits, district, sits

Source code, [Output]

The next version only used the top 20,000 most commonly used words generated from google's n-gram frequency analysis of the Google's Trillion Word Corpus. This subset of words also included slang, swear words, and names of companies. I limited this script to only find sub words that are greater then 3 letters, to reduce the noise. This produces a much better result.

For example:

  • Facilities has 21 sub words that include: fats, clit, cite, lies, fail, facts, acts, files, cities, fits, ties, aces, flies, fact, face, cites, lite, file, fails, fate, faces
  • Generation has 17 sub words that include: nato, neat, ratio, raton, erin, eaton, gene, rain, neon, gran, tion, grin, nero, enron, gain, grain, nation,
  • Servants has 17 sub words that include: evans, vans, eats, seats, sean, seas, sent, rants, servant, serv, evan, sans, rant, rats, sets, ants, seat,

Source code, Output

Now that I have a giant list of words and their sub words.

What is the largest Kangaroo word in the top 20,000 most commonly used words?

Telecommunications has 12 sub words that include: lemma, communion, comm, tion, cocos, cont, coco, elena, unions, lena, louis, loan, coats, counts, onion, union, lion, ciao, coca, ions, cain, conn, cons, icon, mains, cats, tons, econ, toons, locations, latin, comma, elec, onto, lent, lemon, telecommunication, eaton, coins, conan, comics, commits, elect, tele, mins, unit, communication, toni, luton, comic, unto, mans, laos, teas, location, como, outs, cuts, count, tits, tuna, toon, commit, nato, units, lots, oman, main, cunt, commons, econo, elem, loans, tout, elton, lions, icons, cans, lotion, lean, common, coma, omni, lucas, eats, tous, tomato, emma, coat, onions, nation, mais, lets, telecom, coin, leica, leon, tees, tent, teen, teens, tomcat, lens, mats, elections, cation, omit, luna, tuition, tents, cmos, lois, communications, tions, luis, otis, election, tens, telecoms, nations,

The next step create a phrase using the words with the most sub words. Then test different arrangements of sub words to see if they produce a phrase that also makes sense. Testing to see if a string of words creates a proper English word is harder then it sounds. I am going to try it manually first and I fail then I can let the robots at it.