## Text salad: WordNet, word lists, unicode, and compression

### Wordnet Introduction

WordNet (WN) is a large multilingual database pairing words and meanings.

WordNet implements two key ideas.  The **senses** (or meanings) of a word are language
independent concepts represented in a **very large** concept graph. We'll draw some concept network picture when we get to networks.  For now,
some basics of WordNet.

Concepts (or **synsets** in WN) are linked to **lemmas**, which are language
particular ways of expresing a concept: a single concept links
to *dog* in English and *chien* in French. Technically,
the lemma is a lexical entry, which may have multiple forms;
the aforementioned English lemma has two forms, *dog*
and *dogs*.  So we refer to the string that captures the dictionary 
form of the lemma as the **lemma name**. The lemma  and the lemma
name are distinct.  The lemma is a 
class instance with properties
like a language, a name, hypernyms, antonyms, and so on.
The lemma name is a string.

There are three different ways of spelling (expressing in writing) the first (most imprtant) sense of the word *dog*.

In [428]:
[ln for ln in wn.synsets('dog')[0].lemma_names(lang='eng')]

['dog', 'domestic_dog', 'Canis_familiaris']

Below, just for fun, we implement a more general function than we need, `get_active_words` which collects all words of a given length
in a given language (if the language is in WordNet!).

For example, let's collect all 5-letter French words.

In [418]:
f_wds = get_active_words_wn (lang='fra')


Since `f_wds` is a set, we can't just look at the first 20 elements, so instead we look at a random sample.

```
from random import choice,sample
from string import ascii_lowercase


>>> sample(f_wds,20)
['reine',
 'mikvé',
 'geste',
 'orgue',
 'luire',
 'anode',
 'éluer',
 'jaune',
 'kobus',
 'hindi',
 'dingo',
 'osier',
 'lupin',
 'gruau',
 'ajuga',
 'rumen',
 'prise',
 'unix™',
 'axial',
 'gecko']
```

If you know French the set contains some pretty oddball words.



It's a set so we can't grab the first 20 elements.  
We'll just randomly sample 20 words plus one other we know for know.

In [427]:
from random import sample

add_on = [w for w in f_wds if w.startswith('unix')]
sample(f_wds,20) + add_on

['mucor',
 'lycée',
 'paire',
 'combe',
 'kogia',
 'carvi',
 'score',
 'dévas',
 'chéri',
 'nèfle',
 'prêle',
 'nanti',
 'heurt',
 'privé',
 'nuire',
 'adoré',
 'golfe',
 'slave',
 'aotus',
 'curry',
 'unix™']

We note in passing that "Unix" with the trademark symbol counts as a 5-letter word.  Just one of many surprises you will experience once you start working with Unicode.

Here's code for implementing the building a wordnet word list.

In [15]:
from nltk.corpus import wordnet as wn
# For a more natural list of words than nltk.words()
from string import ascii_lowercase,digits
digits = set(digits)


def get_active_words_wn (lang='eng'):
    return {ln for w in wn.all_synsets() for ln in w.lemma_names(lang=lang) 
                 if ln.islower()  and len(ln) == 5 and  '_' not in ln
                 and digits.intersection(ln) == set()}

def get_definitions(word_set,language = None):
    """
    Need to check that synset has at least one lemma in the given langugae.
    """
    for wd in word_set:
        print(wd)
        for (i,ss) in enumerate(wn.synsets(wd)):
            print(f'{i+1}. {ss.definition()}.',end= '  ')
            print()
        print()
        print()



active_words = get_active_words_wn()


In [11]:
len(active_words)

4158

`active_words` is a set so we can't just look at the first 20 elements. 

Let's look at a random sample.

In [17]:
from random import sample
L = sample(active_words,20)
L

['merit',
 'zesty',
 'nitid',
 'bract',
 'so-so',
 'fryer',
 'urban',
 'scoot',
 'dimly',
 'aroid',
 'three',
 'tower',
 'snare',
 'wheal',
 'chivy',
 'gummy',
 'angry',
 'glare',
 'edged',
 'fovea']

In [18]:
get_definitions(L)

merit
1. any admirable quality or attribute.  
2. the quality of being deserving (e.g., deserving assistance).  
3. be worthy or deserving.  


zesty
1. having an agreeably pungent taste.  
2. marked by spirited enjoyment.  


nitid
1. bright with a steady but subdued shining.  


bract
1. a modified leaf or leaflike part just below and protecting an inflorescence.  


so-so
1. being neither good nor bad.  
2. in an acceptable (but not outstanding) manner.  


fryer
1. flesh of a medium-sized young chicken suitable for frying.  


urban
1. relating to or concerned with a city or densely populated area.  
2. located in or characteristic of a city or city life.  


scoot
1. run or move very quickly or hastily.  


dimly
1. in a dim indistinct manner.  
2. in a manner lacking interest or vitality.  
3. with a dim light.  


aroid
1. any plant of the family Araceae; have small flowers massed on a spadix surrounded by a large spathe.  
2. relating to a plant of the family Araceae.  


three

### Unicode and unicode code points

To go from a numerical code point to the corresponding unicode character,
use `chr`:

In [238]:
smiley = chr(0x1F600)#.decode(encoding='utf8')
knight = chr(0x265E)#.decode(encoding='utf8')
len(smiley + knight), smiley + knight

(2, '😀♞')

In [224]:
0x1F600, chr(0x1F600)

(128512, '😀')

In [223]:
0x1F600 == 128512

True

Hence

In [240]:
chr(128512)

'😀'

To go from character  to code point, use ord

In [241]:
ord(smiley)

128512

You will usually want to see this in hex, so do some string formatting.

In [244]:
f'{ord(smiley):#08x}'

'0x01f600'

Note that using "#" in the formating code just produces a string that advertises
the fact that it represents a hexadecimal number (prefix "0x"). Compare:


In [245]:
f'{ord(smiley):08x}'

'0001f600'

`ord` requires a single character argument.

In [249]:
ord('xy')

TypeError: ord() expected a character, but string of length 2 found

To maintain a consistent implementation of
this idea, `ord` does not support the extended notion of
Unicode character which admits some characters that require **two** unicode code points
(flags, some emoji). See [Unicode org docs](https://unicode.org/Public/emoji/4.0/emoji-sequences.txt)

In [437]:
flag_of_ascension_island = '\U0001F1E6\U0001F1E9'
flag_of_poland = '\U0001F1F5\U0001F1F1'
print(len(flag_of_ascension_island),flag_of_ascension_island)
print(len(flag_of_poland),flag_of_poland)

2 🇦🇩
2 🇵🇱


In [252]:
ord(flag_of_ascension_island)

🇦🇩


TypeError: ord() expected a character, but string of length 2 found

To use Unicode code points in strings.  Use \U and \u escapes.

In [236]:
len("\U0001F600"),"\U0001F600",len("\u265E"),"\u265E"

(1, '😀', 1, '♞')

Note: because of the escapes, the strings above are characters, not representations
of hexadecimal numbers.  Compare:

In [248]:
hex_str = f'{ord(smiley):08x}'
unicode_char = "\U0001F600"
print(len(hex_str),hex_str,len(unicode_char),unicode_char)

8 0001f600 1 😀


In [272]:
print('\U0001f4af')

💯


A lot of times you will have access to a keyboard that will let you 
do "literal" unicode character entry.  Use it! 

Or cut and paste from a window where you can!

In [274]:
alphabet = 'αβγδεζηθικλμνξοπρςστυφχψ'
len(alphabet),alphabet

(24, 'αβγδεζηθικλμνξοπρςστυφχψ')

In [276]:
ord(alphabet[1])

946

### Encodings

To represent unicode characters in a file or in a data stream traveling from
computer to computer, we need some conventions about how to **encode** characters.

Two different encodings of the same character

In [255]:
b1, b2 = smiley.encode(encoding='utf8'),smiley.encode(encoding='utf16')
b1,b2

(b'\xf0\x9f\x98\x80', b'\xff\xfe=\xd8\x00\xde')

The results of such encodings in Python are a new sequence type
called **bytes**.  The same type we get when we read in a file
in binary mode (for example, a compiled program).  It's just
data. 

In [258]:
type(b1),type(b2),b1 == b2

(bytes, bytes, False)

In [257]:
len(b1),len(b2)

(4, 6)

When a bytes instance represents a unicode text string, it is very
hard to do anything with it unless we know the encoding.

In [264]:
#Using the wrong encoding.  Utf8 impossibility
b2.decode(encoding='utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

And that's not the worst thing that can happen:


In [260]:
# Using the wrong encoding.  Nonsense!
b1.decode(encoding='utf16')

'\u9ff0肘'

For a great discussion of the design features of encodings 
and what representing textual information on a computer means,
see [real Python docs.](https://realpython.com/python-encodings-guide/)

### The unicode sandwich

The basic principle in Python: Treat all unicode strings as sequences of unicode characters, each represented by a unicode code point.  Support unicode code point entry, unicode code point requests.

This was demonstrated above.  

Encoded strings are bytes: binary data.  A different type.  Textual unicode
data belongs on one side of a great divide; bytes strings on another.

This was illustrated above.

But what about the real world?  The world outside of python.  The world
in which operating system vendors can disagree on  what encoding to use, in which
particular languages, for instance Japanese and Chinese, may have established their
own encodings before UTF8 became a defacto default?

The way to deal with this is the **unicode sandwich**.

```
encoded representation

      \|/   Input: decode (somebody told you the encoding!)
      /|\
      
  unicode string     [your program goes here!]
  
      \|/   Output: encode (somebody, maybe your OS, told you what encoding to use!)
      /|\   
   
encoded  representation
```

In other words the only parts of your program that know
anything about encodings are the Input/Output (IO) parts (a little bit
of an idealization, but that's the goal!).

You don't need to care about whether you are dealing with
Greek, Latin, Tamil, Tibetan, or Tagbanwa characters,
or emoji.  All can be freely combined.  All will 
increase a string length by 1.

Adapted from 
the [Python unicode How To doc:](https://docs.python.org/3/howto/unicode.html)

In [292]:
import unicodedata

u = chr(946) + chr(233) + chr(0x0bf2) + chr(3973) + chr(6000) + smiley

print(f'{len(u)} characters in search of a word: {u}')
print()
for i, c in enumerate(u):
    print(f'{i} {c:^3} {ord(c):08x} {unicodedata.category(c)}', end=" ")
    print(unicodedata.name(c))

# Get numeric value of numeric character
print()
print(f'Tamil {u[2]} is {unicodedata.numeric(u[2])}')

6 characters in search of a word: βé௲྅ᝰ😀

0  β  000003b2 Ll GREEK SMALL LETTER BETA
1  é  000000e9 Ll LATIN SMALL LETTER E WITH ACUTE
2  ௲  00000bf2 No TAMIL NUMBER ONE THOUSAND
3  ྅  00000f85 Po TIBETAN MARK PALUTA
4  ᝰ  00001770 Lo TAGBANWA LETTER SA
5  😀  0001f600 So GRINNING FACE

Tamil ௲ is 1000.0


We use the `unicodedata` module to access the database info on each
character, displaying its name/description and its unicode category.

Tagbanwa is one of the scripts indigenous to the Philippines used by the Tagbanwa people and the Palawan people.

The unicode categories used above are

```
L1 Letter lowercase
No Numerical other
Po Punctuation other
Lo Letter other
So Symbol other
```

See [compart.com unicode docs](https://www.compart.com/en/unicode/category) 
for more discussion of Unicode categories.

There's a lot of well-thought out discussion of Unicode concepts such as Unicode categories, planes, and blocks on [compart.com.](https://www.compart.com/en/unicode/)

Things are only slightly more complicated with 2-character symbols.

In [441]:
 v = '\U0001F1F5\U0001F1F1'
print(f'{v:^3}  {unicodedata.category(v[0])} {unicodedata.category(v[1])}', end=" ")
print(unicodedata.name(v[0]),unicodedata.name(v[1]))

🇵🇱   So So REGIONAL INDICATOR SYMBOL LETTER P REGIONAL INDICATOR SYMBOL LETTER L


Here "PL" is short for Poland.

### Reading a large compressed file

The idea here is to present code that lets you process a large compressed
file without having (a) to uncompress the whole thing in memory;
(b) read the entire compressed file into memory.

Decide first what you want to do with the file contents.  We will use
the example of the Google Books word list.  We will read and uncompress it up
to a frequency threshhold, and then stop.

We will iterate through line by line.  The inner loop
needs to  process a single line.  Here's what that looks like
for this example file.

In [4]:
def process_line(line,freq_dict,threshhold,debug=False,format2=False):
    try:
        # Expected line format goes here
        if format2:
             (wd, count) = line.split()
        else:
            (rank, wd, count, pct, cum_pct) = line.split()
    except ValueError:
        # Abort!  This aint happening.  Not necessarily an error.
        # Could be last line of a well-formatted file.
        print(line)
        return False
    try:
        # replace does not raise an Exception if there is no ",".
        # `int(...)` will for any uncoerceable string.
        int_ct = int(count.replace(",",''))
    except Exception as e:
        print(line)
        raise e
    # We'll stop processing when we get to low frequency stuff.
    if int_ct < threshhold:
        return False
    else:
        if not debug:
            freq_dict[wd] = int_ct
        else:
            freq_dict[wd] = (rank, count, pct, cum_pct)
        return True

### the data

The file loaded below is a truncated version of 
[Hacker9b's wordlist repository.](https://github.com/hackerb9/gwordlist)
This in turn is derived from  [Google corpus V3 20200217](https://storage.googleapis.com/books/ngrams/books/datasetsv3.html)

Have a look at that webpage, if you want an idea of the 
value of the service Hacker9b is providing. It describes what the line by line
values for a Google ngram file are.  An extract:

```
As an example, here are the 3,000,000th and 3,000,001st lines from the a file of the English 1-grams (googlebooks-eng-all-1gram-20120701-a.gz):

circumvallate   1978   335    91
circumvallate   1979   261    91
```

This is [the
page with just the unigram data for that release.](http://storage.googleapis.com/books/ngrams/books/20200217/eng/eng-1-ngrams_exports.html)

Kaggle has released a page with links to a subset of this, which is more
difficult to work with for building a word list, [a bigram corpus derived from
Google data of the same date.](https://www.kaggle.com/ketchupduck/google-2grams-20200217-english-fiction)

With no filtering, Hacker B9's wordlist derived from 
Google's ngram corpora contains 8 million words.  That's too many.
The rarest "words" have count 40.  Here are some of those:

```
0-7923-3178-8 
0-7923-2968-6
0-7923-1344-5
07-1499
```
So this list of 8 M words is obviously too long if you want **just**
words, according to some  "ordinary" definition of word.
Various attempts to work out a better threshhold than 8M are shown beklow.

The 8M words are surely too many in many ways,
but they may also be too few in other ways.  For example,
"butch"  does not make the frequency cutoff in
this mega-corpus, supposedly including a representative
sample of modern texts.  Is this something
Hackerb9 did, or is it Google's sampling method?
An open question I havent the time to resolve.
`


### Compression

The file in the example is a gzip file; gzip is one of several common ways
to compress.  The discussion of compression is placed next to the
discussion of unicode encodings, because what is compressed is some encoding
of a file.  Therefore what is uncompressed will still need to
be decoded before becoming unicode.  We illustrate this below.

The idea is this: Unzip using `gzip` module; gievn a file path,
`gzip.open` returns a text file handle, an iterator that
allows us to loop through the uncompressed file contents one 
line at a time. This is a very transparent way of doing
 decompression.  `open(path)` becomes `gzip.open(path)`.

Let's illustrate two different situations, one a little
more complicated than the other.

First suppose this file exists as a file on your
machine.  The code looks like this;

```
line 20: gzip.open(path) returns a file stream we 
         can iterate through line by line, the
         lines are uncompressed.
Line 21:  for line in ... We begin iterating line by line
Line 27:  We decode the line, which is (uncompressed) bytes
```

In [5]:
import gzip
import os.path

####  File Location info
wd = '/Users/gawron/Desktop/src/sphinx/python_for_ss/colab_notebooks/' \
         'python-for-social-science-drafts'
# Google books
fn,format2 = 'gwordlist/gwordlist-master/frequency-all.txt.gz',False
# Google ngrams  The two dont seem very different apart for format
#fn2,format2 = 'gwordlist/gwordlist-master/1gramsbyfreq.txt.gz',True
####  End File Location info

### Code parameters
path = os.path.join(wd, fn2)
## If for example we use threshhold freq of 5,000, then
## a "word" must be used at least 5000 times to make the vocab list.
## This is a very high number, but this is also Google Books!
## using 40 means all words in the list will be used.
freq_dict_40,threshhold,encoding,debug = dict(),40,'utf8',False
#debug_stopper = 19
debug_stopper = None

with gzip.open(path) as go:
    for (i,ln) in enumerate(go):
        if i==0 :
            continue
        elif i == debug_stopper:
            break
        else:
            line = ln.decode(encoding=encoding)
            if process_line(line,freq_dict_40,threshhold,debug=debug,
                           format2=format2):
                continue
            else:
                print(f'Exiting on line {i:,}')
                break
print(f'{len(freq_dict_40):,} words in Frequency Dict')

Exiting on line 7,919,827
7,919,826 words in Frequency Dict


In [8]:
sum(freq_dict_40.values())

838432998263

Situation 2.  The file is on the web.

The problem is the gzip module wants
to be handed either a string which gives
the path to a file in the current file
system, or a file like object, which
provides a binary data stream to an already opened file.
How do we connect this pipeline to a stream 
from the web?

Solution.  We use `urlib.urlopen` to
download the raw bytes from the file on the web.
Then we use the `BytesIO` module to go from
raw bytes to a file like object containing
those bytes, suitable for a `gzip.GzipFile`
instance to decompress.

```
Line 8: resp (short for response) is a stream
Line 11: resp.read() is the downloaded bytes.  The sad thing is we now load
         the entire compressed file into memory all at once. That's one problem
         with this solution.  No workaround found yet.
Line 11: bts is an iterator, a file like object (file stream)
         that we can pass to the gzip.GzipFile class.  It is
         still binary data. You can't iterate line by line!
line 12: The gzip.GzipFile instance consumes bts to create an iterable (gzipfile) that 
         is uncompressed text bytes, and can be iterated through line by line.
         So as we loop we uncompress. Note the uncompressed file does
         not have to reside in memory all at once.
Line 13:  We begin iterating line by line
Line 17:  We decode ln, which is a line of(uncompressed) bytes.
```

In [456]:
from io import BytesIO
from urllib.request import urlopen
# Must visit raw version of the github repository for single file downloads
raw_github = 'https://raw.githubusercontent.com/hackerb9/' \
             'gwordlist/master/frequency-all.txt.gz'
# The Google unigram min freq is 40.  Let's try 5000
freq_dict_5000,threshhold = dict(),5_000
resp = urlopen(raw_github)

# Create a FileLike Object usable by a Gzipfile instance
with BytesIO(resp.read()) as bts:
    with gzip.GzipFile(fileobj=bts) as gzipfile:
        for (i,ln) in enumerate(gzipfile):
            if i==0:
                continue
            else:
                line = ln.decode(encoding=encoding)
                if process_line(line,freq_dict_5000,threshhold):
                    continue
                else: 
                    print(f'Exiting on line {i}')
                    break

def pfreq(wd,freq_dict=freq_dict_40):
    print(f'{freq_dict[wd]:,}')

print(f'{len(freq_dict_5000):,} words in Frequency Dict')

Exiting on line 676232
676,231 words in Frequency Dict


In [457]:
len(freq_dict_5000)

676231

In [458]:
pfreq('men'),pfreq('women')

454,777,905
312,387,136


(None, None)

### Eyeballing the rarest words

In [460]:
#il = sorted(freq_dict_40.items(),key=lambda x:x[1], reverse=True)
il = freq_dict_40.items()

In [461]:
in_list,min_ct = il,40
# Still getting some ighty strange "words" with 5K cutoff!
filtered_il = [(wd,ct) for (wd,ct) in in_list if #ct >= min_ct and \
                digits.intersection(wd) == set() and \
                wd.istitle() == False and wd.isupper() == False]
len(il), len(filtered_il), filtered_il[-50:]

(7919827,
 2852598,
 [('acciderc', 40),
  ('accesj', 40),
  ('accepte_', 40),
  ('acceptance.This', 40),
  ("a'ccept", 40),
  ('accempany', 40),
  ('accedervi', 40),
  ('acceae', 40),
  ('acagainst', 40),
  ('abuse.f', 40),
  ('abundantlyclear', 40),
  ('abtiut', 40),
  ('absurdissimas', 40),
  ('absurd.f', 40),
  ('absoVOL', 40),
  ('absolverc', 40),
  ('absoluttly', 40),
  ('absolte', 40),
  ('abscindatur_.', 40),
  ('abrupl', 40),
  ('abrasionem', 40),
  ('above.That', 40),
  ('aboveshown', 40),
  ('aboveone', 40),
  ('aboutedly', 40),
  ('abouli', 40),
  ('aborderait', 40),
  ('abominablv', 40),
  ('abolut', 40),
  ("Abolish'n", 40),
  ("abo'", 40),
  ('abl.e', 40),
  ('abieast', 40),
  ('abgetrotzt', 40),
  ('abgeschildert', 40),
  ('abdi_', 40),
  ('abcjut', 40),
  ('ABCjCLIO', 40),
  ('abbracciarono', 40),
  ('abbraccia_.', 40),
  ('abandonnat', 40),
  ('aavant', 40),
  ('aarred', 40),
  ('aapers', 40),
  ('aamission', 40),
  ('aakcd', 40),
  ('aafr', 40),
  ('aadin', 40),
  ('a

### Using/ Evaluating the word list

Complaint:  Our WordNet-generated list of active words has a lot of odd balls on it.

Let's use the Google books word list to filter our Wordnet derived list
of active 5-letter words, built earlier in this notebook.

Here are some oddbvall WordNet 5-ltter words

In [20]:
sample(active_words,20)

['oxlip',
 'grume',
 'moray',
 'felon',
 'queue',
 'spark',
 'moody',
 'knave',
 'clank',
 'tears',
 'speak',
 'flout',
 'lxxiv',
 'dopey',
 'judas',
 'snort',
 'draba',
 'hunch',
 'rumba',
 'galax']

In [21]:
len(active_words)

4158

How rare can our Google words be and still provide useful filtering?  Let's try 2,000 as k
(words below frequency k in the gynormous Google Books corpus are ignored).

In [322]:
active_words2 = {w for w in active_words if w in freq_dict_2000}

len(active_words2)

3344

Here's the bathwater:

In [323]:
eliminated = active_words - active_words2

Let's examine this bathwater for babies:

In [335]:
import random
random.sample(eliminated,60)

['abele',
 'sprat',
 'hippo',
 'gulch',
 'picot',
 'dotty',
 'donna',
 'swami',
 'orach',
 'sulla',
 'lyssa',
 'pappa',
 'verso',
 'redux',
 'rubel',
 'egest',
 'krone',
 'sigma',
 'skive',
 'minge',
 'henry',
 'annex',
 'benne',
 'savin',
 'arras',
 'jinks',
 'ravel',
 'brome',
 'psalm',
 'titty',
 'butte',
 'kraft',
 'amort',
 'bruin',
 'sadhu',
 'molly',
 'goody',
 'daddy',
 'aegir',
 'bosie',
 'athar',
 'chile',
 'daisy',
 'immix',
 'costa',
 'rugby',
 'cager',
 'midge',
 'dhava',
 'welch',
 'plage',
 'poyou',
 'galax',
 'knawe',
 'testa',
 'piper',
 'roble',
 'scoke',
 'butch',
 'lally']

Some definite babies found when k = 2,000!

```
annex
daisy
bruin
rugby
psalm
butch
```

Try lowering k to 1,000.

In [337]:
active_words1 = {w for w in active_words if w in freq_dict_1000}
eliminated1 = active_words - active_words2
len(active_words), len(active_words1)

(4158, 3367)

Notice the very small increase in vocab size 3344 to 3367.

The bathwater:

In [338]:
import random
random.sample(eliminated1,60)

['kasha',
 'hakim',
 'seine',
 'xcvii',
 'osier',
 'tabor',
 'selva',
 'hello',
 'ilama',
 'testa',
 'baron',
 'coney',
 'so-so',
 'spawl',
 'jemmy',
 'tiyin',
 'cirio',
 'judas',
 'pokey',
 'benne',
 'co-ed',
 'lemma',
 'pinto',
 'cypre',
 'burry',
 'butch',
 'lynch',
 'quint',
 'mamba',
 'table',
 'kempt',
 'salat',
 'beery',
 'vouge',
 'baboo',
 'fatso',
 'cytol',
 'cohoe',
 'bravo',
 'spiff',
 'reccy',
 'creel',
 'chino',
 'clegg',
 'corps',
 'rouge',
 'typic',
 'colly',
 'carol',
 'lough',
 'lanai',
 'tubby',
 'whish',
 'xlvii',
 'sally',
 'daisy',
 'stoke',
 'barde',
 'pipit',
 'sigma']

Still some babies!

```
hello (hello!)
judas (probably established a common noun)
beery
corps
rouge
tubby
sigma
```

How about 500?

In [342]:
active_words500 = {w for w in active_words if w in freq_dict_500}
eliminated500 = active_words - active_words500
print(len(active_words), len(active_words500))
random.sample(eliminated500,60)

4158 3375


['deist',
 'wedel',
 'natty',
 'pavis',
 'marri',
 'raita',
 'savin',
 'punic',
 'scone',
 'hi-fi',
 'jawan',
 'xlvii',
 'hello',
 'thane',
 'viola',
 'brail',
 'dicer',
 'leone',
 'hadji',
 'eggar',
 'barde',
 'agama',
 'sabra',
 'draba',
 'hewer',
 'etude',
 'kiley',
 'delta',
 'south',
 'china',
 'gigue',
 'benny',
 'rouge',
 'liger',
 'lotte',
 'paseo',
 'allis',
 'braky',
 'annex',
 'kalif',
 'caddy',
 'boner',
 'xcvii',
 'imaum',
 'phlox',
 'boney',
 'liege',
 'marsh',
 'xliii',
 'ganef',
 'armet',
 'eblis',
 'momma',
 'liman',
 'oxbow',
 'letch',
 'laver',
 'verso',
 'deism',
 'agave']

Babies galore! This is getting tedious.

In [348]:
active_words100 = {w for w in active_words if w in freq_dict_100}
eliminated100 = active_words - active_words100
print(len(active_words), len(active_words100))
random.sample(eliminated100,60)

4158 3381


['doyly',
 'groak',
 'braky',
 'valse',
 'boule',
 'pommy',
 'orpin',
 'butch',
 'dawah',
 'nance',
 'jagua',
 'vista',
 'pater',
 'skeet',
 'belle',
 'tubby',
 'mimer',
 'pruno',
 'roman',
 'indri',
 'whang',
 'grail',
 'chine',
 'clxxx',
 'hydra',
 'gonzo',
 'fleck',
 'pavan',
 'bruin',
 'villa',
 'momma',
 'press',
 'cypre',
 'khadi',
 'cohoe',
 'pacha',
 'argal',
 'hogan',
 'salat',
 'bazar',
 'whish',
 'brant',
 'leech',
 'spiff',
 'armet',
 'vroom',
 'lanai',
 'halma',
 'lazar',
 'arras',
 'codex',
 'hippo',
 'laver',
 'blitz',
 'cronk',
 'hi-fi',
 'henry',
 'jello',
 'tench',
 'emery']

There are many babies here:

```
bruin
villa
momma
press
gonzo
fleck
hi-fi
jello (!)
emery

```
Lowering the threshold from 100 to 40 adds exactly one word to our word list. 

Still babies. For example the mysteriously absent "butch" is still absent.  

In [462]:
active_words40 = {w for w in active_words if w in freq_dict_40}
eliminated40 = active_words - active_words40
print(len(active_words), len(active_words40))
random.sample(eliminated40,60)

4158 3382


['boule',
 'muser',
 'roper',
 'islay',
 'aleph',
 'molly',
 'serin',
 'pichi',
 'anele',
 'hello',
 'jawan',
 'plyer',
 'bosie',
 'pruno',
 'roach',
 'minty',
 'quint',
 'stein',
 'elver',
 'queen',
 'hadji',
 'gamba',
 'indri',
 'paseo',
 'nacho',
 'sigeh',
 'fermi',
 'braky',
 'bodge',
 'tenno',
 'dixie',
 'sissy',
 'nicad',
 'boeuf',
 'pavan',
 'grail',
 'leppy',
 'blanc',
 'gulch',
 'blitz',
 'vanda',
 'deism',
 'serge',
 'ottar',
 'caddy',
 'etude',
 'babka',
 'liger',
 'munja',
 'gemma',
 'no-go',
 'osier',
 'monas',
 'terry',
 'randy',
 'lxvii',
 'titan',
 'lemma',
 'south',
 'lapin']

An incomplete list of babies:

In [470]:
for wd in "butch bruin villa momma press gonzo fleck hi-fi "\
          "jello emery metro carol curry roach lemma hello south".split():
    print(f'{wd} {wd in active_words} {wd in active_words40}')

butch True False
bruin True False
villa True False
momma True False
press True False
gonzo True False
fleck True False
hi-fi True False
jello True False
emery True False
metro True False
carol True False
curry True False
roach True False
lemma True False
hello True False
south True False


Bottom line using frequency in Google ngrams as a criterion for filtering
out some of WordNet's odder words is not working very well yet.

This could be because of issues with hackerB9's filtering code.
It could be because of Google.  Based on looking at **butch** it appears to be 
something about hackerB9's methods.

Facts below;

### Why this may be a problem with the word set, not Google Ngrams

Here's one very odd omission from the freq dict for the whole word set.

In [36]:
freq_dict_40['butch']

KeyError: 'butch'

If it's a Google problem, that word should just have a very small frequency in the original set.

Let;s check.

First some background numbers.  Using *the* with pct figure to calculate size of corpus.

This entry was created in debug mode:

In [25]:
freq_dict_00['the']

('2', '53,097,503,134', '5.937009%', '12.189017%')

In [28]:
# 53097503134 = .05937 x N
# 53097503134/.05937 = Nm
N = 53_097_503_134/.05937
# Nearly 1 trillion words
print(f'{N:,}')

894,349,050,597.9451


Another way, another answer.  He gives a reason for this discrepancy, filtering,
for example, punctuation.  I believe
the percentages are closer to right.

In [9]:
sum(freq_dict_40.values())

838432998263

In [30]:
# The frequency of "butch" in its peak year (c. 1998) 
# according to the Google Ngrams viewer
X = .0000414287 * .01

The frequency of X, the max freq of *butch*, were maintained throughout all the years in the corpus (N is the aggregate of all texts for all years).

In [31]:
N*X

370517.18512507086

The frequency if Y, the min freq of *butch*, were maintained throughout all the years in the corpus.

In [32]:
Y = .00000184 * .01
N*Y

16456.02253100219

Both numbers are well above the Google Ngram curoff of 40.  So this argues there is something going on with hackerb9's filtering methods which has filtered out the word **butch**.