# Counting words

## Frequency

The frequency of a word in a text and/or its unusual size can tell a lot about the text.

In this jupyter notebook, I discuss some counting techniques.

In [183]:
#Importing libs 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import nltk

In [184]:
#Calling the texts
from nltk.book import *

# Counting words

In [81]:
#Counting the amount of word in Genesis
len(text3)

44764

In [90]:
#Print each word of Genesis
print(set(text3))

{'o', 'Leummim', 'trained', 'Kenizzites', 'fowl', 'scarlet', 'reward', 'Beerlahairoi', 'Adam', 'enlarge', 'afflict', 'meanest', 'couching', 'dominion', 'on', 'Hushim', 'submit', 'They', 'waited', 'Cause', 'wor', 'if', 'Bedad', 'cloud', 'strange', 'hearth', 'buy', 'beari', 'rose', 'feet', 'rebuked', 'them', 'Look', 'destroy', 'si', 'east', 'descending', 'ruler', 'young', 'chariots', ';)', 'ones', 'Babel', 'birthright', 'ended', 'ri', 'wick', 'camel', 'verified', 'vale', 'Cursed', 'liest', 'overdrive', 'child', 'trough', 'stood', 'fifth', 'knowest', 'conceal', 'alo', 'wrong', 'early', 'quiver', 'womenservan', 'charge', 'olive', 'Even', 'much', 'truly', 'Wherefore', 'love', 'touching', 'service', 'kill', 'Tarshish', 'together', 'tithes', 'travail', 'wouldest', 'thoughts', 'evening', 'guard', 'countries', 'divided', 'perfect', 'Do', 'Abimael', 'Keturah', 'distress', 'company', 'occasion', 'This', 'gods', 'stay', 'kine', 'mourning', 'old', 'smote', 'silv', 'servan', 'Ishmeelites', 'bondmen'

In [89]:
#Sorting the words in alphabetical order
print(sorted(set(text3)))

['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah', 'Ahuzzath', 'Ajah', 'Akan', 'All', 'Allonbachuth', 'Almighty', 'Almodad', 'Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', 'Amalekites', 'Ammon', 'Amorite', 'Amorites', 'Amraphel', 'An', 'Anah', 'Anamim', 'And', 'Aner', 'Angel', 'Appoint', 'Aram', 'Aran', 'Ararat', 'Arbah', 'Ard', 'Are', 'Areli', 'Arioch', 'Arise', 'Arkite', 'Arodi', 'Arphaxad', 'Art', 'Arvadite', 'As', 'Asenath', 'Ashbel', 'Asher', 'Ashkenaz', 'Ashteroth', 'Ask', 'Asshur', 'Asshurim', 'Assyr', 'Assyria', 'At', 'Atad', 'Avith', 'Baalhanan', 'Babel', 'Bashemath', 'Be', 'Because', 'Becher', 'Bedad', 'Beeri', 'Beerlahairoi', 'Beersheba', 'Behold', 'Bela', 'Belah', 'Benam', 'Benjamin', 'Beno', 'Beor', 'Bera', 'Bered', 'Beriah', 'Bethel', 'Bethlehem', 'Bethuel

In [92]:
#print the number of distinct words of Genesis
len(set(text3))

2789

We can calculate the number of times that a word is used in a text.

In [9]:
len(text3)/len(set(text3)) #total number of words of the text divided by total number of distinct words of the text

16.050197203298673

On average, a word of that text is used about 16 times.

Perhaps, our interest is in knowing the percentage of the presence of a word in a text.

In [11]:
#Counting word God
text3.count('God')

231

In [12]:
#Counting word Abraham
text3.count('Abraham')

129

In [15]:
#Calculating the percentage of the presence of the word God in the text
round((text3.count('God')/len(text3)) * 100, 2)

0.52

In [17]:
#Calculating the percentage of the presence of the word God in the text
round((text3.count('Abraham')/len(text3)) * 100, 2)

0.29

As you can see, the word God appears more times than word Abraham. However, both appear less than 1%.

### Defining a function to calculate lexical diversity on average and percentage of a word in a text

An interesting observation is about the lexical diversity of a text. It indicates how rich a text is lexically. We can have on hand a large text but with vocabulary poverty or short text but with vocabulary richness.

Below I initially developed two function to measure that richness. The first function measures how many times on average a word appears in the text. The second function measures how many the text is diversified. The higher the percentage is, the richer lexically is, that is, the closer to 100%.

In [34]:
#Defining lexical diversity function
def lexical_diversity(text):
    return print('On average, a word appears {} times in the text.'.format(round(len(text)/len(set(text)), 2)))
                 #It returns the gross number of words used on average

In [35]:
#Defining percentage of lexical diversity function
def lexical_diversity_rate(text):
    return print('The lexical diversity is in {}%.'.format(round((len(set(text))/len(text))*100, 2)))

In [36]:
#Testing lexical diversity function
lexical_diversity(text3)

On average, a word appears 16.05 times in the text.


In [37]:
#Testing lexical diversity text
lexical_diversity_rate(text3)

The lexical diversity is in 6.23%.


To define effectively whether a text has low or high lexical diversity, it is not just observe the rate, howerver, it is necessary to study other texts are of the same class in order to understand the patterns of that text type.

You can seek the presence rate of a word in a text through the function below.

In [50]:
#Building word presence rate function in the text
def word_presence_rate_in_the_text(text, word):
    return print('The presence rate of this word is about {}%.'.format(round((text.count(word)/len(text))*100, 4)))

In [51]:
#Testing word presence rate function
word_presence_rate_in_the_text(text3, 'God')

The presence rate of this word is about 0.516%.


In [52]:
#Testing the function on list
test_list = ['Hello', '!', 'My', 'names', 'is', 'Bruno', '.']

In [53]:
lexical_diversity(test_list) #Applying lexical diversity function

On average, a word appears 1.0 times in the text.


In [54]:
lexical_diversity_rate(test_list) #Applying lexical diversity rate function

The lexical diversity is in 100.0%.


In [56]:
word_presence_rate_in_the_text(test_list, 'names') #Word presence rate function

The presence rate of this word is about 14.2857%.


In [57]:
len(test_list) #Seeking for total number of words in the list

7

# Picking up what makes a text distinct
## Frequency distribution

The methods below work on the words frequencies.

FreqDist( ) is a method that creates a dictionary where the characters are the keys and their frequencies in text are the values.

In [188]:
#Searching the 50 most frequent words in the book of Genesis
most_freq_words3 = FreqDist(text3)
most_freq_words3

FreqDist({',': 3681, 'and': 2428, 'the': 2411, 'of': 1358, '.': 1315, 'And': 1250, 'his': 651, 'he': 648, 'to': 611, ';': 605, ...})

To access the values/frequencies, just use the values().

In [266]:
#Showing the frequency of each word
most_freq_words3.values()

dict_values([12, 2411, 5, 231, 11, 28, 2428, 111, 1315, 1250, 317, 9, 1, 3681, 1, 605, 4, 139, 47, 1358, 6, 2, 2, 32, 476, 29, 116, 254, 11, 238, 57, 509, 290, 44, 10, 157, 98, 1, 648, 1, 10, 20, 163, 16, 71, 342, 8, 588, 6, 64, 5, 72, 198, 17, 11, 1, 52, 1, 14, 15, 16, 590, 70, 46, 4, 184, 1, 1, 2, 4, 40, 43, 2, 7, 5, 58, 9, 21, 107, 651, 13, 16, 267, 2, 9, 60, 5, 10, 3, 611, 27, 230, 297, 1, 1, 61, 13, 59, 52, 32, 5, 5, 1, 1, 5, 81, 31, 49, 1, 3, 4, 2, 8, 80, 31, 16, 43, 1, 2, 1, 88, 16, 3, 153, 1, 43, 79, 3, 14, 18, 2, 1, 5, 50, 9, 39, 20, 8, 93, 46, 114, 63, 5, 1, 132, 4, 2, 8, 8, 245, 21, 8, 387, 11, 7, 2, 1, 53, 484, 21, 82, 4, 253, 10, 8, 3, 282, 139, 65, 18, 2, 12, 3, 1, 4, 57, 4, 3, 4, 3, 1, 54, 21, 119, 16, 112, 249, 166, 1, 44, 91, 1, 224, 4, 4, 7, 24, 37, 110, 122, 1, 5, 10, 3, 8, 1, 86, 2, 4, 12, 13, 3, 13, 3, 6, 40, 25, 106, 2, 3, 17, 19, 2, 16, 14, 27, 18, 1, 12, 2, 50, 100, 1, 2, 4, 20, 8, 1, 1, 11, 1, 22, 1, 1, 4, 19, 12, 1, 1, 81, 2, 13, 18, 6, 272, 5, 1, 2, 73, 47, 1

keys( ) shows the keys/words.

In [189]:
#Printing the keys
keys_of_most_freq_words3 = most_freq_words3.keys()
keys_of_most_freq_words3

dict_keys(['In', 'the', 'beginning', 'God', 'created', 'heaven', 'and', 'earth', '.', 'And', 'was', 'without', 'form', ',', 'void', ';', 'darkness', 'upon', 'face', 'of', 'deep', 'Spirit', 'moved', 'waters', 'said', 'Let', 'there', 'be', 'light', ':', 'saw', 'that', 'it', 'good', 'divided', 'from', 'called', 'Day', 'he', 'Night', 'evening', 'morning', 'were', 'first', 'day', 'a', 'firmament', 'in', 'midst', 'let', 'divide', 'made', 'which', 'under', 'above', 'firmame', 'so', 'Heaven', 'second', 'gathered', 'together', 'unto', 'one', 'place', 'dry', 'land', 'appe', 'Earth', 'gathering', 'Se', 'bring', 'forth', 'grass', 'herb', 'yielding', 'seed', 'fruit', 'tree', 'after', 'his', 'kind', 'whose', 'is', 'itself', 'ear', 'brought', 'ki', 'third', 'lights', 'to', 'night', 'them', 'for', 'signs', 'seasons', 'days', 'yea', 'give', 'two', 'great', 'greater', 'rule', 'lesser', 'nig', 'stars', 'also', 'set', 'over', 'darkne', 'fourth', 'abundantly', 'moving', 'creature', 'hath', 'life', 'fowl', 

Both keys( ) and values( ) work with FreqDist( ).

There are very informative words that appear once in a text. Those words are called hapaxes. Below are returned those that are in the book of Genesis. Note that "Night" appears once in Genesis. It is probably beacause "Night" is spelled with capitalized "n".

In [194]:
#Printing words appears once time
print(most_freq_words3.hapaxes())

['form', 'void', 'Day', 'Night', 'firmame', 'Heaven', 'appe', 'Earth', 'signs', 'seasons', 'lesser', 'nig', 'darkne', 'fly', 'whales', 'winged', 'seas', 'likene', 'subdue', 'finished', 'sanctified', 'plant', 'gr', 'mist', 'breathed', 'parted', 'Pison', 'bdellium', 'onyx', 'Gihon', 'Ethiopia', 'Hiddekel', 'Assyria', 'Euphrates', 'freely', 'eatest', 'sle', 'ribs', 'rib', 'Woman', 'Man', 'cleave', 'ashamed', 'subtil', 'gard', 'knowing', 'desired', 'sewed', 'fig', 'leaves', 'aprons', 'walking', 'cool', 'whereof', 'gavest', 'belly', 'enmity', 'conception', 'Thorns', 'thistles', 'sweat', 'tak', 'coats', 'clothed', 'Cherubims', 'flaming', 'tiller', 'firstlings', 'fallen', 'crieth', 'tillest', 'henceforth', 'punishment', 'driven', 'findeth', 'whosoever', 'slayeth', 'vengeance', 'mark', 'finding', 'Nod', 'Methusa', 'Methusael', 'Jabal', 'Jubal', 'handle', 'organ', 'instructor', 'artificer', 'brass', 'ir', 'Naamah', 'spee', 'wounding', 'avenged', 'En', 'book', 'Male', 'begotten', 'Eno', 'always'

In [193]:
#Printing the number of hapaxes
len(most_freq_words3.hapaxes())

1195

# Fine-grained selection of words

Another good possibility is to know the main content of the text by its long words. Long words can be kind of rare and can convey important informations about the content of a text.

For that, observe this mathematical notation:

P: words that have more than 14 characters;

P(w) is true if and only if w is more than 14 characters long;

V: Vocabulary;

a. {w | w E V & P(w)}

b. [w for w in V if P(w)]

Applying that mathematical notation, the words below are returned.

In [203]:
#Oparating the notation
V = set(text3) #Storing distinct words of the book of Genesis
expected_long_words = [w for w in V if len(w) > 9] #Storting words that have more than 10 characters
print(sorted(expected_long_words)) #Sorting the derived list

['Abelmizraim', 'Adullamite', 'Aholibamah', 'Allonbachuth', 'Amalekites', 'Beerlahairoi', 'Canaanites', 'Canaanitish', 'Chedorlaomer', 'EleloheIsrael', 'Girgashites', 'Hazarmaveth', 'Hazezontamar', 'Ishmeelites', 'Jegarsahadutha', 'Jehovahjireh', 'Kadmonites', 'Kenizzites', 'Kiriathaim', 'Kirjatharba', 'Mahalaleel', 'Melchizedek', 'Mesopotamia', 'Methuselah', 'Midianites', 'Peradventure', 'Perizzites', 'Philistines', 'Potipherah', 'Zaphnathpaaneah', 'abomination', 'abundantly', 'acknowledged', 'affliction', 'afterwards', 'altogether', 'anointedst', 'birthright', 'buryingplace', 'butlership', 'circumcise', 'circumcised', 'commanding', 'commandment', 'commandments', 'compasseth', 'conception', 'concerning', 'concubines', 'confederate', 'continually', 'countenance', 'deceitfully', 'deliverance', 'descending', 'displeased', 'distressed', 'established', 'everlasting', 'exceedingly', 'excellency', 'experience', 'fatfleshed', 'firstlings', 'fourteenth', 'generation', 'generations', 'graciousl

those words features in parts the book of Genesis. Note some words:

circumcised, commanding, commandment, commandments, concubines, generation, generations, habitations, imagination, inhabitants, interpretation, interpretations, interpreted, interpreter and pilgrimage. 

They are words with a relationship degree.

In [204]:
#Number of words derived from original text
len((sorted(expected_long_words)))

139

According to returned above, the number of long words that meet the criterion is 139.

Looking for large words is not always enough. Adding frequency to that criterion can be a good method of chategorizing text.

In [264]:
freqdist = FreqDist(text3) #Defining a dictionary where the words are keys and the values are their frequencies
print(sorted([w for w in V if len(w) > 4 and freqdist[w] > 16])) 
            #Variable w gets V if and only if number of characters of w is more than 4 and its frequency is more than 16

['Abimelech', 'Abraham', 'Abram', 'Behold', 'Benjamin', 'Canaan', 'Egypt', 'Isaac', 'Ishmael', 'Israel', 'Jacob', 'Joseph', 'Judah', 'Laban', 'Pharaoh', 'Rachel', 'Rebekah', 'Sarah', 'Sarai', 'Shechem', 'Sodom', 'These', 'according', 'after', 'again', 'against', 'among', 'answered', 'beast', 'because', 'before', 'begat', 'behold', 'between', 'bless', 'blessed', 'bread', 'brethren', 'bring', 'brother', 'brought', 'called', 'camels', 'cattle', 'children', 'commanded', 'conceived', 'covenant', 'daughter', 'daughters', 'dream', 'drink', 'dwell', 'dwelt', 'earth', 'every', 'famine', 'father', 'field', 'firstborn', 'flesh', 'flocks', 'forth', 'found', 'given', 'great', 'ground', 'hands', 'heard', 'heaven', 'himself', 'house', 'hundred', 'little', 'lived', 'master', 'money', 'morning', 'mother', 'multiply', 'nations', 'night', 'people', 'place', 'returned', 'saying', 'servant', 'servants', 'seven', 'shall', 'shalt', 'sheep', 'should', 'sight', 'sister', 'spake', 'stood', 'taken', 'their', 'th

That instruction returns many people's names, nations, ethnicities, some foods and drinks, time.