# Aspects of a text by number of characters

Studying a text from the number of characters that its words usually have can give us important information. Comparing 20 text types that fit into just two text classes using character counts can show significant patterns. Perhaps recipes have smaller words than scientific articles. This would be a text classification criterion.

In [1]:
#Importing libs 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import nltk

In [2]:
#Calling the texts
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


# Collocations

A collocation is a sequence of words that occurs together unusually often. A characteristic of collocations is that they are resistent to substitution with words that have similar senses.

The collocations are essentially just frequent bigrams.

The collocations that emerge are very specific to the genre of the texts.

In [29]:
#Searching 20 collocations
text3.collocations(num=100)

said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast; ill favoured; shall come;
creepeth upon; thy servant; four hundred; LORD hath; every man; hast
done; bless thee; eight hundred; five years; burnt offering; God said;
unto thee; nine hundred; right hand; Thou shalt; thy father; chief
butler; appeared unto; found grace; yet alive; find grace; hundred
years; came near; thy brother; brought forth; third day; every
creeping; gray hairs; hath taken; unto Joseph; youngest brother; taken
away; thy servants; Potipherah priest; ewe lambs; mischief befall; old
age; fifth part; fifty righteous; art thou; two daughters; many
colours; thou wilt; well favoured; LORD said; draw water; thou mayest;
bring forth; thy wife; three hundred; everlasting covenant; God
created; drink wine; thine eyes; true men; forty d

In [63]:
#Generating a dictionary with words lengths and their frequencies
FreqDist([len(w) for w in text3])

FreqDist({3: 11599, 4: 8984, 1: 7408, 2: 5968, 5: 4693, 6: 2549, 7: 1674, 8: 928, 9: 599, 10: 182, ...})

Above is a dictionary that displayes words lengths and their frequencies. It is usuful to know statistics about the words lengths.

In [61]:
#Displaying statistics about words lengths
pd.Series(FreqDist([len(w) for w in text3]).keys()).describe()

count    15.000000
mean      8.000000
std       4.472136
min       1.000000
25%       4.500000
50%       8.000000
75%      11.500000
max      15.000000
dtype: float64

On average, the words have 8 characters.

In [67]:
#Generating a dictionary of items
FreqDist([len(w) for w in text3]).items()

dict_items([(2, 5968), (3, 11599), (9, 599), (7, 1674), (6, 2549), (5, 4693), (1, 7408), (4, 8984), (8, 928), (10, 182), (11, 123), (12, 39), (13, 9), (14, 7), (15, 2)])

That type of dict groups keys and values into a tuple.

The max( ) returns the words lengths most frequent.

In [74]:
#Searching the most frequent words lengths
FreqDist([len(w) for w in text3]).max()

3

Applying the returned value into the filter, the most frequency is returned.

In [75]:
#Seeking the frequency of the most frequent words lengths
FreqDist([len(w) for w in text3])[3]

11599

Using the freq( ) function, we can know the presence of words with a determined length in terms of percentage.

In [84]:
#Knowing the percentage of most word lengths
FreqDist([len(w) for w in text3]).freq(3)

0.2591144669823966

Words made up of 3 characters represent about 25,91% of the text.

In [110]:
#Searching for the percentage of presence of the name God
FreqDist([w for w in text3]).freq('God')

0.005160396747386293

The word God represents about 5,16% of the text.