# Text Normalization

This notebook provides an introduction on how to perform text normalization on Python using NTLK and other NLP libraries

This notebook contains information about the following processes

*   Data Exploration
*   Word Tokenization
*   Word Normalization
*   Sentence Segmentation

# Initialize NTLK

Download some of the resources that NLTK needs

In [1]:
import nltk
nltk.download('book', quiet=True)

True

## Import the additional modules

The `re` module is the built in Python regex module while the `tokenizers` modules is a [library from hugging face](https://github.com/huggingface/tokenizers).

In [2]:
import re
from tokenizers import ByteLevelBPETokenizer, BertWordPieceTokenizer

## Load the text data of interest

There are a lot of ways to do it. If a text file exist, using Python's `open` function will be the easiest. However, for now the predefined corpura in the NLTK library will be used.

The first 500 characters of the text will be shown as a initial view of the data.

In [3]:
TEXT_DATA = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
TEXT_DATA[:500]

'[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.\r\n\r\n(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.  He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.  He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.\r\n\r\n"While you take in hand to school others, and to teac'

## Data Exploration

Regular expressions can be used to look at different aspects of the data. These examples are naive to show the power of regular expressions but they certainly be extended to be complex as possible. For most part, this are just done to have an overview of the data

### Past Tenses

Provided is a very naive way of checking for words used that are in past tense.

In [4]:
re.findall(r'(?<=\W)[A-Za-z-]+ed(?=\W)', TEXT_DATA)

['Supplied',
 'embellished',
 'loved',
 'reminded',
 'called',
 'named',
 'arched',
 'vaulted',
 'Supplied',
 'sacred',
 'fancied',
 'seven-storied',
 'long-pampered',
 'splintered',
 'created',
 'prepared',
 'crooked',
 'called',
 'proceeded',
 'appeared',
 'open-mouthed',
 'visited',
 'catched',
 'killed',
 'swallowed',
 'described',
 'received',
 'extracted',
 'bred',
 'wounded',
 'learned',
 'fixed',
 'created',
 'called',
 'swallowed',
 'Created',
 'Stretched',
 'placed',
 'forced',
 'proceed',
 'called',
 'informed',
 'agreed',
 'killed',
 'hundred',
 'attended',
 'stuffed',
 'armed',
 'supposed',
 'killed',
 'seemed',
 'stranded',
 'grounded',
 'barbed',
 'Up-spouted',
 'covered',
 'Floundered',
 'dived',
 'Led',
 'Assaulted',
 'hooked',
 'observed',
 'killed',
 'answered',
 'introduced',
 'answered',
 'dimmed',
 'gleamed',
 'floundered',
 'engaged',
 'amounted',
 'infuriated',
 'expanded',
 'propelled',
 'destroyed',
 'neglected',
 'excited',
 'possessed',
 'armed',
 'regarded'

### Proper Names

Below is a naive way of looking at proper names in the text. Obvious mistakes here would be start of the sentences.

In [5]:
re.findall(r'(?<![\s.!]\W)(?<=\W)[A-Z][a-z-]+(?=\W)', TEXT_DATA)

['Moby',
 'Dick',
 'Herman',
 'Melville',
 'Late',
 'Consumptive',
 'Usher',
 'Grammar',
 'School',
 'Usher--threadbare',
 'Dan',
 'Dan',
 'Dut',
 'Ger',
 'Sub',
 'Sub',
 'Librarian',
 'Sub',
 'Sub',
 'Leviathan',
 'Sub',
 'Sub',
 'Pale',
 'Sherry',
 'Give',
 'Sub',
 'Subs',
 'Hampton',
 'Court',
 'Tuileries',
 'Gabriel',
 'Michael',
 'Raphael',
 'God',
 'One',
 'Lord',
 'Jonah',
 'Leviathan',
 'Lord',
 'Leviathan',
 'Leviathan',
 'Indian',
 'Sea',
 'Whales',
 'Whirlpooles',
 'Balaene',
 'Whales',
 'Nick',
 'Leviathan',
 'Moses',
 'Job',
 'Leviathan',
 'Nescio',
 'Spencer',
 'Talus',
 'Leviathan',
 'Commonwealth',
 'Latin',
 'Civitas',
 'Mansoul',
 'God',
 'There',
 'Leviathan',
 'Leviathan',
 'Elbe',
 'The',
 'Whale',
 'Shetland',
 'Spitzbergen',
 'Anno',
 'Pitferren',
 'God',
 'Asiatics',
 'Spermacetti',
 'Whale',
 'Nantuckois',
 'Europe',
 'London',
 'Bridge',
 'Spermacetti',
 'Whales',
 'May',
 'Leviathan',
 'Atlantic',
 'Polar',
 'Sea',
 'Susan',
 'Gothic',
 'Arch',
 'Pacific',
 '

### Words without Vovels

Below is a naive way to check what words do not have vowels in them

In [6]:
re.findall(r'(?<=\W)[^aeiouAEIOU\W]+(?=\W)', TEXT_DATA)

['by',
 '1851',
 'by',
 'by',
 'H',
 'Sw',
 'S',
 'S',
 'S',
 'by',
 's',
 'by',
 'by',
 'by',
 's',
 'S',
 'S',
 'S',
 'S',
 'S',
 'BY',
 'D',
 '890',
 's',
 's',
 'fly',
 'fly',
 'by',
 's',
 'S',
 'S',
 's',
 's',
 'T',
 'V',
 's',
 'S',
 'By',
 'S',
 'S',
 'S',
 'fry',
 'S',
 'S',
 'T',
 'S',
 'S',
 'N',
 't',
 'D',
 '1671',
 '1652',
 '500',
 'S',
 'try',
 'by',
 'S',
 'D',
 '1668',
 's',
 'N',
 'by',
 'S',
 'D',
 '1729',
 'S',
 'sylphs',
 'by',
 'S',
 'S',
 'S',
 'S',
 '1772',
 'by',
 'S',
 '1778',
 'S',
 's',
 'S',
 'by',
 'S',
 'S',
 's',
 'S',
 '40',
 'S',
 'd',
 'by',
 'by',
 'd',
 'S',
 's',
 'S',
 '1690',
 's',
 'S',
 'by',
 's',
 'S',
 'by',
 's',
 'S',
 'S',
 'My',
 'Mr',
 'by',
 'BY',
 'BY',
 '1821',
 '10',
 '440',
 'S',
 '1839',
 'S',
 '1840',
 '13',
 'J',
 'S',
 '1846',
 'BY',
 'D',
 '1828',
 'by',
 'by',
 'Mr',
 'by',
 'S',
 'S',
 '1828',
 'S',
 'S',
 'BY',
 'T',
 'BY',
 'S',
 'by',
 'S',
 's',
 'S',
 'my',
 '1',
 'my',
 'my',
 'my',
 's',
 'my',
 'by',
 'by',
 'by',
 

## Word Tokenization

For this part, both Penn Treebank Tokenization and Regex Tokenization will be shown

### Penn Treebank Tokenization

In [7]:
ptb_tokenizer = nltk.tokenize.treebank.TreebankWordTokenizer()
PTB_TOKENS = ptb_tokenizer.tokenize(TEXT_DATA)

In [8]:
PTB_TOKENS[:500]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality.',
 "''",
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'to

### Regex Tokenization

In [9]:
regex_tokenizer_pattern = \
    r'''(?x)     # set flag to allow verbose regexps
        (?:[A-Z]\.)+       # abbreviations, e.g. U.S.A.
        | \w+(?:-\w+)*       # words with optional internal hyphens
        | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
        | \.\.\.             # ellipsis
        | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
    '''
REGEX_TOKENS = nltk.regexp_tokenize(TEXT_DATA, regex_tokenizer_pattern)

In [10]:
REGEX_TOKENS[:500]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now',
 '.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '.',
 '"',
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',
 'in',


## Data-based Tokenization

For this part, the text data is saved into a file to follow the interface of the `tokenizers` module. This part will show both Byte Pair Encoding and Word Piece Encoding

In [11]:
TEXT_PATH = 'moby_dick.txt'
with open(TEXT_PATH, 'w') as f:
    f.write(TEXT_DATA)

### Byte Pair Encoding

In [12]:
bl_tokenizer = ByteLevelBPETokenizer(lowercase=True)
bl_tokenizer.train([TEXT_PATH])

In [13]:
BL_TOKENS = bl_tokenizer.encode(TEXT_DATA).tokens

In [14]:
BL_TOKENS[:500]

['[',
 'moby',
 'Ġdick',
 'Ġby',
 'Ġher',
 'man',
 'Ġmel',
 'vil',
 'le',
 'Ġ1851',
 ']',
 'čĊ',
 'čĊ',
 'č',
 'Ċ',
 'ety',
 'mo',
 'logy',
 '.',
 'čĊ',
 'č',
 'Ċ',
 '(',
 'supplied',
 'Ġby',
 'Ġa',
 'Ġlate',
 'Ġcons',
 'umpt',
 'ive',
 'Ġusher',
 'Ġto',
 'Ġa',
 'Ġgrammar',
 'Ġschool',
 ')',
 'čĊ',
 'č',
 'Ċ',
 'the',
 'Ġpale',
 'Ġusher',
 '--',
 'thread',
 'bare',
 'Ġin',
 'Ġcoat',
 ',',
 'Ġheart',
 ',',
 'Ġbody',
 ',',
 'Ġand',
 'Ġbrain',
 ';',
 'Ġi',
 'Ġsee',
 'Ġhim',
 'č',
 'Ċ',
 'now',
 '.',
 'Ġ',
 'Ġhe',
 'Ġwas',
 'Ġever',
 'Ġdusting',
 'Ġhis',
 'Ġold',
 'Ġlexic',
 'ons',
 'Ġand',
 'Ġgrammars',
 ',',
 'Ġwith',
 'Ġa',
 'Ġqueer',
 'č',
 'Ċ',
 'handkerchief',
 ',',
 'Ġmock',
 'ingly',
 'Ġembellished',
 'Ġwith',
 'Ġall',
 'Ġthe',
 'Ġgay',
 'Ġflag',
 's',
 'Ġof',
 'Ġall',
 'Ġthe',
 'č',
 'Ċ',
 'known',
 'Ġnations',
 'Ġof',
 'Ġthe',
 'Ġworld',
 '.',
 'Ġ',
 'Ġhe',
 'Ġloved',
 'Ġto',
 'Ġdust',
 'Ġhis',
 'Ġold',
 'Ġgrammars',
 ';',
 'Ġit',
 'č',
 'Ċ',
 'somehow',
 'Ġmildly',
 'Ġreminded',

### Word Piece Encoding

In [15]:
wp_tokenizer = BertWordPieceTokenizer(lowercase=True)
wp_tokenizer.train([TEXT_PATH])

In [16]:
WP_TOKENS = wp_tokenizer.encode(TEXT_DATA).tokens

In [17]:
WP_TOKENS[:500]

['[',
 'moby',
 'dick',
 'by',
 'herm',
 '##an',
 'mel',
 '##vil',
 '##le',
 '1851',
 ']',
 'et',
 '##y',
 '##mo',
 '##lo',
 '##gy',
 '.',
 '(',
 'supplied',
 'by',
 'a',
 'late',
 'cons',
 '##ump',
 '##ti',
 '##ve',
 'usher',
 'to',
 'a',
 'grammar',
 'school',
 ')',
 'the',
 'pale',
 'usher',
 '-',
 '-',
 'thread',
 '##bar',
 '##e',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'i',
 'see',
 'him',
 'now',
 '.',
 'he',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicon',
 '##s',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mocking',
 '##ly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flag',
 '##s',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '.',
 'he',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortal',
 '##ity',
 '.',
 '"',
 'while',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 

## Word Normalization

On this section, several word normalization techniques such as case folding, Porter stemmer, and Wordnet Lemmatizer are shown. From this part, the Penn Treebank tokens are used.

### Case Folding

In [18]:
CF_PTB_TOKENS = [w.lower() for w in PTB_TOKENS]
CF_PTB_TOKENS[:500]

['[',
 'moby',
 'dick',
 'by',
 'herman',
 'melville',
 '1851',
 ']',
 'etymology.',
 '(',
 'supplied',
 'by',
 'a',
 'late',
 'consumptive',
 'usher',
 'to',
 'a',
 'grammar',
 'school',
 ')',
 'the',
 'pale',
 'usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'i',
 'see',
 'him',
 'now.',
 'he',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world.',
 'he',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality.',
 "''",
 'while',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'to

### Stemming using Porter

In [19]:
stemmer = nltk.PorterStemmer()
PS_PTB_TOKENS = [stemmer.stem(w) for w in PTB_TOKENS]
PS_PTB_TOKENS[:500]

['[',
 'mobi',
 'dick',
 'by',
 'herman',
 'melvil',
 '1851',
 ']',
 'etymology.',
 '(',
 'suppli',
 'by',
 'a',
 'late',
 'consumpt',
 'usher',
 'to',
 'a',
 'grammar',
 'school',
 ')',
 'the',
 'pale',
 'usher',
 '--',
 'threadbar',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'bodi',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now.',
 'He',
 'wa',
 'ever',
 'dust',
 'hi',
 'old',
 'lexicon',
 'and',
 'grammar',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingli',
 'embellish',
 'with',
 'all',
 'the',
 'gay',
 'flag',
 'of',
 'all',
 'the',
 'known',
 'nation',
 'of',
 'the',
 'world.',
 'He',
 'love',
 'to',
 'dust',
 'hi',
 'old',
 'grammar',
 ';',
 'it',
 'somehow',
 'mildli',
 'remind',
 'him',
 'of',
 'hi',
 'mortality.',
 "''",
 'while',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'other',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'call',
 'in',
 'our',
 'tongu',
 'leav',
 'out',
 ',',

### Lemmatize using WordNet

In [20]:
lemmatizer = nltk.WordNetLemmatizer()
WNL_PTB_TOKENS = [lemmatizer.lemmatize(w) for w in PTB_TOKENS]
WNL_PTB_TOKENS[:500]

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now.',
 'He',
 'wa',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicon',
 'and',
 'grammar',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flag',
 'of',
 'all',
 'the',
 'known',
 'nation',
 'of',
 'the',
 'world.',
 'He',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammar',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality.',
 "''",
 'While',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale-fish',
 'is',
 'to',
 'be',
 'called',
 'in',
 'our',
 'tongue',

## Sentence Segmentation

This part will show how to do sentence segmentation using the Punkt System. NLTK provides a pretained Punkt model on it's `sent_tokenize` function.

In [21]:
nltk.sent_tokenize(TEXT_DATA)

['[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.',
 '(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.',
 'He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.',
 'He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.',
 '"While you take in hand to school others, and to teach them by what\r\nname a whale-fish is to be called in our tongue leaving out, through\r\nignorance, the letter H, which almost alone maketh the signification\r\nof the word, you deliver that which is not true."',
 '--HACKLUYT\r\n\r\n"WHALE.',
 '... Sw. and Dan.',
 'HVAL.',
 'This animal is named from roundness\r\nor rolling; for in Dan.',
 'HVALT is arched or vaulted."',
 '--WEBSTER\'S\r\nDICTIONARY\r\n\r\n"WHALE.',
 '...',
 'It is more immediately from the Dut.',
 '