# Text Normalization

This notebook provides an introduction on how to perform text normalization on Python using NTLK and other NLP libraries

This notebook contains information about the following processes

*   Data Exploration
*   Word Tokenization
*   Word Normalization
*   Sentence Segmentation

# Initialize NTLK

Download some of the resources that NLTK needs

In [None]:
import nltk
nltk.download('book')

## Import the additional modules

The `re` module is the built in Python regex module while the `tokenizers` modules is a [library from hugging face](https://github.com/huggingface/tokenizers).

In [None]:
import re
from tokenizers import ByteLevelBPETokenizer, BertWordPieceTokenizer

## Load the text data of interest

There are a lot of ways to do it. If a text file exist, using Python's `open` function will be the easiest. However, for now the predefined corpura in the NLTK library will be used.

The first 500 characters of the text will be shown as a initial view of the data.

In [None]:
TEXT_DATA = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
TEXT_DATA[:500]

## Data Exploration

Regular expressions can be used to look at different aspects of the data. These examples are naive to show the power of regular expressions but they certainly be extended to be complex as possible. For most part, this are just done to have an overview of the data

### Past Tenses

Provided is a very naive way of checking for words used that are in past tense.

In [None]:
past_tenses = re.findall(r'(?<=\W)[A-Za-z-]+ed(?=\W)', TEXT_DATA)
past_tenses[:20]

### Proper Names

Below is a naive way of looking at proper names in the text. Obvious mistakes here would be start of the sentences.

In [None]:
proper_names = re.findall(r'(?<![\s.!]\W)(?<=\W)[A-Z][a-z-]+(?=\W)', TEXT_DATA)
proper_names[:20]

### Words without Vovels

Below is a naive way to check what words do not have vowels in them

In [None]:
words_without_vowels = re.findall(r'(?<=\W)[^aeiouAEIOU\W]+(?=\W)', TEXT_DATA)
words_without_vowels[:20]

## Word Tokenization

For this part, both Penn Treebank Tokenization and Regex Tokenization will be shown

### Penn Treebank Tokenization

In [None]:
ptb_tokenizer = nltk.tokenize.treebank.TreebankWordTokenizer()
ptb_tokens = ptb_tokenizer.tokenize(TEXT_DATA)
ptb_tokens[:20]

### Regex Tokenization

In [None]:
regex_tokenizer_pattern = \
    r'''(?x)                 # set flag to allow verbose regexps
        (?:[A-Z]\.)+         # abbreviations, e.g. U.S.A.
        | \w+(?:-\w+)*       # words with optional internal hyphens
        | \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
        | \.\.\.             # ellipsis
        | [][.,;"'?():-_`]   # these are separate tokens; includes ], [
    '''
regex_tokens = nltk.regexp_tokenize(TEXT_DATA, regex_tokenizer_pattern)
regex_tokens[:20]

## Data-based Tokenization

This part will show both Byte Pair Encoding and Word Piece Encoding

### Byte Pair Encoding

In [None]:
bl_tokenizer = ByteLevelBPETokenizer(lowercase=True)
bl_tokenizer.train_from_iterator([TEXT_DATA])
bl_token = bl_tokenizer.encode(TEXT_DATA).tokens
bl_token[:20]

### Word Piece Encoding

In [None]:
wp_tokenizer = BertWordPieceTokenizer(lowercase=True)
wp_tokenizer.train_from_iterator([TEXT_DATA])
wp_tokens = wp_tokenizer.encode(TEXT_DATA).tokens
wp_tokens[:20]

## Word Normalization

On this section, several word normalization techniques such as case folding, Porter stemmer, and Wordnet Lemmatizer are shown. From this part, the Penn Treebank tokens are used.

### Case Folding

In [None]:
cf_ptb_tokens = [w.lower() for w in ptb_tokens]
cf_ptb_tokens[:20]

### Stemming using Porter

In [None]:
stemmer = nltk.PorterStemmer()
PS_PTB_TOKENS = [stemmer.stem(w) for w in ptb_tokens]
PS_PTB_TOKENS[:20]

### Lemmatize using WordNet

In [None]:
lemmatizer = nltk.WordNetLemmatizer()
wnl_ptb_tokens = [lemmatizer.lemmatize(w) for w in ptb_tokens]
wnl_ptb_tokens[:20]

## Sentence Segmentation

This part will show how to do sentence segmentation using the Punkt System. NLTK provides a pretained Punkt model on it's `sent_tokenize` function.

In [None]:
sent_seg = nltk.sent_tokenize(TEXT_DATA)
sent_seg[:20]