- Usually text corpora and other textual data in their native raw format are not well formatted and standardized, and of course, we should expect this—after all, text data is highly unstructured! 
- Text processing, or to be more specific, pre-processing, involves using a variety of techniques to convert raw text into well- defined sequences of linguistic components that have standard structure and notation.
- The following list gives us an idea of some of the most popular text pre-processing techniques that we will be exploring in this chapter:
    - Tokenization
    - Tagging
    - Chunking
    - Stemming
    - Lemmatization
- An important thing to remember always is that a robust text pre-processing system is always an essential part of any application on NLP and text analytics. 
    - The primary reason for that is because all the textual components that are obtained after pre-processing—be they words, phrases, sentences, or any other tokens—form the basic building blocks of input that are fed into the further stages of the application that perform more complex analyses, including learning patterns and extracting information.

# 1. Text Tokenization
- __Tokens__ : independent and minimal textual components that have some definite syntax and semantics.
- tokenization can be defined as the process of breaking down or splitting textual data into smaller meaningful components called tokens. 

## 1) Sentence Tokenization
- The process of splitting a text corpus into sentences that act as the first level of tokens which the corpus is comprised of. 
- This is also known as sentence segmentation, because we try to segment the text into meaningful sentences. 
- Any text corpus is a body of text where each paragraph comprises several sentences.

In [4]:
import nltk
from nltk.corpus import gutenberg

In [12]:
alice = gutenberg.raw(fileids='carroll-alice.txt')
type(alice), len(alice)

(str, 144395)

In [13]:
alice[:100]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was"

In [14]:
sample_text = 'We will discuss briefly about the basic syntax, structure and \
design philosophies. There is a defined hierarchical syntax for Python code \
which you should remember when writing code! Python is a really powerful \
programming language!'

In [24]:
default_st = nltk.sent_tokenize

In [21]:
alice_sentences = default_st(alice)
alice_sentences[:5]

["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'",
 'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.',
 "There was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!']

In [20]:
sample_sentences = default_st(sample_text)
sample_sentences

['We will discuss briefly about the basic syntax, structure and design philosophies.',
 'There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']

In [25]:
# Total sentences
len(sample_sentences), len(alice_sentences)

(3, 1625)

- __The tokenizer considers other punctuation and the capitalization of words.__
- We can also tokenize text of other languages. 
    - If we are dealing with German text, we can use _sent-tokenize_, which is already trained, or load a pre-trained tokenization model on German text into a _PunktSentenceTokenizer_ instance and perform the same operation. 

In [28]:
# Other Language(German) example
from nltk.corpus import europarl_raw
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
len(german_text), german_text[:100]

(157171,
 ' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit')

In [40]:
german_sentences_def = default_st(text=german_text, language='german')
german_sentences_def[:5]

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']

In [45]:
# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
# verify the type of german_tokenizer
# should be PunktSentenceTokenizer
type(german_tokenizer) 
# -> PunktSentenceTokenizer, which is specialized in dealing with the German language.

nltk.tokenize.punkt.PunktSentenceTokenizer

In [44]:
german_sentences = german_tokenizer.tokenize(german_text)
german_sentences[:5]

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']

In [46]:
german_sentences_def == german_sentences

True

In [50]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()

In [51]:
sample_sentences = punkt_st.tokenize(sample_text)

In [52]:
sample_sentences

['We will discuss briefly about the basic syntax, structure and design philosophies.',
 'There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']

- RegexpTokenizer class to tokenize text into sentences where we will use specific regular expression-based patterns to segment sentences.

In [55]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps=True)

In [57]:
sample_sentences = regex_st.tokenize(sample_text)
sample_sentences

['We will discuss briefly about the basic syntax, structure and design philosophies.',
 'There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']

## 2) Word Tokenization
- Word tokenization is the process of splitting or segmenting sentences into their constituent words. 
- A sentence is a collection of words, and with tokenization we essentially split a sentence into a list of words that can be used to reconstruct the sentence. 
- very important in many processes, especially in cleaning and normalizing text where operations like stemming and lemmatization work on each individual word based on its respective stems and lemma.

In [58]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"

### i) word_tokenize
- the default and recommended word tokenizer as specified by nltk. 
- This tokenizer is actually an instance or object of the __TreebankWordTokenizer__ class in its internal implementation and acts as a wrapper to that core class.

In [60]:
default_wt = nltk.word_tokenize
words = default_wt(sentence)
words

['The',
 'brown',
 'fox',
 'was',
 "n't",
 'that',
 'quick',
 'and',
 'he',
 'could',
 "n't",
 'win',
 'the',
 'race']

### ii) TreebankWordTokenizer
- based on the Penn Treebank and uses various regular expressions to tokenize the text. 
- Of course, one primary assumption here is that we have already performed sentence tokenization beforehand. 
- Some of the main features of this tokenizer include the following:
    - Splits and separates out periods that appear at the end of a sentence
    - Splits and separates commas and single quotes when followed by whitespaces
    - Most punctuation characters are split and separated into independent tokens
    - Splits words with standard contractions—examples would be don’t to do and n’t

In [63]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
words # output is similar to word_tokenize() because both use the same tokenizing mechanism.

['The',
 'brown',
 'fox',
 'was',
 "n't",
 'that',
 'quick',
 'and',
 'he',
 'could',
 "n't",
 'win',
 'the',
 'race']

### iii) RegexpTokenizer

In [64]:
# if set to True, is used to find the gaps between the tokens. Otherwise, it is used to find the tokens themselves.

In [67]:
# pattern to identify tokens themselves
TOKEN_PATTERN = r'\w+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=False)
words = regex_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 'wasn',
 't',
 'that',
 'quick',
 'and',
 'he',
 'couldn',
 't',
 'win',
 'the',
 'race']

In [68]:
# pattern to identify gaps in tokens
GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN, gaps=True)
words = regex_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 "wasn't",
 'that',
 'quick',
 'and',
 'he',
 "couldn't",
 'win',
 'the',
 'race']

In [69]:
# get start and end indices of each token
word_indices = list(regex_wt.span_tokenize(sentence))

In [70]:
word_indices

[(0, 3),
 (4, 9),
 (10, 13),
 (14, 20),
 (21, 25),
 (26, 31),
 (32, 35),
 (36, 38),
 (39, 47),
 (48, 51),
 (52, 55),
 (56, 60)]

In [71]:
[sentence[start:end] for start, end in word_indices]

['The',
 'brown',
 'fox',
 "wasn't",
 'that',
 'quick',
 'and',
 'he',
 "couldn't",
 'win',
 'the',
 'race']

### iiii) WordPunktTokenizer
- uses the pattern __r'\w+|[^\w\s]+'__ to tokenize sentences into independent alphabetic and non-alphabetic tokens.

In [74]:
wordpunkt_wt = nltk.WordPunctTokenizer() # pattern = r'\w+|[^\w\s]+'
words = wordpunkt_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 'wasn',
 "'",
 't',
 'that',
 'quick',
 'and',
 'he',
 'couldn',
 "'",
 't',
 'win',
 'the',
 'race']

### iiiii) WhitespaceTokenizer
- whitespace : tabs, newlines, and spaces

In [76]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
words

['The',
 'brown',
 'fox',
 "wasn't",
 'that',
 'quick',
 'and',
 'he',
 "couldn't",
 'win',
 'the',
 'race']

# 2. Text Normalization

## 1) Cleaning Text

## 2)  Tokenizing Text

## 3) Removing Special Characters

## 4) Expanding Contractions

## 5) Case Conversions

## 6) Removing Stopwords

## 7) Correcting Words

## 8) Stemming

## 9) Lemmatization

# 3. Understanding Text Syntax and Structure