In [24]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

# Text Tokenization
- Tokens => Independent and minimal textual components that have some definite syntax and semantics.

NOTE: This minimal textual component can be sentences, clauses, phrases, or words depending on what you want.

- Tokenization => The process of breaking down or splitting textual data into tokens.

## A. Sentence Tokenization

In [25]:
alice = gutenberg.raw(fileids='carroll-alice.txt')

print(alice[0:500])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an


In [26]:
sample_text = '''We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python codewhich you should remember when writing code! Python is a really powerful programming language!'''

print(sample_text)

We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python codewhich you should remember when writing code! Python is a really powerful programming language!


**The nltk.sent_tokenize Function**

> The nltk.sent_tokenize function is the default sentence tokenization function that nltk recommends. 

> It uses an instance of the PunktSentenceTokenizer class internally. However, this is not just a normal object or instance of that class—it has been pre-trained on several language models and works really well on many popular languages besides just English.

In [27]:
# Using nltk.sent_tokenize funtion
from nltk import sent_tokenize
# or
# from nltk.tokenize import sent_tokenize

alice_sentences = sent_tokenize(text=alice)
sample_sentences = sent_tokenize(text=sample_text)

In [28]:
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :')
pprint(sample_sentences)

Total sentences in sample_text: 3
Sample text sentences :
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python codewhich you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


In [29]:
print('Total sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:')
pprint(alice_sentences[0:5])

Total sentences in alice: 1625
First 5 sentences in alice:
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a daisy-chain would be worth the trouble of getting up and\n'
 'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n'
 'close by her.',
 'There was nothing so VERY remarkable in that; nor did Alice think it so\n'
 "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!']


**Using pre-trained tokenization model into PunkSentenceTokenizer**

In [30]:
from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
print("Total characters in the corpus: ", len(german_text))
print("First 100 characters in the corpus\n", german_text[:100])

Total characters in the corpus:  157171
First 100 characters in the corpus
  
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [31]:
# Using pre-trained tokenization

german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

print(german_sentences[:5])

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .', 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .', 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .', 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .', 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']


In [32]:
# Note: We can use sent_tokenizer too

german_sentences_st = sent_tokenize(text=german_text, language='german')

print(german_sentences_st[:5])

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .', 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .', 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .', 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .', 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']


In [33]:
# Make sure the result is same

german_sentences_st == german_sentences

True

**Using RegexpTokenizer**

In [34]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'

from nltk.tokenize import RegexpTokenizer

regex_st = RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN,
                           gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)

['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python codewhich you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


## B. Word Tokenization

**Using nltk.word_tokenize Function**

> It is the default and recommended work tokenizer as specified by nltk.

> This tokenizer is actually an instance or object of the TreebankWordTokenizer class in its internal implementation and acts as a warpper to that core class.

In [35]:
# Using nltk.word_tokenize function

from nltk import word_tokenize

sentence = "The brown fox wasn't that quick and he couldn't win the race."

words = word_tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race', '.']


**Using TreebankWordTokenizer**

> It's based on the Penn Treebank and uses various regular expressions to tokenize the text.

Read more about Penn Treebank: www.cis.upenn.edu/~treebank/tokenizer.sed

The main features of this tokenizer:
- Splits and separates out periods that appear at the end of a sentence.
- Splits and separates commas and single quotes when followed by whitespace.
- Most punctuation characters are split and separated into independent tokens.
- Splits words with standard contractions. Examples would be don't to do and n't.

In [36]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race', '.']


NOTE: Since word_tokenize function and TreebankWordTokenizer using same tokenizing mechanism, then the outputs are same.

**Using RegexpTokenizer**

There are two main parameters:
- pattern ==> For building the tokenizer.
- gaps ==> If set to True then it will find the gaps between the tokens. Otherwise it is used to find the tokens themselves.


In [37]:
# Pattern to identify tokens themselves

TOKEN_PATTERN = r'\w+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,
                                gaps=False)

words = regex_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [38]:
# Pattern to identify gaps in tokens

GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
                                gaps=True)

words = regex_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race.']


**WordPunctTokenizer & WhiteSpaceTokenizer**
> The WordPunctTokenizer uses the pattern r'\w+|[^\w\s]+' to tokenize sentences into independent alphabetic and non-alphabetic tokens.

> The WhiteSpaceTokenizer uses whitespaces (tabs, newlines, and spaces) to tokenizes sentences into word.

In [39]:
# WordPunctTokenizer

wordpunk_wt = nltk.WordPunctTokenizer()
words = wordpunk_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race', '.']


In [40]:
# WhiteSpaceTokenizer

whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race.']


# Text Normalization

> Process wrangle, clean, and standardize textual data into form that could be consumed by other NLP and analytics systems and applications as input.

NOTE: 
- Text normalization also called text cleansing or wrangling.
- Tokenization is a part of text normalization.

In [57]:
import nltk
import re
import string

corpus = ["The brown fox wasn't that quick and he couldn't win the race",
          "Hey that's a great deal! I just bought a phone for $199",
          "@@You'll (learn) a **lot** in the book. Python is an amazing language !@@"]

print(corpus)

["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for $199", "@@You'll (learn) a **lot** in the book. Python is an amazing language !@@"]


**Cleaning Text**

> Often the textual data we want to use or analyze contains a lot of extraneous and unnecessary tokens and characters that should be removed before performing any further operations like tokenization or other normalization techniques.

For example:
- Extract meaningful text from HTML => We can use (1) clean_html() function from nltk, (2) Parse HTML data using BeautifulSoup.


**Tokenizing Text**

In [75]:
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

token_list = [tokenize_text(text) for text in corpus]

print(token_list)

[[['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']], [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'], ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']], [['@', '@', 'You', "'ll", '(', 'learn', ')', 'a', '*', '*', 'lot', '*', '*', 'in', 'the', 'book', '.'], ['Python', 'is', 'an', 'amazing', 'language', '!'], ['@', '@']]]


**Removing Special Characters**

- These may be special symbols or even punctuation that occurs in sentences.

- The main reason is often punctuation or special characters do not have much significance when we analyze the text and utilize it for extracting features or information based on NLP and ML.

- This step is often performed before or after tokenization.

In [83]:
# Example removing special characters after tokenization

def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = list(filter(None, [pattern.sub('', token) for token in tokens]))
    # filter(None, ...) is used to remove any empty strings that may 
    #   result from removing all punctuation from a token.
    return filtered_tokens

filtered_list_1 = [list(filter(None, [remove_characters_after_tokenization(tokens) for tokens in sentence_tokens])) for sentence_tokens in token_list]

print(filtered_list_1)

[[['The', 'brown', 'fox', 'was', 'nt', 'that', 'quick', 'and', 'he', 'could', 'nt', 'win', 'the', 'race']], [['Hey', 'that', 's', 'a', 'great', 'deal'], ['I', 'just', 'bought', 'a', 'phone', 'for', '199']], [['You', 'll', 'learn', 'a', 'lot', 'in', 'the', 'book'], ['Python', 'is', 'an', 'amazing', 'language']]]


In [97]:
# Example removing special characters before tokenization

def remove_characters_before_tokenization(sentence,
                                          keep_apostrophes=False):
    
    # Removing leading and trailing whitespace from the sentence
    sentence = sentence.strip()
    
    # If keep_apostrophes is True, remove only specific special characters
    if keep_apostrophes:
        PATTERN = r'[?|$|&|*|%|@|(|)|~]'
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        # If keep_apostrophes is False, remove all non-alphanumeric characters except spaces
        PATTERN = r'[^a-zA-Z0-9 ]'
        filtered_sentence = re.sub(PATTERN, r'', sentence)

    # Return the cleaned sentence
    return filtered_sentence

filtered_list_2 = [remove_characters_before_tokenization(sentence) for sentence in corpus]
cleaned_corpus = [remove_characters_before_tokenization(sentence, True) for sentence in corpus]

print("Do not keep some apostrophes: ", filtered_list_2)
print()
print("Keep some apostrophes: ", cleaned_corpus)

Do not keep some apostrophes:  ['The brown fox wasnt that quick and he couldnt win the race', 'Hey thats a great deal I just bought a phone for 199', 'Youll learn a lot in the book Python is an amazing language ']

Keep some apostrophes:  ["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for 199", "You'll learn a lot in the book. Python is an amazing language !"]


**Expanding Contractions**

Contractions are shortened version of words or syllables.

Why contractions do pose a problem for NLP and text analytics:
- Constractions contains a special apostrophe character.
- We could have two or more words represented by a contraction (e.g: you'll => you will or you shall). This opens a whole new can of worms when we try to tokenize this or even standardize the words.



In [108]:
# We create contractions.py file that contains CONTRACTION MAP.

from contractions import CONTRACTION_MAP

def expand_contractions(sentence, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                     flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(match) else contraction_mapping.get(match.lower())
        
#         print("Contraction", contraction)
#         print("First_char", first_char)
#         print("Expanded_contraction", expanded_contraction)
        expanded_contraction = first_char + expanded_contraction[1:]
        return expanded_contraction
    
    expanded_sentence = contractions_pattern.sub(expand_match, sentence)
#     print(expanded_sentence)
    return expanded_sentence

expanded_corpus = [expand_contractions(sentence, CONTRACTION_MAP)
                   for sentence in cleaned_corpus]
print(expanded_corpus)

['The brown fox was not that quick and he could not win the race', 'Hey that is a great deal! I just bought a phone for 199', 'You will learn a lot in the book. Python is an amazing language !']


**Case Conversions**

Usually there are two type of case conversion:(1) Lowercase and (2) Uppercase.

In [109]:
# Lowercase
print(corpus[0].lower())
# Uppercase
print(corpus[0].upper())

the brown fox wasn't that quick and he couldn't win the race
THE BROWN FOX WASN'T THAT QUICK AND HE COULDN'T WIN THE RACE


**Correcting Words**

> the WordNet corpus help to validate word

In [113]:
# Correcting Repeating Characters
from nltk.corpus import wordnet

old_word = 'finalllyyy'
step = 1

def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word

    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens


# In format common (not recursive)
# repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
# match_substitution = r'\1\2\3'
# while True:
#     # check for semantically correct word
#     if wordnet.synsets(old_word):
#         print ("Final correct word:", old_word)
#         break
#     # remove one repeated character
#     new_word = repeat_pattern.sub(match_substitution, old_word)
#     if new_word != old_word:
#         print('Step: {} Word: {}'.format(step, new_word))
#         step += 1 # update step
#         # update old word to last substituted state
#         old_word = new_word
#         continue
#     else:
#         print("Final word:", new_word)
#         break


sample_sentence = 'My schooool is realllllyyy amaaazingggg'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]
print(sample_sentence_tokens)
print(remove_repeated_characters(sample_sentence_tokens))


['My', 'schooool', 'is', 'realllllyyy', 'amaaazingggg']
['My', 'school', 'is', 'really', 'amazing']


**Correcting Spellings**

Some algorithms:
- Peter Norvig Algorithm (https://norvig.com/spell-correct.html)
- PyEnchant (http://pythonhosted.org/pyenchant/)
- aspell-python

In [118]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

corpus = """
This is a sample corpus of text. It contains a few sentences with words.
This is useful for demonstration purposes, but in practice, you would
want a much larger corpus to improve the accuracy of the spelling correction.
"""

WORDS = Counter(words(corpus))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))


print(correction("speling"))  # Should return "spelling"
print(correction("korrectud"))  # Should return "corrected"
print(correction("bycycle"))  # Should return "bicycle"
print(correction('fianlly')) # Should return "finally"

spelling
korrectud
bycycle
fianlly


**Stemming**

- Stem => Base form of a word.
- Inflection => Attaching affixes to base.
- Stemming => Separate affixes and its base.

For example:

"jumping" ==> base: "jump" and affixes: "-ing"


Some stemmers:
- Porter Stemmer ==> Invented by Dr. Martin Porter
> The algorithm is said to have had a total of five different phases for reduction of inflections to their stems, where each phase has its own set of rules.

- Porter 2 Stemmer ==> Improvements that suggested by Dr. Porter

- Lancaster Stemmer (or Paice/Husk Stemmer) ==> Invented by Chris D. Paice
> It is an iterative stemmer that has over 120 rules specifying specific removal or replacement for affixes to obtain the word stems.

- RegexpStemmer
> Using regular expressions to identify the morphological affixes in words, and any part of the string matching the same is removed.

- SnowBallStemmer ==> Supports stemming in 13 different languages besides English (https://snowballstem.org/)


NOTE: The Porter Stemmer is used most frequently, but we should choose our stemmer based on our problem and after trial and error.

In [120]:
# Porter Stemmer

from nltk.stem import PorterStemmer

ps = PorterStemmer()

print(ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped'))
print(ps.stem('lying'))
print(ps.stem('strange'))

jump jump jump
lie
strang


In [123]:
# Lancaster Stemmer
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()

print(ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped'))
print(ls.stem('lying'))
print(ls.stem('strange'))

jump jump jump
lying
strange


In [124]:
# Regex based
from nltk.stem import RegexpStemmer

rs = RegexpStemmer('ing$|s$|ed$', min=4)

print(rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped'))
print(rs.stem('lying'))
print(rs.stem('strange'))

jump jump jump
ly
strange


In [126]:
# Snowball Stemmer
from nltk.stem import SnowballStemmer
print('Supported Languages:', SnowballStemmer.languages)
print()

# Set as german

ss = SnowballStemmer("german")

# stemming on German words
# autobahnen -> cars
# autobahn -> car
print(ss.stem('autobahnen'))

# springen -> jumping
# spring -> jump
print(ss.stem('springen'))

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

autobahn
spring


**Lemmatization**

> The process of lemmatization is very similar to stemming—you remove word affixes to get to a base form of the word. But in this case, this base form is also known as the root word , but not the root stem.


NOTE:
The difference between root stem & root word:
- The root stem may not always be a lexicographically correct word, it may not be present in the dictionary.
- The root word (also known as the lemma) will always be present in the dictionary.

In [129]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

# Make sure the part of speech is not wrong
print(wnl.lemmatize('ate', 'n')) # it should verb
print(wnl.lemmatize('fancier', 'v')) # It should adjective

car
men
run
eat
sad
fancy
ate
fancier


NOTE: 

> The preceding code leverages the WordNetLemmatizer class, it's internally uses the morphy() function belongin to the WordNetCorpusReader class. 

> It's basically finds the base form or lemma for a given word using the word and its part of speech by checking the Wordnet corpus and uses a recursive technique for removing affixes from the word until a match is found in WordNet. If no match is found, the input word itself is returned unchanged.

> The part of speech is extremely important here because if that is wrong, the lemmatization will not be effective.