### Text Pre-Processing 

- Tokenization
- Tagging
- Chunking
- Stemming
- Lemmatization

### Sentence Tokenization
Sentence tokenization is the process of splitting a text corpus into sentences that act as
the first level of tokens which the corpus is comprised of. This is also known as sentence
segmentation

We will primarily focus on the following sentence tokenizers:
- sent_tokenize
- PunktSentenceTokenizer
- RegexpTokenizer
- Pre-trained sentence tokenization models

In [1]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = " We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!"


In [2]:
# Total characters in Alice in Wonderland
print( len(alice))

144395


In [3]:
# First 100 characters in the corpus
print(alice[0:100])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


The nltk.sent_tokenize function is the default sentence tokenization function that
nltk recommends. It uses an instance of the PunktSentenceTokenizer class internally.
However, this is not just a normal object or instance of that class—it has been pre-trained
on several language models and works really well on many popular languages besides
just English.

In [4]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
pprint(sample_sentences)
print('\nTotal sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:-')
pprint(alice_sentences[0:5])

Total sentences in sample_text: 3
Sample text sentences :-
[' We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a daisy-chain would be worth the 

In [5]:
# loading a German text corpus and inspecting it:

from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print (len(german_text))
# First 100 characters in the corpus
print (german_text[0:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [6]:
german_sentences_def = default_st(text=german_text,language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

# verify the type of german_tokenizer
# should be PunktSentenceTokenizer
print (type(german_tokenizer))

<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>


In [7]:
print (german_sentences_def == german_sentences)
 # print first 5 sentences of the corpus
for sent in german_sentences[0:5]:
    print (sent)

True
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .


In [8]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

[' We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


#### RegexpTokenizer

In [9]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN,gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)

[' We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


### Word Tokenization
Word tokenization is the process of splitting or segmenting sentences into their
constituent words. A sentence is a collection of words, and with tokenization we
essentially split a sentence into a list of words that can be used to reconstruct the
sentence. Word tokenization is very important in many processes, especially in cleaning
and normalizing text where operations like stemming and lemmatization work on
each individual word based on its respective stems and lemma.
- word_tokenize
- TreebankWordTokenizer
- RegexpTokenizer
- Inherited tokenizers from RegexpTokenizer<br>
The nltk. word_tokenize function is the default and recommended word tokenizer as specified
by nltk . This tokenizer is actually an instance or object of the TreebankWordTokenizer
class in its internal implementation and acts as a wrapper to that core class. The following
snippet illustrates its usage:

#### word_tokenize

In [10]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"

default_wt = nltk.word_tokenize
words = default_wt(sentence)
print (words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


The TreebankWordTokenizer is based on the Penn Treebank and uses various regular
expressions to tokenize the text.
- Splits and separates out periods that appear at the end of a sentence
- Splits and separates commas and single quotes when followed by whitespaces
- Most punctuation characters are split and separated into independent tokens
- Splits words with standard contractions—examples would be don’t to do and n’t

#### TreebankWordTokenizer

In [11]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print (words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


word_tokenize() and treebankWordTokenizer() follow same tokenixe machnism

#### RegexpTokenizer
RegexpTokenizer class to tokenize sentences into words. Remember, there are two main parameters that are useful
in tokenization: the regex pattern for building the tokenizer and the gaps parameter,
which, if set to True , is used to find the gaps between the tokens. Otherwise, it is used to
find the tokens themselves.

In [12]:
# pattern to identify tokens themselves
TOKEN_PATTERN = r'\w+'
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,gaps=False)
words = regex_wt.tokenize(sentence)
print (words)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [13]:
# pattern to identify gaps in tokens
GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,gaps=True)
words = regex_wt.tokenize(sentence)
print (words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


In [14]:
# get start and end indices of each token and then print them
word_indices = list(regex_wt.span_tokenize(sentence))
print (word_indices)
print([sentence[start:end] for start, end in word_indices])

[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


The WordPunktTokenizer uses the pattern
r'\w+|[^\w\s]+' to tokenize sentences into independent alphabetic and non-alphabetic
tokens. The WhitespaceTokenizer tokenizes sentences into words based on whitespaces
like tabs, newlines, and spaces .

In [15]:
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print (words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race']


In [16]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print( words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


### Text Normalization
Text normalization is defined as a process that consists of a series of steps that
should be followed to wrangle, clean, and standardize textual data into a form that
could be consumed by other NLP and analytics systems and applications as input.
Often tokenization itself also is a part of text normalization. Besides tokenization,
various other techniques include cleaning text, case conversion, correcting spellings,
removing stopwords and other unnecessary terms, stemming, and lemmatization. Text
normalization is also often called text cleansing or wrangling .

In [17]:
import nltk
import re
import string
from pprint import pprint
corpus = ["The brown fox wasn't that quick and he couldn't win the race",
"Hey that's a great deal! I just bought a phone for $199",
"@@You'll (learn) a **lot** in the book. Python is an amazing language !@@"]

### Cleaning Text
Often the textual data we want to use or analyze contains a lot of extraneous and
unnecessary tokens and characters that should be removed before performing any
further operations like tokenization or other normalization techniques. This includes
extracting out meaningful text from data sources like HTML data, which consists of
unnecessary HTML tags, or even data from XML and JSON feeds. There are many ways
to parse and clean this data to remove unnecessary tags. You can use functions like
clean_html() from nltk or even the BeautifulSoup library to parse HTML data. You can
also use your own custom logic, including regexes, xpath, and the lxml library, to parse
through XML data. And getting data from JSON is substantially easier because it has
definite key-value annotations.

### Tokenizing Text
Usually, we tokenize text before or after removing unnecessary characters and symbols
from the data. This choice depends on the problem you are trying to solve and the data
you are dealing with.

In [18]:
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

In [19]:
token_list = [tokenize_text(text)for text in corpus]
pprint(token_list)

[[['The',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book',
   '.'],
  ['Python', 'is', 'an', 'amazing', 'language', '!'],
  ['@', '@']]]


### Removing Special Characters

One important task in text normalization involves removing unnecessary and special
characters. These may be special symbols or even punctuation that occurs in sentences.
This step is often performed before or after tokenization. The main reason for doing so is
because often punctuation or special characters do not have much significance when we
analyze the text and utilize it for extracting features or information based on NLP and ML.
We will implement both types of special characters removal, before and after tokenization.

In [20]:
def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    return filtered_tokens

In [21]:
filtered_list_1 = [filter(None,[remove_characters_after_tokenization(tokens)
                                for tokens in sentence_tokens])
                   for sentence_tokens in token_list]
filtered_list_1

[<filter at 0x15e72be2978>,
 <filter at 0x15e72be2b38>,
 <filter at 0x15e72be2e10>]

In [22]:
def remove_characters_before_tokenization(sentence,keep_apostrophes=False):
    sentence = sentence.strip()
    if keep_apostrophes:
        PATTERN = r'[?|$|&|*|%|@|(|)|~]' # add other characters here to remove them
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        PATTERN = r'[^a-zA-Z0-9 ]' # only extract alpha-numeric characters
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    return filtered_sentence

In [23]:
filtered_list_2 = [remove_characters_before_tokenization(sentence) for sentence in corpus]
print (filtered_list_2)

['The brown fox wasnt that quick and he couldnt win the race', 'Hey thats a great deal I just bought a phone for 199', 'Youll learn a lot in the book Python is an amazing language ']


In [24]:
cleaned_corpus = [remove_characters_before_tokenization(sentence,keep_apostrophes=True) for sentence in corpus]
print(cleaned_corpus)

["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for 199", "You'll learn a lot in the book. Python is an amazing language !"]


### Expanding Contractions
Contractions are shortened version of words or syllables. They exist in either written or
spoken forms. Shortened versions of existing words are created by removing specific
letters and sounds. In case of English contractions, they are often created by removing
one of the vowels from the word. Examples would be is not to isn’t and will not to won’t ,
where you can notice the apostrophe being used to denote the contraction and some
of the vowels and other letters being removed.<br>



In [25]:
from contractions import CONTRACTION_MAP

def expand_contractions(sentence, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_sentence = contractions_pattern.sub(expand_match, sentence)
    return expanded_sentence

In [26]:
expanded_corpus = [expand_contractions(sentence, CONTRACTION_MAP) 
                    for sentence in cleaned_corpus]    
print (expanded_corpus)

['The brown fox was not that quick and he could not win the race', 'Hey that is a great deal! I just bought a phone for 199', 'You will learn a lot in the book. Python is an amazing language !']


### Case Conversions
These are lowercase and uppercase conversions, where a
body of text is converted completely to lowercase or uppercase. There are other forms
also, such as sentence case or proper case. Lowercase is a form where all the letters of the
text are small letters, and in uppercase they are all capitalized.

In [27]:
# lower case
print (corpus[0].lower())

the brown fox wasn't that quick and he couldn't win the race


In [28]:
# upper case
print (corpus[0].upper())

THE BROWN FOX WASN'T THAT QUICK AND HE COULDN'T WIN THE RACE


### Removing Stopwords
Stopwords , sometimes written stop words , are words that have little or no significance.
They are usually removed from text during processing so as to retain words having
maximum significance and context.<br>

Words like a, the , me , and so on are stopwords. There is no universal or
exhaustive list of stopwords. Each domain or language may have its own set of stopwords.

In [29]:
# removing stopwords
def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

In [30]:
expanded_corpus_tokens = [tokenize_text(text)for text in expanded_corpus] 

In [31]:
filtered_list_3 =  [[remove_stopwords(tokens) 
                        for tokens in sentence_tokens] 
                        for sentence_tokens in expanded_corpus_tokens]
print (filtered_list_3)

[[['The', 'brown', 'fox', 'quick', 'could', 'win', 'race']], [['Hey', 'great', 'deal', '!'], ['I', 'bought', 'phone', '199']], [['You', 'learn', 'lot', 'book', '.'], ['Python', 'amazing', 'language', '!']]]


### Correcting Words
One of the main challenges faced in text normalization is the presence of incorrect words
in the text. The definition of incorrect here covers words that have spelling mistakes as
well as words with several letters repeated that do not contribute much to its overall
significance. To illustrate some examples, the word finally could be mistakenly written as
fianlly , or someone expressing intense emotion could write it as finalllllyyyyyy . The main
objective here would be to standardize different forms of these words to the correct form
so that we do not end up losing vital information from different tokens in the text. This
section covers dealing with repeated characters as well as correcting spellings.

### Correcting Repeating Characters
The first step in our algorithm would be to identify repeated characters in a word
using a regex pattern and then use a substitution to remove the characters one by one.
Consider the word finalllyyy from the earlier example. The pattern r'(\w*)(\w)\2(\w*)'
can be used to identify characters that occur twice among other characters in the
word, and in each step we will try to eliminate one of the repeated characters using a
substitution for the match by utilizing the regex match groups (groups 1, 2, and 3) using
the pattern r’\1\2\3’ and then keep iterating through this process till no repeated
characters remain.

In [32]:
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1
while True:
    # remove one repeated character
    new_word = repeat_pattern.sub(match_substitution,old_word)
    if new_word != old_word:
        print ('Step: {} Word: {}'.format(step, new_word))
        step += 1 # update step
        # update old word to last substituted state
        old_word = new_word
        continue
    else:
        print ("Final word:", new_word)
        break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Step: 4 Word: finaly
Final word: finaly


In [33]:
# removing repeated characters
sample_sentence = 'My schooool is realllllyyy amaaazingggg'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]

In [34]:
from nltk.corpus import wordnet

def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
            
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens

print (remove_repeated_characters(sample_sentence_tokens)) 

['My', 'school', 'is', 'really', 'amazing']


In [35]:
def correct_match(match): #"""Spell-correct word in match,and preserve proper upper/lower/title case."""
    word = match.group()
def case_of(text): #"""Return the case-function appropriatefor text: upper, lower, title, or just str.:"""
    return (str.upper if text.isupper() else
    str.lower if text.islower()         else
    str.title if text.istitle()         else 
    str)
    return case_of(word)(correct(word.lower()))

def correct_text_generic(text): #"""Correct all the words within a text,returning the corrected text."""
    return re.sub('[a-zA-Z]+', correct_match, text)

In [36]:
correct_text_generic('fianlly')

''

### Stemming
Word stems are also often known as the base form of a
word, and we can create new words by attaching affixes to them in a process known as
inflection . The reverse of this is obtaining the base form of a word from its inflected form,
and this is known as stemming .<br>
Consider the word JUMP . You can add affixes to it and form new words like JUMPS ,
JUMPED , and JUMPING . In this case, the base word JUMP is the word stem.
![](https://media.springernature.com/lw785/springer-static/image/chp%3A10.1007%2F978-1-4842-2388-8_3/MediaObjects/427287_1_En_3_Fig1_HTML.jpg)


#### PorterStemmer

In [37]:
# porter stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

print (ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped'))

jump jump jump


In [38]:
print (ps.stem('lying'))



lie


In [39]:
print (ps.stem('strange'))

strang


#### LancasterStemmer
The Lancaster stemmer is based on the Lancaster stemming algorithm, also often
known as the Paice/Husk stemmer, invented by Chris D. Paice. This stemmer is an iterative
stemmer that has over 120 rules specifying specific removal or replacement for affixes to
obtain the word stems.

In [40]:
# lancaster stemmer
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

print (ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped'))


jump jump jump


In [41]:
print (ls.stem('lying'))



lying


In [42]:
print (ls.stem('strange'))

strange


#### RegexpStemmer
The RegexpStemmer uses regular expressions to identify the morphological
affixes in words, and any part of the string matching the same is removed:

In [43]:
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)

print (rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped'))

jump jump jump


In [44]:
print (rs.stem('lying'))



ly


In [45]:
print (rs.stem('strange'))

strange


#### SnowballStemmer
SnowballStemmer , which supports
stemming in 16 different languages besides English.

In [46]:
# snowball stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")

print ('Supported Languages:', SnowballStemmer.languages)

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [47]:
# autobahnen -> cars
# autobahn -> car
ss.stem('autobahnen')


'autobahn'

In [48]:
# springen -> jumping
# spring -> jump
ss.stem('springen')

'spring'

### Lemmatization
The process of lemmatization is very similar to stemming—you remove word affixes to
get to a base form of the word. But in this case, this base form is also known as the root
word , but not the root stem . The difference is that the root stem may not always be a
lexicographically correct word; that is, it may not be present in the dictionary. The root
word, also known as the lemma , will always be present in the dictionary.<br>


The lemmatization process is considerably slower than stemming because an
additional step is involved where the root form or lemma is formed by removing the affix
from the word if and only if the lemma is present in the dictionary. The nltk package has
a robust lemmatization module that uses WordNet and the word’s syntax and semantics,
like part of speech and context, to get the root word or lemma. Remember parts of speech
from Chapter 1 ? There were mainly three entities—nouns, verbs, and adjectives—that
occur most frequently in natural language.

In [49]:
# lemmatization
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

# lemmatize nouns
print (wnl.lemmatize('cars', 'n'))
print (wnl.lemmatize('men', 'n'))

car
men


In [50]:
# lemmatize verbs
print (wnl.lemmatize('running', 'v'))
print (wnl.lemmatize('ate', 'v'))

run
eat


In [51]:
# lemmatize adjectives
print (wnl.lemmatize('saddest', 'a'))
print (wnl.lemmatize('fancier', 'a'))

sad
fancy


In [52]:
# ineffective lemmatization
print (wnl.lemmatize('ate', 'n'))
print (wnl.lemmatize('fancier', 'v'))

ate
fancier


### Understanding Text Syntax and Structure
we will look and implement some of the concepts and
techniques that are used for understanding text syntax and structure. This is extremely
useful in NLP and is usually done after text processing and normalization . We will focus
on implementing the following techniques:
- Parts of speech (POS) tagging
- Shallow parsing
- Dependency-based parsing
- Constituency-based parsing

### Important Machine Learning Concepts
We will be implementing and training some of our own taggers in the following section
using corpora and also leverage existing pre-built taggers. There are some important
concepts related to analytics and ML that you must know in order to better understand
the implementations:
- **Data preparation :** Usually consists of pre-processing the data before extracting features and training
- **Feature extraction :** The process of extracting useful features from raw data that are used to train machine learning models
- **Features :** Various useful attributes of the data (examples could be age, weight, and so on for personal data)
- **Training data :** A set of data points used to train a model
- **Testing/validation data :** A set of data points on which a pretrained model is tested and evaluated to see how well it performs
- **Model :** Built using a combination of data/features and a machine learning algorithm that could be supervised or unsupervised
- **Accuracy :** How well the model predicts something (also has otherdetailed evaluation metrics like precision, recall, and F1-score)

### Parts of Speech (POS) Tagging
Parts of speech (POS) are specific lexical categories to which words are assigned based on their
syntactic context and role.the main
POS being noun, verb, adjective, and adverb. The process of classifying and labeling POS tags
for words called parts of speech tagging or POS tagging . POS tags are used to annotate words
and depict their POS, which is really helpful when we need to use the same annotated text
later in NLP-based applications because we can filter by specific parts of speech and utilize
that information to perform specific analysis, such as narrowing down upon nouns and seeing
which ones are the most prominent, word sense disambiguation, and grammar analysis.

In [64]:
entence = 'The brown fox is quick and he is jumping over the lazy dog'


# recommended tagger based on PTB
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')
print (tagged_sent)

[('The', 'DET'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('was', 'VERB'), ("n't", 'ADV'), ('that', 'ADP'), ('quick', 'ADJ'), ('and', 'CONJ'), ('he', 'PRON'), ('could', 'VERB'), ("n't", 'ADV'), ('win', 'VERB'), ('the', 'DET'), ('race', 'NOUN')]


### Building Your Own POS Taggers


In [65]:
# building your own tagger

# preparing the data
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print (train_data[0])


[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]


In [66]:
"""the DefaultTagger , which inherits from the
SequentialBackoffTagger base class and assigns the same user input POS tag to each
word. This may seem really naïve, but it is an excellent way to form a baseline POS tagger
and improve upon it:"""

# default tagger
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')

print (dt.evaluate(test_data))

print (dt.tag(tokens))

0.1454158195372253
[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('was', 'NN'), ("n't", 'NN'), ('that', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('could', 'NN'), ("n't", 'NN'), ('win', 'NN'), ('the', 'NN'), ('race', 'NN')]


In [67]:
# regex tagger
from nltk.tag import RegexpTagger
# define regex tag patterns
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default) ... 
]
rt = RegexpTagger(patterns)

print (rt.evaluate(test_data))
print (rt.tag(tokens))

0.24039113176493368
[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('was', 'NNS'), ("n't", 'NN'), ('that', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('could', 'MD'), ("n't", 'NN'), ('win', 'NN'), ('the', 'NN'), ('race', 'NN')]


The **UnigramTagger** , **BigramTagger** , and **TrigramTagger** are
classes that inherit from the base class **NGramTagger** , which itself inherits from the
**ContextTagger class** , which inherits from the **SequentialBackoffTagger class** . We will
use train_data as training data to train the n-gram taggers based on sentence tokens and
their POS tags.

In [69]:
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

uni_gram = UnigramTagger(train_data)
bi_gram= BigramTagger(train_data)
tri_gram= TrigramTagger(train_data)


In [70]:
# testing performance of unigram tagger
print(uni_gram.evaluate(test_data))

0.8607803272340013


In [72]:
print (uni_gram.tag(tokens))

[('The', 'DT'), ('brown', None), ('fox', None), ('was', 'VBD'), ("n't", 'RB'), ('that', 'IN'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('could', 'MD'), ("n't", 'RB'), ('win', 'VB'), ('the', 'DT'), ('race', 'NN')]


In [73]:
# testing performance of bigram tagger
print(bi_gram.evaluate(test_data))

0.13466937748087907


In [74]:
print (bi_gram.tag(tokens))

[('The', 'DT'), ('brown', None), ('fox', None), ('was', None), ("n't", None), ('that', None), ('quick', None), ('and', None), ('he', None), ('could', None), ("n't", None), ('win', None), ('the', None), ('race', None)]


In [75]:
# testing performance of trigram tagger
print(tri_gram.evaluate(test_data))

0.08064672281924679


In [76]:
print (tri_gram.tag(tokens))

[('The', 'DT'), ('brown', None), ('fox', None), ('was', None), ("n't", None), ('that', None), ('quick', None), ('and', None), ('he', None), ('could', None), ("n't", None), ('win', None), ('the', None), ('race', None)]


#### Observation:
The preceding output clearly shows that we obtain 86 percent accuracy on the test
set using UnigramTagger tagger alone, which is really good compared to our last tagger.
The None tag indicates the tagger was unable to tag that word, the reason being that it was
unable to get a similar token in the training data. Accuracies of the bigram and trigram
models are far less because it is not always the case that the same bigrams and trigrams it
had observed in the training data will also be present in the same way in the testing data.

In [77]:
def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

ct = combined_tagger(train_data=train_data, 
                     taggers=[UnigramTagger, BigramTagger, TrigramTagger],
                     backoff=rt)

print (ct.evaluate(test_data) )       
print (ct.tag(tokens))

0.9094781682641108
[('The', 'DT'), ('brown', 'NN'), ('fox', 'NN'), ('was', 'VBD'), ("n't", 'RB'), ('that', 'RB'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('could', 'MD'), ("n't", 'RB'), ('win', 'VB'), ('the', 'DT'), ('race', 'NN')]


#### Observation:
We now obtain an accuracy of 91 percent on the test data, which is excellent. Also we
see that this new tagger is able to successfully tag all the tokens in our sample sentence
(even though a couple of them are not correct, like brown should be an adjective).

In [78]:
from nltk.classify import NaiveBayesClassifier, MaxentClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger

nbt = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=NaiveBayesClassifier.train)

# evaluate tagger on test data and sample sentence
print (nbt.evaluate(test_data))

print (nbt.tag(tokens) )

0.9306806079969019
[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('was', 'VBD'), ("n't", 'RB'), ('that', 'IN'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('could', 'MD'), ("n't", 'RB'), ('win', 'VB'), ('the', 'DT'), ('race', 'NN')]


### Shallow Parsing
Shallow parsing , also known as light parsing or chunking , is a technique of analyzing the
structure of a sentence to break it down into its smallest constituents (which are tokens
such as words) and group them together into higher-level phrases. In shallow parsing,
there is more focus on identifying these phrases or chunks rather than diving into further
details of the internal syntax and relations inside each chunk, like we see in grammarbased
parse trees obtained from deep parsing. The main objective of shallow parsing is to
obtain semantically meaningful phrases and observe relations among them.

### Building Your Own Shallow Parsers
We will use several techniques like regular expressions and tagging-based learners to build
our own shallow parsers. As with POS tagging, we will use some training data to train our
parsers if needed and evaluate all our parsers on some test data and also on our sample
sentence. The treebank corpus is available in nltk with chunk annotations . We will load it
first and prepare our training and testing datasets using the following code snippet :


In [79]:
from nltk.corpus import treebank_chunk
data = treebank_chunk.chunked_sents()
train_data = data[:4000]
test_data = data[4000:]

  return [tok for tok in self._regexp.split(text) if tok]


In [80]:
print (train_data[7])

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


**Chinking**
is the reverse of chunking, where we specify which specific tokens we do not want to be
a part of any chunk and then form the necessary chunks excluding these tokens. Let us
consider a simple sentence and use regular expressions by leveraging the RegexpParser
class to create shallow parsers to illustrate both chunking and chinking for noun phrases :

### Constituency-based Parsing
Constituent-based grammars are used to analyze and determine the constituents
a sentence is usually composed of. Besides determining the constituents, another
important objective is to find out the internal structure of these constituents and see how
they link to each other. There are usually several rules for different types of phrases based
on the type of components they can contain, and we can use them to build parse trees.
Refer to the “Constituency Grammars” subsection under “Grammar” in the “Language
Syntax and Structure” section from Chapter 1 if you need to refresh your memory and
look at some examples of sample parse trees.

There are various types of parsing algorithms, including the following:
- Recursive Descent parsing
- Shift Reduce parsing
- Chart parsing
- Bottom-up parsing
- Top-down parsing
- PCFG parsing

**Shift Reduce parsing:** follows a bottom-up parsing approach where it finds sequences
of tokens (words/phrases) that correspond to the righthand side of grammar productions
and then replaces it with the lefthand side for that rule. This process continues until the
whole sentence is reduced to give us a parse tree.
**Chart parsing:** uses dynamic programming , which stores intermediate results and
reuses them when needed to get significant efficiency gains. In this case, chart parsers
store partial solutions and look them up when needed to get to the complete solution.