### text syntax and structure processing

Knowledge about the structure and syntax of language is helpful in many areas like text processing, annotation, and parsing for further operations such as text classification or summarization. In this section, we implement some of the concepts and techniques used to understand text syntax and structure. This is extremely useful in natural language processing and is usually done after text processing and wrangling.

In [1]:
import pandas as pd
import numpy as np
import nltk

In [2]:
import requests
data = requests.get('http://www.gutenberg.org/files/1399/1399-h/1399-h.htm')
content = data.content

In [3]:
import re
from bs4 import BeautifulSoup
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

In [4]:
clean_content = strip_html_tags(content)
sample_text = clean_content[1932:2721]
sample_text2 = sample_text.replace("\n", " ")
sample_text2

'Happy families are all alike; every unhappy family is unhappy in its own way. Everything was in confusion in the Oblonskys’ house. The wife had discovered that the husband was carrying on an intrigue with a French girl, who had been a governess in their family, and she had announced to her husband that she could not go on living in the same house with him. This position of affairs had now lasted three days, and not only the husband and wife themselves, but all the members of their family and household, were painfully conscious of it. Every person in the house felt that there was no sense in their living together, and that the stray people brought together by chance in any inn had more in common with one another than they, the members of the family and household of the Oblonskys.'

In [11]:
### remove \n from text
sample_text2 = sample_text.replace("\n", " ")

SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
            pattern=SENTENCE_TOKENS_PATTERN,
            gaps=True)
sample_sentences = regex_st.tokenize(sample_text2)
print('Total sentences in sample_text:', len(sample_sentences), '\n')
print('Sample text sentences : \n', np.array(sample_sentences))

Total sentences in sample_text: 5 

Sample text sentences : 
 ['Happy families are all alike; every unhappy family is unhappy in its own way.'
 'Everything was in confusion in the Oblonskys’ house.'
 'The wife had discovered that the husband was carrying on an intrigue with a French girl, who had been a governess in their family, and she had announced to her husband that she could not go on living in the same house with him.'
 'This position of affairs had now lasted three days, and not only the husband and wife themselves, but all the members of their family and household, were painfully conscious of it.'
 'Every person in the house felt that there was no sense in their living together, and that the stray people brought together by chance in any inn had more in common with one another than they, the members of the family and household of the Oblonskys.']


In [37]:
### default nltk and spacy POS taggers

In [14]:
import pandas as pd
import spacy

sentence = "US unveils world's most powerful supercomputer, beats China."
nlp = spacy.load('en', parse=True, tag=True, entity=True)

def spacy_POS_tag(sentence):
    sentence_nlp = nlp(sentence)
    # POS tagging with Spacy
    spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
    return pd.DataFrame(spacy_pos_tagged, columns=['Word', 'POS tag', 'Tag type'])

In [17]:
spacy_POS_tag(sentence).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Word,US,unveils,world,'s,most,powerful,supercomputer,",",beats,China,.
POS tag,NNP,VBZ,NN,POS,RBS,JJ,NN,",",VBZ,NNP,.
Tag type,PROPN,VERB,NOUN,PART,ADV,ADJ,NOUN,PUNCT,VERB,PROPN,PUNCT


In [19]:
sample_sentences[0]

'Happy families are all alike; every unhappy family is unhappy in its own way.'

In [21]:
spacy_POS_tag(sample_sentences[0]).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Word,Happy,families,are,all,alike,;,every,unhappy,family,is,unhappy,in,its,own,way,.
POS tag,JJ,NNS,VBP,DT,RB,:,DT,JJ,NN,VBZ,JJ,IN,PRP$,JJ,NN,.
Tag type,ADJ,NOUN,AUX,DET,ADV,PUNCT,DET,ADJ,NOUN,AUX,ADJ,ADP,DET,ADJ,NOUN,PUNCT


In [23]:
# POS tagging with nltk
import nltk
nltk_pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag']).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Word,US,unveils,world,'s,most,powerful,supercomputer,",",beats,China,.
POS tag,NNP,JJ,NN,POS,RBS,JJ,NN,",",VBZ,NNP,.


In [65]:
nltk_pos_tagged = nltk.pos_tag(nltk.word_tokenize(sample_sentences[0]))
pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag']).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Word,Happy,families,are,all,alike,;,every,unhappy,family,is,unhappy,in,its,own,way,.
POS tag,JJ,NNS,VBP,DT,RB,:,DT,JJ,NN,VBZ,JJ,IN,PRP$,JJ,NN,.


This output gives us tags that purely follow the Penn Treebank format specifying the specific form of adjective, noun, or verbs in more detail.

### Building POS Taggers - only if necessary - otherwise use default NLTK or SpaCy
We will be leveraging NLTK and spaCy, which use the Penn Treebank notation for POS tagging. Let’s look at how POS tagging can be implemented using spaCy.

We will now explore some techniques to build our own POS taggers! We leverage some classes provided by NLTK. To evaluate the performance of our taggers, we use some test data from the treebank corpus in NLTK. We will also be using some training data for training some of our taggers. To start with, we will first get the necessary data for training and evaluating the taggers by reading in the tagged treebank corpus.


In [24]:
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]
print(train_data[0])

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]


We will use the test data to evaluate our taggers and see how they work on our sample sentence by using its tokens as input.

We will first look at the DefaultTagger, which inherits from the SequentialBackoffTagger base class and assigns the same user input POS tag to each word. This might seem to be really naïve but it is an excellent way to form a baseline POS tagger and improve upon it.


In [25]:
# default tagger
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')
# accuracy on test data
dt.evaluate(test_data)

0.1454158195372253

In [26]:
# tagging our sample headline
dt.tag(nltk.word_tokenize(sentence))

[('US', 'NN'),
 ('unveils', 'NN'),
 ('world', 'NN'),
 ("'s", 'NN'),
 ('most', 'NN'),
 ('powerful', 'NN'),
 ('supercomputer', 'NN'),
 (',', 'NN'),
 ('beats', 'NN'),
 ('China', 'NN'),
 ('.', 'NN')]

We can see from this output we have obtained 14% accuracy in correctly tagging words from the treebank test dataset, which is not great. 

We will now use regular expressions and the RegexpTagger to see if we can build a better performing tagger.


In [27]:
# regex tagger
from nltk.tag import RegexpTagger
# define regex tag patterns
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default) ...
]
rt = RegexpTagger(patterns)

In [28]:
rt.evaluate(test_data)

0.24039113176493368

In [29]:
rt.tag(nltk.word_tokenize(sentence))

[('US', 'NN'),
 ('unveils', 'NNS'),
 ('world', 'NN'),
 ("'s", 'NN$'),
 ('most', 'NN'),
 ('powerful', 'NN'),
 ('supercomputer', 'NN'),
 (',', 'NN'),
 ('beats', 'NNS'),
 ('China', 'NN'),
 ('.', 'NN')]

This output shows us that the accuracy has now increased to 24%, but can we do better? We will now train some n-gram taggers. 

We will use the train_data as training data to train the n-gram taggers based on sentence tokens and their POS tags. Then we will evaluate the trained taggers on test_data and see the result upon tagging our sample sentence.


In [30]:
## N gram taggers
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

ut = UnigramTagger(train_data)
bt = BigramTagger(train_data)
tt = TrigramTagger(train_data)

# testing performance of unigram tagger
print(ut.evaluate(test_data))
print(ut.tag(nltk.word_tokenize(sentence)))

0.8607803272340013
[('US', 'NNP'), ('unveils', None), ('world', 'NN'), ("'s", 'POS'), ('most', 'JJS'), ('powerful', 'JJ'), ('supercomputer', 'NN'), (',', ','), ('beats', None), ('China', 'NNP'), ('.', '.')]


In [32]:
# testing performance of bigram tagger
print(bt.evaluate(test_data))
print(bt.tag(nltk.word_tokenize(sentence)))

0.13466937748087907
[('US', None), ('unveils', None), ('world', None), ("'s", None), ('most', None), ('powerful', None), ('supercomputer', None), (',', None), ('beats', None), ('China', None), ('.', None)]


In [33]:
# testing performance of trigram tagger
print(tt.evaluate(test_data))
print(tt.tag(nltk.word_tokenize(sentence)))

0.08064672281924679
[('US', None), ('unveils', None), ('world', None), ("'s", None), ('most', None), ('powerful', None), ('supercomputer', None), (',', None), ('beats', None), ('China', None), ('.', None)]


This output clearly shows us that we obtain 86% accuracy on the test set using unigram tagger alone, which is really good compared to our last tagger. 

 Accuracies of the bigram and trigram models are far lower because the same bigrams and trigrams observed in the training data aren’t always present in the same way in the testing data.
 
 We now look at an approach to combine all the taggers by creating a combined tagger with a list of taggers and use a backoff tagger. Essentially, we would create a chain of taggers and each tagger would fall back on a backoff tagger if it cannot tag the input tokens.


In [34]:
def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

ct = combined_tagger(train_data=train_data,
                     taggers=[UnigramTagger, BigramTagger, TrigramTagger],
                     backoff=rt)
# evaluating the new combined tagger with backoff taggers
print(ct.evaluate(test_data))
print(ct.tag(nltk.word_tokenize(sentence)))

0.9094781682641108
[('US', 'NNP'), ('unveils', 'NNS'), ('world', 'NN'), ("'s", 'POS'), ('most', 'RBS'), ('powerful', 'JJ'), ('supercomputer', 'NN'), (',', ','), ('beats', 'NNS'), ('China', 'NNP'), ('.', '.')]


We now obtain an accuracy of 91% on the test data, which is excellent. Also we see that this new tagger can successfully tag all the tokens in our sample sentence (even though a couple of them are not correct, like beats should be a verb).

For our final tagger, we will use a <b>supervised classification algorithm</b> to train our tagger. The ClassifierBasedPOSTagger class enables us train a tagger by using a supervised learning algorithm in the classifier_builder parameter. 

In [35]:
from nltk.classify import NaiveBayesClassifier, MaxentClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger

nbt = ClassifierBasedPOSTagger(train=train_data,
                               classifier_builder=NaiveBayesClassifier.train)

# evaluate tagger on test data and sample sentence
print(nbt.evaluate(test_data))
print(nbt.tag(nltk.word_tokenize(sentence)))

0.9306806079969019
[('US', 'PRP'), ('unveils', 'VBZ'), ('world', 'VBN'), ("'s", 'POS'), ('most', 'JJS'), ('powerful', 'JJ'), ('supercomputer', 'NN'), (',', ','), ('beats', 'VBZ'), ('China', 'NNP'), ('.', '.')]


Using this tagger, we get an accuracy of 93% on our test data, which is the highest out of all our taggers. Also if you observe the output tags for the sample sentence, you will see they are correct and make perfect sense.