In [1]:
### Extracting Information from Text

## Syntax and structure of a natural language such as English are tied with a set of specific rules, conventions, and principles which dictate how words are combined into phrases, phrases get combined into clauses, and clauses get combined into sentences.
## All these constituents exist together in any sentence and are related to each other in a hierarchical structure.
## Let’s consider a very basic example of language structure which explains a specific example in the light of subject and predicate relationship. Consider a simple sentence:

# Harry is playing football

## This sentence is talking about two subjects - Harry and football. To find the subject of the sentence, it is easier to first find the verb and then find “who” or “what” around it.
#In the above sentence, “playing” is the verb of predicate.

## If you ask “Who is playing?”, the answer is "Harry" which gives the first subject, and “What is he playing?” gives us "football" as the other subject.
#An extensive combination of similar rules allows us to define the entities (subjects), intent (predicates), the relationship between intent and entity, etc.

## Such an analysis is very useful in any NLP application since it defines some meaning of the text.
#In a collection of words without any relation or structure, it is very difficult to ascertain what it might be trying to convey or what it means.

## We'll approach the language syntax and structure problem in 3 parts:

# Part of Speech tagging (this tutorial): analyzing syntax of single words
# Chunking / shallow parsing (part 2): analyzing multi-word phrases (or chunks) of text
# Parsing (part 3): analyzing sentence structure as a whole, and the relation of words to one another

In [29]:
## POS

# Parts of speech (POS) tags are specific lexical categories to which words are assigned based on their syntactic context and role.
#In English language there are broadly 8 parts of speech: nouns, adjectives, pronouns, interjections, conjunctions, prepositions, adverbs, verbs

# For instance, in the sentence:


# I am learning NLP

# the POS tags are: ('I'/’PRONOUN' 'am'/'VERB' 'learning'/'VERB' 'NLP'/'NOUN')
# However, there could be additional detailed tags apart from the generic tags.
#In Penn Treebank, a commonly used dataset for language syntax and structure, there are 47 tags defined which are widely used in text analytics and NLP applications.
#You can find more information on specific POS tags and their notations at: Penn Treebank Tagset.pdf

# The process of classifying and labelling POS tags is called POS tagging.
#Let’s try the python example of most commonly used POS tagger using nltk’s pos_tag() function, which is based on the Penn Treebank dataset:

import nltk
nltk.download('punkt_tab',quiet=True)
nltk.download('universal_tagset',quiet=True)
nltk.download('averaged_perceptron_tagger_eng',quiet=True)


sentence = 'The brown fox is quick and he is jumping over the lazy dog'
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')
print(tagged_sent)

# The preceding output shows us the POS tag for each word in the sentence.

[('The', 'DET'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('is', 'VERB'), ('quick', 'ADJ'), ('and', 'CONJ'), ('he', 'PRON'), ('is', 'VERB'), ('jumping', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN')]


In [8]:
## Implementation of POS taggers
# Some prominent approaches used to build a POS tagger are described below:

# Rule based
# Rule-based tagging is the oldest approach to POS tagging. It uses predefined rules to get possible tags for each word.
#The tagger uses information from context (surrounding words) and morphology (within the word), and might also include rules pertaining to such factors as capitalization and punctuation, etc.
#A couple of examples are:

# If a word X is preceded by a determiner and followed by a noun, tag it as an adjective (contextual rule). Eg, "The brown fox".
# If a word ends with -ous, tag it as an adjective (morphological rule). Eg, adventurous
# An old but useful paper was published by Eric Brill in 1992: A Simple Rule-Based POS tagger. It is the basis of Brill’s Tagger.
#See section 2 (about 1 page, easy read) for detailed description of a rule-based POS tagger.

# Statistics based
# Statistics based taggers have obtained a high accuracy without requiring manually crafted linguistic rules.
#There are many methods in statistical model, the most notable for POS tagging being Hidden Markov Models (HMMs) and the maximum entropy approach.
#In the HMM model the word-tag probabilities are estimated from a manually annotated corpus (training set).
#It is a stochastic model in which the tagger is assumed to be a Markov Process with unobservable states and observable outputs.
#Here, the POS tags are the states and the words are the outputs. Hence, the POS tagger consists of:

# Ps(Ti): Probability of the sequence starting in tag Ti
# Pt(Tj|Ti): Probability of the sequence transitioning from tag Ti to tag Tj
# PE(Wj|Ti): Probability of the sequence emitting word Wj on Tag Ti
# The Tagger makes two simplifying assumptions:
# The probability of a word depends only on its tag, i.e. given its tag, it is independent of other words and other tags.
# The probability of a tag depends only on its previous tag, i.e. given the previous tag, it is independent of next tags and tags before the previous tag.
# Given a sequence of words, the POS tagger is interested in finding the most likely sequence of tags that generates that sequence of words.

# Supervised learning based
# The supervised learning based approach to build a POS tagger yields the current most accurate taggers.
#It is based on a neural network which is proved to be faster and more accurate than rule based or statistical models.
#This approach considers POS tagging as a “supervised learning problem” where manually annotated training data is given to the machine learning model and it learns to predict the missing tags by finding the correlations from the training data.

# For example given the predictors (features) as "POS of word i-1" or "last three letters of word at i+1" etc, can a neural network be trained to predict the "POS of word i".
#In some sense, it can be looked at as a generalization of the rule based approach, where the supervised learning algorithm is learning the importance of each rule.

# Some applications of POS tagging include narrowing down the nouns to focus on the most prominent ones, or performing qualifier-subject analysis, word sense disambiguation, grammar analysis, etc.
#The most important use case is to extract phrases from the sentence. In fact, it serves as an input to various more complex analysis such as chunking and parsing.

In [9]:
## Chunking (Shallow Parsing)

# Phrasal Structures
# A phrase can be a single word or a combination of words based on the syntax and position of the phrase in a clause or sentence. For example, in the following sentence

# My dog likes his food.

# there are three phrases. "My dog" is a noun phrase, "likes" is a verb phrase, and "his food" is also a noun phrase.
# There are five major categories of phrases:
# Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or object to a verb or an adjective.
#In some cases a noun phrase can be replaced by a pronoun without changing the syntax of the sentence. Some examples of Noun phrases are "little boy", "hard rock", etc.

# Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually there are two forms of verb phrases.
#One form has the verb components as well as other entities such as nouns, adjectives, or adverbs as parts of the object. The verb here is known as a finite verb.
#For example in the sentence “The boy is playing football”, "playing football" is the finite verb phrase.
#The second form of this includes verb phrases which consist strictly of verb components only. For example, "is playing" in the same sentence is such a verb phrase.

# Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun.
#The sentence, "The cat is too cute" has an adjective phrase, "too cute", qualifying "cat".

# Adverb phrase (ADVP): These are phrases where adverb acts as the head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that describe or qualify them.
#In the sentence "The train should be at the station pretty soon", the adverb phrase "pretty soon" describes when the train would be arriving.

# Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical components like nouns, pronouns, and so on.
#It acts like an adjective or adverb describing other words or phrases. The phrase "going up the stairs" contains a prepositional phrase "up", describing the direction of the stairs.

# These five major syntactic categories of phrases can be generated from words using several rules, utilizing syntax and grammars of different types.

In [None]:
!pip install parse

In [32]:
# Shallow parsing, also known as light parsing or chunking, is a technique for analyzing the structure of a sentence in-order to identify these phrases or chunks.
#We start by first breaking the sentence down into its smallest constituents (which are tokens such as words) and then grouping them together into higher-level phrases.

# In python the parse package uses shallow parsing to extract meaningful chunks out of sentences. The following code snippet shows how to perform shallow parsing on our sample sentence:
import parse

from parse import *

result = parse("It's {}, I love it!", "It's spam, I love it!")
# print the chunks from shallow parsed sentence tree
print(result.fixed[0])

# The preceding output is the chunks extracted from the sentence using shallow parsing.
#Each line begins with the phrase type, and is followed by the list of words in the phrase along with their part-of-speech tags.

spam


In [33]:
# Let’s take an example of building a basic noun phrase chunker. As explained in the previous section, in a noun phrase, noun acts as a subject or object to a verb or an adjective.
#In order to create a noun phrase chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked.
#For simplicity let’s assume a single rule grammar which says that a noun phrase chunk should be formed whenever the chunker finds an optional determiner (DET) followed by any number of adjectives (ADJ) and then a noun (NOUN):

grammar = "NP: {<DET>?<ADJ>*<NOUN>}"

# Using this grammar, we create a chunk parser, and test it on a sentence. The result is a tree of phrases around noun chunks

sentence = "The famous algorithm produced accurate results"

tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens, tagset='universal')

cp = nltk.RegexpParser(grammar)

result = cp.parse(tagged_sent)
print(result)

(S
  (NP The/DET famous/ADJ algorithm/NOUN)
  produced/VERB
  (NP accurate/ADJ results/NOUN))


In [34]:
# Likewise you can define multiple grammar rules based on the grammatical phrases you want to extract, for example:
grammar = '''
    NP: {<DET>? <ADJ>* <NOUN>*}
    P: {<PREP>}
    V: {<VERB.*>}
    PP: {<PREP> <NOUN>}
    VP: {<VERB> <NOUN|PREP>*}
    '''

In [35]:
## Machine Learning approach for chunking
# Another way to build a Parser is to train a classifier using any commonly used supervised classification algorithms such as SVM, Logistic Regression, etc.
# Dataset description
# You can use IOB tags as features to train a model that can extract chunks.
#In IOB tags, each word is tagged with one of three special chunk tags, I (Inside), O (Outside), or B (Begin). A word is tagged as B if it marks the beginning of a chunk, subsequent words within the chunk are tagged I and all other words are tagged O.

# Fortunately, NLTK provides a labelled training corpus to train such a classifier chunker. CoNLL-2000 data consist of three columns.
#The first column contains the word, the second its part-of-speech tag and the third its IOB chunk tag. For more details on this corpus, you can read the following paper: Introduction to the CoNLL-2000 Shared Task: Chunking

# Let's try an example where we would use CoNLL-2000 corpus to randomly pick 80% of data set for training and remaining 20% to test our classifier.

from nltk.corpus import conll2000
nltk.download('conll2000',quiet=True)
import random

conll_data = list(conll2000.chunked_sents())
random.shuffle(conll_data)
train_sents = conll_data[:int(len(conll_data) * 0.8)]
test_sents = conll_data[int(len(conll_data) * 0.8 + 1):]

In [36]:
# Defining features
# Next, lets define a custom feature extractor we would use to train our model. The features would consist of a sequence of tags based on the co-occurrence of the words with the token.

from nltk.stem.porter import PorterStemmer

def features(tokens, index, history):
    # tokens are tagged words of a sentence
    # Index is the index of token for which the features to be extracted
    # history is the previous predicted tags

    stemmer = PorterStemmer()

    # Build the sequence of words for training
    tokens = [('__PREVSEQ2__', '__PREVSEQ2__'),
        ('__PREVSEQ1__', '__PRESEQ1__')] + list(tokens) + [('__END1__', '__END1__'),
        ('__END2__', '__END2__')]
    history = ['__PREVSEQ2__', '__PREVSEQ2__'] + list(history)

    # shift the index with 2 to point to current token
    index += 2

    word, pos = tokens[index]
    prevword, prevpos = tokens[index - 1]
    prev2word, prev2pos = tokens[index - 2]
    nextword, nextpos = tokens[index + 1]
    next2word, next2pos = tokens[index + 2]

    return {
        'word': word,
        'lemma': stemmer.stem(word),
        'pos': pos,

        'next-word': nextword,
        'next-pos': nextpos,

        'next-next-word': next2word,
        'next-next-pos': next2pos,

        'prev-word': prevword,
        'prev-pos': prevpos,

        'prev-prev-word': prev2word,
        'prev-prev-pos': prev2pos,
    }

In [37]:
# Training and evaluation
# NLTK also provides in its package a sequential tagger that uses a classifier to choose the tag for each token in a sentence.
#NLTK's ClassifierBasedTagger can be trained on custom features extracted from CoNLL-2000 or a similar data set.

# Let's train NLTK's ClassifierBasedTagger using our custom features on the training data set and evaluate the model on the test sample data.

from nltk import ChunkParserI, ClassifierBasedTagger
from nltk.chunk import conlltags2tree, tree2conlltags

class FooChunkParser(ChunkParserI):
    def __init__(self, chunked_sents, **kwargs):

        # Transform the trees in IOB annotated sentences [(word, pos, chunk)]
        chunked_sents = [tree2conlltags(sent) for sent in chunked_sents]

        # Make tags compatible with the tagger interface [((word, pos), chunk)]
        def get_tagged_pairs(chunked_sent):
            return [((word, pos), chunk) for word, pos, chunk in chunked_sent]

        chunked_sents = [get_tagged_pairs(sent) for sent in chunked_sents]

        self.feature_detector = features
        self.tagger = ClassifierBasedTagger(
            train=chunked_sents,
            feature_detector=features,
            **kwargs)

    def parse(self, tagged_sent):
        chunks = self.tagger.tag(tagged_sent)
        iob_triplets = [(word, token, chunk) for ((word, token), chunk) in chunks]
        # Transform the list of triplets to nltk.Tree format
        return conlltags2tree(iob_triplets)

chunker = FooChunkParser(train_sents)
print(chunker.evaluate(test_sents))

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(chunker.evaluate(test_sents))


ChunkParse score:
    IOB Accuracy:  93.0%%
    Precision:     87.5%%
    Recall:        90.6%%
    F-Measure:     89.0%%


In [None]:
# Comparison of machine learning vs. rule-based chunker
# Different experiments have shown that the performance of the classifier based chunker is very similar to the results obtained by the rule based chunker.
#At times it could be considerably hard to define regular expressions to extract chunks which may have very complex structures. In such a cases, the machine learning approach is helpful.

In [None]:
## Deep Parsing
# Natural language parsing (also known as deep parsing) is a process of analyzing the complete syntactic structure of a sentence.
#This includes how different words in a sentence are related to each other, for example, which words are the subject or object of a verb.
#Probabilistic parsing uses language understanding such as grammatical rules.
#Alternatively, it may also use supervised training set of hand-parsed sentences to try to infer the most likely syntax and structure of new sentences.

# Parsing is used to solve various complex NLP problems such as conversational dialogues and text summarization.
#It is different from 'shallow parsing' in that it yields more expressive structural representations which directly capture long-distance dependencies and underlying predicate-argument structures.

# There are two main types of parse tree structures - constituency parsing and dependency parsing.

In [None]:
# Constituency Parsing
# Constituent-based grammars are used to analyze and determine the components which a sentence is composed of.
#There are usually several rules for different types of phrases based on the type of components they can contain, and this can be used to build a parse tree.
#The non-terminals nodes in the tree are types of phrases and the terminal nodes are the words in the sentence which are constituents of the phrase.
#For example, consider the following sentence and its constituency parse tree


# Harry met Sejal
#                  Sentence
#                     |
#       +-------------+------------+
#       |                          |
#  Noun Phrase                Verb Phrase
#       |                          |
#     Harry                +-------+--------+
#                          |                |
#                        Verb          Noun Phrase
#                          |                |
#                         met             Sejal

In [None]:
# Dependency parsing
# The dependency-based grammar is based on the notion that linguistic units, e.g. words, are connected to each other by directed links (one-to-one mappings) between words which signify their dependencies.
# The resulting parse tree representation is a labelled directed graph where the nodes are the lexical tokens and the labelled edges show dependency relationships between the heads and their dependents.
#The labels on the edges indicate the grammatical role of the dependent. For example, consider the same sentence and its dependency parse tree


# Harry met Sejal

#               met
#                |
#        +--------------+
#subject |              | subject
#      Harry          Sejal

In [None]:
# how parsers based on probabilistic context free grammars work. Let’s start by considering the following example sentence:

# I went to the market in my shorts.

# If we analyze the sentence carefully, there is an ambiguity in phrases - “I went to market” + “in my shorts” and “I went to” + “the market in my shorts”.
#While it may to be obvious to us that the first phrasal interpretation is sensible, both phrasal structures are syntactically correct.
#The latter would mean there is some market which is in my shorts (analogous to the sentence, "I went to the market in the city").

In [24]:
## Context Free Grammars
# A Context Free Grammar (CFG) is a set of rules which can be repeatedly applied to generate a sentence.
#Given a sentence and a particular CFG, we can also infer the possible parse tree structures.
#Let’s illustrate this with a toy CFG for the above example sentence, and check the possible parse trees:


grammar = nltk.CFG.fromstring("""S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'to' | 'my'
N -> 'market' | 'shorts'
V -> 'went'
P -> 'in'
""")

# Here, S is the start symbol, NP, VP, etc represent noun phrase, verb phrase, etc, and N, V, etc represent noun, verb, etc.
# This grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase "in my shorts" describes the subject "market" or going to market.

import nltk
sent = 'I went to market in my shorts'.split()
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V went) (NP (Det to) (N market)))
    (PP (P in) (NP (Det my) (N shorts)))))
(S
  (NP I)
  (VP
    (V went)
    (NP (Det to) (N market) (PP (P in) (NP (Det my) (N shorts))))))


In [39]:
sentence = 'The dog saw a man in the park'

grammar = nltk.CFG.fromstring('''S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
NP -> Det N | Det N PP | 'I'
Det -> 'a' | 'the'
N -> 'man' | 'dog' | 'park'
V -> 'saw'
P -> 'in'
''')

# Here, S is the start symbol, NP, VP, etc represent noun phrase, verb phrase, etc, and N, V, etc represent noun, verb, etc.
# This grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase "in my shorts" describes the subject "market" or going to market.

import nltk
sent = sentence.lower().split()
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sent):
    print(tree)

(S
  (NP (Det the) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man))
    (PP (P in) (NP (Det the) (N park)))))
(S
  (NP (Det the) (N dog))
  (VP
    (V saw)
    (NP (Det a) (N man) (PP (P in) (NP (Det the) (N park))))))


In [26]:
## Probabilistic Context Free Grammars
# In-order to disambiguate between the above possible trees, we can use a Probabilistic Context Free Grammar (PCFG).
#In this setting, each of the rules are probabilistic.
#For example, if we take the rule N -> 'market' | 'shorts', our grammar would specify with what probabilities corresponding to 'market' and 'shorts'.
#In general, the probabilities and rules are inferred from annotated datasets.

# Now, we will build a custom constituency parsers by creating our own PCFG rules and then using NLTK’s ViterbiParser to train a parser.
#We will use treebank corpus that provides annotated parse trees for sentences.

nltk.download('treebank',quiet=True)

from nltk.grammar import Nonterminal
from nltk.corpus import treebank

# get training data
training_set = treebank.parsed_sents()

# example training sentence
print(training_set[1])

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


(S
  (NP-SBJ (NNP Mr.) (NNP Vinken))
  (VP
    (VBZ is)
    (NP-PRD
      (NP (NN chairman))
      (PP
        (IN of)
        (NP
          (NP (NNP Elsevier) (NNP N.V.))
          (, ,)
          (NP (DT the) (NNP Dutch) (VBG publishing) (NN group))))))
  (. .))


In [27]:
# Next, we will build the rules for our grammar by extracting them from the annotated training sentences.
# Extract the rules for all annotated training sentences
rules = list(set(rule for sent in training_set for rule in sent.productions()))
print(rules[0:5])

[NN -> 'movie', NN -> 'Dividend', NNS -> 'six-packs', NN -> 'low-altitude', NNS -> 'exports']
