Chunking is an essential part of natural language processing (NLP) for grouping tokens into meaningful units. Advanced chunking involves not only handling basic chunks but also managing more complex structures and dealing with multi-layered linguistic features. This section explores advanced chunking and nested structures, which are critical for handling complex text, such as hierarchical relationships in language.



### 7.1 Introduction to Advanced Chunking

- **Definition**: Advanced chunking is the process of using sophisticated methods to chunk text, enabling the detection of both simple and nested structures in sentences.
- **Purpose**: To extend basic chunking techniques and identify deeper linguistic relationships.
- **Challenges**: Handling overlapping chunks, nested structures, and maintaining accuracy while increasing complexity.



### 7.2 Cascaded Chunking

#### 7.2.1 Multi-Level Chunkers

- **Concept**: Cascaded chunking involves creating multiple layers of chunkers that sequentially identify various types of linguistic phrases. For example, multi-level chunkers can be used in extracting hierarchical information in medical texts, such as identifying symptoms, diagnoses, and treatments in clinical reports. Each layer focuses on a specific type, such as noun phrases, verb phrases, or prepositional phrases.
  - **Observation**: This hierarchical approach enables more fine-grained extraction of linguistic structures, providing better context representation.
  - **Code Demonstration**: Demonstrate a multi-level chunker in NLTK to first chunk noun phrases, followed by prepositional phrases.


In [None]:
import nltk  # Importing the NLTK (Natural Language Toolkit) library.
from nltk import RegexpParser  # Importing the RegexpParser to create chunk parsers using regular expressions.

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


# Sample text to process.
text = "The quick brown fox jumps over the lazy dog."

# Tokenizing the text into individual words and assigning part-of-speech (POS) tags to each token.
# 'nltk.word_tokenize' splits the text into tokens (words), and 'nltk.pos_tag' assigns POS tags (e.g., DT for determiner, NN for noun).
tokens = nltk.pos_tag(nltk.word_tokenize(text))

# Defining a grammar rule for noun phrases (NP).
# This rule matches optional determiners (DT), followed by zero or more adjectives (JJ*), and a noun (NN).
grammar_np = "NP: {<DT>?<JJ>*<NN>}"

# Creating a RegexpParser object to find noun phrases based on the grammar.
chunker_np = RegexpParser(grammar_np)

# Parsing the tokenized and POS-tagged sentence to identify noun phrases.
chunked_np = chunker_np.parse(tokens)

# Defining a grammar rule for prepositional phrases (PP).
# A prepositional phrase typically starts with a preposition (IN) followed by a noun phrase (NP).
grammar_pp = "PP: {<IN><NP>}"

# Creating a new RegexpParser object to find prepositional phrases based on the grammar.
chunker_pp = RegexpParser(grammar_pp)

# Parsing the previously chunked noun phrases to also find prepositional phrases.
chunked_pp = chunker_pp.parse(chunked_np)

# Printing the tree structure of the chunked sentence, which shows the identified noun and prepositional phrases.
print(chunked_pp)


(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  (PP over/IN (NP the/DT lazy/JJ dog/NN))
  ./.)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


#### 7.2.2 Advantages of Cascaded Chunking

- **Layered Processing**: Each level can correct or refine the output from the previous level.
- **Granular Control**: Facilitates better error handling and customization in chunking rules.
- **Code Demonstration**: Extend the previous code by adding a verb phrase (VP) chunker to capture verb and auxiliary verb patterns.


In [None]:
# Install necessary libraries
import nltk
from nltk import RegexpParser

# Make sure you have the necessary resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Example sentence for demonstration
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize and tag parts of speech
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)

# Grammar rule to identify verb phrases (VP)
grammar_vp = "VP: {<VB.*><NP|PP|RB>*}"

# Creating a RegexpParser object to identify verb phrases based on the grammar rule
chunker_vp = RegexpParser(grammar_vp)

# Parsing the sentence to identify verb phrases
chunked_vp = chunker_vp.parse(tagged)

# Print the tree structure of the chunked sentence, including the identified verb phrases
print(chunked_vp)


(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  (VP jumps/VBZ)
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### 7.3 Recursive Chunking

- **Definition**: Recursive chunking, also called nested chunking, involves detecting chunks within chunks, such as parsing nested noun phrases in legal documents where phrases may contain multiple layers of embedded information. leading to hierarchical structures.
- **Use Cases**: Useful in extracting nested noun phrases, such as "[The quick brown fox [with [a bushy tail]]]."
- **Observation**: Recursive chunking allows for a more detailed representation of complex syntactic relationships within sentences.



#### 7.2.2 Advantages of Cascaded Chunking

- **Layered Processing**: Each level can correct or refine the output from the previous level.
- **Granular Control**: Facilitates better error handling and customization in chunking rules.
- **Code Demonstration**: Extend the previous code by adding a verb phrase (VP) chunker to capture verb and auxiliary verb patterns.



In [None]:
# Grammar rule to identify verb phrases (VP).
# - <VB.*> matches any verb form (VB = base form, VBD = past tense, VBG = gerund, etc.).
# - <NP|PP|RB>* allows zero or more noun phrases (NP), prepositional phrases (PP), or adverbs (RB) following the verb.
#   This is flexible, as it can match many verb phrase structures, such as "jumps quickly", "jumps over the dog", etc.
grammar_vp = "VP: {<VB.*><NP|PP|RB>*}"

# Creating a RegexpParser object to identify verb phrases using the defined grammar rule.
chunker_vp = RegexpParser(grammar_vp)

# Parsing the previously chunked sentence (which includes noun phrases and prepositional phrases).
# Now, it will additionally identify and chunk verb phrases.
chunked_vp = chunker_vp.parse(chunked_pp)

# Printing the full parse tree, which now includes verb phrases.
print(chunked_vp)



(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  (VP jumps/VBZ (PP over/IN (NP the/DT lazy/JJ dog/NN)))
  ./.)


#### 7.3.1 Implementation of Recursive Chunking

- **Approach**: Apply chunkers iteratively to handle deeper nested structures in sentences.
  - **Code Demonstration**: Use multiple passes to create a tree structure representing nested chunks.


In [None]:
# Sample text with a more complex structure, containing prepositional phrases that modify the noun.
nested_text = "The quick brown fox with a bushy tail jumps over the lazy dog."

# Tokenizing the text into individual words and assigning part-of-speech (POS) tags to each token.
tokens = nltk.pos_tag(nltk.word_tokenize(nested_text))

# Grammar rule for identifying nested noun phrases (NPs).
# - <DT>? allows for an optional determiner.
# - <JJ>* allows for zero or more adjectives.
# - <NN> matches a noun.
# - (<IN><DT>?<JJ>*<NN>)* captures prepositional phrases (introduced by a preposition, IN) modifying the noun,
#   including their own determiner, adjectives, and noun structure.
grammar_nested_np = "NP: {<DT>?<JJ>*<NN>(<IN><DT>?<JJ>*<NN>)*}"

# Creating a RegexpParser to identify nested noun phrases using the defined grammar rule.
chunker_nested_np = RegexpParser(grammar_nested_np)

# Parsing the tokenized and POS-tagged sentence to identify nested noun phrases.
nested_chunked = chunker_nested_np.parse(tokens)

# Printing the full parse tree of the sentence with the nested noun phrases identified.
print(nested_chunked)


(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN with/IN a/DT bushy/JJ tail/NN)
  jumps/NNS
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)


#### 7.3.2 Handling Complex Structures

- **Observation**: Recursive chunking is prone to errors such as over-chunking or under-chunking due to ambiguity in sentence structure.
- **Techniques to Handle Complexity**:
  - **Feature Enrichment**: Using additional features such as POS tags, word embeddings, or dependency parsing to improve chunk detection.
  - **Post-Processing**: Apply rules after chunking to correct errors and remove ambiguities.



### 7.4 Chunking Nested Entities



#### 7.4.1 Nested Named Entities

- **Concept**: Extracting named entities that are nested within other entities, such as "[ORG: [BBC News]] Headquarters."
- **Applications**: Biomedical texts, where entities like proteins and genes often appear within other named entities.
- **Code Demonstration**: Use SpaCy to extract nested named entities.


In [None]:
import spacy  # Importing the spaCy library for Natural Language Processing (NLP).

# Loading the small English NLP model in spaCy. This model includes capabilities such as part-of-speech tagging and Named Entity Recognition (NER).
nlp = spacy.load("en_core_web_sm")

# Processing the input text using the loaded spaCy model. This produces a Doc object, which contains linguistic annotations.
doc = nlp("BBC News headquarters is located in London.")

# Iterating over all the named entities (ents) in the Doc object.
# Named entities are pre-identified phrases in the text, such as organizations, locations, etc.
for ent in doc.ents:
    # Printing the entity's text and its label (the type of entity, such as 'ORG' for organization or 'GPE' for geographical entity).
    print(f"Entity: {ent.text}, Label: {ent.label_}")

    # For each token (word) within the named entity, check its type.
    for token in ent:
        # If the token has an entity type, print the nested token and its entity type.
        # This can be useful if the entity contains multiple tokens (e.g., "BBC News" or "New York").
        if token.ent_type_:
            print(f" - Nested Token: {token.text}, Type: {token.ent_type_}")


Entity: BBC News, Label: ORG
 - Nested Token: BBC, Type: ORG
 - Nested Token: News, Type: ORG
Entity: London, Label: GPE
 - Nested Token: London, Type: GPE


#### 7.4.2 Challenges in Extracting Nested Entities

- **Ambiguity**: Nested entities can lead to ambiguity, particularly when entities belong to overlapping classes (e.g., "University of [ORG: California]"). Hierarchical tagging can help resolve this by ensuring each entity is tagged at the appropriate level of specificity. For example, in a legal document, an organization might be nested within a larger entity, and hierarchical tagging helps distinguish these layers effectively. (e.g., "University of [ORG: California]").
- **Approach to Resolve Ambiguity**:
  - **Hierarchical Tagging**: Assign tags in a hierarchical manner to avoid overlap.
  - **Code Demonstration**: Implement a tagging mechanism using conditional checks to determine if entities belong to multiple classes.


In [None]:
# Iterating over all the named entities (ents) in the SpaCy Doc object.
for ent in doc.ents:
    # Checking if the entity label includes "ORG" (which stands for organizations).
    # This will print out any entities that are recognized as organizations.
    if "ORG" in ent.label_:
        print(f"Organization Entity: {ent.text}")

    # Checking if the entity label includes "GPE" (which stands for Geopolitical Entity, i.e., countries, cities, etc.).
    # This will print out any entities recognized as geographical locations.
    elif "GPE" in ent.label_:
        print(f"Geopolitical Entity: {ent.text}")


Organization Entity: BBC News
Geopolitical Entity: London


### 7.5 Handling Overlapping Chunks

- **Definition**: Overlapping chunks are segments of text that belong to more than one chunk category, resulting in overlapping spans. leading to overlapping spans.
- **Techniques to Handle Overlapping**:
  - **Chunk Pruning**: Prune chunks to ensure only one chunk type is assigned to a specific span of text.
  - **Voting Mechanism**: Use a voting system from multiple chunkers to determine the final chunk label.



#### 7.5.1 Overlapping Chunk Detection and Resolution

- **Code Demonstration**: Use custom logic to detect overlaps and determine the appropriate chunk based on predefined rules.


In [None]:
# Sample sentence that includes both noun phrases (NP) and prepositional phrases (PP), potentially overlapping.
sentence = "The tall building in New York City."

# Tokenizing the sentence and assigning part-of-speech (POS) tags.
tokens = nltk.pos_tag(nltk.word_tokenize(sentence))

# Defining a grammar that captures both noun phrases (NP) and prepositional phrases (PP).
# - NP: Noun phrases consisting of an optional determiner (<DT>?), zero or more adjectives (<JJ>*), and a noun (<NN>).
# - PP: Prepositional phrases that consist of a preposition (<IN>) followed by a noun phrase (<NP>).
grammar_overlap = """
NP: {<DT>?<JJ>*<NN>}   # Noun Phrase
PP: {<IN><NP>}         # Prepositional Phrase
"""

# Creating a RegexpParser to chunk the sentence using the defined grammar.
overlap_chunker = RegexpParser(grammar_overlap)

# Parsing the tokenized sentence, producing a tree structure of chunks (including potential overlaps).
chunked_overlap = overlap_chunker.parse(tokens)

# Iterating through the resulting chunks (subtrees) in the parsed sentence.
# This part resolves the overlaps by identifying and printing the dominant chunks.
for subtree in chunked_overlap:
    # Check if the subtree is an actual chunk (i.e., a Tree object in NLTK).
    if isinstance(subtree, nltk.Tree):
        # Printing the type of the chunk (NP or PP) and the words it contains.
        print(f"Chunk Type: {subtree.label()}, Text: {' '.join(word for word, tag in subtree.leaves())}")


Chunk Type: NP, Text: The tall building


### 7.6 Challenges in Advanced Chunking

- **Ambiguity and Complexity**: More complex grammatical structures, such as subordinate clauses or appositives, lead to ambiguities in chunk boundaries.
- **Scalability**: As sentences become longer and more complex, the computational cost of chunking increases.
- **Code Demonstration**: Measure the complexity of chunking using a timing function to assess performance on large texts.


In [None]:
import time  # Importing the time module to measure the execution time.

# Sample long text to tokenize and chunk.
long_text = "The quick brown fox jumped over the lazy dog multiple times. The dog, however, remained indifferent."

# Tokenizing the text and assigning part-of-speech (POS) tags to each token.
tokens = nltk.pos_tag(nltk.word_tokenize(long_text))

# Recording the start time before chunking begins.
start_time = time.time()

# Chunking the tokenized text using a previously defined chunker (in this case, 'chunker_np').
# This assumes you have already defined 'chunker_np' earlier (for noun phrase chunking, for example).
chunked_long = chunker_np.parse(tokens)

# Recording the end time after chunking is completed.
end_time = time.time()

# Calculating and printing the time taken to perform the chunking process.
# The time difference (end_time - start_time) gives the elapsed time, which is formatted to 2 decimal places.
print(f"Time taken to chunk: {end_time - start_time:.2f} seconds")


Time taken to chunk: 0.00 seconds


### 7.7 Applications of Advanced Chunking

- **Information Extraction**: Extracting key information such as named entities, relationships, and nested facts from unstructured data.
- **Question Answering**: Leveraging chunked and nested structures to identify precise answers in response to user queries.
- **Text Summarization**: Using hierarchical chunk structures to summarize documents by extracting key phrases and their relationships.



#### 7.7.1 Use Case Demonstration: Information Extraction

- **Code Demonstration**: Extract nested entities to build a knowledge base.


In [None]:
# An empty list to store extracted entities (i.e., noun phrases) from the text.
knowledge_base = []

# Sample text that includes multiple noun phrases like "Barack Obama", "former president", "United States", and "Honolulu".
text = "Barack Obama, the former president of the United States, was born in Honolulu."

# Tokenizing the text and assigning part-of-speech (POS) tags to each token.
tokens = nltk.pos_tag(nltk.word_tokenize(text))

# Defining a grammar to chunk noun phrases (NP).
# - <DT>? : Optional determiner (e.g., "the").
# - <JJ>* : Zero or more adjectives (e.g., "former", "quick").
# - <NN.*>: Noun (e.g., "president", "United States"). The .* allows for any noun type (singular/plural).
nested_chunk_grammar = "NP: {<DT>?<JJ>*<NN.*>}"

# Creating a RegexpParser object to chunk the sentence based on the defined grammar.
nested_chunker = RegexpParser(nested_chunk_grammar)

# Parsing the tokenized sentence to identify and chunk noun phrases (NPs).
chunked_entities = nested_chunker.parse(tokens)

# Iterating through the chunks (subtrees) in the chunked sentence.
for subtree in chunked_entities:
    # Checking if the subtree is a noun phrase (NP) chunk.
    if isinstance(subtree, nltk.Tree) and subtree.label() == 'NP':
        # Extracting the words that form the noun phrase by joining the individual words in the chunk.
        entity = " ".join([word for word, pos in subtree.leaves()])
        # Adding the extracted noun phrase entity to the knowledge base list.
        knowledge_base.append(entity)

# Printing the extracted noun phrases that have been stored in the knowledge base.
print(f"Extracted Entities: {knowledge_base}")


Extracted Entities: ['Barack', 'Obama', 'the former president', 'the United', 'States', 'Honolulu']
