# Introduction

Chunking is a critical task in Natural Language Processing (NLP) where the goal is to divide text into meaningful groups of tokens, also known as “chunks.” These chunks typically represent phrases like noun phrases (NP), verb phrases (VP), or prepositional phrases (PP). Unlike tokenization or part-of-speech (POS) tagging, which operates at the word level, chunking operates at a higher level, grouping words into syntactically correlated units that convey more semantic meaning.



## 3.1 **Definition and Purpose of Chunking**
   - **Definition**:
     - Chunking is the process of segmenting and labeling multi-token sequences (chunks) in a sentence, such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP).
     - It focuses on identifying non-overlapping, contiguous sequences of words that form coherent phrases.
   
   - **Purpose**:
     - Chunking helps in breaking down sentences into smaller, semantically meaningful units, which is crucial for tasks like named entity recognition (NER), information extraction, and syntactic parsing.
     - It is simpler than full parsing but offers more syntactic information than tokenization or POS tagging alone.

   - **Demonstration**:


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

# Print the POS tags
print("POS Tags:", pos_tags)


POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 3.2 **Types of Chunking**
   - **Noun Phrase Chunking (NP-Chunking)**:
     - Focuses on identifying noun phrases, which consist of a noun and its associated modifiers (e.g., adjectives, determiners).
     - NP-Chunking simplifies sentence structure into meaningful groups such as "The quick brown fox."
     


  - **Demonstration**:


In [None]:
import nltk

# Defining a simple rule for NP (Noun Phrase) chunking using regular expressions
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"  # NP: determiner (optional), adjectives (optional), and noun
chunk_parser = nltk.RegexpParser(chunk_grammar)

# Using previously defined POS tags for chunking
chunked_tree = chunk_parser.parse(pos_tags)

# Print the chunked tree structure
print(chunked_tree)


(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)


   - **Verb Phrase Chunking (VP-Chunking)**:
     - Involves grouping verb-related components such as the main verb and its arguments (objects or complements).
     - This type of chunking is helpful in identifying actions and their corresponding entities.

     - **Demonstration**:


In [None]:
# Install NLTK
!pip install nltk

# Import the NLTK library
import nltk

# Download necessary NLTK resources
nltk.download('punkt')  # For tokenization
nltk.download('averaged_perceptron_tagger')  # For POS tagging

# Sample sentence for demonstration
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Generate POS tags for the tokens
pos_tags = nltk.pos_tag(tokens)

# Defining a rule for VP (Verb Phrase) chunking
chunk_grammar_vp = "VP: {<VB.*><NP|PP>*}"  # VP: Verb followed by noun phrases (NP) or prepositional phrases (PP)
chunk_parser_vp = nltk.RegexpParser(chunk_grammar_vp)

# Using the POS tags for chunking
chunked_tree_vp = chunk_parser_vp.parse(pos_tags)

# Print the chunked tree structure for verb phrases
print(chunked_tree_vp)


(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  (VP jumps/VBZ)
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


   - **Prepositional Phrase Chunking (PP-Chunking)**:
     - PP-Chunking identifies prepositional phrases, which typically consist of a preposition and its noun phrase (e.g., "over the lazy dog").
     - Prepositional phrases provide information about relationships between entities, which can be crucial for tasks like relation extraction.

     - **Demonstration**:


In [None]:
import nltk

# Defining a rule for PP (Prepositional Phrase) chunking
chunk_grammar_pp = "PP: {<IN><NP>}"  # PP: Preposition (IN) followed by a noun phrase (NP)
chunk_parser_pp = nltk.RegexpParser(chunk_grammar_pp)

# Using the POS tags from earlier for chunking
chunked_tree_pp = chunk_parser_pp.parse(pos_tags)

# Print the chunked tree structure for prepositional phrases
print(chunked_tree_pp)



(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  jumps/VBZ
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)


## 3.3 **Techniques for Chunking**
   - **Rule-Based Chunking (Regular Expression-Based Chunking)**:
     - Rule-based chunking involves manually defining patterns based on POS tags to identify chunks. It uses grammar-like rules to specify the structure of phrases.
     - This technique is easy to implement and interpret, but it may not generalize well across different text corpora or domains.
   
   - **Chinking**:
     - Chinking is the inverse of chunking. Instead of specifying patterns to include in a chunk, chinking defines what to exclude from a chunk.
     - This is useful when the initial chunking rule is too inclusive and captures more than desired.

     - **Demonstration**:



In [None]:
import nltk

# Defining a rule for NP chunking with chinking (excluding certain parts)
chunk_grammar_with_chink = """
NP: {<DT>?<JJ>*<NN>}  # Chunk determiners, adjectives, and nouns
}<VBZ>{  # Chink (exclude) verbs (VBZ) from the NP chunk
"""
chunk_parser_with_chink = nltk.RegexpParser(chunk_grammar_with_chink)

# Using the POS tags from earlier for chunking with chinking
chunked_tree_with_chink = chunk_parser_with_chink.parse(pos_tags)

# Print the chunked tree structure with chinking applied
print(chunked_tree_with_chink)



(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)


   - **Data-Driven Chunking**:
     - Data-driven chunking relies on machine learning models trained on annotated corpora to identify chunk boundaries.
     - Instead of defining rules manually, data-driven methods can learn patterns directly from data, making them more robust to variations in text.


## 3.4 **Chunk Representation**
   - **IOB Tagging Format**:
     - The Inside-Outside-Beginning (IOB) format is commonly used to represent chunks in a sentence. Each word is labeled as either Inside (I), Outside (O), or Beginning (B) of a chunk.
     - This format is useful for training machine learning models that predict chunk boundaries.

     - **Example**:
       - Sentence: "The quick brown fox jumps over the lazy dog."
       - IOB Format:
         ```
         The    B-NP
         quick  I-NP
         brown  I-NP
         fox    I-NP
         jumps  B-VP
         over   B-PP
         the    B-NP
         lazy   I-NP
         dog    I-NP
         ```

     - **Demonstration**:


In [None]:
# Simple demonstration of assigning IOB tags manually
iob_tags = [
    ('The', 'B-NP'), ('quick', 'I-NP'), ('brown', 'I-NP'),
    ('fox', 'I-NP'), ('jumps', 'B-VP'), ('over', 'B-PP'),
    ('the', 'B-NP'), ('lazy', 'I-NP'), ('dog', 'I-NP')
]

# Print each word along with its IOB tag
for word, tag in iob_tags:
    print(f"{word}: {tag}")


The: B-NP
quick: I-NP
brown: I-NP
fox: I-NP
jumps: B-VP
over: B-PP
the: B-NP
lazy: I-NP
dog: I-NP


   - **Tree Representation**:
     - Chunking can also be represented using tree structures, which allow for the visualization of hierarchical relationships between chunks.
     - Tree representations are especially useful for linguists and NLP researchers as they visually depict the structure of a sentence.



## 3.5 **Developing and Evaluating Chunkers**
   - **Corpus-Based Development**:
     - Chunkers are often trained and evaluated using annotated corpora like the CoNLL-2000 chunking corpus, which provides POS-tagged and chunk-annotated sentences.
     - Data-driven chunkers can use supervised learning to predict chunk boundaries based on features extracted from the input data.

   - **Evaluation Metrics**:
     - **Accuracy**: Measures the percentage of correctly predicted chunks.
     - **Precision**: Proportion of predicted chunks that are correct.
     - **Recall**: Proportion of actual chunks that are correctly predicted.
     - **F-Measure**: Harmonic mean of precision and recall, providing a balanced metric.

     - **Demonstration**:
  

In [None]:
from sklearn.metrics import classification_report

# True labels (ground truth IOB tags)
true_labels = ['B-NP', 'I-NP', 'I-NP', 'I-NP', 'B-VP', 'B-PP', 'B-NP', 'I-NP', 'I-NP']

# Predicted labels by the chunker
predicted_labels = ['B-NP', 'I-NP', 'I-NP', 'I-NP', 'B-VP', 'B-PP', 'B-NP', 'I-NP', 'I-NP']

# Generate and print the classification report (Precision, Recall, F1-score)
print(classification_report(true_labels, predicted_labels))


              precision    recall  f1-score   support

        B-NP       1.00      1.00      1.00         2
        B-PP       1.00      1.00      1.00         1
        B-VP       1.00      1.00      1.00         1
        I-NP       1.00      1.00      1.00         5

    accuracy                           1.00         9
   macro avg       1.00      1.00      1.00         9
weighted avg       1.00      1.00      1.00         9




   - **Unigram and Bigram Chunkers**:
     - **Unigram Chunkers**: Use individual tokens and their POS tags as features to predict chunk boundaries.
     - **Bigram Chunkers**: Use pairs of consecutive tokens (bigrams) and their POS tags to capture local context for chunk prediction.
     - These models can be evaluated using standard corpora and metrics mentioned above.



## 3.6 **Creative Observations in Chunking**

   - **Chunking as a Precursor to Full Parsing**:
     - Chunking can be seen as a lightweight alternative to full syntactic parsing. While chunking only identifies major phrase types (e.g., NP, VP), parsing determines the entire grammatical structure of a sentence.
     - In practice, chunking is computationally less expensive and faster, making it a viable solution for applications where full parsing may not be necessary.

   - **Combining Chunking with Named Entity Recognition**:
     - Chunking can be integrated with Named Entity Recognition (NER) to extract meaningful entities from text. For example, NP chunks can serve as candidates for named entities, which can then be classified as PERSON, ORGANIZATION, or LOCATION.
     - This hybrid approach leverages the strengths of both chunking and NER.

   - **Chunking for Relation Extraction**:
     - Chunking plays a vital role in relation extraction by grouping relevant entities together. Once noun phrases or verb phrases are chunked, they can be analyzed further to extract relationships between entities (e.g., "John works at Microsoft").

   - **Impact of Domain-Specific Language on Chunking**:
     - Chunking performance can vary significantly across domains. For instance, legal or medical texts contain specialized terminology and complex sentence structures, which require customized chunking rules or domain-specific training data for optimal performance.


# Observations

## 1 **Definition and Purpose of Chunking**

- **Definition**: Chunking is the process of grouping tokens (words) into meaningful phrases based on syntactic patterns.
- **Purpose**: Chunking helps group individual tokens into semantically significant units like noun phrases (NP), verb phrases (VP), and prepositional phrases (PP).

- **Code Demonstration**:

This code demonstrates how to use NLTK to tokenize a sentence into individual words and assign part-of-speech (POS) tags to each token. The output will include a list of tuples where each tuple contains a word and its corresponding POS tag.








In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenizing the sentence into words
tokens = nltk.word_tokenize(sentence)

# Assigning POS tags to each token
pos_tags = nltk.pos_tag(tokens)

# Print the tokens with their corresponding POS tags
print("POS Tags:", pos_tags)


POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 2 **Types of Chunking**



#### **3.2.1 Noun Phrase Chunking (NP-Chunking)**

- **Goal**: Identify noun phrases based on POS tags.
  
- **Code Demonstration**:


In [None]:
# Defining a grammar for NP-chunking
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"  # NP: Determiner (optional), adjectives (optional), and noun
chunk_parser = nltk.RegexpParser(chunk_grammar)
chunked_tree = chunk_parser.parse(pos_tags)

# Print the chunked tree structure
print(chunked_tree)


(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)


- **Explanation**:
  - The grammar `NP: {<DT>?<JJ>*<NN>}` specifies that a noun phrase (NP) can consist of:
    - An optional determiner (`<DT>`, e.g., "the"),
    - Zero or more adjectives (`<JJ>*`, e.g., "quick", "brown"),
    - Followed by a noun (`<NN>`, e.g., "fox").
  - This rule is applied to chunk parts of the sentence, grouping them as noun phrases.


#### **3.2.2 Verb Phrase Chunking (VP-Chunking)**

- **Goal**: Group together verb phrases based on POS tags and other syntactic clues.

- **Code Demonstration**:


In [None]:
# Defining a grammar for VP-chunking
chunk_grammar_vp = "VP: {<VB.*><NP|PP>*}"  # VP: Verb followed by noun phrases or prepositional phrases
chunk_parser_vp = nltk.RegexpParser(chunk_grammar_vp)
chunked_tree_vp = chunk_parser_vp.parse(pos_tags)

# Print the chunked tree structure
print(chunked_tree_vp)


(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  (VP jumps/VBZ)
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)


- **Explanation**:
  - The grammar `VP: {<VB.*><NP|PP>*}` captures verb phrases (VP) by identifying:
    - Any verb (`<VB.*>`, such as `VB`, `VBZ`, `VBD`, etc.),
    - Followed by zero or more noun phrases (`<NP>`) or prepositional phrases (`<PP>`).
  - This rule helps to group verbs with their related noun or prepositional phrases, creating meaningful verb phrases from the sentence.

#### **3.2.3 Prepositional Phrase Chunking (PP-Chunking)**

- **Goal**: Identify prepositional phrases.

- **Code Demonstration**:


In [None]:
# Defining a grammar for PP-chunking
chunk_grammar_pp = "PP: {<IN><NP>}"  # PP: Preposition followed by a noun phrase
chunk_parser_pp = nltk.RegexpParser(chunk_grammar_pp)
chunked_tree_pp = chunk_parser_pp.parse(pos_tags)

# Print the chunked tree structure
print(chunked_tree_pp)


(S
  The/DT
  quick/JJ
  brown/NN
  fox/NN
  jumps/VBZ
  over/IN
  the/DT
  lazy/JJ
  dog/NN
  ./.)


- **Explanation**:
  - The grammar `PP: {<IN><NP>}` captures **prepositional phrases (PP)** by identifying:
    - A preposition (`<IN>`, e.g., "over", "in", "with"),
    - Followed by a **noun phrase** (`<NP>`, e.g., "the lazy dog").
  - This rule allows the extraction of prepositional phrases from the sentence, such as "over the lazy dog," grouping prepositions with their related noun phrases.

## 3.3 **Techniques for Chunking**




#### **3.3.1 Rule-Based Chunking (Regular Expression-Based Chunking)**

- **Goal**: Use manually defined rules to identify chunks in a sentence.

- **Code Demonstration**:


In [None]:
# Rule-based chunking using regular expressions
chunk_grammar_rule_based = """
NP: {<DT>?<JJ>*<NN>}   # Noun Phrase: Optional determiner, adjectives, and noun
VP: {<VB.*><NP|PP>*}   # Verb Phrase: Verb followed by noun phrases or prepositional phrases
PP: {<IN><NP>}         # Prepositional Phrase: Preposition followed by noun phrase
"""
chunk_parser_rule_based = nltk.RegexpParser(chunk_grammar_rule_based)
chunked_tree_rule_based = chunk_parser_rule_based.parse(pos_tags)

# Print the chunked tree structure
print(chunked_tree_rule_based)


(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  (VP jumps/VBZ)
  (PP over/IN (NP the/DT lazy/JJ dog/NN))
  ./.)


#### **3.3.2 Chinking**

- **Goal**: Exclude certain tokens from chunks to refine chunk boundaries.

- **Code Demonstration**:


In [None]:
# Chunking with chinking: remove verbs from NP chunks
chunk_grammar_with_chink = """
NP: {<DT>?<JJ>*<NN>}  # Chunk determiner, adjectives, and nouns
}<VBZ>{  # Exclude verbs (VBZ) from the NP chunk using chinking
"""
chunk_parser_with_chink = nltk.RegexpParser(chunk_grammar_with_chink)
chunked_tree_with_chink = chunk_parser_with_chink.parse(pos_tags)

# Print the chunked tree structure with chinking applied
print(chunked_tree_with_chink)



(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  (NP the/DT lazy/JJ dog/NN)
  ./.)


#### **3.3.3 Data-Driven Chunking**

- **Goal**: Use machine learning models to predict chunk boundaries.

- **Code Demonstration Using NLTK**:


In [None]:
# Training a Unigram chunker on the CoNLL 2000 chunking corpus
import nltk
nltk.download('conll2000')

from nltk.corpus import conll2000
from nltk.chunk.util import conlltags2tree
from nltk.chunk import ChunkParserI
from nltk.tag import UnigramTagger

# Define a UnigramChunker class
class UnigramChunker(ChunkParserI):
    def __init__(self, train_sents):
        # Convert training data into a format suitable for the UnigramTagger
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = UnigramTagger(train_data)  # Train a UnigramTagger on the chunk labels

    def parse(self, sentence):
        # Extract POS tags from the sentence
        pos_tags = [pos for (word, pos) in sentence]
        # Use the trained UnigramTagger to predict chunk tags
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Combine words, POS tags, and chunk predictions into the format expected by conlltags2tree
        conll_tags = [(word, pos, chunk) for ((word, pos), (pos2, chunk)) in zip(sentence, tagged_pos_tags)]
        # Convert the tagged sentence into a chunk tree and return it
        return conlltags2tree(conll_tags)

# Training the Unigram chunker on the CoNLL-2000 chunking corpus
train_sentences = conll2000.chunked_sents('train.txt')
unigram_chunker = UnigramChunker(train_sentences)

# Test the chunker on a sample sentence
test_sentence = [("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"),
                 ("jumps", "VBZ"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN")]

# Output the chunk tree for the test sentence
print(unigram_chunker.parse(test_sentence))


[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


(S
  (NP The/DT quick/JJ brown/JJ fox/NN)
  (VP jumps/VBZ)
  (PP over/IN)
  (NP the/DT lazy/JJ dog/NN))


## 3.4 **Chunk Representation**



#### **3.4.1 IOB Tagging Format**

- **Goal**: Use the Inside-Outside-Beginning (IOB) format to label chunks.

- **Code Demonstration**:


In [None]:
# Example IOB format for a sentence
iob_tags = [
    ('The', 'B-NP'), ('quick', 'I-NP'), ('brown', 'I-NP'),
    ('fox', 'I-NP'), ('jumps', 'B-VP'), ('over', 'B-PP'),
    ('the', 'B-NP'), ('lazy', 'I-NP'), ('dog', 'I-NP')
]

# Print each word with its corresponding IOB tag
for word, tag in iob_tags:
    print(f"{word}: {tag}")


The: B-NP
quick: I-NP
brown: I-NP
fox: I-NP
jumps: B-VP
over: B-PP
the: B-NP
lazy: I-NP
dog: I-NP


- **Explanation**:
  - **IOB (Inside-Outside-Beginning)** tags are used to mark the boundaries of chunks in text.
    - **B-** (Beginning) indicates the beginning of a chunk.
    - **I-** (Inside) marks a token inside a chunk.
    - **O** (Outside) means the token is not part of any chunk.
  - For example, the noun phrase "The quick brown fox" is tagged with `B-NP` (Beginning of Noun Phrase) for "The" and `I-NP` (Inside Noun Phrase) for "quick", "brown", and "fox".

#### **3.4.2 Tree Representation**

- **Goal**: Represent chunks as hierarchical trees for visualization.

- **Code Demonstration**:


In [None]:
# Example tree structure for a chunked sentence
from nltk import Tree

# Define a tree structure for the sentence "The quick brown fox jumps over the lazy dog."
tree = Tree('S', [
    Tree('NP', [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN')]),  # Noun Phrase (NP)
    ('jumps', 'VBZ'),  # Verb
    Tree('PP', [('over', 'IN'),  # Prepositional Phrase (PP) with preposition 'over'
        Tree('NP', [('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')])])  # Nested NP within the PP
])

# Pretty-print the tree structure
tree.pretty_print()



                                     S                                     
     ________________________________|____________________                  
    |               |                                     PP               
    |               |                         ____________|_____            
    |               NP                       |                  NP         
    |        _______|________________        |       ___________|______     
jumps/VBZ The/DT quick/JJ brown/JJ fox/NN over/IN the/DT     lazy/JJ dog/NN



## 3.5 **Developing and Evaluating Chunkers**



#### **3.5.1 Corpus-Based Development**

- **Goal**: Train chunkers using annotated corpora like CoNLL-2000.

- **Code Demonstration**:



In [None]:
import nltk
nltk.download('conll2000')

from nltk.corpus import conll2000

# The conll2000.chunked_sents() function retrieves the chunked sentences from the dataset, where each sentence includes part-of-speech tags and chunk labels.

# Load training and testing sentences from the CoNLL-2000 chunking corpus
train_sents = conll2000.chunked_sents('train.txt')
test_sents = conll2000.chunked_sents('test.txt')

# train_sents[0] displays the first sentence from the training set with its chunk structure, showing how the text is annotated with phrases such as NP (noun phrase) and VP (verb phrase).
# Display the first sentence in the training set with chunk annotations
print(train_sents[0])


(S
  (NP Confidence/NN)
  (PP in/IN)
  (NP the/DT pound/NN)
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  (PP for/IN)
  (NP September/NNP)
  ,/,
  due/JJ
  (PP for/IN)
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  (VP fail/VB to/TO show/VB)
  (NP a/DT substantial/JJ improvement/NN)
  (PP from/IN)
  (NP July/NNP and/CC August/NNP)
  (NP 's/POS near-record/JJ deficits/NNS)
  ./.)


[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!


#### **3.5.2 Evaluation Metrics**

- **Goal**: Evaluate chunkers using accuracy, precision, recall, and F1-score.

- **Code Demonstration**:


In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
from sklearn.metrics import classification_report

# Convert a chunk tree to IOB tags
def tree_to_iob(tree):
    return tree2conlltags(tree)

# Evaluate chunker performance
def evaluate_chunker(chunker, test_sents):
    # Convert gold standard and predicted chunk trees to IOB tags
    gold = [tree_to_iob(sent) for sent in test_sents]  # Gold standard IOB tags
    predictions = [tree_to_iob(chunker.parse(sent.leaves())) for sent in test_sents]  # Predicted IOB tags by the chunker

    # Flatten the gold and predicted IOB tags to lists for evaluation
    gold_flat = [tag for sent in gold for _, _, tag in sent]  # Flatten gold standard tags
    pred_flat = [tag for sent in predictions for _, _, tag in sent]  # Flatten predicted tags

    # Use classification_report to evaluate precision, recall, and F1-score
    return classification_report(gold_flat, pred_flat, labels=["B-NP", "I-NP", "O"], zero_division=0)

# Example evaluation on the Unigram chunker
print(evaluate_chunker(unigram_chunker, test_sents))


              precision    recall  f1-score   support

        B-NP       0.87      0.95      0.91     12422
        I-NP       0.97      0.86      0.91     14376
           O       0.86      0.83      0.85      8416

   micro avg       0.91      0.89      0.90     35214
   macro avg       0.90      0.88      0.89     35214
weighted avg       0.91      0.89      0.90     35214



## 3.6 **Creative Observations in Chunking**



#### **3.6.1 Multi-Tasking with Chunking and Named Entity Recognition (NER)**

- **Observation**:
   - Chunking is often combined with Named Entity Recognition (NER) to improve entity extraction. Noun phrases (NPs) identified through chunking can serve as candidates for entity recognition, which can then classify these phrases as `PERSON`, `LOCATION`, `ORGANIZATION`, or other types.
   - This combination provides a more refined understanding of the text, allowing both syntactic (chunking) and semantic (NER) information to work together.

- **Code Demonstration**:


In [None]:
from nltk.chunk import ne_chunk
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Use NLTK's ne_chunk to identify named entities in a sentence
sentence = [("Barack", "NNP"), ("Obama", "NNP"), ("was", "VBD"), ("born", "VBN"),
            ("in", "IN"), ("Hawaii", "NNP")]

# Perform named entity recognition (NER) and chunking on the POS-tagged sentence
named_entities_tree = ne_chunk(sentence)

# Print the named entities tree structure
print(named_entities_tree)



[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP))


#### **3.6.2 Chunking for Relation Extraction**

- **Observation**:
   - After chunking, relation extraction can be performed to identify the relationships between entities in a sentence. For example, once NP chunks (noun phrases) and VP chunks (verb phrases) are identified, the relation between entities such as "John" and "Microsoft" can be detected via a phrase like "works at."
   - Chunking helps by simplifying sentences into basic components, making it easier to detect relationships.

- **Code Demonstration**:


In [None]:
import re

# Sample text
text = "John works at Microsoft in Seattle."

# Define a regular expression pattern to extract person, organization, and location entities
pattern = r"(?P<person>\b[A-Z][a-z]+\b) works at (?P<organization>\b[A-Z][a-z]+\b) in (?P<location>\b[A-Z][a-z]+\b)"

# Search for the pattern in the text
match = re.search(pattern, text)

# If the pattern matches, extract and print the named entities
if match:
    print("Person:", match.group("person"))
    print("Organization:", match.group("organization"))
    print("Location:", match.group("location"))


Person: John
Organization: Microsoft
Location: Seattle


- **Explanation**:
  - This example uses **regular expressions** for **relation extraction** by defining a pattern that captures:
    - A **person** (capitalized word like "John"),
    - An **organization** (capitalized word like "Microsoft"),
    - A **location** (capitalized word like "Seattle").
  - The regex pattern matches sentences like "John works at Microsoft in Seattle" and uses named capturing groups (`?P<name>`) to label different entities: **Person**, **Organization**, and **Location**.
  - This is a simple method for extracting structured relationships from text using patterns based on expected sentence structures.

#### **3.6.3 Domain-Specific Chunking**

- **Observation**:
   - Different domains, such as legal, medical, or financial texts, require customized chunking grammars due to variations in sentence structure and terminology. For example, legal texts may have longer, more complex noun phrases, while medical texts might involve specific terms like drug names or symptoms.
   - Adapting chunking grammars or training domain-specific models can significantly improve chunking performance in these specialized fields.

- **Code Demonstration for Medical Text Chunking**:


In [None]:
# Example chunking for medical text
import nltk

# Medical text sample
medical_text = "The patient was prescribed 20mg of Lisinopril for hypertension."

# Tokenize and assign POS tags
medical_tokens = nltk.word_tokenize(medical_text)
medical_pos_tags = nltk.pos_tag(medical_tokens)

# Customized NP and VP chunking for drug dosage and medical conditions
medical_chunk_grammar = r"""
NP: {<CD><NN><IN><NNP>}  # Noun Phrase: Dosage (CD), Unit (NN), Preposition (IN), Drug name (NNP)
VP: {<VBD><NP>}          # Verb Phrase: Verb (VBD) followed by a noun phrase (NP)
"""

# Create a chunk parser with the defined grammar
medical_chunk_parser = nltk.RegexpParser(medical_chunk_grammar)
medical_chunked_tree = medical_chunk_parser.parse(medical_pos_tags)

# Print and display the chunked tree
print(medical_chunked_tree)


(S
  The/DT
  patient/NN
  was/VBD
  prescribed/VBN
  20mg/CD
  of/IN
  Lisinopril/NNP
  for/IN
  hypertension/NN
  ./.)



- **Explanation**:
  - The chunking grammar is customized for **medical text**:
    - **NP**: Matches patterns like **"20mg of Lisinopril"**, where:
      - `<CD>` represents a cardinal number (e.g., **20mg**),
      - `<NN>` is a noun (e.g., **mg**),
      - `<IN>` is a preposition (e.g., **of**),
      - `<NNP>` is a proper noun (e.g., **Lisinopril**).
    - **VP**: Matches verb phrases like **"was prescribed 20mg of Lisinopril"** with a verb followed by a noun phrase.
  - This grammar helps capture domain-specific structures in **medical text** by identifying relationships between drug dosages and conditions. The result can be printed and visualized using **`draw()`**.

## 3.7 **Chunk Representation**



#### **3.7.1 IOB Tagging Format**

- **Observation**:
   - The Inside-Outside-Beginning (IOB) format is widely used to represent chunked text, especially in datasets for training machine learning models. Each token is labeled as being inside (I), outside (O), or at the beginning (B) of a chunk.
   - The IOB format is useful for training sequence labeling models such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF) for chunking tasks.

- **Code Demonstration**:


In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
import nltk

# Assuming medical_chunked_tree is the chunk tree you want to convert
# Replace 'medical_chunked_tree' with the actual name of your chunk tree variable if it's different

# Convert chunk tree to IOB tags
iob_tags = tree2conlltags(medical_chunked_tree) # Changed chunked_tree_rule_based to medical_chunked_tree
print(iob_tags)

# Convert IOB tags back to chunk tree
chunk_tree_from_iob = conlltags2tree(iob_tags)
print(chunk_tree_from_iob)

[('The', 'DT', 'O'), ('patient', 'NN', 'O'), ('was', 'VBD', 'O'), ('prescribed', 'VBN', 'O'), ('20mg', 'CD', 'O'), ('of', 'IN', 'O'), ('Lisinopril', 'NNP', 'O'), ('for', 'IN', 'O'), ('hypertension', 'NN', 'O'), ('.', '.', 'O')]
(S
  The/DT
  patient/NN
  was/VBD
  prescribed/VBN
  20mg/CD
  of/IN
  Lisinopril/NNP
  for/IN
  hypertension/NN
  ./.)


#### **3.7.2 Tree Representation**

- **Observation**:
   - Chunked sentences are often represented using tree structures, which visually depict the hierarchical relationships between different chunks. This tree representation is useful for linguists and NLP researchers to understand the syntactic structure of sentences.
   - NLTK provides functions to display tree structures, allowing for a clear visualization of chunking results.

- **Code Demonstration**:


In [None]:
from nltk import Tree

# Example of creating a chunk tree manually
tree = Tree('S', [
    Tree('NP', [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN')]),  # Noun Phrase
    ('jumps', 'VBZ'),  # Verb
    Tree('PP', [('over', 'IN'),  # Prepositional Phrase
        Tree('NP', [('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')])])  # Nested Noun Phrase within PP
])

# Print a pretty representation of the tree structure
tree.pretty_print()



                                     S                                     
     ________________________________|____________________                  
    |               |                                     PP               
    |               |                         ____________|_____            
    |               NP                       |                  NP         
    |        _______|________________        |       ___________|______     
jumps/VBZ The/DT quick/JJ brown/JJ fox/NN over/IN the/DT     lazy/JJ dog/NN




- **Explanation**:
  - This code creates a **chunk tree** manually using NLTK's `Tree` class.
  - The tree structure includes:
    - A **Noun Phrase (NP)** consisting of a determiner (`DT`), adjectives (`JJ`), and a noun (`NN`).
    - A **Verb (VBZ)** for the action performed by the subject.
    - A **Prepositional Phrase (PP)** that includes a preposition (`IN`) followed by another **Noun Phrase (NP)**.
  - The methods **`pretty_print()`** and **`draw()`** are used to visualize the tree structure, making it easier to understand the relationships between different chunks and how they combine to form the overall sentence structure.

## 3.8 **Developing and Evaluating Chunkers**



#### **3.8.1 Training a Chunker with a Corpus**

- **Observation**:
   - Chunkers can be trained using annotated corpora, such as the CoNLL-2000 chunking dataset, where each word is tagged with its corresponding chunk label. Supervised learning models like UnigramTagger, BigramTagger, or more advanced methods like Conditional Random Fields (CRF) can be trained to automatically predict chunk boundaries.
   - NLTK provides access to pre-annotated corpora, making it easy to train chunkers using real-world data.

- **Code Demonstration (Training a Unigram Chunker)**:


In [None]:
import nltk
nltk.download('conll2000')

from nltk.corpus import conll2000
from nltk.chunk.util import conlltags2tree
from nltk.tag import UnigramTagger

# Define a UnigramChunker class that inherits from ChunkParserI
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        # Prepare training data in IOB format for the UnigramTagger
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = UnigramTagger(train_data)  # Train a UnigramTagger on the chunk labels

    def parse(self, sentence):
        # Extract POS tags from the sentence
        pos_tags = [pos for (word, pos) in sentence]
        # Use the trained UnigramTagger to predict chunk tags
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Combine words, POS tags, and chunk predictions into the format expected by conlltags2tree
        conll_tags = [(word, pos, chunk) for ((word, pos), (pos2, chunk)) in zip(sentence, tagged_pos_tags)]
        # Convert the tagged sentence into a chunk tree and return it
        return conlltags2tree(conll_tags)

# Load training sentences from the CoNLL-2000 chunking corpus
train_sentences = conll2000.chunked_sents('train.txt')
# Initialize the UnigramChunker with the training data
unigram_chunker = UnigramChunker(train_sentences)

# Test the chunker on a sample sentence
test_sentence = [("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"),
                 ("jumps", "VBZ"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN")]

# Output the chunk tree for the test sentence
print(unigram_chunker.parse(test_sentence))


[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!


(S
  (NP The/DT quick/JJ brown/JJ fox/NN)
  (VP jumps/VBZ)
  (PP over/IN)
  (NP the/DT lazy/JJ dog/NN))


#### **3.8.2 Evaluating Chunkers**

- **Observation**:
   - After training a chunker, it is essential to evaluate its performance using metrics such as precision, recall, and F1-score. These metrics help determine how well the chunker identifies correct chunks, avoids false positives, and captures all

 relevant chunks.
   - Evaluation is typically performed using a test set, which is separate from the training data.

- **Code Demonstration (Evaluating a Chunker)**:


In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
from sklearn.metrics import classification_report

# Convert a chunk tree to IOB tags
def tree_to_iob(tree):
    return tree2conlltags(tree)

# Evaluate chunker performance
def evaluate_chunker(chunker, test_sents):
    # Convert the gold standard sentences to IOB tags
    gold = [tree_to_iob(sent) for sent in test_sents]
    # Generate predictions by parsing the test sentences
    predictions = [tree_to_iob(chunker.parse(sent.leaves())) for sent in test_sents]

    # Flatten the lists to compare gold and predicted tags
    gold_flat = [tag for sent in gold for _, _, tag in sent]
    pred_flat = [tag for sent in predictions for _, _, tag in sent]

    # Generate and return the classification report
    return classification_report(gold_flat, pred_flat, labels=["B-NP", "I-NP", "O"], zero_division=0)

# Example evaluation on the Unigram chunker using the CoNLL-2000 test set
test_sentences = conll2000.chunked_sents('test.txt')
print(evaluate_chunker(unigram_chunker, test_sentences))


              precision    recall  f1-score   support

        B-NP       0.87      0.95      0.91     12422
        I-NP       0.97      0.86      0.91     14376
           O       0.86      0.83      0.85      8416

   micro avg       0.91      0.89      0.90     35214
   macro avg       0.90      0.88      0.89     35214
weighted avg       0.91      0.89      0.90     35214

