#### 4.1 **Overview of Chunkers**
   - **Definition**: Chunkers are tools that identify and group sequences of words into meaningful units, such as noun phrases (NPs), verb phrases (VPs), and other syntactic structures.
   - **Purpose**:
     - Provide syntactic analysis of sentences by segmenting and labeling units.
     - Useful for named entity recognition (NER), relation extraction, and sentence parsing.
   - **Key Techniques**: Rule-based chunking, statistical chunking, and classifier-based chunking.
   - **Example Code**:


In [None]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text to process
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text into individual words
tokens = nltk.word_tokenize(text)

# Assign part-of-speech (POS) tags to each token
pos_tags = nltk.pos_tag(tokens)

# Define a grammar for noun phrase (NP) chunking using a regular expression
# The grammar specifies that an NP can consist of:
# - An optional determiner (DT),
# - Zero or more adjectives (JJ),
# - A noun (NN).
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Create a chunk parser with the defined grammar
chunk_parser = nltk.RegexpParser(grammar)

# Parse the POS-tagged tokens to create a chunked tree
chunked = chunk_parser.parse(pos_tags)

# Visualize the chunked structure in a graphical interface
#chunked.draw()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


#### 4.2 **Preparing the Data**
   - **Chunked Corpora**:
     - Chunked corpora are collections of text that have been manually segmented and labeled into syntactic units.
     - **Examples**: CoNLL-2000 Corpus, Penn Treebank.
     - **Importance**:
       - Provides labeled data necessary for training and evaluating chunkers.
       - Helps in developing chunkers that generalize across different domains.
     - **Example Code**:


In [None]:
from nltk.corpus import conll2000

# Download the CoNLL-2000 dataset if not already downloaded
nltk.download('conll2000')

# Load the chunked sentences from the training set of the CoNLL-2000 corpus
train_sents = conll2000.chunked_sents('train.txt')

# Print the first chunked sentence from the training data
print(train_sents[0])
# Output: A tree structure with chunked phrases


(S
  (NP Confidence/NN)
  (PP in/IN)
  (NP the/DT pound/NN)
  (VP is/VBZ widely/RB expected/VBN to/TO take/VB)
  (NP another/DT sharp/JJ dive/NN)
  if/IN
  (NP trade/NN figures/NNS)
  (PP for/IN)
  (NP September/NNP)
  ,/,
  due/JJ
  (PP for/IN)
  (NP release/NN)
  (NP tomorrow/NN)
  ,/,
  (VP fail/VB to/TO show/VB)
  (NP a/DT substantial/JJ improvement/NN)
  (PP from/IN)
  (NP July/NNP and/CC August/NNP)
  (NP 's/POS near-record/JJ deficits/NNS)
  ./.)


[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


   - **IOB Format**:
     - **Definition**: IOB (Inside, Outside, Beginning) format is a tagging format used to denote the start, continuation, or absence of a chunk.
     - **Structure**:
       - **B**: Indicates the beginning of a chunk.
       - **I**: Indicates that the token is inside a chunk.
       - **O**: Indicates that the token is outside a chunk.
     - **Benefits**:
       - Standardized format for encoding chunks.
       - Facilitates training classifier-based chunkers.
     - **Example Code**:


In [None]:
from nltk import conlltags2tree, tree2conlltags
from nltk.corpus import conll2000

# Load the first chunked sentence from the CoNLL-2000 training data
sentence = conll2000.chunked_sents('train.txt')[0]

# Convert the chunked sentence into IOB format
# This function transforms a chunk tree into a list of (word, POS tag, chunk tag) tuples
iob_tagged = tree2conlltags(sentence)

# Print the first 10 tokens in IOB format
# The output will show tuples with the structure (word, POS tag, IOB tag)
print(iob_tagged[:10])  # Display the first 10 tokens


[('Confidence', 'NN', 'B-NP'), ('in', 'IN', 'B-PP'), ('the', 'DT', 'B-NP'), ('pound', 'NN', 'I-NP'), ('is', 'VBZ', 'B-VP'), ('widely', 'RB', 'I-VP'), ('expected', 'VBN', 'I-VP'), ('to', 'TO', 'I-VP'), ('take', 'VB', 'I-VP'), ('another', 'DT', 'B-NP')]


#### 4.3 **Baseline Chunking Approaches**


   - **4.3.1 Unigram Chunkers**:
     - **Definition**: A unigram chunker assigns a chunk tag to each word based solely on the word’s POS tag.
     - **Advantages**:
       - Simple to implement.
       - Useful as a baseline for evaluating more complex models.
     - **Disadvantages**:
       - Does not consider context, resulting in reduced accuracy for complex structures.
     - **Example Code**:


In [None]:
import nltk
from nltk.corpus import conll2000

# Define a UnigramChunker class that inherits from ChunkParserI
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        # Prepare training data in IOB format for the UnigramTagger
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        # Initialize the UnigramTagger with the prepared training data
        self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
        # Extract POS tags from the sentence
        pos_tags = [pos for (word, pos) in sentence]
        # Use the trained UnigramTagger to predict chunk tags based on the POS tags
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Combine words, POS tags, and chunk predictions into the format expected by conlltags2tree
        conlltags = [(word, pos, chunk) for ((word, pos), (pos, chunk)) in zip(sentence, tagged_pos_tags)]
        # Convert the tagged sentence into a chunk tree and return it
        return nltk.chunk.conlltags2tree(conlltags)

# Load training sentences from the CoNLL-2000 chunking corpus
train_sents = conll2000.chunked_sents('train.txt')
# Initialize the UnigramChunker with the training data
unigram_chunker = UnigramChunker(train_sents)

# Load test sentences from the CoNLL-2000 chunking corpus
test_sents = conll2000.chunked_sents('test.txt')
# Evaluate the performance of the Unigram chunker on the test set and print the results
print(unigram_chunker.evaluate(test_sents))


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(unigram_chunker.evaluate(test_sents))


ChunkParse score:
    IOB Accuracy:  86.5%%
    Precision:     74.3%%
    Recall:        86.4%%
    F-Measure:     79.9%%


   - **4.3.2 Bigram Chunkers**:
     - **Definition**: A bigram chunker considers the current POS tag and the previous one for assigning chunk tags.
     - **Advantages**:
       - More accurate than unigram chunkers by accounting for limited context.
     - **Disadvantages**:
       - Increased complexity compared to unigram.
       - Still limited by the narrow context window.
     - **Example Code**:


In [None]:
import nltk
from nltk.corpus import conll2000

# Define a BigramChunker class that inherits from ChunkParserI
class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        # Prepare training data in IOB format for the BigramTagger
        train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        # Initialize the BigramTagger with the prepared training data
        self.tagger = nltk.BigramTagger(train_data)

    def parse(self, sentence):
        # Extract POS tags from the sentence
        pos_tags = [pos for (word, pos) in sentence]
        # Use the trained BigramTagger to predict chunk tags based on the POS tags
        tagged_pos_tags = self.tagger.tag(pos_tags)
        # Combine words, POS tags, and chunk predictions into the format expected by conlltags2tree
        conlltags = [(word, pos, chunk) for ((word, pos), (pos, chunk)) in zip(sentence, tagged_pos_tags)]
        # Convert the tagged sentence into a chunk tree and return it
        return nltk.chunk.conlltags2tree(conlltags)

# Load training sentences from the CoNLL-2000 chunking corpus
train_sents = conll2000.chunked_sents('train.txt')
# Initialize the BigramChunker with the training data
bigram_chunker = BigramChunker(train_sents)

# Load test sentences from the CoNLL-2000 chunking corpus
test_sents = conll2000.chunked_sents('test.txt')
# Evaluate the performance of the Bigram chunker on the test set and print the results
print(bigram_chunker.evaluate(test_sents))


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(bigram_chunker.evaluate(test_sents))


ChunkParse score:
    IOB Accuracy:  89.3%%
    Precision:     81.2%%
    Recall:        86.2%%
    F-Measure:     83.6%%


#### 4.4 **Classifier-Based Chunking**
   - **4.4.1 Concept**:
     - Chunking as a classification task involves training a model to assign chunk tags (e.g., IOB tags) based on features.
     - **Features**:
       - POS tags, word identity, neighboring word tags, position in sentence, prefixes, and suffixes.
     - **Classifier Types**:
       - Naive Bayes, Maximum Entropy, Decision Trees, Support Vector Machines.
     - **Benefits**:
       - Greater flexibility in incorporating features.
       - Higher accuracy by leveraging richer contextual information.
     - **Example Code**:


In [None]:
import nltk
from nltk.chunk.util import conlltags2tree, tree2conlltags
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

# Feature extraction function for chunking
def chunk_features(sentence, index, history):
    """
    Extracts features for the chunking model.

    Parameters:
    - sentence: A list of (word, POS, chunk_tag) tuples for the sentence.
    - index: The index of the current word to extract features for.
    - history: The list of previous chunk tags (not used in this implementation).

    Returns:
    - A dictionary containing features for the word at the given index.
    """
    # Modification: Check the length of the tuple and unpack accordingly
    if len(sentence[index]) == 3:
        word, pos, _ = sentence[index]
    else:  # Assuming 2 values for word and pos
        word, pos = sentence[index]

    features = {
        'pos': pos,  # Part of speech tag of the current word
        'word': word,  # Word itself
        'prev_pos': '' if index == 0 else sentence[index - 1][1] if len(sentence[index-1])>1 else '',  # POS tag of the previous word, if it exists
        'next_pos': '' if index == len(sentence) - 1 else sentence[index + 1][1] if len(sentence[index+1])>1 else '',  # POS tag of the next word, if it exists
    }
    return features

In [None]:

# Extract features and labels from the CoNLL-2000 dataset for training
train_sents = conll2000.chunked_sents('train.txt')
train_data = []  # Feature dictionaries for each word
train_labels = []  # Corresponding chunk tags for each word

for sent in train_sents:
    iob_tags = tree2conlltags(sent)  # Convert chunk tree to IOB tags
    for index, (word, pos, chunk_tag) in enumerate(iob_tags):
        features = chunk_features(iob_tags, index, history=[]) # Pass iob_tags which contains (word, pos, chunk_tag) tuples
        train_data.append(features)
        train_labels.append(chunk_tag)


In [None]:

# Train a logistic regression classifier on the extracted features
clf = LogisticRegression(max_iter=10)  # Increase max_iter for convergence
clf.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:

# Example evaluation on a new test sentence
test_sentence = [("The", "DT"), ("cat", "NN"), ("sat", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
test_data = [chunk_features(test_sentence, i, []) for i in range(len(test_sentence))]
X_test = vectorizer.transform(test_data)
y_pred = clf.predict(X_test)

# Print the predicted chunk tags for each word in the test sentence
print(list(zip([word for word, _ in test_sentence], y_pred)))

[('The', 'B-NP'), ('cat', 'I-NP'), ('sat', 'B-VP'), ('on', 'B-PP'), ('the', 'B-NP'), ('mat', 'I-NP')]


#### 4.5 **Evaluation Metrics for Chunkers**


   - **4.5.1 ChunkParse Score**:
     - Measures chunker performance by calculating precision, recall, and F1-score.
     - **Precision**: Fraction of predicted chunks that are correct.
     - **Recall**: Fraction of actual chunks that are correctly predicted.
     - **F1-Score**: Harmonic mean of precision and recall, providing a balanced measure.
     - **Example Code**:


In [None]:
# Evaluating a chunker using NLTK's built-in evaluation
from nltk.corpus import conll2000

# Load the test sentences from the CoNLL-2000 chunking corpus
test_sents = conll2000.chunked_sents('test.txt')

# Assuming UnigramChunker class is defined and trained using the train_sents
unigram_chunker = UnigramChunker(train_sents)

# Evaluate the performance of the Unigram chunker on the test sentences
# The evaluation returns Precision, Recall, and F1-score
print("Precision, Recall, and F1:", unigram_chunker.evaluate(test_sents))


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print("Precision, Recall, and F1:", unigram_chunker.evaluate(test_sents))


Precision, Recall, and F1: ChunkParse score:
    IOB Accuracy:  86.5%%
    Precision:     74.3%%
    Recall:        86.4%%
    F-Measure:     79.9%%


   - **4.5.2 Error Analysis**:
     - **Analyzing Incorrectly Chunked Tokens**:
       - Identify common errors and refine chunking rules or features to improve performance.
       - Helps to understand weaknesses in the chunker and guide subsequent iterations.
     - **Example Code**:


In [None]:
from nltk.chunk import ChunkScore
import nltk

# Initialize ChunkScore object to evaluate the chunker
chunk_score = ChunkScore()

# Iterate through each sentence in the test set
for sent in test_sents:
    # Convert the gold standard chunk tree to IOB format
    gold_chunks = nltk.chunk.tree2conlltags(sent)

    # Flatten the sentence tree into a list of (word, POS tag) tuples
    words_and_pos_tags = [(token, tag) for token, tag in sent.leaves()]

    # Use the Unigram chunker to parse the sentence and convert it to IOB format
    predicted_chunks = nltk.chunk.tree2conlltags(unigram_chunker.parse(words_and_pos_tags))

    # Score the predicted chunks against the gold standard
    chunk_score.score(predicted_chunks, gold_chunks)

# Print the missed chunks (chunks that the model should have found but did not)
print("Missed Chunks:", chunk_score.missed())

# Print the incorrect chunks (chunks that were incorrectly identified by the model)
print("Incorrect Chunks:", chunk_score.incorrect())

Missed Chunks: []
Incorrect Chunks: []


#### 4.6 **Creative Observations in Developing Chunkers**
   - **Importance of Feature Engineering**:
     - In classifier-based chunking, effective feature engineering can significantly improve model performance.
     - Features like prefixes, suffixes, and word shape (e.g., capitalization, hyphenation) often add value in recognizing chunk boundaries.
   
   - **Role of Contextual Embeddings**:
     - Contextual word embeddings (e.g., BERT) can be used to replace manual features in classifier-based chunkers.
     - These embeddings provide richer, context-dependent word representations, often improving chunking accuracy.

   - **Chunking Beyond Noun Phrases**:
     - Chunkers can be extended beyond noun phrases to identify verb phrases, prepositional phrases, and other types of syntactic structures.
     - Enables more detailed parsing of sentences, beneficial for downstream tasks like dependency parsing.

   - **Semi-Supervised Chunking**:
     - Incorporating unlabeled data along with labeled data can improve chunker performance by leveraging large amounts of unlabeled text.
     - Semi-supervised learning approaches like bootstrapping can help label new examples based on confidence scores.

   - **Handling Multi-Word and Nested Chunks**:
     - Chunking multi-word entities (e.g., named entities with multiple tokens) and handling nested chunks (e.g., NP within a VP) remains a challenge.
     - Recursive chunking methods and hierarchical tagging strategies can be employed to manage nested structures.

   - **Scalability and Computational Efficiency**:
     - When dealing with large datasets, scalability becomes crucial.
     - Efficient algorithms and use of parallel processing can help improve chunker performance and reduce processing time.



#### 4.7 **Demonstration of Creative Observations**
   - **Feature Engineering for Classifier-Based Chunkers**:


In [None]:
# Adding word shape and suffix features to classifier-based chunker
def enhanced_chunk_features(sentence, index, history):
    """
    Extracts enhanced features for chunking, including word shape and suffix.

    Parameters:
    - sentence: A list of (word, POS) tuples for the sentence.
    - index: The index of the current word to extract features for.
    - history: The list of previous chunk tags (not used in this implementation).

    Returns:
    - A dictionary containing enhanced features for the word at the given index.
    """
    word, pos, _ = sentence[index] # Unpack all 3 elements, ignoring the chunk tag with '_'
    features = {
        'pos': pos,  # Part of speech tag of the current word
        'word': word,  # The word itself
        'suffix': word[-3:],  # The last three characters of the word, helpful for suffix analysis
        'word_shape': 'capitalized' if word[0].isupper() else 'lowercase',  # Indicator if the word is capitalized or not
    }
    return features

# Reuse classifier code and train with enhanced features
train_data = []  # List to store training features
train_labels = []  # List to store corresponding chunk labels

# Extract enhanced features for each word in the training set
for sent in train_sents:
    # Convert the chunk tree to IOB format
    iob_tags = tree2conlltags(sent)

    # Iterate over each word in the sentence
    for index, (word, pos, chunk_tag) in enumerate(iob_tags):
        # Extract features using the enhanced feature function
        features = enhanced_chunk_features(iob_tags, index, history=[]) # Pass iob_tags to enhanced_chunk_features
        train_data.append(features)  # Append features to training data
        train_labels.append(chunk_tag)  # Append the corresponding chunk tag to training labels

  - **Contextual Embeddings for Chunking**:


In [None]:
# Using pre-trained BERT embeddings for feature extraction
from transformers import BertTokenizer, BertModel
import torch

# Load the pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Define a sample sentence to be processed by BERT
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence using the BERT tokenizer
# The tokenizer will convert the sentence to input IDs and create attention masks
inputs = tokenizer(sentence, return_tensors='pt')  # Return PyTorch tensors

# Pass the tokenized input through the pre-trained BERT model
outputs = model(**inputs)

# Extract the last hidden state (contextual embeddings) for each token in the sentence
# 'last_hidden_state' contains the hidden states of the model for each token
embeddings = outputs.last_hidden_state

# Print the shape of the embeddings tensor
print(embeddings.shape)  # [batch_size, sequence_length, hidden_size]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

torch.Size([1, 12, 768])
