Ambiguities and complex grammatical structures are inherent in natural language, presenting significant challenges for natural language processing (NLP). This section discusses various types of ambiguities and complex cases encountered in chunking and named entity recognition, and provides strategies and techniques to handle these challenges effectively.



### 8.1 Types of Ambiguities in NLP



#### 8.1.1 Lexical Ambiguity

- **Definition**: Lexical ambiguity arises when a word has multiple meanings. For instance, the word "bank" can refer to a financial institution or the side of a river.
  - **Impact on Chunking**: Ambiguous words can lead to incorrect identification of entities or incorrect attachment to phrases.
  - **Handling Strategies**:
    - **Word Sense Disambiguation (WSD)**: Use algorithms like Lesk or machine learning approaches to identify the correct sense of the word based on context.
    - **Code Demonstration**: Use NLTK's Lesk algorithm to disambiguate the word "bank."


In [None]:
import nltk # import the nltk library
nltk.download('wordnet')

from nltk.wsd import lesk  # Importing the Lesk algorithm for Word Sense Disambiguation (WSD).
from nltk.tokenize import word_tokenize  # Importing a function to tokenize the sentence.

nltk.download('punkt')

# Example sentence with the word "bank", which has multiple meanings (e.g., a financial institution or the side of a river).
sentence = "He sat on the bank of the river."

# The ambiguous word "bank" to be disambiguated in the context of the sentence.
word = "bank"

# Using the Lesk algorithm to disambiguate the sense of the word "bank" based on the context of the sentence.
# The algorithm chooses the best sense of the word by comparing the context words in the sentence with definitions from WordNet.
sense = lesk(word_tokenize(sentence), word)

# Printing the definition of the chosen sense.
print(f"Sense: {sense.definition()}")



Sense: cover with ashes so to control the rate of burning


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### 8.1.2 Structural Ambiguity

- **Definition**: Structural ambiguity occurs when a sentence can be parsed in more than one way due to its grammatical structure. For example, "The man saw the boy with the telescope" could mean that the man used a telescope or the boy had a telescope.
  - **Impact on Chunking**: Structural ambiguity can lead to incorrect chunk boundaries, making it difficult to accurately identify the relationships between entities.
  - **Handling Strategies**:
    - **Dependency Parsing**: Use dependency parsers to identify the correct syntactic structure of a sentence.
    - **Code Demonstration**: Use SpaCy to demonstrate dependency parsing.


In [None]:
import spacy  # Importing the spaCy library for natural language processing (NLP).

# Loading the small English language model from spaCy, which includes dependency parsing, part-of-speech tagging, and more.
nlp = spacy.load("en_core_web_sm")

# Ambiguous sentence: "The man saw the boy with the telescope."
# This sentence is ambiguous because it could mean either:
# 1. The man used a telescope to see the boy.
# 2. The boy had the telescope, and the man saw him.
sentence = "The man saw the boy with the telescope."

# Processing the sentence using the spaCy model, which will annotate the sentence with linguistic features.
doc = nlp(sentence)

# Iterating over each token (word) in the processed sentence.
for token in doc:
    # Printing the token (word), its syntactic head (the word it depends on), and its dependency label (dep_).
    # The dependency label describes the grammatical relationship between the token and its head.
    print(f"Token: {token.text}, Head: {token.head.text}, Dependency: {token.dep_}")


Token: The, Head: man, Dependency: det
Token: man, Head: saw, Dependency: nsubj
Token: saw, Head: saw, Dependency: ROOT
Token: the, Head: boy, Dependency: det
Token: boy, Head: saw, Dependency: dobj
Token: with, Head: saw, Dependency: prep
Token: the, Head: telescope, Dependency: det
Token: telescope, Head: with, Dependency: pobj
Token: ., Head: saw, Dependency: punct


#### 8.1.3 Syntactic Ambiguity

- **Definition**: Syntactic ambiguity occurs when a phrase or sentence has multiple possible syntactic structures. An example is "old men and women," which could mean both old men and old women or only old men and all women.
  - **Impact on Chunking**: This ambiguity can lead to incorrect chunking of phrases, particularly when identifying noun phrases or adjective-noun combinations.
  - **Handling Strategies**:
    - **POS Tagging with Context**: Leverage part-of-speech (POS) tagging along with context-specific rules to clarify the structure.
    - **Code Demonstration**: Use NLTK to tag parts of speech and disambiguate syntactic structure.


In [None]:
import nltk  # Importing the NLTK (Natural Language Toolkit) library for tokenization and POS tagging.

nltk.download('averaged_perceptron_tagger')

# Ambiguous sentence: "old men and women".
# The sentence is ambiguous because "old" could modify either "men" alone or both "men and women."
sentence = "old men and women"

# Tokenizing the sentence into individual words.
tokens = nltk.word_tokenize(sentence)

# Assigning part-of-speech (POS) tags to each token in the sentence.
# The POS tag for each word will help determine its grammatical function (e.g., adjective, noun, conjunction).
pos_tags = nltk.pos_tag(tokens)

# Printing the list of tokens and their corresponding POS tags.
print(pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('old', 'JJ'), ('men', 'NNS'), ('and', 'CC'), ('women', 'NNS')]


### 8.2 Handling Complex Cases in Chunking



#### 8.2.1 Nested Structures

- **Definition**: Nested structures involve phrases or entities embedded within one another, such as "[The president [of the United States]] announced a new policy."
  - **Impact on Chunking**: It becomes challenging to correctly identify boundaries when multiple layers of chunks are involved.
  - **Handling Strategies**:
    - **Recursive Chunking**: Use recursive chunking techniques to iteratively parse deeper levels of nested structures.
    - **Code Demonstration**: Use NLTK's `RegexpParser` to implement recursive chunking.



In [None]:
import nltk  # Importing NLTK for tokenization, POS tagging, and chunking.

# Sample sentence containing a nested noun phrase: "The president of the United States".
nested_text = "The president of the United States announced a new policy."

# Tokenizing the sentence into individual words and assigning part-of-speech (POS) tags to each token.
tokens = nltk.pos_tag(nltk.word_tokenize(nested_text))

# Defining a grammar to identify nested noun phrases (NPs).
# This grammar identifies structures like "The president of the United States":
# - <DT>? allows for an optional determiner (like "The").
# - <NN.*> matches any noun (singular/plural or proper/common noun).
# - <IN> matches a preposition (like "of").
# - The combination captures noun phrases like "president of the United States".
grammar_nested_np = "NP: {<DT>?<NN.*><IN><DT>?<NN.*>}"

# Creating a RegexpParser object to chunk noun phrases based on the defined grammar.
chunker_nested_np = nltk.RegexpParser(grammar_nested_np)

# Parsing the tokenized sentence to chunk nested noun phrases.
nested_chunked = chunker_nested_np.parse(tokens)

# Printing the chunk tree, which shows the identified noun phrases.
print(nested_chunked)


(S
  (NP The/DT president/NN of/IN the/DT United/NNP)
  States/NNPS
  announced/VBD
  a/DT
  new/JJ
  policy/NN
  ./.)


#### 8.2.2 Overlapping Entities

- **Definition**: Overlapping entities occur when the same span of text could belong to multiple entity types or chunks. For example, "Washington" could be both a location and a person’s last name.
  - **Impact on Chunking**: Overlapping entities can lead to conflicting chunking decisions, reducing accuracy.
  - **Handling Strategies**:
    - **Hierarchical Tagging and Priority Rules**: Assign a priority to entity types and use hierarchical tagging to ensure the correct type is chosen.
    - **Code Demonstration**: Implement a simple rule-based system to resolve overlapping entities using SpaCy.



In [None]:
# Assuming spaCy's language model is already loaded (e.g., nlp = spacy.load("en_core_web_sm")).
# The sentence "Washington is a great place to visit." can be ambiguous as "Washington" could refer to a person or a place.
doc = nlp("Washington is a great place to visit.")

# Iterating over all named entities in the processed document.
for ent in doc.ents:
    # Checking if the entity is labeled as either "GPE" (Geopolitical Entity, e.g., a city or country) or "PERSON" (a person's name).
    if ent.label_ in ["GPE", "PERSON"]:
        # Printing the text of the entity and its label (either GPE or PERSON).
        print(f"Entity: {ent.text}, Label: {ent.label_}")


Entity: Washington, Label: GPE


### 8.3 Techniques for Handling Ambiguity



#### 8.3.1 Contextual Embeddings

- **Concept**: Contextual embeddings, such as BERT, capture the meaning of a word based on the context in which it appears, helping to resolve ambiguities.
- **Application**: Using BERT embeddings to differentiate between meanings of the same word in different contexts.
- **Code Demonstration**: Use Hugging Face Transformers to extract contextual embeddings for an ambiguous word.


In [None]:
from transformers import BertTokenizer, BertModel  # Importing BERT tokenizer and model from the Hugging Face Transformers library.
import torch  # Importing PyTorch for tensor operations.

# Loading a pre-trained BERT tokenizer and model (uncased version, which means the text will be lowercased before tokenization).
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample sentence with ambiguity ("bank" could mean a financial institution or the side of a river).
sentence = "He went to the bank to deposit money."

# Tokenizing the input sentence. The tokenizer converts the text into token IDs, adds special tokens (CLS, SEP), and returns a PyTorch tensor.
inputs = tokenizer(sentence, return_tensors='pt')

# Passing the tokenized input through the pre-trained BERT model.
# The model outputs the last hidden states (contextual embeddings) for each token.
outputs = model(**inputs)

# Printing the last hidden states, which are the embeddings for each token in the sentence.
# Each token will have a corresponding embedding vector that captures its context in the sentence.
print(outputs.last_hidden_state)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tensor([[[ 0.0331,  0.2479, -0.0500,  ..., -0.0737,  0.2191,  0.4464],
         [ 0.7697,  0.1283, -0.0531,  ..., -0.0815,  1.0483, -0.1717],
         [ 0.3747, -0.4291,  0.2311,  ...,  0.0135, -0.1754,  0.3431],
         ...,
         [ 0.6070, -0.2980, -0.1447,  ...,  0.0301, -0.6997,  0.2486],
         [ 0.5998,  0.1889, -0.2550,  ...,  0.4067, -0.1178, -0.4125],
         [ 0.6390,  0.3418,  0.4444,  ...,  0.5424, -0.1573, -0.4498]]],
       grad_fn=<NativeLayerNormBackward0>)


#### 8.3.2 Dependency Parsing

- **Concept**: Dependency parsing is used to determine the grammatical structure of a sentence and the relationships between words, which helps in resolving ambiguities related to word attachment.
- **Application**: Identifying which noun a prepositional phrase modifies, as in "She saw the man with a telescope."
- **Code Demonstration**: Use SpaCy to demonstrate how dependency parsing can resolve ambiguities.


In [None]:
# Assuming spaCy's language model (e.g., 'en_core_web_sm') has already been loaded (e.g., nlp = spacy.load("en_core_web_sm")).
# The sentence contains an ambiguous phrase: "with a telescope", which could modify "saw" (verb) or "man" (noun).
sentence = "She saw the man with a telescope."

# Processing the sentence using spaCy's NLP pipeline, which includes tokenization and dependency parsing.
doc = nlp(sentence)

# Iterating over each token (word) in the sentence.
for token in doc:
    # Printing the token's text, its syntactic head (the word it depends on), and the dependency relation (e.g., subject, object).
    print(f"Token: {token.text}, Head: {token.head.text}, Dependency: {token.dep_}")


Token: She, Head: saw, Dependency: nsubj
Token: saw, Head: saw, Dependency: ROOT
Token: the, Head: man, Dependency: det
Token: man, Head: saw, Dependency: dobj
Token: with, Head: man, Dependency: prep
Token: a, Head: telescope, Dependency: det
Token: telescope, Head: with, Dependency: pobj
Token: ., Head: saw, Dependency: punct


### 8.4 Disambiguation Using External Knowledge Sources

- **Concept**: External knowledge sources, such as knowledge graphs (e.g., DBpedia, Wikidata), can be used to provide context and resolve ambiguities by linking entities to structured data.
- **Application**: Disambiguating entities like "Apple" by linking to either the fruit or the company based on context.
- **Code Demonstration**: Use an external API to link entities to a knowledge graph.


In [None]:
import requests  # Importing the requests library to make HTTP requests.

# Defining the entity to search for, in this case, "Apple".
entity = "Apple"

# Sending a GET request to the Wikidata API, searching for entities with the term "Apple".
# The API returns results in JSON format, and the search is conducted in English (`language=en`).
response = requests.get(f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={entity}&language=en&format=json")

# Parsing the JSON response into a Python dictionary.
data = response.json()

# Iterating over each search result returned by the API.
# Each result contains a 'label' (the name of the entity) and a 'description' (a short explanation of what the entity represents).
for item in data['search']:
    print(f"Entity: {item['label']}, Description: {item['description']}")


Entity: Apple, Description: American multinational technology company based in Cupertino, California
Entity: Apple Records, Description: UK record label
Entity: apple, Description: fruit of the apple tree
Entity: App Store, Description: digital app distribution platform for iOS/iPadOS
Entity: Apple Music, Description: Internet online music service by Apple
Entity: Apple, Description: given name
Entity: Malus, Description: flowering genus in the rose family Rosaceae


### 8.5 Ambiguity in Coreference Resolution

- **Definition**: Coreference ambiguity occurs when it is unclear which entity a pronoun or noun phrase refers to. For example, in "John told Mike he would help him," it is ambiguous who "he" refers to.
- **Handling Strategies**:
  - **Coreference Resolution Tools**: Use tools like SpaCy or AllenNLP to resolve coreferences based on context.
  - **Code Demonstration**: Use SpaCy to perform coreference resolution.
