- Information Extraction (IE) architecture is the structured framework used to process unstructured text and extract meaningful entities, relationships, and events.
- This architecture typically follows a pipeline approach, where different stages in the pipeline are responsible for specific tasks such as
  tokenization,
  part-of-speech tagging,
  named entity recognition, and
  relation extraction.
- Each stage plays a critical role in ensuring that the text is processed accurately and that relevant information is extracted effectively.



# Discussion

## 2.1 **Overview of the Pipeline**

- **Pipeline Definition**:
  - A sequence of stages or modules that sequentially process the text to extract structured information.
  - Each module is responsible for a specific NLP task (e.g., tokenization, tagging, entity recognition) and passes its output to the next stage.
  - This modularity allows for flexibility and scalability in handling different types of text and extraction tasks.

- **Main Stages in the Pipeline**:
  - **Sentence Segmentation**: Dividing text into individual sentences.
  - **Tokenization**: Splitting sentences into tokens (words or phrases).
  - **Part-of-Speech (POS) Tagging**: Labeling tokens with grammatical categories (e.g., noun, verb).
  - **Named Entity Recognition (NER)**: Identifying and classifying named entities (e.g., person names, locations).
  - **Relation Detection**: Extracting relationships between identified entities (e.g., "works at," "located in").

- **Modularity**:
  - Each stage in the pipeline is designed as a separate module, making the pipeline adaptable to various IE tasks and domains.
  - Pipelines can be customized to include additional stages such as co-reference resolution or event detection based on the use case.



- **Example Code (Simple Pipeline Implementation)**:



In [None]:
import nltk
import spacy
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk
from nltk.tree import Tree

# Download necessary NLTK data
# 'punkt' is a tokenizer for breaking text into words and sentences.
nltk.download('punkt')
# 'averaged_perceptron_tagger' is used for part-of-speech (POS) tagging.
nltk.download('averaged_perceptron_tagger')
# 'maxent_ne_chunker' is used for Named Entity Recognition (NER) in NLTK.
nltk.download('maxent_ne_chunker')
# 'words' is a list of known English words needed for NER in NLTK.
nltk.download('words')

# Load the spaCy model for advanced NER and relation detection.
# 'en_core_web_sm' is a small English model that includes pre-trained NER capabilities.
nlp = spacy.load("en_core_web_sm")

# Sample text for demonstration
# The text contains a person's name, organization, location, and a date, useful for NER tasks.
text = "John Doe works at Google in Mountain View, California. He attended Stanford University in 2010."


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


## 2.2 **Sentence Segmentation**

- **Definition**:
  - The process of breaking down a text document into individual sentences.
  - Sentence segmentation is crucial for downstream tasks like tokenization, part-of-speech tagging, and relation extraction, as many NLP tasks operate at the sentence level.

- **Challenges**:
  - Handling punctuation marks that do not indicate sentence boundaries (e.g., periods in abbreviations like "Dr." or "U.S.").
  - Dealing with quotations, parentheses, and other special symbols that complicate sentence boundaries.

- **Techniques**:
  - **Rule-Based Approaches**: Using predefined rules to split text based on punctuation and capitalization.
  - **Statistical Methods**: Utilizing machine learning models trained on labeled sentence boundaries to predict the start and end of sentences.

- **Creative Observations**:
  - Multi-lingual sentence segmentation: Sentence boundaries vary significantly across languages, requiring language-specific models or rules for effective segmentation.
  - In the context of informal or social media text, sentence segmentation becomes more difficult due to inconsistent punctuation.

- **Example Code (Sentence Segmentation)**:


In [None]:
def sentence_segmentation(text):
    # Use NLTK's sent_tokenize to split the input text into sentences
    sentences = sent_tokenize(text)
    return sentences

# Example usage
# Call the sentence_segmentation function and pass the input text to it
sentences = sentence_segmentation(text)

# Print the segmented sentences
print("Sentence Segmentation:", sentences)


Sentence Segmentation: ['John Doe works at Google in Mountain View, California.', 'He attended Stanford University in 2010.']


## 2.3 **Tokenization**

- **Definition**:
  - Tokenization is the process of splitting a sentence into individual words, subwords, or symbols, known as tokens.
  - It is the first step in most NLP tasks as it provides the basic units of text (words) that can be further processed.

- **Challenges**:
  - Handling compound words, contractions, and multi-word expressions (e.g., "New York" vs. "New" and "York").
  - Tokenizing languages with different writing systems (e.g., Chinese characters vs. English words).

- **Techniques**:
  - **Word Tokenization**: Splitting sentences based on spaces and punctuation.
  - **Subword Tokenization**: Breaking words into smaller meaningful units (e.g., using Byte-Pair Encoding for subword tokenization in BERT).
  - **Custom Tokenization**: Developing domain-specific tokenizers for technical or specialized texts.

- **Creative Observations**:
  - Subword tokenization, as used in models like BERT, allows for better handling of rare or out-of-vocabulary words by decomposing them into more frequent subunits.
  - Tokenization for social media data often requires special handling of hashtags, mentions, and emoticons.

- **Example Code (Word Tokenization)**:


In [None]:
def tokenize(sentences):
    # Tokenize each sentence using NLTK's word_tokenize
    # word_tokenize splits sentences into individual words (tokens)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    return tokenized_sentences

# Example usage
# Call the tokenize function to tokenize each sentence
tokenized_sentences = tokenize(sentences)

# Print the list of tokenized sentences (each sentence is a list of words)
print("Tokenization:", tokenized_sentences)


Tokenization: [['John', 'Doe', 'works', 'at', 'Google', 'in', 'Mountain', 'View', ',', 'California', '.'], ['He', 'attended', 'Stanford', 'University', 'in', '2010', '.']]


## 2.4 **Part-of-Speech (POS) Tagging**

- **Definition**:
  - POS tagging is the process of assigning a part-of-speech label (e.g., noun, verb, adjective) to each token in a sentence.
  - POS tags provide syntactic information that is valuable for subsequent NLP tasks such as entity recognition and parsing.

- **Challenges**:
  - Words that can have multiple POS tags depending on context (e.g., "run" can be a noun or a verb).
  - Dealing with ambiguous or complex sentence structures.

- **Techniques**:
  - **Rule-Based Tagging**: Using predefined grammar rules to assign POS tags.
  - **Statistical Tagging**: Utilizing machine learning models (e.g., Hidden Markov Models, Conditional Random Fields) trained on labeled POS data.

- **Creative Observations**:
  - POS tagging is language-dependent; for morphologically rich languages (e.g., Finnish), POS tagging requires more sophisticated models to capture inflection and case.
  - POS tagging in informal text (e.g., tweets) requires handling of slang, abbreviations, and incomplete sentences.

- **Example Code (POS Tagging)**:


In [None]:
def pos_tagging(tokenized_sentences):
    # Perform Part-of-Speech (POS) tagging on each tokenized sentence using NLTK's pos_tag
    # pos_tag assigns POS tags to each word (e.g., noun, verb, adjective)
    pos_tagged_sentences = [pos_tag(sentence) for sentence in tokenized_sentences]
    return pos_tagged_sentences

# Example usage
# Call the pos_tagging function to apply POS tagging to each tokenized sentence
pos_tagged_sentences = pos_tagging(tokenized_sentences)

# Print the POS-tagged sentences (each sentence is a list of tuples where each tuple contains a word and its POS tag)
print("POS Tagging:", pos_tagged_sentences)


POS Tagging: [[('John', 'NNP'), ('Doe', 'NNP'), ('works', 'VBZ'), ('at', 'IN'), ('Google', 'NNP'), ('in', 'IN'), ('Mountain', 'NNP'), ('View', 'NNP'), (',', ','), ('California', 'NNP'), ('.', '.')], [('He', 'PRP'), ('attended', 'VBD'), ('Stanford', 'NNP'), ('University', 'NNP'), ('in', 'IN'), ('2010', 'CD'), ('.', '.')]]


## 2.5 **Named Entity Recognition (NER)**

- **Definition**:
  - NER is the task of identifying and classifying named entities in text into predefined categories such as persons, organizations, locations, dates, etc.
  - It is essential for extracting structured information about the key entities mentioned in the text.

- **Challenges**:
  - Disambiguation of entity types (e.g., "Washington" as a person, location, or organization).
  - Handling multi-word named entities (e.g., "New York City" vs. "York").
  - Domain adaptation, where entity types may vary based on the field (e.g., biological entities vs. company names).

- **Techniques**:
  - **Dictionary-Based Methods**: Matching tokens to a predefined dictionary of known entities.
  - **Statistical Models**: Using supervised machine learning models (e.g., CRFs, BiLSTM-CRF) trained on annotated NER corpora.
  - **Neural Networks**: Utilizing deep learning techniques like transformers (e.g., BERT) for state-of-the-art NER performance.

- **Creative Observations**:
  - Pre-trained models like BERT offer improved performance by leveraging contextual embeddings, enabling the model to better capture ambiguous and rare entities.
  - NER in multi-lingual or domain-specific contexts (e.g., medical or legal domains) requires specialized training data to achieve high accuracy.

- **Example Code (NER using NLTK)**:


In [None]:
def ner_nltk(pos_tagged_sentences):
    # Perform Named Entity Recognition (NER) using NLTK's ne_chunk
    # ne_chunk takes POS-tagged sentences and returns a tree structure where named entities are identified
    named_entities = [ne_chunk(sentence) for sentence in pos_tagged_sentences]
    return named_entities

# Example usage
# Call the ner_nltk function to extract named entities from POS-tagged sentences
ner_entities_nltk = ner_nltk(pos_tagged_sentences)

# Print the named entities recognized by NLTK (in the form of a tree structure for each sentence)
print("NER (NLTK):", ner_entities_nltk)


NER (NLTK): [Tree('S', [Tree('PERSON', [('John', 'NNP')]), Tree('ORGANIZATION', [('Doe', 'NNP')]), ('works', 'VBZ'), ('at', 'IN'), Tree('ORGANIZATION', [('Google', 'NNP')]), ('in', 'IN'), Tree('GPE', [('Mountain', 'NNP'), ('View', 'NNP')]), (',', ','), Tree('GPE', [('California', 'NNP')]), ('.', '.')]), Tree('S', [('He', 'PRP'), ('attended', 'VBD'), Tree('ORGANIZATION', [('Stanford', 'NNP'), ('University', 'NNP')]), ('in', 'IN'), ('2010', 'CD'), ('.', '.')])]


- **Example Code (NER with SpaCy)**:


In [None]:
def ner_spacy(text):
    # Process the input text with spaCy's NLP pipeline
    # The pipeline will perform tokenization, POS tagging, and Named Entity Recognition (NER)
    doc = nlp(text)

    # Extract named entities from the processed text
    # For each entity, return its text and label (e.g., PERSON, ORG, GPE)
    return [(ent.text, ent.label_) for ent in doc.ents]

# Example usage
# Call the ner_spacy function to extract named entities from the input text using spaCy
ner_entities_spacy = ner_spacy(text)

# Print the named entities recognized by spaCy (as tuples of entity text and entity label)
print("NER (spaCy):", ner_entities_spacy)


NER (spaCy): [('John', 'PERSON'), ('Google', 'ORG'), ('Mountain View', 'GPE'), ('California', 'GPE'), ('Stanford University', 'ORG'), ('2010', 'DATE')]


## 2.6 **Relation Detection**

- **Definition**:
  - Relation detection involves identifying relationships

 between the extracted entities (e.g., "works for," "located in").
  - The goal is to link named entities through their interactions or roles in a sentence.

- **Challenges**:
  - Ambiguity in identifying the correct relation (e.g., "John works at Google" vs. "John visited Google").
  - Capturing implicit or complex relationships that are not explicitly stated in the text.

- **Techniques**:
  - **Rule-Based Approaches**: Using regular expressions and syntactic patterns to identify relations between entities.
  - **Statistical Approaches**: Using machine learning models trained on relation-annotated corpora to predict relationships.
  - **Neural Approaches**: Leveraging deep learning models (e.g., dependency parsing combined with transformers) to extract complex relations.

- **Creative Observations**:
  - Relation extraction in domain-specific texts, such as legal or biomedical documents, requires specialized knowledge and relation schemas to ensure high accuracy.
  - Pre-trained language models can significantly improve relation extraction by capturing the complex, context-dependent interactions between entities.

- **Example Code (Relation Detection)**:


Way -1

In [None]:
# Simplified rule-based relation extraction
import re

# Sample text for relation extraction
text = "Alice works at Google in Mountain View."

# Define a regex pattern to extract person, organization, and location
# The pattern uses named capturing groups: (?P<name>pattern)
# - (?P<person>[A-Z][a-z]+): Captures a person with a capitalized first name
# - (?P<organization>[A-Z][a-z]+): Captures an organization with a capitalized name
# - (?P<location>[A-Z][a-z]+ [A-Z][a-z]+): Captures a two-word location (both words capitalized)
pattern = r"(?P<person>[A-Z][a-z]+) works at (?P<organization>[A-Z][a-z]+) in (?P<location>[A-Z][a-z]+ [A-Z][a-z]+)"

# Search for the pattern in the text
match = re.search(pattern, text)

# If a match is found, print the named groups as a dictionary
if match:
    print(match.groupdict())


{'person': 'Alice', 'organization': 'Google', 'location': 'Mountain View'}


Way -2

In [None]:
def relation_detection_spacy(text):
    # Process the input text with spaCy's NLP pipeline
    doc = nlp(text)

    relations = []  # Initialize an empty list to store detected relations

    # Iterate over the named entities in the document
    for ent in doc.ents:
        # Check if the entity is a person (labeled "PERSON" by spaCy's NER)
        if ent.label_ == "PERSON":
            # Check the syntactic dependencies of the person's root word
            # Look for a preposition (prep) that is related to the verb "work"
            for token in ent.root.head.children:
                if token.dep_ == "prep" and token.head.lemma_ == "work":
                    # Find the organization (ORG) entity that appears after the person entity
                    org = [ent2 for ent2 in doc.ents if ent2.start > ent.start and ent2.label_ == "ORG"]
                    # If an organization is found, add the relation (person, verb, organization) to the relations list
                    if org:
                        relations.append((ent.text, token.head.lemma_, org[0].text))

    return relations  # Return the list of detected relations

# Example usage
relations = relation_detection_spacy(text)
print("Relations Detected:", relations)


Relations Detected: [('Alice', 'work', 'Google')]


## 2.7 **Modularity and Scalability of the Pipeline**

- **Modular Design**:
  - Each stage in the pipeline is modular, allowing for easy replacement, enhancement, or addition of stages based on the specific requirements of the task or domain.
  - For example, a pipeline for legal text extraction may include additional steps like case identification or contract clause extraction.

- **Scalability**:
  - The architecture should be scalable to handle large datasets or real-time data streams.
  - Parallelization and distributed computing frameworks (e.g., Apache Spark, Hadoop) can be used to scale the extraction pipeline for massive text corpora.

- **Creative Observations**:
  - A modular pipeline can incorporate domain adaptation mechanisms, allowing the architecture to switch between general-purpose and domain-specific models based on the input text.
  - Integration with knowledge graphs (e.g., Wikidata, DBpedia) enhances the scalability of relation extraction by linking extracted entities to known relationships in structured datasets.

- **Example Code (Adding Modularity)**:


In [None]:
def process_text(text, pipeline):
    # Apply each stage in the pipeline to the text sequentially
    for stage in pipeline:
        text = stage(text)  # Update the text with the output of each stage
    return text

# Define modular pipeline stages
def tokenize(text):
    # Tokenize the input text into individual words (tokens) using NLTK's word_tokenize
    return word_tokenize(text)

def pos_tagging(tokens):
    # Perform part-of-speech (POS) tagging on the tokenized words using NLTK's pos_tag
    return pos_tag(tokens)

def named_entity_recognition(pos_tags):
    # Perform Named Entity Recognition (NER) using NLTK's ne_chunk on POS-tagged words
    return ne_chunk(pos_tags)

# Create a modular pipeline
# Each function (tokenize, pos_tagging, named_entity_recognition) is treated as a stage in the pipeline
pipeline = [tokenize, pos_tagging, named_entity_recognition]

# Process text through the pipeline
# The input text goes through each stage in the pipeline sequentially
text = "Alice works at Google in Mountain View."
result = process_text(text, pipeline)

# Print the result after passing through the entire pipeline
print(result)


(S
  (GPE Alice/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Google/NNP)
  in/IN
  (GPE Mountain/NNP View/NNP)
  ./.)


# Demonstration (Continuation)

#### 3.1 **Sentence Segmentation (Advanced)**

- **Advanced Sentence Segmentation**:
  - Sentence segmentation can get tricky when dealing with abbreviations or multiple punctuation marks. Below is an example that handles edge cases using a custom rule-based approach.



In [None]:
import re

def advanced_sentence_segmentation(text):
    """
    Segments text into sentences using a refined regex pattern.

    Args:
        text (str): The input text to be segmented.

    Returns:
        list: A list of segmented sentences.
    """
    # Regex to split sentences, addressing abbreviations and titles.
    # This pattern uses negative lookbehind with word boundaries for better accuracy.
    sentence_endings = re.compile(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s')

    # Split the text based on the pattern.
    sentences = sentence_endings.split(text)

    return sentences

# Sample text with abbreviations and punctuation.
text = "Dr. Smith went to the U.S. He met Mrs. Jones. Isn't it great?"

# Apply the sentence segmentation.
sentences = advanced_sentence_segmentation(text)

# Display the segmented sentences.
print(sentences)

['Dr. Smith went to the U.S. He met Mrs.', 'Jones.', "Isn't it great?"]


#### 3.2 **Tokenization (Advanced)**

- **Custom Tokenization for Complex Cases**:
  - Below is a custom tokenizer that handles email addresses, URLs, and contractions more effectively than a simple word tokenizer.



In [None]:
import re

def custom_tokenizer(text):
    # Handling URLs, email addresses, and contractions in the tokenization process
    # - Email addresses are matched with: [\w\.-]+@[\w\.-]+
    # - URLs are matched with: \w+://[^\s]+
    # - Contractions (e.g., I'll, John's) are matched with: [A-Za-z]+['’]?\w*
    # - Words (basic token matching) are matched with: \w+
    pattern = r"[\w\.-]+@[\w\.-]+|\w+://[^\s]+|[A-Za-z]+['’]?\w*|\w+"

    # Use re.findall to return all matching tokens based on the pattern
    return re.findall(pattern, text)

# Sample text with email, URL, and contraction for tokenization
text = "You can reach me at john.doe@gmail.com or visit http://example.com. I'll be there!"

# Tokenize the text using the custom tokenizer
tokens = custom_tokenizer(text)

# Print the resulting tokens
print(tokens)


['You', 'can', 'reach', 'me', 'at', 'john.doe@gmail.com', 'or', 'visit', 'http://example.com.', "I'll", 'be', 'there']


#### 3.3 **Part-of-Speech (POS) Tagging (Advanced)**

- **Using a Custom POS Tagger with Contextual Rules**:
  - This example creates a simple rule-based POS tagger for certain words, adding more logic to handle context and token-specific rules.



In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download the necessary POS tagger model
nltk.download('averaged_perceptron_tagger')

def custom_pos_tagger(tokens):
    # Perform default POS tagging using NLTK's pos_tag function
    pos_tags = nltk.pos_tag(tokens)

    # Create a list to hold the customized POS tags
    custom_pos = []

    # Iterate over the word-POS tag pairs to apply custom rules
    for word, tag in pos_tags:
        # Custom rule: Always tag 'Google' as a proper noun (NNP), even if the default tag is different
        if word.lower() == 'google':
            custom_pos.append((word, 'NNP'))  # NNP stands for proper noun (singular)
        # Custom rule: Change 'run' tagged as a noun (NN) to a verb (VB) based on our logic
        elif word.lower() == 'run' and tag == 'NN':
            custom_pos.append((word, 'VB'))  # VB is the tag for base form verbs
        else:
            # For all other cases, keep the default POS tag
            custom_pos.append((word, tag))

    return custom_pos

# Sample sentence for POS tagging
sentence = "He will Google the problem and run away."

# Tokenize the sentence into words
tokens = word_tokenize(sentence)

# Perform custom POS tagging on the tokenized sentence
pos_tags = custom_pos_tagger(tokens)

# Print the customized POS tags
print(pos_tags)


[('He', 'PRP'), ('will', 'MD'), ('Google', 'NNP'), ('the', 'DT'), ('problem', 'NN'), ('and', 'CC'), ('run', 'VB'), ('away', 'RB'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


#### 3.4 **Named Entity Recognition (NER) (Advanced)**

- **NER with SpaCy**:
  - Using SpaCy’s pre-trained models for more advanced NER, which can handle entity recognition at scale.



In [None]:
import spacy

# Load the SpaCy model for English NER
# "en_core_web_sm" is a small, pre-trained English model that includes vocabulary, part-of-speech tagging, and named entity recognition
nlp = spacy.load("en_core_web_sm")

def spacy_ner(text):
    # Process the text through spaCy's NLP pipeline
    doc = nlp(text)

    # Iterate over the named entities detected in the text
    for ent in doc.ents:
        # Print the entity text (the part of the text that is recognized as an entity) and its label (the entity type)
        print(ent.text, ent.label_)

# Sample text for entity recognition
text = "Apple is looking at buying U.K. startup for $1 billion."

# Call the function to perform NER and print the results
spacy_ner(text)


Apple ORG
U.K. GPE
$1 billion MONEY


#### 3.5 **Relation Detection (Advanced)**

- **Relation Detection Using Dependency Parsing**:
  - Relation detection often involves identifying the syntactic structure of the sentence. Dependency parsing can be used to detect relationships between entities.



In [None]:
import spacy

# Load SpaCy model for dependency parsing and relation extraction
# "en_core_web_sm" is a pre-trained model that includes vocabulary, part-of-speech tagging, and dependency parsing
nlp = spacy.load("en_core_web_sm")

def relation_extraction(text):
    # Process the text using spaCy's NLP pipeline
    doc = nlp(text)

    # Iterate through each token in the document
    for token in doc:
        # Check for entities that are either nominal subjects (nsubj) or direct objects (dobj)
        if token.dep_ in ('nsubj', 'dobj'):
            # Print the entity (token text), the relation (the head word of the token), and the type (subject or object)
            print(f"Entity: {token.text}, Relation: {token.head.text}, Type: {token.dep_}")

# Sample text for relation extraction
text = "Google acquired YouTube in 2006 for $1.65 billion."

# Call the function to perform relation extraction and print the results
relation_extraction(text)


Entity: Google, Relation: acquired, Type: nsubj
Entity: YouTube, Relation: acquired, Type: dobj


#### 3.6 **Modularity and Scalability (Advanced)**

- **Dynamic Pipeline for Entity and Relation Extraction**:
  - This demonstrates a dynamically configurable pipeline, where users can add or remove stages such as tokenization, POS tagging, and NER based on their needs.



In [None]:
import spacy

# Define stages
def tokenize(text):
    # Tokenize the text using spaCy and return a list of token texts
    return [token.text for token in nlp(text)]

def pos_tagging(tokens):
    # Convert the tokens back into a string and run through spaCy's NLP pipeline for POS tagging
    doc = nlp(" ".join(tokens))
    # Return a list of tuples with each token and its corresponding POS tag
    return [(token.text, token.pos_) for token in doc]

def named_entity_recognition(tokens):
    # Check if the input 'tokens' is a list of tuples (from pos_tagging)
    if isinstance(tokens[0], tuple):
        # If yes, extract only the token texts for NER
        tokens = [token[0] for token in tokens]
    # Convert the tokens back into a string and run through spaCy's NLP pipeline for NER
    doc = nlp(" ".join(tokens))
    # Return a list of tuples with each named entity and its corresponding label
    return [(ent.text, ent.label_) for ent in doc.ents]

# Dynamic pipeline function
def dynamic_pipeline(text, stages):
    # Sequentially apply each stage in the pipeline to the text
    for stage in stages:
        text = stage(text)  # Update the text at each stage with the processed result
    return text  # Return the final result after all stages

# Example of a dynamic pipeline usage
nlp = spacy.load("en_core_web_sm")
text = "John works at Google in California."

# Define the sequence of stages in the pipeline
pipeline = [tokenize, pos_tagging, named_entity_recognition]

# Run the dynamic pipeline on the input text
result = dynamic_pipeline(text, pipeline)

# Print the result after going through the pipeline
print(result)

[('John', 'PERSON'), ('Google', 'ORG'), ('California', 'GPE')]


#### 3.7 **Error Handling in Pipeline**

- **Robustness in Pipeline by Handling Failures**:
  - It's important to build robust pipelines that can handle errors and exceptions gracefully without breaking the entire process.



In [None]:
import spacy

# Load the spaCy model for English NLP tasks
nlp = spacy.load("en_core_web_sm")

# Define stages with error handling for robust processing

def safe_tokenize(text):
    # Try to tokenize the text and handle any errors gracefully
    try:
        return [token.text for token in nlp(text)]  # Tokenize the text
    except Exception as e:
        print(f"Error in tokenization: {e}")  # Print an error message if an exception occurs
        return []  # Return an empty list in case of an error

def safe_pos_tagging(tokens):
    # Try to perform POS tagging on the tokens and handle any errors gracefully
    try:
        doc = nlp(" ".join(tokens))  # Join tokens into a single string and pass through spaCy
        return [(token.text, token.pos_) for token in doc]  # Return token and its POS tag
    except Exception as e:
        print(f"Error in POS tagging: {e}")  # Print an error message if an exception occurs
        return []  # Return an empty list in case of an error

def safe_ner(tokens):
    # Try to perform Named Entity Recognition (NER) and handle any errors gracefully
    try:
        doc = nlp(" ".join(tokens))  # Join tokens into a string and pass through spaCy for NER
        return [(ent.text, ent.label_) for ent in doc.ents]  # Return entities and their labels
    except Exception as e:
        print(f"Error in NER: {e}")  # Print an error message if an exception occurs
        return []  # Return an empty list in case of an error

# Robust dynamic pipeline function with error handling
def robust_pipeline(text, stages):
    # Apply each stage in the pipeline, passing the text through all stages sequentially
    for stage in stages:
        text = stage(text)  # Update text with the result of each stage
    return text  # Return the final result after all stages

# Example usage with robustness in mind
text = "John works at Google in California."

# Define the pipeline with the robust, error-handling stages
pipeline = [safe_tokenize, safe_pos_tagging, safe_ner]

# Run the robust pipeline on the input text
result = robust_pipeline(text, pipeline)

# Print the final result after processing through the pipeline
print(result)


Error in NER: sequence item 0: expected str instance, tuple found
[]


# Demonstration using NTLK library

## 4.1 **Sentence Segmentation (Advanced with NLTK)**

- **Advanced Sentence Segmentation Using NLTK**:
  - This demonstrates sentence segmentation using custom regex rules combined with NLTK’s sentence tokenizer, specifically handling abbreviations and complex sentence boundaries.



In [None]:
import nltk
nltk.download('punkt')

def advanced_sentence_segmentation(text):
    # Using NLTK sentence tokenizer for default segmentation
    sentences = nltk.sent_tokenize(text)

    # Custom post-processing for handling abbreviations and other edge cases
    processed_sentences = []

    for sentence in sentences:
        # If the sentence ends with 'Dr.' or 'U.S.', we append it as is
        if sentence.endswith('Dr.') or sentence.endswith('U.S.'):
            processed_sentences.append(sentence)
        else:
            # Strip any trailing/leading whitespace and append
            processed_sentences.append(sentence.strip())

    return processed_sentences

# Sample text with abbreviations and punctuation for testing
text = "Dr. Smith went to the U.S. He said, 'It was great!'"

# Call the function to perform sentence segmentation
sentences = advanced_sentence_segmentation(text)

# Print the processed sentences
print(sentences)


['Dr. Smith went to the U.S.', "He said, 'It was great!'"]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 4.2 **Tokenization (Advanced Custom Tokenization with NLTK)**

- **Custom Tokenization Handling Special Cases with NLTK**:
  - This example demonstrates handling of contractions, numbers, and special cases using a combination of NLTK’s `word_tokenize` and custom logic.



In [None]:
from nltk.tokenize import word_tokenize

def custom_nltk_tokenizer(text):
    # Basic word tokenization using NLTK
    tokens = word_tokenize(text)
    processed_tokens = []

    # Post-processing to handle contractions and numbers
    for token in tokens:
        if token == "n't":
            # Append contraction ("n't") to the previous word
            processed_tokens[-1] = processed_tokens[-1] + "n't"
        elif token.replace('.', '', 1).isdigit():  # Handle decimal numbers
            processed_tokens.append(token)
        else:
            # Append the token as is for all other cases
            processed_tokens.append(token)
    return processed_tokens

# Sample text containing contractions and numbers
text = "She won't go, and it costs 100.50 dollars."

# Tokenize the text using the custom tokenizer
tokens = custom_nltk_tokenizer(text)

# Print the processed tokens
print(tokens)


['She', "won't", 'go', ',', 'and', 'it', 'costs', '100.50', 'dollars', '.']


## 4.3 **Part-of-Speech (POS) Tagging (Advanced with NLTK)**

- **Handling Ambiguity in POS Tagging with NLTK**:
  - This example deals with ambiguous words like "run" or "lead" and assigns different tags based on specific contexts using custom rule logic along with NLTK’s default POS tagger.



In [None]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download necessary resources for POS tagging
nltk.download('averaged_perceptron_tagger')

def custom_pos_tagging(sentence):
    # Tokenize the sentence into words
    tokens = word_tokenize(sentence)

    # Perform default POS tagging using NLTK
    pos_tags = pos_tag(tokens)

    # Apply custom rules to handle ambiguous words
    custom_tags = []
    for word, tag in pos_tags:
        if word.lower() == 'lead':
            if 'VB' in tag:  # If 'lead' is used as a verb
                custom_tags.append((word, 'VB'))
            else:  # Otherwise, assume it is a noun
                custom_tags.append((word, 'NN'))
        elif word.lower() == 'run':
            # Handle 'run' based on its context: verb (VB) or noun (NN)
            custom_tags.append((word, 'VB' if 'VB' in tag else 'NN'))
        else:
            # For other words, keep the default POS tag
            custom_tags.append((word, tag))

    return custom_tags

# Example sentence containing ambiguous words like "lead"
sentence = "The CEO will lead the meeting, and the lead pipe was rusty."

# Perform custom POS tagging
pos_tags = custom_pos_tagging(sentence)

# Print the resulting POS tags after applying custom rules
print(pos_tags)


[('The', 'DT'), ('CEO', 'NNP'), ('will', 'MD'), ('lead', 'VB'), ('the', 'DT'), ('meeting', 'NN'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('lead', 'NN'), ('pipe', 'NN'), ('was', 'VBD'), ('rusty', 'JJ'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 4.4 **Named Entity Recognition (NER) with NLTK (Advanced)**

- **Using a Custom NER System Based on NLTK**:
  - This example demonstrates building a simple NER system using NLTK’s chunking features combined with POS tags to detect named entities.



In [None]:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# Download necessary resources for NER
nltk.download('maxent_ne_chunker')
nltk.download('words')

def custom_nltk_ner(sentence):
    # Tokenizing the sentence into words
    tokens = word_tokenize(sentence)

    # Performing POS tagging on the tokens
    pos_tags = pos_tag(tokens)

    # Performing Named Entity Recognition (NER) using NLTK's ne_chunk on POS-tagged tokens
    ner_tree = ne_chunk(pos_tags)

    return ner_tree

# Example sentence for NER
sentence = "Barack Obama was born in Honolulu and became the president of the USA."

# Perform custom NER using NLTK
ner_result = custom_nltk_ner(sentence)

# Print the resulting NER tree
print(ner_result)


(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Honolulu/NNP)
  and/CC
  became/VBD
  the/DT
  president/NN
  of/IN
  the/DT
  (ORGANIZATION USA/NNP)
  ./.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


## 4.5 **Relation Detection Using Dependency Parsing (NLTK)**

- **Relation Extraction Using POS Patterns and Chunking in NLTK**:
  - This code demonstrates extracting relations between entities by combining POS tagging with chunking patterns to detect simple entity relations in a sentence.



In [None]:
import nltk
from nltk import pos_tag
from nltk.chunk import RegexpParser
from nltk.tokenize import word_tokenize

def custom_relation_extraction(text):
    # Tokenization and POS tagging
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)

    # Define a simple grammar for extracting NP (Noun Phrase) and VP (Verb Phrase) relations
    grammar = r"""
        NP: {<DT>?<JJ>*<NN.*>}    # Noun phrase: optional determiner (DT), adjectives (JJ), and noun(s) (NN.*)
        VP: {<VB.*>}              # Verb phrase: any form of verb (VB.*)
    """

    # Create a RegexpParser with the defined grammar
    chunker = RegexpParser(grammar)

    # Apply the chunking to the POS-tagged tokens
    chunked_tree = chunker.parse(pos_tags)

    # Print the chunked sentence tree for visualization
    print(chunked_tree)

    # Extract NP-VP-NP relations
    relations = []
    current_np = None

    # Traverse the chunked tree to find noun phrases (NP)
    for subtree in chunked_tree:
        if type(subtree) == nltk.Tree:
            # If the subtree is a noun phrase (NP), extract the text
            if subtree.label() == 'NP':
                if current_np is None:
                    current_np = " ".join([word for word, tag in subtree.leaves()])
                else:
                    relations.append((current_np, " ".join([word for word, tag in subtree.leaves()])))
                    current_np = None

    return relations

# Example sentence for relation extraction
text = "John bought a car from the dealership."

# Perform relation extraction
relations = custom_relation_extraction(text)

# Print the extracted NP-VP-NP relations
print(relations)


(S
  (NP John/NNP)
  (VP bought/VBD)
  (NP a/DT car/NN)
  from/IN
  (NP the/DT dealership/NN)
  ./.)
[('John', 'a car')]


## 4.6 **Error Handling and Robustness in NLTK Pipelines**

- **Handling Errors in NER and POS Tagging Using NLTK**:
  - A robust pipeline that handles errors during tokenization, POS tagging, or NER without breaking the entire process.



In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Safe tokenization with error handling
def safe_tokenize(text):
    try:
        return word_tokenize(text)  # Tokenize the input text
    except Exception as e:
        print(f"Error in tokenization: {e}")
        return []  # Return an empty list in case of an error

# Safe POS tagging with error handling
def safe_pos_tag(tokens):
    try:
        return pos_tag(tokens)  # Perform POS tagging on the tokenized input
    except Exception as e:
        print(f"Error in POS tagging: {e}")
        return []  # Return an empty list in case of an error

# Safe NER (Named Entity Recognition) with error handling
def safe_ner(pos_tags):
    try:
        return ne_chunk(pos_tags)  # Perform NER using the POS-tagged tokens
    except Exception as e:
        print(f"Error in NER: {e}")
        return []  # Return an empty list in case of an error

# Robust NLTK pipeline with error handling at each stage
def robust_nltk_pipeline(text):
    tokens = safe_tokenize(text)  # Tokenize the input text
    pos_tags = safe_pos_tag(tokens)  # Perform POS tagging on the tokens
    named_entities = safe_ner(pos_tags)  # Perform Named Entity Recognition (NER) on the POS tags

    return named_entities

# Example input text
text = "John Smith lives in California and works at Google."

# Run the robust NLTK pipeline
entities = robust_nltk_pipeline(text)

# Print the resulting named entities
print(entities)


(S
  (PERSON John/NNP)
  (PERSON Smith/NNP)
  lives/VBZ
  in/IN
  (GPE California/NNP)
  and/CC
  works/NNS
  at/IN
  (ORGANIZATION Google/NNP)
  ./.)


## 4.7 **Recursive Chunking and Cascaded Chunkers with NLTK**

- **Recursive Chunking for Handling Nested Entities in NLTK**:
  - Demonstrating cascaded chunking with NLTK for handling nested structures, where noun phrases can contain other phrases.



In [None]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser

def recursive_chunking(text):
    # Tokenization and POS tagging
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)

    # Define a recursive grammar for chunking
    grammar = r"""
        NP: {<DT>?<JJ>*<NN.*>}        # Noun phrase: optional determiner (DT), adjectives (JJ), and noun(s)
        VP: {<VB.*><NP|PP>}           # Verb phrase: verb followed by NP (Noun Phrase) or PP (Prepositional Phrase)
        PP: {<IN><NP>}                # Prepositional phrase: preposition followed by NP
    """

    # Chunk the sentence according to the recursive grammar
    chunker = RegexpParser(grammar)
    chunked_tree = chunker.parse(pos_tags)

    return chunked_tree

# Example text for chunking
text = "The quick brown fox jumps over the lazy dog near the river."

# Perform recursive chunking
chunked_tree = recursive_chunking(text)

# Print the resulting chunked tree
print(chunked_tree)


(S
  (NP The/DT quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  (PP over/IN (NP the/DT lazy/JJ dog/NN))
  (PP near/IN (NP the/DT river/NN))
  ./.)


# Demonstration using pyTorch

## 5.1 **Tokenization using PyTorch**

- **Custom Subword Tokenization Using PyTorch and Hugging Face's Tokenizers**:
  - This example demonstrates how to use subword tokenization (like Byte-Pair Encoding) with PyTorch using Hugging Face’s `transformers` library for tokenization, often used in large models like BERT.



In [None]:
from transformers import BertTokenizer

# Load the tokenizer from Hugging Face for BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def subword_tokenization(text):
    # Perform subword tokenization using BERT's tokenizer
    tokens = tokenizer.tokenize(text)  # Tokenize the input text into subwords
    token_ids = tokenizer.convert_tokens_to_ids(tokens)  # Convert the tokens into their corresponding token IDs
    return tokens, token_ids

# Example text for subword tokenization
text = "The quick brown fox jumps over the lazy dog near the river."

# Perform subword tokenization
tokens, token_ids = subword_tokenization(text)

# Print the resulting subword tokens and their token IDs
print("Tokens:", tokens)
print("Token IDs:", token_ids)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'near', 'the', 'river', '.']
Token IDs: [1996, 4248, 2829, 4419, 14523, 2058, 1996, 13971, 3899, 2379, 1996, 2314, 1012]




## 5.2 **POS Tagging Using a Custom RNN in PyTorch**

- **POS Tagging with a Simple RNN in PyTorch**:
  - This example builds a simple RNN model to perform part-of-speech tagging. The model takes in tokenized sentences, passes them through an embedding layer, and processes them using an RNN to predict the POS tags.



In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Sample dataset
data = [("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
        ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])]

# Vocabulary and tag set
word_to_ix = {}
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Define tag mappings for parts of speech
for sent, tags in data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)  # Create a word-to-index mapping

# Define the RNN model
class RNNTagger(nn.Module):
    def __init__(self, vocab_size, tagset_size, embedding_dim, hidden_dim):
        super(RNNTagger, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)  # Embedding layer
        self.rnn = nn.RNN(embedding_dim, hidden_dim)  # Simple RNN layer
        self.fc = nn.Linear(hidden_dim, tagset_size)  # Fully connected layer to map to tag set

    def forward(self, sentence):
        embeds = self.embedding(sentence)  # Convert input words to embeddings
        rnn_out, _ = self.rnn(embeds)  # Pass embeddings through the RNN
        tag_space = self.fc(rnn_out)  # Linear transformation to tag space
        tag_scores = nn.functional.log_softmax(tag_space, dim=2)  # Log-softmax for classification
        return tag_scores

# Hyperparameters
EMBEDDING_DIM = 6  # Size of word embeddings
HIDDEN_DIM = 6  # Size of the RNN hidden state
vocab_size = len(word_to_ix)
tagset_size = len(tag_to_ix)

# Initialize the model, loss function, and optimizer
model = RNNTagger(vocab_size, tagset_size, EMBEDDING_DIM, HIDDEN_DIM)
loss_function = nn.NLLLoss()  # Negative log likelihood loss
optimizer = optim.SGD(model.parameters(), lr=0.1)  # Stochastic gradient descent optimizer

# Prepare input data as tensors
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]  # Convert words to their index values
    return torch.tensor(idxs, dtype=torch.long)  # Return tensor of word indices

# Training loop
for epoch in range(300):
    for sentence, tags in data:
        model.zero_grad()  # Clear the gradients

        sentence_in = prepare_sequence(sentence, word_to_ix)  # Convert input sentence to indices
        targets = prepare_sequence(tags, tag_to_ix)  # Convert target tags to indices

        tag_scores = model(sentence_in)  # Forward pass through the model
        loss = loss_function(tag_scores.view(-1, tagset_size), targets)  # Compute the loss
        loss.backward()  # Backpropagate the loss
        optimizer.step()  # Update model parameters

# Testing the model
with torch.no_grad():
    sentence = prepare_sequence("Everybody read that book".split(), word_to_ix)  # Prepare test sentence
    tag_scores = model(sentence)  # Get tag scores for the sentence
    print(tag_scores)  # Print the tag scores for each word in the sentence


IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

## 5.3 **Named Entity Recognition (NER) Using PyTorch and BERT**

- **NER Using Pre-trained BERT Model in PyTorch**:
  - This example demonstrates how to use a pre-trained BERT model for named entity recognition (NER) using Hugging Face's `transformers` library integrated with PyTorch.



In [None]:
from transformers import BertTokenizer, BertForTokenClassification
import torch

# Load pre-trained BERT model and tokenizer for token classification (NER)
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = BertForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

# Tokenize the input text
text = "Hawking was a theoretical physicist at the University of Cambridge."
input_ids = tokenizer.encode(text, return_tensors="pt")  # Convert text to token ids in PyTorch tensor format

# Get predictions from the model
outputs = model(input_ids)  # Forward pass through the model
logits = outputs.logits  # Extract logits (unnormalized predictions)

# Get predicted entity labels (the highest logit value per token corresponds to the predicted label)
predictions = torch.argmax(logits, dim=2)  # Get the index of the max logit per token

# Convert token ids to actual tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# Convert predicted label ids to actual label names
labels = [model.config.id2label[p.item()] for p in predictions[0]]

# Print tokens and their corresponding entity labels
for token, label in zip(tokens, labels):
    print(f"{token}: {label}")


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[CLS]: O
Hawk: I-PER
##ing: I-PER
was: O
a: O
theoretical: O
physicist: O
at: O
the: O
University: I-ORG
of: I-ORG
Cambridge: I-ORG
.: O
[SEP]: O


## 5.4 **Relation Extraction Using a Custom CNN in PyTorch**

- **Relation Extraction Using CNN in PyTorch**:
  - In this example, a simple convolutional neural network (CNN) is built to predict relationships between entities based on the context of the sentence.



In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable

# Create a simple CNN for relation extraction
class CNNRelationExtractor(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_filters, num_classes):
        super(CNNRelationExtractor, self).__init__()
        # Embedding layer to convert input tokens into embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Convolutional layer with 'num_filters' filters of size (3, embedding_dim)
        self.conv1 = nn.Conv2d(1, num_filters, (3, embedding_dim))
        # Max pooling layer to reduce dimensionality
        self.pool = nn.MaxPool2d((2, 1))
        # Fully connected layer for classification
        self.fc1 = nn.Linear(num_filters, num_classes)

    def forward(self, x):
        # Convert input tokens to embeddings
        x = self.embedding(x)
        # Add a channel dimension for the convolutional layer (batch_size, 1, sentence_len, embedding_dim)
        x = x.unsqueeze(1)
        # Apply convolutional layer
        x = self.conv1(x)
        # Apply ReLU activation function
        x = torch.relu(x)
        # Apply max pooling
        x = self.pool(x)
        # Flatten the output for the fully connected layer
        x = x.view(x.size(0), -1)
        # Apply fully connected layer to get final class predictions
        x = self.fc1(x)
        return torch.softmax(x, dim=1)

# Hyperparameters
VOCAB_SIZE = 100  # Vocabulary size for embedding layer
EMBEDDING_DIM = 50  # Size of the word embeddings
NUM_FILTERS = 10  # Number of filters in the convolutional layer
NUM_CLASSES = 3  # Number of output classes (relation types)

# Create model, loss function, and optimizer
model = CNNRelationExtractor(VOCAB_SIZE, EMBEDDING_DIM, NUM_FILTERS, NUM_CLASSES)
loss_function = nn.CrossEntropyLoss()  # Cross-entropy loss for classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer

# Dummy data for training (list of tokenized sentences with corresponding relation labels)
data = [([10, 20, 30, 40, 50], 0), ([50, 60, 70, 80, 90], 1)]
batch_size = 2

# Training loop
for epoch in range(100):
    for sentence, label in data:
        # Convert sentences and labels to PyTorch tensors
        sentence = Variable(torch.LongTensor([sentence]))
        label = Variable(torch.LongTensor([label]))

        # Zero the gradients before the forward pass
        model.zero_grad()
        # Forward pass through the model
        output = model(sentence)
        # Compute the loss
        loss = loss_function(output, label)
        # Backward pass (compute gradients)
        loss.backward()
        # Update model parameters
        optimizer.step()

# Example test case
test_sentence = Variable(torch.LongTensor([[10, 20, 30, 40, 50]]))  # Prepare test input
with torch.no_grad():  # Disable gradient calculation for testing
    prediction = model(test_sentence)  # Get model predictions for the test input
    print(prediction)  # Print the predicted class probabilities


tensor([[9.9632e-01, 3.3740e-03, 3.0674e-04]])


## 5.5 **Building a Dynamic PyTorch-Based Pipeline**

- **Building a Modular Information Extraction Pipeline in PyTorch**:
  - This example demonstrates how to build a dynamic pipeline where various NLP tasks such as tokenization, embedding, and entity extraction are modularized and can be assembled based on need.



In [None]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForTokenClassification

# Load tokenizer and BERT model for token classification (NER)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Define the pipeline stages

# Tokenization stage
def tokenize(text):
    tokens = tokenizer.encode(text, return_tensors="pt")  # Convert text to BERT token IDs as a tensor
    return tokens

# Run the BERT model on the tokenized input
def run_model(tokens):
    output = model(tokens)  # Pass tokens through the model
    return output.logits  # Return the raw logits (unnormalized predictions)

# Dynamic pipeline executor
def dynamic_pipeline(text, stages):
    for stage in stages:  # Iterate through each stage in the pipeline
        text = stage(text)  # Apply the stage and update the result
    return text  # Return the final result after all stages

# Assemble and execute the pipeline
pipeline = [tokenize, run_model]  # Define the stages of the pipeline

# Execute the pipeline on the input text
result = dynamic_pipeline("Hello, my name is John.", pipeline)

# Print the resulting logits (unnormalized scores) from the BERT model
print(result)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[[ 0.0879,  0.1843],
         [ 0.0200,  0.2126],
         [ 0.2310,  0.0548],
         [ 0.2503,  0.3087],
         [ 0.0089,  0.2088],
         [ 0.0530,  0.2011],
         [-0.0215,  0.4724],
         [ 0.2831, -0.0015],
         [ 0.2905, -0.1730]]], grad_fn=<ViewBackward0>)


## 5.6 **Error Handling in PyTorch Models**

- **Handling Errors in a PyTorch Model Pipeline**:
  - A robust pipeline that ensures error handling during different stages of the PyTorch-based information extraction pipeline.



In [None]:
import torch
from transformers import BertTokenizer, BertForTokenClassification

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForTokenClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Define error-safe functions for the pipeline stages

# Tokenization with error handling
def safe_tokenize(text):
    try:
        return tokenizer.encode(text, return_tensors="pt")  # Tokenize input text and return tensor
    except Exception as e:
        print(f"Error in tokenization: {e}")
        return None  # Return None if an error occurs

# Model execution with error handling
def safe_model_run(tokens):
    try:
        return model(tokens).logits  # Run the model and return logits
    except Exception as e:
        print(f"Error in model run: {e}")
        return None  # Return None if an error occurs

# Define a robust pipeline with error handling
def robust_pipeline(text, stages):
    for stage in stages:
        text = stage(text)  # Apply each stage
        if text is None:
            break  # Stop if any stage fails
    return text  # Return the final result or None if a stage failed

# Use the pipeline with error handling
pipeline = [safe_tokenize, safe_model_run]
result = robust_pipeline("Hello, my name is John.", pipeline)

# Print the result (either the logits or None if any stage failed)
print(result)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[[ 0.0990, -0.5672],
         [ 0.3384, -0.5945],
         [-0.5949, -0.2023],
         [-0.3101, -0.4697],
         [-0.9338, -0.0466],
         [-0.8299, -0.1092],
         [ 0.0443, -0.7669],
         [-0.0305, -0.2591],
         [-0.0138, -0.2138]]], grad_fn=<ViewBackward0>)
