Final Code

Comprehensive Explanation for Advanced NLP Task Extraction Pipeline:

Architecture Overview:
The code implements an advanced Natural Language Processing (NLP) task extraction system using multiple sophisticated techniques:

1. Preprocessing Stage
- Uses spaCy and NLTK for text processing
- Removes stop words
- Performs lemmatization
- Applies Part-of-Speech (POS) tagging
- Tokenizes sentences with advanced filtering

2. Task Identification Mechanism
- Multi-layered task detection strategy
- Uses predefined imperative verb lists
- Matches task-related phrases
- Implements intelligent sentence boundary analysis

3. Entity Extraction Techniques
- Prioritizes Named Entity Recognition (NER)
- Fallback to grammatical subject extraction
- Handles pronouns and specific entity types

4. Deadline Detection
- Multiple regex pattern matching
- Captures complex time-related expressions
- Flexible deadline identification

5. Semantic Categorization
- Word2Vec embedding generation
- TF-IDF vectorization
- Latent Dirichlet Allocation (LDA) for topic modeling

Key Technical Components:
- spaCy: Advanced NLP processing
- NLTK: Natural language toolkit
- Word2Vec: Semantic embedding
- scikit-learn: Machine learning utilities
- Gensim: Topic modeling

Technical Complexity Highlights:
- Dynamic task boundary detection
- Semantic understanding beyond keyword matching
- Machine learning-powered categorization
- Robust handling of varied text inputs

Design Principles:
- Modular architecture
- Extensible component design
- Configurable task detection rules
- Machine learning-enhanced processing

Potential Interview Talking Points:
1. Discuss multilayered task detection approach
2. Explain semantic embedding's role in categorization
3. Highlight flexible entity extraction mechanism
4. Demonstrate handling of complex linguistic scenarios

Implementation Challenges Addressed:
- Handling ambiguous sentence structures
- Capturing implicit task information
- Providing context-aware categorization
- Managing varied linguistic expressions

Would you like me to elaborate on any specific aspect of the implementation for interview preparation?

In [81]:
import spacy
import nltk
import re
import numpy as np
from typing import List, Dict
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

class AdvancedTaskExtractor:
    def __init__(self):
        # Download necessary resources
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)

        # Load spaCy model
        self.nlp = spacy.load("en_core_web_sm")

        # Stop words and preprocessing
        self.stop_words = set(stopwords.words('english'))

        # Task identification patterns
        self.task_indicators = {
            'imperative_verbs': [
                'buy', 'clean', 'review', 'prepare', 'submit',
                'finalize', 'complete', 'schedule', 'discuss',
                'send', 'create', 'update', 'resolve', 'go',
                'finish', 'plan', 'organize'
            ],
            'task_phrases': [
                'need to', 'has to', 'should', 'must', 'will',
                'plan to', 'going to', 'wants to'
            ]
        }

    def preprocess_text(self, text: str) -> List[str]:
        """Advanced text preprocessing with POS tagging"""
        # Normalize text
        text = re.sub(r'\s+', ' ', text).strip()

        # Tokenize sentences
        sentences = sent_tokenize(text)

        # Preprocess sentences
        processed_sentences = []
        for sentence in sentences:
            doc = self.nlp(sentence)
            processed_sentence = [
                token.lemma_.lower()
                for token in doc
                if token.text.lower() not in self.stop_words
                and not token.is_punct
                and token.pos_ in ['VERB', 'NOUN', 'PROPN']
            ]
            processed_sentences.append(' '.join(processed_sentence))

        return sentences, processed_sentences

    def identify_tasks(self, original_sentences: List[str], processed_sentences: List[str]) -> List[Dict]:
        """Advanced task identification with detailed logging"""
        tasks = []

        for orig_sent, proc_sent in zip(original_sentences, processed_sentences):
            # Detailed logging for task detection
            print(f"\nAnalyzing Sentence: {orig_sent}")
            print(f"Processed Sentence: {proc_sent}")

            # Enhanced task detection
            is_task = False

            # Check for imperative verbs
            verb_match = any(verb in proc_sent for verb in self.task_indicators['imperative_verbs'])
            print(f"Imperative Verb Match: {verb_match}")

            # Check for task indicator phrases
            phrase_match = any(phrase in orig_sent.lower() for phrase in self.task_indicators['task_phrases'])
            print(f"Task Phrase Match: {phrase_match}")

            # Determine if sentence is a task
            if verb_match or phrase_match:
                is_task = True
                print("Sentence identified as a task")

            if not is_task:
                print("Not a task. Skipping.")
                continue

            # Entity extraction
            doc = self.nlp(orig_sent)
            entities = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]

            # Fallback to subject extraction
            if not entities:
                entities = [token.text for token in doc if token.dep_ in ["nsubj", "nsubjpass"]]

            # Deadline extraction
            deadline = self._extract_deadline(orig_sent)

            # Create task entry
            task_entry = {
                'task': orig_sent,
                'processed_task': proc_sent,
                'entity': entities[0] if entities else None,
                'deadline': deadline
            }

            print("Task Entry:")
            print(task_entry)

            tasks.append(task_entry)

        return tasks

    def _extract_deadline(self, sentence: str) -> str:
        """Sophisticated deadline extraction"""
        deadline_patterns = [
            r'by\s+([\w\s]+)',
            r'until\s+([\w\s]+)',
            r'on\s+([\w\s]+)',
            r'before\s+([\w\s]+)'
        ]

        for pattern in deadline_patterns:
            match = re.search(pattern, sentence, re.IGNORECASE)
            if match:
                return match.group(1).strip()

        # Additional time-related keywords
        time_keywords = ['today', 'tomorrow', 'next week', 'this week', 'monday', 'friday']
        for keyword in time_keywords:
            if keyword in sentence.lower():
                return keyword

        return None

    def train_word_embeddings(self, processed_sentences: List[str]):
        """Train Word2Vec embeddings"""
        # Tokenize processed sentences
        tokenized_sentences = [sent.split() for sent in processed_sentences]

        # Train Word2Vec model
        model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
        return model

    def categorize_tasks(self, tasks: List[Dict], processed_sentences: List[str]):
        """Advanced categorization using TF-IDF and LDA"""
        # TF-IDF Vectorization
        vectorizer = TfidfVectorizer(max_features=1000)
        tfidf_matrix = vectorizer.fit_transform(processed_sentences)

        # LDA Topic Modeling
        lda_model = LatentDirichletAllocation(n_components=4, random_state=42)
        lda_output = lda_model.fit_transform(tfidf_matrix)

        # Map topics to categories
        category_map = {
            0: 'Professional',
            1: 'Personal',
            2: 'Team',
            3: 'Administrative'
        }

        # Assign categories
        for task, topic_dist in zip(tasks, lda_output):
            dominant_topic = np.argmax(topic_dist)
            task['category'] = category_map[dominant_topic]

        return tasks

    def extract_tasks(self, text: str) -> List[Dict]:
        """Main task extraction pipeline"""
        # Preprocess text
        original_sentences, processed_sentences = self.preprocess_text(text)

        # Identify tasks
        tasks = self.identify_tasks(original_sentences, processed_sentences)

        # Categorize tasks
        categorized_tasks = self.categorize_tasks(tasks, processed_sentences)

        return categorized_tasks

def main():
    text = """
    Rahul wakes up early every day. He goes to college in the morning and comes back at 3 pm.
    At present, Rahul is outside. He has to buy the snacks for all of us.
    Rahul should clean the room by 5 pm today.
    John needs to review the report by Friday.
    Alice needs to finish her homework by 6 pm.
    Bob is planning to go for a run tomorrow morning.
    The team should discuss the project updates in the meeting next week.
    Sarah has to prepare the presentation for the meeting on Monday.
    Tom will submit the project report by the end of the week.
    The group needs to finalize the budget by 3 pm tomorrow.
    """

    extractor = AdvancedTaskExtractor()
    tasks = extractor.extract_tasks(text)

    print("\nExtracted Tasks:")
    for task in tasks:
        print(f"Task: {task['task']}")
        print(f"Entity: {task['entity']}")
        print(f"Deadline: {task['deadline']}")
        print(f"Category: {task['category']}\n")

if __name__ == "__main__":
    main()
    # Uncomment to run additional test scenarios
    # test_multiple_scenarios()


Analyzing Sentence: Rahul wakes up early every day.
Processed Sentence: wake day
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He goes to college in the morning and comes back at 3 pm.
Processed Sentence: go college morning come pm
Imperative Verb Match: True
Task Phrase Match: False
Sentence identified as a task
Task Entry:
{'task': 'He goes to college in the morning and comes back at 3 pm.', 'processed_task': 'go college morning come pm', 'entity': 'He', 'deadline': None}

Analyzing Sentence: At present, Rahul is outside.
Processed Sentence: present rahul
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He has to buy the snacks for all of us.
Processed Sentence: buy snack
Imperative Verb Match: True
Task Phrase Match: True
Sentence identified as a task
Task Entry:
{'task': 'He has to buy the snacks for all of us.', 'processed_task': 'buy snack', 'entity': 'He', 'deadline': None}

Analy

Text
I/o

In [85]:
import spacy
import nltk
import re
import numpy as np
from typing import List, Dict
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import csv


class AdvancedTaskExtractor:
    def __init__(self):
        # Download necessary resources
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)

        # Load spaCy model
        self.nlp = spacy.load("en_core_web_sm")

        # Stop words and preprocessing
        self.stop_words = set(stopwords.words('english'))

        # Task identification patterns
        self.task_indicators = {
            'imperative_verbs': [
                'buy', 'clean', 'review', 'prepare', 'submit',
                'finalize', 'complete', 'schedule', 'discuss',
                'send', 'create', 'update', 'resolve', 'go',
                'finish', 'plan', 'organize'
            ],
            'task_phrases': [
                'need to', 'has to', 'should', 'must', 'will',
                'plan to', 'going to', 'wants to'
            ]
        }

    def preprocess_text(self, text: str) -> List[str]:
        """Advanced text preprocessing with POS tagging"""
        # Normalize text
        text = re.sub(r'\s+', ' ', text).strip()

        # Tokenize sentences
        sentences = sent_tokenize(text)

        # Preprocess sentences
        processed_sentences = []
        for sentence in sentences:
            doc = self.nlp(sentence)
            processed_sentence = [
                token.lemma_.lower()
                for token in doc
                if token.text.lower() not in self.stop_words
                and not token.is_punct
                and token.pos_ in ['VERB', 'NOUN', 'PROPN']
            ]
            processed_sentences.append(' '.join(processed_sentence))

        return sentences, processed_sentences

    def identify_tasks(self, original_sentences: List[str], processed_sentences: List[str]) -> List[Dict]:
        """Advanced task identification with detailed logging"""
        tasks = []

        for orig_sent, proc_sent in zip(original_sentences, processed_sentences):
            # Detailed logging for task detection
            print(f"\nAnalyzing Sentence: {orig_sent}")
            print(f"Processed Sentence: {proc_sent}")

            # Enhanced task detection
            is_task = False

            # Check for imperative verbs
            verb_match = any(verb in proc_sent for verb in self.task_indicators['imperative_verbs'])
            print(f"Imperative Verb Match: {verb_match}")

            # Check for task indicator phrases
            phrase_match = any(phrase in orig_sent.lower() for phrase in self.task_indicators['task_phrases'])
            print(f"Task Phrase Match: {phrase_match}")

            # Determine if sentence is a task
            if verb_match or phrase_match:
                is_task = True
                print("Sentence identified as a task")

            if not is_task:
                print("Not a task. Skipping.")
                continue

            # Entity extraction
            doc = self.nlp(orig_sent)
            entities = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]

            # Fallback to subject extraction
            if not entities:
                entities = [token.text for token in doc if token.dep_ in ["nsubj", "nsubjpass"]]

            # Deadline extraction
            deadline = self._extract_deadline(orig_sent)

            # Create task entry
            task_entry = {
                'task': orig_sent,
                'processed_task': proc_sent,
                'entity': entities[0] if entities else None,
                'deadline': deadline
            }

            print("Task Entry:")
            print(task_entry)

            tasks.append(task_entry)

        return tasks

    def _extract_deadline(self, sentence: str) -> str:
        """Sophisticated deadline extraction"""
        deadline_patterns = [
            r'by\s+([\w\s]+)',
            r'until\s+([\w\s]+)',
            r'on\s+([\w\s]+)',
            r'before\s+([\w\s]+)'
        ]

        for pattern in deadline_patterns:
            match = re.search(pattern, sentence, re.IGNORECASE)
            if match:
                return match.group(1).strip()

        # Additional time-related keywords
        time_keywords = ['today', 'tomorrow', 'next week', 'this week', 'monday', 'friday']
        for keyword in time_keywords:
            if keyword in sentence.lower():
                return keyword

        return None

    def train_word_embeddings(self, processed_sentences: List[str]):
        """Train Word2Vec embeddings"""
        # Tokenize processed sentences
        tokenized_sentences = [sent.split() for sent in processed_sentences]

        # Train Word2Vec model
        model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
        return model

    def categorize_tasks(self, tasks: List[Dict], processed_sentences: List[str]):
        """Advanced categorization using TF-IDF and LDA"""
        # TF-IDF Vectorization
        vectorizer = TfidfVectorizer(max_features=1000)
        tfidf_matrix = vectorizer.fit_transform(processed_sentences)

        # LDA Topic Modeling
        lda_model = LatentDirichletAllocation(n_components=4, random_state=42)
        lda_output = lda_model.fit_transform(tfidf_matrix)

        # Map topics to categories
        category_map = {
            0: 'Professional',
            1: 'Personal',
            2: 'Team',
            3: 'Administrative'
        }

        # Assign categories
        for task, topic_dist in zip(tasks, lda_output):
            dominant_topic = np.argmax(topic_dist)
            task['category'] = category_map[dominant_topic]

        return tasks

    def extract_tasks(self, text: str) -> List[Dict]:
        """Main task extraction pipeline"""
        # Preprocess text
        original_sentences, processed_sentences = self.preprocess_text(text)

        # Identify tasks
        tasks = self.identify_tasks(original_sentences, processed_sentences)

        # Categorize tasks
        categorized_tasks = self.categorize_tasks(tasks, processed_sentences)

        return categorized_tasks

def main():
    # Read text from .txt file
    try:
        with open("input.txt", "r", encoding="utf-8") as file:  # Specify encoding if needed
            text = file.read()
    except FileNotFoundError:
        print("Error: input.txt not found. Please create the file and add text.")
        return

    extractor = AdvancedTaskExtractor()
    tasks = extractor.extract_tasks(text)

    # Write output to CSV file
    try:
        with open("output.csv", "w", newline="", encoding="utf-8") as csvfile:
            fieldnames = ['task', 'entity', 'deadline', 'category', 'processed_task'] # Add processed_task
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()
            for task in tasks:
                writer.writerow(task)
        print("Tasks extracted and saved to output.csv")
    except Exception as e:
        print(f"Error writing to CSV: {e}")


if __name__ == "__main__":
    main()


Analyzing Sentence: Rahul wakes up early every day.
Processed Sentence: wake day
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He goes to college in the morning and comes back at 3 pm.
Processed Sentence: go college morning come pm
Imperative Verb Match: True
Task Phrase Match: False
Sentence identified as a task
Task Entry:
{'task': 'He goes to college in the morning and comes back at 3 pm.', 'processed_task': 'go college morning come pm', 'entity': 'He', 'deadline': None}

Analyzing Sentence: At present, Rahul is outside.
Processed Sentence: present rahul
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He has to buy the snacks for all of us.
Processed Sentence: buy snack
Imperative Verb Match: True
Task Phrase Match: True
Sentence identified as a task
Task Entry:
{'task': 'He has to buy the snacks for all of us.', 'processed_task': 'buy snack', 'entity': 'He', 'deadline': None}

Analy

Word Embeddings + Clustering

```
# This is formatted as code
```



In [86]:
import spacy
import nltk
import re
import numpy as np
from typing import List, Dict
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans

import csv


class AdvancedTaskExtractor:
    def __init__(self):
        # Download necessary resources
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)

        # Load spaCy model
        self.nlp = spacy.load("en_core_web_sm")

        # Stop words and preprocessing
        self.stop_words = set(stopwords.words('english'))

        # Task identification patterns
        self.task_indicators = {
            'imperative_verbs': [
                'buy', 'clean', 'review', 'prepare', 'submit',
                'finalize', 'complete', 'schedule', 'discuss',
                'send', 'create', 'update', 'resolve', 'go',
                'finish', 'plan', 'organize'
            ],
            'task_phrases': [
                'need to', 'has to', 'should', 'must', 'will',
                'plan to', 'going to', 'wants to'
            ]
        }

    def preprocess_text(self, text: str) -> List[str]:
        """Advanced text preprocessing with POS tagging"""
        # Normalize text
        text = re.sub(r'\s+', ' ', text).strip()

        # Tokenize sentences
        sentences = sent_tokenize(text)

        # Preprocess sentences
        processed_sentences = []
        for sentence in sentences:
            doc = self.nlp(sentence)
            processed_sentence = [
                token.lemma_.lower()
                for token in doc
                if token.text.lower() not in self.stop_words
                and not token.is_punct
                and token.pos_ in ['VERB', 'NOUN', 'PROPN']
            ]
            processed_sentences.append(' '.join(processed_sentence))

        return sentences, processed_sentences

    def identify_tasks(self, original_sentences: List[str], processed_sentences: List[str]) -> List[Dict]:
        """Advanced task identification with detailed logging"""
        tasks = []

        for orig_sent, proc_sent in zip(original_sentences, processed_sentences):
            # Detailed logging for task detection
            print(f"\nAnalyzing Sentence: {orig_sent}")
            print(f"Processed Sentence: {proc_sent}")

            # Enhanced task detection
            is_task = False

            # Check for imperative verbs
            verb_match = any(verb in proc_sent for verb in self.task_indicators['imperative_verbs'])
            print(f"Imperative Verb Match: {verb_match}")

            # Check for task indicator phrases
            phrase_match = any(phrase in orig_sent.lower() for phrase in self.task_indicators['task_phrases'])
            print(f"Task Phrase Match: {phrase_match}")

            # Determine if sentence is a task
            if verb_match or phrase_match:
                is_task = True
                print("Sentence identified as a task")

            if not is_task:
                print("Not a task. Skipping.")
                continue

            # Entity extraction
            doc = self.nlp(orig_sent)
            entities = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]

            # Fallback to subject extraction
            if not entities:
                entities = [token.text for token in doc if token.dep_ in ["nsubj", "nsubjpass"]]

            # Deadline extraction
            deadline = self._extract_deadline(orig_sent)

            # Create task entry
            task_entry = {
                'task': orig_sent,
                'processed_task': proc_sent,
                'entity': entities[0] if entities else None,
                'deadline': deadline
            }

            print("Task Entry:")
            print(task_entry)

            tasks.append(task_entry)

        return tasks

    def _extract_deadline(self, sentence: str) -> str:
        """Sophisticated deadline extraction"""
        deadline_patterns = [
            r'by\s+([\w\s]+)',
            r'until\s+([\w\s]+)',
            r'on\s+([\w\s]+)',
            r'before\s+([\w\s]+)'
        ]

        for pattern in deadline_patterns:
            match = re.search(pattern, sentence, re.IGNORECASE)
            if match:
                return match.group(1).strip()

        # Additional time-related keywords
        time_keywords = ['today', 'tomorrow', 'next week', 'this week', 'monday', 'friday']
        for keyword in time_keywords:
            if keyword in sentence.lower():
                return keyword

        return None

    def train_word_embeddings(self, processed_sentences: List[str]):
        """Train Word2Vec embeddings"""
        # Tokenize processed sentences
        tokenized_sentences = [sent.split() for sent in processed_sentences]

        # Train Word2Vec model
        model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
        return model


    def categorize_tasks(self, tasks: List[Dict], processed_sentences: List[str]):
        """Categorize tasks using Word2Vec embeddings and K-Means clustering"""

        # Train Word2Vec model
        model = self.train_word_embeddings(processed_sentences)

        # Convert sentences to embeddings
        embeddings = [np.mean([model.wv[word] for word in sent.split() if word in model.wv], axis=0)
                      if any(word in model.wv for word in sent.split()) else np.zeros(100)
                      for sent in processed_sentences]

        # Cluster embeddings using K-Means
        kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
        task_clusters = kmeans.fit_predict(embeddings)

        # Define category names
        category_map = {
            0: 'Professional',
            1: 'Personal',
            2: 'Team',
            3: 'Administrative'
        }

        # Assign categories
        for task, cluster in zip(tasks, task_clusters):
            task['category'] = category_map[cluster]

        return tasks


    def extract_tasks(self, text: str) -> List[Dict]:
        """Main task extraction pipeline"""
        # Preprocess text
        original_sentences, processed_sentences = self.preprocess_text(text)

        # Identify tasks
        tasks = self.identify_tasks(original_sentences, processed_sentences)

        # Categorize tasks
        categorized_tasks = self.categorize_tasks(tasks, processed_sentences)

        return categorized_tasks

def main():
    # Read text from .txt file
    try:
        with open("input.txt", "r", encoding="utf-8") as file:  # Specify encoding if needed
            text = file.read()
    except FileNotFoundError:
        print("Error: input.txt not found. Please create the file and add text.")
        return

    extractor = AdvancedTaskExtractor()
    tasks = extractor.extract_tasks(text)

    # Write output to CSV file
    try:
        with open("output.csv", "w", newline="", encoding="utf-8") as csvfile:
            fieldnames = ['task', 'entity', 'deadline', 'category', 'processed_task'] # Add processed_task
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            writer.writeheader()
            for task in tasks:
                writer.writerow(task)
        print("Tasks extracted and saved to output.csv")
    except Exception as e:
        print(f"Error writing to CSV: {e}")


if __name__ == "__main__":
    main()




Analyzing Sentence: Rahul wakes up early every day.
Processed Sentence: wake day
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He goes to college in the morning and comes back at 3 pm.
Processed Sentence: go college morning come pm
Imperative Verb Match: True
Task Phrase Match: False
Sentence identified as a task
Task Entry:
{'task': 'He goes to college in the morning and comes back at 3 pm.', 'processed_task': 'go college morning come pm', 'entity': 'He', 'deadline': None}

Analyzing Sentence: At present, Rahul is outside.
Processed Sentence: present rahul
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He has to buy the snacks for all of us.
Processed Sentence: buy snack
Imperative Verb Match: True
Task Phrase Match: True
Sentence identified as a task
Task Entry:
{'task': 'He has to buy the snacks for all of us.', 'processed_task': 'buy snack', 'entity': 'He', 'deadline': None}

Analy