###**NLPTasker** - Extract and Categorize Tasks from Unannotated Text

The following note book walks through the code that is required Extract and Categorize Tasks from Unannotated Text using **LDA Topic Modeling**

This cell installs and imports the necessary libraries for the task extraction pipeline:
- `spacy`: Used for Natural Language Processing (NLP) tasks like tokenization and Named Entity Recognition (NER).
- `nltk`: Provides stopwords and sentence tokenization.
- `re`: Enables regular expressions for text preprocessing.
- `numpy`: Supports numerical operations (used in topic modeling).
- `sklearn.feature_extraction.text.TfidfVectorizer`: Converts text into TF-IDF vectors for categorization.
- `sklearn.decomposition.LatentDirichletAllocation`: Performs topic modeling to categorize tasks.


In [1]:
import spacy
import nltk
import re
import numpy as np
from typing import List, Dict
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# # Download necessary resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Stop words and preprocessing
stop_words = set(stopwords.words('english'))

This dictionary defines two important sets of keywords for task identification:
1. **Imperative Verbs**: Words that often indicate an action or a task (e.g., "buy", "clean", "review").
2. **Task Phrases**: Common phrases that indicate a task's requirement (e.g., "need to", "should", "must").


In [2]:
# Task identification patterns
task_indicators = {
    'imperative_verbs': [
        'buy', 'clean', 'review', 'prepare', 'submit',
        'finalize', 'complete', 'schedule', 'discuss',
        'send', 'create', 'update', 'resolve', 'go',
        'finish', 'plan', 'organize'
    ],
    'task_phrases': [
        'need to', 'has to', 'should', 'must', 'will',
        'plan to', 'going to', 'wants to'
    ]
}

This function takes raw text as input and applies the following preprocessing steps:
1. **Sentence Tokenization**: Splits the text into individual sentences.
2. **Lemmatization**: Converts words to their root form (e.g., "running" → "run").
3. **Stopword Removal**: Eliminates common words like "the", "is", and "and" that do not add much meaning.
4. **POS Filtering**: Retains only Verbs, Nouns, and Proper Nouns, as they are crucial for understanding tasks.
5. **Final Output**: Returns both the original and processed versions of each sentence.


In [3]:
def preprocess_text(text: str) -> List[str]:
    """Advanced text preprocessing with POS tagging"""
    # Normalize text
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize sentences
    sentences = sent_tokenize(text)

    # Preprocess sentences
    processed_sentences = []
    for sentence in sentences:
        doc = nlp(sentence)
        processed_sentence = [
            token.lemma_.lower()
            for token in doc
            if token.text.lower() not in stop_words
            and not token.is_punct
            and token.pos_ in ['VERB', 'NOUN', 'PROPN']
        ]
        processed_sentences.append(' '.join(processed_sentence))

    return sentences, processed_sentences

This function extracts deadlines from sentences using:
1. **Regex Matching**: Searches for date-related patterns like "by 5 pm", "on Monday", or "before Friday".
2. **Keyword Matching**: Looks for common time-related words like "today", "tomorrow", "next week".
3. **Returns a Deadline (if found)**: Helps categorize tasks with specific time constraints.


In [4]:
import nltk
nltk.download('punkt_tab')

def _extract_deadline(sentence: str) -> str:
    """Sophisticated deadline extraction"""
    deadline_patterns = [
        r'by\s+([\w\s]+)',
        r'until\s+([\w\s]+)',
        r'on\s+([\w\s]+)',
        r'before\s+([\w\s]+)'
    ]

    for pattern in deadline_patterns:
        match = re.search(pattern, sentence, re.IGNORECASE)
        if match:
            return match.group(1).strip()

    # Additional time-related keywords
    time_keywords = ['today', 'tomorrow', 'next week', 'this week', 'monday', 'friday']
    for keyword in time_keywords:
        if keyword in sentence.lower():
            return keyword

    return None

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


This function determines whether a given sentence is a task by:
1. **Checking for Imperative Verbs**: If a sentence contains an action verb, it is likely a task.
2. **Checking for Task Phrases**: If a sentence includes "has to", "should", or similar phrases, it is considered a task.
3. **Extracting Entities**: Uses Named Entity Recognition (NER) to detect people mentioned in the task.
4. **Extracting Subjects as Fallback**: If no entities are found, it retrieves the subject of the sentence.
5. **Extracting Deadlines**: Calls the `extract_deadline` function to find due dates.
6. **Storing Results**: Returns a list of extracted tasks with entities and deadlines.


In [5]:
def identify_tasks(original_sentences: List[str], processed_sentences: List[str]) -> List[Dict]:
    """Advanced task identification with detailed logging"""
    tasks = []

    for orig_sent, proc_sent in zip(original_sentences, processed_sentences):
        # Detailed logging for task detection
        print(f"\nAnalyzing Sentence: {orig_sent}")
        print(f"Processed Sentence: {proc_sent}")

        # Enhanced task detection
        is_task = False

        # Check for imperative verbs
        verb_match = any(verb in proc_sent for verb in task_indicators['imperative_verbs'])
        print(f"Imperative Verb Match: {verb_match}")

        # Check for task indicator phrases
        phrase_match = any(phrase in orig_sent.lower() for phrase in task_indicators['task_phrases'])
        print(f"Task Phrase Match: {phrase_match}")

        # Determine if sentence is a task
        if verb_match or phrase_match:
            is_task = True
            print("Sentence identified as a task")

        if not is_task:
            print("Not a task. Skipping.")
            continue

        # Entity extraction
        doc = nlp(orig_sent)
        entities = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]

        # Fallback to subject extraction
        if not entities:
            entities = [token.text for token in doc if token.dep_ in ["nsubj", "nsubjpass"]]

        # Deadline extraction
        deadline = _extract_deadline(orig_sent)

        # Create task entry
        task_entry = {
            'task': orig_sent,
            'processed_task': proc_sent,
            'entity': entities[0] if entities else None,
            'deadline': deadline
        }

        print("Task Entry:")
        print(task_entry)

        tasks.append(task_entry)

    return tasks

This function categorizes extracted tasks into meaningful groups using **TF-IDF and LDA (Latent Dirichlet Allocation)**:
1. **TF-IDF Vectorization**: Converts text into numerical vectors based on word importance.
2. **LDA Topic Modeling**: Identifies underlying themes in the tasks.
3. **Category Mapping**:
   - **Professional**: Work-related tasks.
   - **Personal**: Individual tasks.
   - **Team**: Collaborative work.
   - **Administrative**: Office or documentation-related work.
4. **Assigns a Category**: The most dominant topic is used to classify the task.


In [6]:
def categorize_tasks(tasks: List[Dict], processed_sentences: List[str]):
    """Advanced categorization using TF-IDF and LDA"""
    # TF-IDF Vectorization
    vectorizer = TfidfVectorizer(max_features=1000)
    tfidf_matrix = vectorizer.fit_transform(processed_sentences)

    # LDA Topic Modeling
    lda_model = LatentDirichletAllocation(n_components=4, random_state=42)
    lda_output = lda_model.fit_transform(tfidf_matrix)

    # Map topics to categories
    category_map = {
        0: 'Professional',
        1: 'Personal',
        2: 'Team',
        3: 'Administrative'
    }

    # Assign categories
    for task, topic_dist in zip(tasks, lda_output):
        dominant_topic = np.argmax(topic_dist)
        task['category'] = category_map[dominant_topic]

    return tasks

This is the main function that runs the entire pipeline:
1. **Preprocesses the text**: Cleans and tokenizes input data.
2. **Identifies tasks**: Detects action-oriented sentences.
3. **Categorizes tasks**: Uses LDA and TF-IDF to assign task types.
4. **Returns a structured output**: Outputs tasks along with associated entities, deadlines, and categories.


In [7]:
def extract_tasks(text: str) -> List[Dict]:
    """Main task extraction pipeline"""
    # Preprocess text
    original_sentences, processed_sentences = preprocess_text(text)

    # Identify tasks
    tasks = identify_tasks(original_sentences, processed_sentences)

    # Categorize tasks
    categorized_tasks = categorize_tasks(tasks, processed_sentences)

    return categorized_tasks

This cell demonstrates the entire pipeline using sample text. It:
1. **Extracts tasks from the input text**.
2. **Displays key details**:
   - The actual task statement.
   - The person/entity responsible.
   - The deadline (if present).
   - The category assigned to the task.


In [8]:
def main():
    text = """
    Rahul wakes up early every day. He goes to college in the morning and comes back at 3 pm.
    At present, Rahul is outside. He has to buy the snacks for all of us.
    Rahul should clean the room by 5 pm today.
    John needs to review the report by Friday.
    Alice needs to finish her homework by 6 pm.
    Bob is planning to go for a run tomorrow morning.
    The team should discuss the project updates in the meeting next week.
    Sarah has to prepare the presentation for the meeting on Monday.
    Tom will submit the project report by the end of the week.
    The group needs to finalize the budget by 3 pm tomorrow.
    """
    tasks = extract_tasks(text)

    print("\nExtracted Tasks:")
    for task in tasks:
        print(f"Task: {task['task']}")
        print(f"Entity: {task['entity']}")
        print(f"Deadline: {task['deadline']}")
        print(f"Category: {task['category']}\n")

if __name__ == "__main__":
    main()


Analyzing Sentence: Rahul wakes up early every day.
Processed Sentence: wake day
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He goes to college in the morning and comes back at 3 pm.
Processed Sentence: go college morning come pm
Imperative Verb Match: True
Task Phrase Match: False
Sentence identified as a task
Task Entry:
{'task': 'He goes to college in the morning and comes back at 3 pm.', 'processed_task': 'go college morning come pm', 'entity': 'He', 'deadline': None}

Analyzing Sentence: At present, Rahul is outside.
Processed Sentence: present rahul
Imperative Verb Match: False
Task Phrase Match: False
Not a task. Skipping.

Analyzing Sentence: He has to buy the snacks for all of us.
Processed Sentence: buy snack
Imperative Verb Match: True
Task Phrase Match: True
Sentence identified as a task
Task Entry:
{'task': 'He has to buy the snacks for all of us.', 'processed_task': 'buy snack', 'entity': 'He', 'deadline': None}

Analy

### **Analysis and Conclusion: LDA vs. Word Embedding Approach**  

Based on the outputs, here’s a structured analysis of how the two approaches perform in **task identification, categorization, and deadline extraction**.  

---

### **1. Task Identification**  
**Observation:** Both LDA and Word Embedding approaches correctly extracted the same set of tasks from the text.  
✔ **Conclusion:** No significant difference—both methods are equally effective in recognizing tasks.  

---

### **2. Task Categorization**  
This is where the key difference lies.  

#### **LDA Categorization:**
- **Three distinct categories:** **Professional, Team, Administrative.**  
- Categorization appears structured and somewhat logical:
  - **Professional:** College, project discussions.  
  - **Team:** Group tasks, cleaning, runs.  
  - **Administrative:** Reports, presentations, budgeting.  

#### **Word Embedding Categorization:**
- **Two dominant categories:** **Personal** and **Administrative.**  
- **Issue:** Almost all tasks are categorized as **Personal**, including reviewing reports, finalizing budgets, and preparing presentations—these should ideally be **Administrative or Professional**.  

✔ **Conclusion:**  
- **LDA is superior** in categorization because it provides meaningful distinctions between different types of tasks.  
- **Word Embedding fails in categorization**, as it overuses the "Personal" label, making it **less useful for structured task management**.  

---

### **3. Deadline Extraction**  
**Observation:** Both LDA and Word Embedding approaches correctly identified deadlines, with no noticeable errors in temporal recognition.  
✔ **Conclusion:** No significant difference—both methods perform **equally well** in deadline extraction.  

---

### **Final Verdict: Which One is Better?**  

| **Criterion**         | **LDA** | **Word Embedding** |
|----------------------|--------|------------------|
| **Task Identification** | ✅ Good | ✅ Good |
| **Task Categorization** | ✅ Structured (Professional, Team, Administrative) | ❌ Overgeneralized ("Personal" for almost everything) |
| **Deadline Extraction** | ✅ Accurate | ✅ Accurate |

✔ **Final Recommendation: Use LDA.**  
- LDA provides **better categorization** and keeps tasks structured.  
- Word Embedding fails to categorize tasks meaningfully, reducing its usefulness.  
- Both perform equally well in extracting deadlines and identifying tasks.  
