# **PART A: Task: Extract and Categorize Tasks from Unannotated Text**

# **Introduction**  

This assignment involves building a **complete NLP pipeline** to extract and categorize tasks from unstructured text. The pipeline follows a structured approach to process text, extract relevant task details, and format them into a structured JSON output.  

The deliverables for this assignment include:  
1. **A short video walkthrough** demonstrating the code and its output.  
2. **A well-documented Notebook**, including modular functions for **preprocessing, task extraction, categorization, and validation** using manually curated test cases.  

### **Overview of the NLP Pipeline**  

The **full NLP pipeline** code is provided, covering all the essential steps for **task extraction, categorization, and structuring output.** The workflow consists of four key steps:  

- **Step 1: Preprocessing the Text and Extracting Tasks from Text**  
  - Tokenization and sentence segmentation  
  - Identifying key task-related phrases  
  - Extracting potential tasks from sentences  

- **Step 2: Categorizing Extracted Tasks**  
  - Assigning tasks to relevant categories (Work, Personal, Academic, Health, etc.)  
  - Expanding categorization using **keywords and topic modeling**  

- **Step 3: Extracting "Who" and "When" from Tasks**  
  - Identifying **who** is responsible for the task  
  - Extracting **when** (deadline/time) using **regex and Named Entity Recognition (NER)**  

- **Step 4: Structuring the Final Output in JSON Format**  
  - Formatting extracted tasks into a **structured JSON**  
  - Ensuring data is well-structured for storage and API use  

This approach ensures that tasks are accurately identified, categorized, and structured in a **machine-readable format**, making it easier for downstream applications to process the information.  

The code is designed to be **modular, reusable, and efficient**, following best practices in NLP and text processing. **All functions are well-documented,** and test cases have been included to validate the results.  



# **Extract and Categorize Tasks NLP PIPELINE**

In [1]:
import spacy
import re
import json
from collections import defaultdict

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Task-related phrases
TASK_KEYWORDS = [
    "has to", "needs to", "should", "must", "is required to", "is expected to", 
    "is supposed to", "is scheduled to", "is assigned to"
]

# Deadline patterns
TIME_PATTERNS = [
    r'\bby\s+\d{1,2}\s*(am|pm)?\b',  # "by 5 pm"
    r'\bbefore\s+\w+\b',  # "before tomorrow"
    r'\btomorrow\b',
    r'\btoday\b',
    r'\bin\s+\d+\s+\w+\b',  # "in 3 hours"
    r'\bby end of the day\b',
    r'\bwithin\s+\d+\s+(hours|days|minutes)\b'
]

# Categorization keywords
TASK_CATEGORIES = {
    "Personal": ["buy", "get", "shop", "visit"],
    "Academic": ["submit", "study", "complete", "assignment", "exam", "project"],
    "Work": ["send", "email", "call", "schedule", "meeting", "review"],
    "Household": ["clean", "wash", "cook", "arrange", "fix"],
    "Health": ["exercise", "run", "walk", "meditate"],
    "Finance": ["pay", "invest", "deposit", "withdraw", "budget"]
}

def extract_and_categorize_pipeline(text):
    """
    NLP pipeline to extract tasks, categorize them, and structure the output.
    """
    doc = nlp(text)
    extracted_tasks = []

    for sent in doc.sents:
        sentence = sent.text.strip()
        sentence_lower = sentence.lower()
        
        # Check if the sentence contains a task-related phrase
        if any(keyword in sentence_lower for keyword in TASK_KEYWORDS):
            task = {"who": None, "task": None, "deadline": None, "category": "Uncategorized"}

            # Extract the subject (who)
            subjects = [token.text for token in sent if token.dep_ in {"nsubj", "nsubjpass"} and token.pos_ in {"PROPN", "PRON"}]
            task["who"] = ", ".join(subjects) if subjects else "Unknown"

            # Extract the task (everything after the task phrase)
            task_start = -1
            for keyword in TASK_KEYWORDS:
                if keyword in sentence_lower:
                    task_start = sentence_lower.find(keyword) + len(keyword)
                    break
            
            if task_start != -1:
                task_text = sentence[task_start:].strip()
                task["task"] = task_text

            # Extract deadline
            for pattern in TIME_PATTERNS:
                match = re.search(pattern, sentence_lower)
                if match:
                    task["deadline"] = match.group(0)
                    break

            # Use NER for additional date extraction
            for ent in sent.ents:
                if ent.label_ in {"DATE", "TIME"}:
                    task["deadline"] = ent.text
                    break

            # Categorize the task
            for category, keywords in TASK_CATEGORIES.items():
                if any(keyword in task["task"].lower() for keyword in keywords):
                    task["category"] = category
                    break

            # Store valid tasks
            if task["task"]:
                extracted_tasks.append(task)

    return extracted_tasks


def format_output(tasks):
    """
    Convert extracted tasks into a structured JSON output.
    """
    return json.dumps(tasks, indent=4)

In [2]:
# Test Cases
test_texts = [
    "Rahul wakes up early every day. He goes to college in the morning and comes back at 3 pm. At present, Rahul is outside. He has to buy the snacks for all of us. He also needs to submit his assignment by 5 pm.",
    
    "John should complete the project before Friday. Alice must send the email by 10 am tomorrow. They have to attend the meeting at 2 pm.",
    
    "David is required to pay the electricity bill today. Sarah needs to visit the dentist by the end of the day. James should invest in stocks next month.",
    
    "Maya has to clean the kitchen by 6 pm. Adam is supposed to arrange the books in the library. Tom needs to exercise in the morning."
]

# Run NLP pipeline
for idx, text in enumerate(test_texts, 1):
    print(f"\nTest Case {idx}:")
    extracted_tasks = extract_and_categorize_pipeline(text)
    print(format_output(extracted_tasks))


Test Case 1:
[
    {
        "who": "He",
        "task": "buy the snacks for all of us.",
        "deadline": null,
        "category": "Personal"
    },
    {
        "who": "He",
        "task": "submit his assignment by 5 pm.",
        "deadline": "5 pm",
        "category": "Academic"
    }
]

Test Case 2:
[
    {
        "who": "John",
        "task": "complete the project before Friday.",
        "deadline": "Friday",
        "category": "Academic"
    },
    {
        "who": "Alice",
        "task": "send the email by 10 am tomorrow.",
        "deadline": "10 am tomorrow",
        "category": "Work"
    }
]

Test Case 3:
[
    {
        "who": "David",
        "task": "pay the electricity bill today.",
        "deadline": "today",
        "category": "Finance"
    },
    {
        "who": "Sarah",
        "task": "visit the dentist by the end of the day.",
        "deadline": "the end of the day",
        "category": "Personal"
    },
    {
        "who": "James",
        "task

In [3]:
text_1 = "John needs to submit the report by Monday."
extract_and_categorize_pipeline(text_1)

[{'who': 'John',
  'task': 'submit the report by Monday.',
  'deadline': 'Monday',
  'category': 'Academic'}]

In [4]:
text_2 = "Mark needs to finalize the budget proposal."
extract_and_categorize_pipeline(text_2)

[{'who': 'Mark',
  'task': 'finalize the budget proposal.',
  'deadline': None,
  'category': 'Personal'}]

# **Step 1: Preprocessing the Text and Extracting Tasks from Text**

In [5]:
# Example input
text = """Rahul wakes up early every day. He goes to college in the morning and comes back at 3 pm.
At present, Rahul is outside. He has to buy the snacks for all of us. 
He also needs to submit his assignment by 5 pm."""

In [6]:
import spacy
import re

# Load English NLP model
nlp = spacy.load("en_core_web_sm")

# Task-related phrases
TASK_KEYWORDS = ["has to", "needs to", "should", "must", "is required to", "is expected to"]

# Deadline patterns
TIME_PATTERNS = [
    r'\bby\s+\d{1,2}\s*(am|pm)?\b',  # "by 5 pm"
    r'\bbefore\s+\w+\b',  # "before tomorrow"
    r'\btomorrow\b',
    r'\btoday\b',
    r'\bin\s+\d+\s+\w+\b'  # "in 3 hours"
]

def extract_tasks(text):
    """
    Extracts tasks from unstructured text, capturing:
    - Who is responsible (subject)
    - The action to be done (verb + object)
    - Any deadline (if mentioned)
    """
    doc = nlp(text)
    extracted_tasks = []

    for sent in doc.sents:
        sentence = sent.text  # Original sentence
        sentence_lower = sentence.lower()  # Lowercased for keyword matching
        
        # DEBUG: Print sentences being processed
        print(f"\nProcessing Sentence: {sentence}")

        # Check if any task-related phrase is in the sentence
        if any(keyword in sentence_lower for keyword in TASK_KEYWORDS):
            task = {"who": None, "task": None, "deadline": None}

            # Extract the subject (who is responsible)
            for token in sent:
                if token.dep_ in {"nsubj", "nsubjpass"} and token.pos_ in {"PROPN", "PRON"}:
                    task["who"] = token.text
                    break  # Capture only the first valid subject

            # Extract the task (after task keyword)
            task_start = -1
            for keyword in TASK_KEYWORDS:
                if keyword in sentence_lower:
                    task_start = sentence_lower.find(keyword) + len(keyword)
                    break

            if task_start != -1:
                task_text = sentence[task_start:].strip()  # Extract everything after the task phrase
                task["task"] = task_text
            
            # Extract deadlines
            for pattern in TIME_PATTERNS:
                match = re.search(pattern, sentence_lower)
                if match:
                    task["deadline"] = match.group(0)
                    break

            # DEBUG: Print extracted task details
            print(f"Extracted Task: {task}")

            if task["task"]:
                extracted_tasks.append(task)

    return extracted_tasks

# Extract tasks
tasks = extract_tasks(text)

# Print final extracted tasks
print("\nFinal Extracted Tasks:")
for task in tasks:
    print(task)


Processing Sentence: Rahul wakes up early every day.

Processing Sentence: He goes to college in the morning and comes back at 3 pm.


Processing Sentence: At present, Rahul is outside.

Processing Sentence: He has to buy the snacks for all of us. 

Extracted Task: {'who': 'He', 'task': 'buy the snacks for all of us.', 'deadline': None}

Processing Sentence: He also needs to submit his assignment by 5 pm.
Extracted Task: {'who': 'He', 'task': 'submit his assignment by 5 pm.', 'deadline': 'by 5 pm'}

Final Extracted Tasks:
{'who': 'He', 'task': 'buy the snacks for all of us.', 'deadline': None}
{'who': 'He', 'task': 'submit his assignment by 5 pm.', 'deadline': 'by 5 pm'}


# **Step 2: Categorizing Extracted Tasks**

In [7]:
# Dictionary of categories with associated keywords
TASK_CATEGORIES = {
    "Personal": ["buy", "get", "shop", "visit"],
    "Academic": ["submit", "study", "complete", "assignment", "exam", "project"],
    "Work": ["send", "email", "call", "schedule", "meeting"],
    "Household": ["clean", "wash", "cook", "arrange", "fix"],
}

# Function to categorize tasks
def categorize_task(task_description):
    doc = nlp(task_description.lower())  # Process task with spaCy
    for category, keywords in TASK_CATEGORIES.items():
        if any(token.text in keywords for token in doc):  # Match keywords
            return category  # Return first matched category
    return "Uncategorized"  # Default if no match found

# Categorize extracted tasks
def categorize_extracted_tasks(extracted_tasks):
    categorized_tasks = []
    for task in extracted_tasks:
        category = categorize_task(task["task"])  # Get the task description for categorization
        categorized_tasks.append((task, category))
    return categorized_tasks

# Example usage
categorized_tasks = categorize_extracted_tasks(tasks)

# Print categorized tasks
print("\nCategorized Tasks:")
for task, category in categorized_tasks:
    print(f"Task: {task['task']} → Category: {category}")


Categorized Tasks:
Task: buy the snacks for all of us. → Category: Personal
Task: submit his assignment by 5 pm. → Category: Academic


# **Step 3: Extracting "Who" and "When" from Tasks**

In [8]:
# Function to extract "Who" and "When" from tasks
def extract_who_and_when(task):
    who = task["who"]
    when = task["deadline"]
    return who, when

# Extract "Who" and "When" from categorized tasks
def extract_who_and_when_from_tasks(categorized_tasks):
    who_when_info = []
    for task, _ in categorized_tasks:
        who, when = extract_who_and_when(task)
        who_when_info.append((who, when))
    return who_when_info

# Example usage
who_when_info = extract_who_and_when_from_tasks(categorized_tasks)

# Print extracted "Who" and "When"
print("\nWho and When Information:")
for who, when in who_when_info:
    print(f"Who: {who}, When: {when}")



Who and When Information:
Who: He, When: None
Who: He, When: by 5 pm


# **Step 4: Structuring the Final Output in JSON Format**

In [9]:
import json

# Function to structure the final output in JSON format
def structure_final_output_json(who_when_info, categorized_tasks):
    final_output = []
    for (who, when), (task, category) in zip(who_when_info, categorized_tasks):
        final_output.append({
            "Task": task["task"],
            "Who": who,
            "When": when,
            "Category": category
        })
    return json.dumps(final_output, indent=4)  # Return the output as a pretty-printed JSON string

# Example usage
final_output_json = structure_final_output_json(who_when_info, categorized_tasks)

# Print the final structured JSON output
print("\nFinal Structured Output (JSON):")
print(final_output_json)


Final Structured Output (JSON):
[
    {
        "Task": "buy the snacks for all of us.",
        "Who": "He",
        "When": null,
        "Category": "Personal"
    },
    {
        "Task": "submit his assignment by 5 pm.",
        "Who": "He",
        "When": "by 5 pm",
        "Category": "Academic"
    }
]
