# DSPy Practical Assignment: Structuring Unstructured Data

This notebook implements the full assignment: scraping 10 URLs, processing text with DSPy pipeline, and generating deliverables (Mermaid diagrams, CSV, and this notebook).

## Setup
- Install dependencies.
- Set up API key (replace with your own from the assignment instructions).
- Configure DSPy.

In [None]:
#!pip install dspy-ai requests beautifulsoup4  # Install required packages: DSPy for AI pipelines, requests for HTTP, BeautifulSoup for HTML parsing

ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


In [2]:
import json  # For handling JSON data structures
import dspy  # Import DSPy library for AI pipeline components
import copy  # For deep copying objects if needed
from typing import List, Optional  # Type hints for lists and optional types
from typing import Literal, Dict, Union  # Additional type hints for literals, dictionaries, and unions
from dspy.adapters import XMLAdapter  # Import XML adapter for DSPy
import requests  # For making HTTP requests to scrape websites
from bs4 import BeautifulSoup  # For parsing HTML content
import csv  # For writing CSV files with extracted data

# API Key Setup: Replace with your own key from https://scribehow.com/viewer/Sign_Up_for_Longcat_API_Platform__9sYiobPNS0OnXzyxKHu4zg?add_to_team_with_invite=True&sharer_domain=gmail.com&sharer_id=e0b8270f-e494-45b1-b41a-c6adf9f11845
API_KEY = 'ak_1Nm8PI6Zb0aR97X1wc9Zs32K8DZ06'  # Set the API key for LongCat API access

main_lm = dspy.LM("openai/LongCat-Flash-Chat", api_key=API_KEY, api_base="https://api.longcat.chat/openai/v1")  # Initialize the language model with LongCat API

dspy.settings.configure(lm=main_lm, adapter=dspy.XMLAdapter())  # Configure DSPy settings with the language model and XML adapter

## DSPy Pipeline Components

Define the signatures and predictors for entity extraction, deduplication, and relation extraction.

In [3]:
from pydantic import BaseModel, Field  # Import Pydantic for data validation and settings management

class EntityWithAttr(BaseModel):# Data model for an entity with its semantic attribute type
    entity: str = Field(description="the named entity")# the named entity
    attr_type: str = Field(description="semantic type of the entity (e.g. Drug, Disease, Symptom, etc.)")# semantic type of the entity (e.g. Drug, Disease, Symptom, etc.)

class ExtractEntities(dspy.Signature):
    """From the paragraph extract all relevant entities and their semantic attribute types."""
    paragraph: str = dspy.InputField(desc="input paragraph")# input paragraph
    entities: List[EntityWithAttr] = dspy.OutputField(desc="list of entities and their attribute types")# list of entities and their attribute types

# Predictor for extracting entities from text
extractor = dspy.Predict(ExtractEntities)

# Deduplication logic
class DeduplicateEntities(dspy.Signature):
    """Given a list of (entity, attr_type) decide which ones are duplicates.
    Return a deduplicated list and a confidence that the remaining items are ALL distinct."""
    items: List[EntityWithAttr] = dspy.InputField(desc="batch of entities to deduplicate")# batch of entities to deduplicate
    deduplicated: List[EntityWithAttr] = dspy.OutputField(desc="deduplicated list")# deduplicated list
    confidence: float = dspy.OutputField(
        desc="confidence (0-1) that every item in deduplicated is semantically distinct"# confidence (0-1) that every item in deduplicated is semantically distinct 
    )

# Predictor for deduplicating entities using chain of thought reasoning
dedup_predictor = dspy.ChainOfThought(DeduplicateEntities)# Predictor for deduplicating entities using chain of thought reasoning

# Recursive deduplication function
def deduplicate_with_lm(
    items: List[EntityWithAttr],
    *,
    batch_size: int = 10,
    target_confidence: float = 0.9,
) -> List[EntityWithAttr]:
    """
    Recursively deduplicate using the LM.
    Works by:
      1. splitting into batches of `batch_size`
      2. for each batch asking the LM for duplicates + confidence
      3. rerunning the batch until confidence >= target_confidence
      4. concatenating results from all batches
    """
    if not items:
        return []

    # helper to process one batch
    def _process_batch(batch: List[EntityWithAttr]) -> List[EntityWithAttr]:
        while True:
            pred = dedup_predictor(items=batch)
            if pred.confidence >= target_confidence:
                return pred.deduplicated
            # otherwise loop again with same batch

    # split into batches and process
    results = []
    for i in range(0, len(items), batch_size):
        batch = items[i : i + batch_size]
        results.extend(_process_batch(batch))
    return results

# Data model for a subject-predicate-object relation
class Relation(BaseModel):
    subj: str = Field(description="subject entity (exact string as in deduplicated list)")
    pred: str = Field(description="short predicate / relation phrase")
    obj:  str = Field(description="object entity (exact string as in deduplicated list)")

# Signature for extracting relations from a paragraph given a list of entities
class ExtractRelations(dspy.Signature):
    """Given the original paragraph and a list of unique entities, extract all factual (subject, predicate, object) triples that are explicitly stated or clearly implied."""
    paragraph: str = dspy.InputField(desc="original paragraph")
    entities:  List[str] = dspy.InputField(desc="list of deduplicated entity strings")
    relations: List[Relation] = dspy.OutputField(desc="list of subject-predicate-object triples")

# Predictor for extracting relations using chain of thought reasoning
rel_predictor = dspy.ChainOfThought(ExtractRelations)

## Mermaid Serialization

Function to convert relations to Mermaid diagram.

In [4]:
# Function to convert triples to Mermaid flowchart
def triples_to_mermaid(
    triples: list[Relation],
    entity_list: list[str],
    max_label_len: int = 40
) -> str:
    """
    Convert triples to a VALID Mermaid flowchart LR diagram.
    """
    entity_set = {e.strip().lower() for e in entity_list}
    lines = ["flowchart LR"]

    def _make_id(s: str) -> str:
        # Create valid Mermaid node ID (no spaces or special chars)
        return s.strip().replace(" ", "_").replace("(", "").replace(")", "").replace("-", "_")

    for t in triples:
        subj_norm, obj_norm = t.subj.strip().lower(), t.obj.strip().lower()

        if obj_norm in entity_set:
            src, dst, lbl = t.subj, t.obj, t.pred
        elif subj_norm in entity_set:
            src, dst, lbl = t.obj, t.subj, t.pred
        else:
            continue

        # Sanitize label
        lbl = lbl.strip()
        if len(lbl) > max_label_len:
            lbl = lbl[:max_label_len - 3] + "..."

        # Use valid IDs with display labels
        src_id, dst_id = _make_id(src), _make_id(dst)
        lines.append(f'    {src_id}["{src}"] -->|{lbl}| {dst_id}["{dst}"]')

    return "\n".join(lines)

## URLs and Processing

List of URLs to scrape. Note: Some URLs may be invalid (future dates or typos); the code will skip them with error logging.

In [5]:
# List of URLs to process
urls = [
    "https://en.wikipedia.org/wiki/Sustainable_agriculture",
    "https://www.nature.com/articles/d41586-025-03353-5",
    "https://www.sciencedirect.com/science/article/pii/S1043661820315152",
    "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457221/",
    "https://www.fao.org/3/y4671e/y4671e06.htm",
    "https://www.medscape.com/viewarticle/time-reconsider-tramadol-chronic-pain-2025a1000ria",
    "https://www.sciencedirect.com/science/article/pii/S0378378220307088",
    "https://www.frontiersin.org/news/2025/09/01/rectangle-telescope-finding-habitable-planets",
    "https://www.medscape.com/viewarticle/second-dose-boosts-shingles-protection-adults-aged-65-years-2025a1000ro7",
    "https://www.theguardian.com/global-development/2025/oct/13/astro-ambassadors-stargazers-himalayas-hanle-ladakh-india"
]

csv_data = []# List to hold CSV rows

for i, url in enumerate(urls, 1):
    try:
        print(f"Processing URL {i}: {url}")
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract text from paragraphs (basic scraping; may need refinement for complex sites)
        paragraphs = soup.find_all('p')
        text = ' '.join([p.get_text().strip() for p in paragraphs if p.get_text().strip()])
        if not text:
            raise ValueError("No text extracted")
        # Truncate if too long (DSPy may have limits)
        if len(text) > 10000:
            text = text[:10000] + "..."
        
        # Extract entities
        extracted = extractor(paragraph=text)
        
        # Deduplicate
        unique = deduplicate_with_lm(extracted.entities)
        
        # Add to CSV data (deduplicated per URL)
        for e in unique:
            csv_data.append({'link': url, 'tag': e.entity, 'tag_type': e.attr_type})
        
        # Entity strings for relations
        entity_strings = [e.entity for e in unique]
        
        # Extract relations
        rel_out = rel_predictor(paragraph=text, entities=entity_strings)
        
        # Generate Mermaid
        mermaid_code = triples_to_mermaid(rel_out.relations, entity_strings)
        
        # Save Mermaid diagram
        with open(f'mermaid_{i}.md', 'w') as f:
            f.write("```mermaid\n" + mermaid_code + "\n```")
        
        print(f"Completed URL {i}")
    
    except Exception as e:
        print(f"Error processing URL {i} ({url}): {e}")
        # Skip invalid URLs

# Save CSV
with open('tags.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['link', 'tag', 'tag_type'])
    writer.writeheader()
    writer.writerows(csv_data)

print("Processing complete. Check files: mermaid_1.md to mermaid_10.md and tags.csv")

Processing URL 1: https://en.wikipedia.org/wiki/Sustainable_agriculture
Completed URL 1
Processing URL 2: https://www.nature.com/articles/d41586-025-03353-5
Completed URL 2
Processing URL 3: https://www.sciencedirect.com/science/article/pii/S1043661820315152
Error processing URL 3 (https://www.sciencedirect.com/science/article/pii/S1043661820315152): 400 Client Error: Bad Request for url: https://www.sciencedirect.com/unsupported_browser
Processing URL 4: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457221/
Error processing URL 4 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457221/): 403 Client Error: Forbidden for url: https://pmc.ncbi.nlm.nih.gov/articles/PMC10457221/
Processing URL 5: https://www.fao.org/3/y4671e/y4671e06.htm
Completed URL 5
Processing URL 6: https://www.medscape.com/viewarticle/time-reconsider-tramadol-chronic-pain-2025a1000ria
Error processing URL 6 (https://www.medscape.com/viewarticle/time-reconsider-tramadol-chronic-pain-2025a1000ria): 403 Client Error: Fo