# M3: Entity Extraction with Modal + Qwen3-32B

This notebook runs the entity extraction pipeline using Modal cloud compute with Qwen3-32B on A100-80GB GPUs.

## Pipeline Overview

1. **Export** chunks from MongoDB to JSON
2. **Upload & Run** extraction on Modal
3. **Import** results back to MongoDB
4. **Verify** extraction quality

## 0. Setup

In [1]:
# Install Modal if not already installed
!pip install -q modal pymongo

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
import json
import os
from datetime import datetime
from pymongo import MongoClient

# MongoDB connection
# Use 'mongodb' hostname when running inside Docker, 'localhost' when running on host
MONGO_HOST = os.environ.get("MONGO_HOST", "mongodb")  # Docker service name
MONGO_URI = f"mongodb://erica:erica_password_123@{MONGO_HOST}:27017/"
DB_NAME = "erica"

client = MongoClient(MONGO_URI)
db = client[DB_NAME]

print(f"Connected to MongoDB: {DB_NAME}")
print(f"Collections: {db.list_collection_names()}")

Connected to MongoDB: erica
Collections: ['failures', 'extractions', 'resources', 'chunks', 'pages']


In [4]:
# Check Modal authentication
!modal token new

[?25l[31mWas not able to launch web browser[0m
Please go to this URL manually and complete the flow:

[2K]8;id=294159;https://modal.com/token-flow/tf-SEVHm1FoR24FKKwKilHnDv\[4;94mhttps://modal.com/token-flow/tf-SEVHm1FoR24FKKwKilHnDv[0m]8;;\

[2K[32m‚†ã[0m Waiting for authentication in the web browser
[2K[32m‚†º[0m Waiting for token flow to complete...omplete...
[1A[2K[32mWeb authentication finished successfully![0m
[32mToken is connected to the [0m[35mchetangoel01[0m[32m workspace.[0m
Verifying token against [4;34mhttps://api.modal.com[0m
[32mToken verified successfully![0m
[?25l[32m‚†ã[0m Storing token
[1A[2K[32mToken written to [0m[35m/root/[0m[35m.modal.toml[0m[32m in profile [0m[35mchetangoel01[0m[32m.[0m


If you see "No token found", run this to authenticate:
```bash
modal token new
```

## 1. Export Chunks from MongoDB

In [5]:
# Check how many chunks we have
chunks_count = db.chunks.count_documents({})
print(f"Total chunks in MongoDB: {chunks_count}")

# Sample a chunk to see structure
sample = db.chunks.find_one()
if sample:
    print(f"\nSample chunk keys: {list(sample.keys())}")
    print(f"Text preview: {sample.get('text', '')[:200]}...")

Total chunks in MongoDB: 3708

Sample chunk keys: ['_id', 'text', 'source_url', 'source_type', 'source_title', 'chunk_index', 'start_char', 'end_char', 'start_time', 'end_time', 'page_numbers', 'token_count', 'created_at']
Text preview: Introduction to Artificial Intelligence Foundations 2D Perception Large Language Models Logical Reasoning Task Planning Markov Decision Processes Reinforcement Learning Start (in-person) Start (online...


In [6]:
def export_chunks(limit=None, output_file="chunks.json"):
    """
    Export chunks from MongoDB to JSON file.
    """
    cursor = db.chunks.find({})
    if limit:
        cursor = cursor.limit(limit)
    
    chunks = []
    for doc in cursor:
        chunk = {
            "chunk_id": doc.get("chunk_id", str(doc["_id"])),
            "text": doc.get("text", ""),
            "source_url": doc.get("source_url", ""),
            "source_type": doc.get("source_type", "unknown"),
            "source_title": doc.get("source_title", ""),
            "chunk_index": doc.get("chunk_index", 0),
            "token_count": doc.get("token_count", 0),
        }
        chunks.append(chunk)
    
    with open(output_file, "w") as f:
        json.dump(chunks, f, indent=2)
    
    # Stats
    by_type = {}
    total_tokens = 0
    for c in chunks:
        t = c["source_type"]
        by_type[t] = by_type.get(t, 0) + 1
        total_tokens += c.get("token_count", 0)
    
    print(f"Exported {len(chunks)} chunks to {output_file}")
    print(f"\nBy source type:")
    for t, count in sorted(by_type.items()):
        print(f"  {t}: {count}")
    print(f"\nTotal tokens: {total_tokens:,}")
    
    return chunks

In [7]:
# Export ALL chunks for full run
# chunks = export_chunks(output_file="chunks.json")

# Or export a subset for testing
chunks = export_chunks(limit=100, output_file="chunks_test.json")

Exported 100 chunks to chunks_test.json

By source type:
  web: 100

Total tokens: 35,418


## 2. Run Extraction on Modal

We'll use the `modal run` command to execute the extraction on Modal's cloud GPUs.

In [None]:
# The extract.py file is located at src/graph/extract.py
# Modal will use this path when running the extraction

import os
print(f"Current directory: {os.getcwd()}")
print(f"Files: {os.listdir('.')}")
if os.path.exists("src/graph/extract.py"):
    print(f"\n‚úì Found extract.py at: src/graph/extract.py")
else:
    print(f"\n‚úó extract.py not found at src/graph/extract.py")

Current directory: /app
Files: ['chunks_test.json', 'notebooks', 'config', 'src', 'data', 'requirements.txt']


### 2a. Test Run (Small Batch)

First, test with a small batch to make sure everything works.

In [13]:
# Test run with 20 chunks
!modal run src/graph/extract.py --input chunks_test.json --output extractions_test.json --max-chunks 20 --batch-size 10

[33m‚îÇ[0m `gpu=A100(...)` is deprecated. Use `gpu="A100-80GB"` instead.                [33m‚îÇ[0m
[33m‚îÇ[0m                                                                              [33m‚îÇ[0m
[33m‚îÇ[0m Source: /app/src/graph/extract.py:81                                         [33m‚îÇ[0m
[33m‚îÇ[0m   gpu=modal.gpu.A100(size="80GB"),                                           [33m‚îÇ[0m
[33m‚ï∞‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïØ[0m
[33m‚îÇ[0m We have renamed several parameters related to autoscaling. Please update     [33m‚îÇ[0m
[33m‚îÇ[0m your code to use the following new names:                                    [33m‚îÇ[0m
[33m‚îÇ[0m                                                                              [33m‚îÇ[0m
[33m‚îÇ[0m - container_idle

In [14]:
# Check test results
if os.path.exists("extractions_test.json"):
    with open("extractions_test.json") as f:
        test_results = json.load(f)
    
    print(f"Test extractions: {len(test_results)}")
    
    n_concepts = sum(len(r.get("concepts", [])) for r in test_results)
    n_relations = sum(len(r.get("relations", [])) for r in test_results)
    n_errors = sum(1 for r in test_results if r.get("error"))
    
    print(f"Concepts extracted: {n_concepts}")
    print(f"Relations extracted: {n_relations}")
    print(f"Errors: {n_errors}")
    
    # Show sample
    for r in test_results[:3]:
        if r.get("concepts"):
            print(f"\n--- Chunk: {r['chunk_id'][:30]}... ---")
            print(f"Concepts: {[c['title'] for c in r['concepts'][:5]]}")
            if r.get("relations"):
                print(f"Relations: {r['relations'][:3]}")
else:
    print("No test results found. Run the modal command above first.")

Test extractions: 20
Concepts extracted: 92
Relations extracted: 103
Errors: 1

--- Chunk: 692f7634e8ea998b6d034f03... ---
Concepts: ['Artificial Intelligence', 'Perception', 'Probabilistic Reasoning', 'Logical Reasoning', 'Planning']
Relations: [{'source': 'Artificial Intelligence', 'target': 'Perception', 'relation_type': 'part_of'}, {'source': 'Artificial Intelligence', 'target': 'Probabilistic Reasoning', 'relation_type': 'part_of'}, {'source': 'Artificial Intelligence', 'target': 'Logical Reasoning', 'relation_type': 'part_of'}]

--- Chunk: 692f7634e8ea998b6d034f04... ---
Concepts: ['Supervised Learning', 'Optimization', 'Maximum Likelihood Estimation', 'Classification', 'Logistic Regression']
Relations: [{'source': 'Supervised Learning', 'target': 'Classification', 'relation_type': 'is_a'}, {'source': 'Optimization', 'target': 'Supervised Learning', 'relation_type': 'prereq_of'}, {'source': 'Logistic Regression', 'target': 'Classification', 'relation_type': 'is_a'}]

--- Chunk: 6

### 2b. Full Run (All Chunks)

Once the test looks good, run on all chunks.

In [15]:
# Export all chunks first
chunks = export_chunks(output_file="chunks.json")

Exported 3708 chunks to chunks.json

By source type:
  pdf: 1339
  video: 172
  web: 2197

Total tokens: 4,928,975


In [16]:
# Full run - this will take 20-30 minutes and cost ~$1.50-2.00
# Uncomment to run:

!modal run src/graph/extract.py --input chunks.json --output extractions.json --batch-size 32

[33m‚îÇ[0m `gpu=A100(...)` is deprecated. Use `gpu="A100-80GB"` instead.                [33m‚îÇ[0m
[33m‚îÇ[0m                                                                              [33m‚îÇ[0m
[33m‚îÇ[0m Source: /app/src/graph/extract.py:81                                         [33m‚îÇ[0m
[33m‚îÇ[0m   gpu=modal.gpu.A100(size="80GB"),                                           [33m‚îÇ[0m
[33m‚ï∞‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïØ[0m
[33m‚îÇ[0m We have renamed several parameters related to autoscaling. Please update     [33m‚îÇ[0m
[33m‚îÇ[0m your code to use the following new names:                                    [33m‚îÇ[0m
[33m‚îÇ[0m                                                                              [33m‚îÇ[0m
[33m‚îÇ[0m - container_idle

## 3. Import Results to MongoDB

In [None]:
def import_extractions(input_file, clear_existing=False):
    """
    Import extraction results into MongoDB.
    """
    print(f"Loading extractions from {input_file}...")
    with open(input_file) as f:
        extractions = json.load(f)
    
    print(f"Loaded {len(extractions)} extractions")
    
    if clear_existing:
        print("Clearing existing extractions...")
        db.extractions.delete_many({})
    
    # Add metadata and insert
    docs = []
    for ext in extractions:
        doc = {
            "chunk_id": ext["chunk_id"],
            "source_url": ext.get("source_url", ""),
            "concepts": ext.get("concepts", []),
            "relations": ext.get("relations", []),
            "error": ext.get("error"),
            "imported_at": datetime.utcnow(),
        }
        docs.append(doc)
    
    if docs:
        result = db.extractions.insert_many(docs)
        print(f"Inserted {len(result.inserted_ids)} documents")
    
    # Create indexes
    db.extractions.create_index("chunk_id")
    db.extractions.create_index("source_url")
    
    # Stats
    n_concepts = sum(len(e.get("concepts", [])) for e in extractions)
    n_relations = sum(len(e.get("relations", [])) for e in extractions)
    n_errors = sum(1 for e in extractions if e.get("error"))
    
    print(f"\nSummary:")
    print(f"  Total extractions: {len(extractions)}")
    print(f"  Total concepts:    {n_concepts}")
    print(f"  Total relations:   {n_relations}")
    print(f"  Errors:            {n_errors}")
    
    return extractions

In [None]:
# Import test results
if os.path.exists("extractions_test.json"):
    import_extractions("extractions_test.json", clear_existing=True)
else:
    print("No extractions file found. Run the Modal extraction first.")

In [None]:
# Import full results (uncomment after full run)
# import_extractions("extractions.json", clear_existing=True)

## 4. Verify & Explore Results

In [None]:
# Check extraction stats in MongoDB
extractions_count = db.extractions.count_documents({})
print(f"Total extractions in MongoDB: {extractions_count}")

# Count concepts and relations
pipeline = [
    {
        "$project": {
            "n_concepts": {"$size": {"$ifNull": ["$concepts", []]}},
            "n_relations": {"$size": {"$ifNull": ["$relations", []]}},
            "has_error": {"$cond": [{"$ne": ["$error", None]}, 1, 0]}
        }
    },
    {
        "$group": {
            "_id": None,
            "total_concepts": {"$sum": "$n_concepts"},
            "total_relations": {"$sum": "$n_relations"},
            "total_errors": {"$sum": "$has_error"}
        }
    }
]

stats = list(db.extractions.aggregate(pipeline))
if stats:
    s = stats[0]
    print(f"\nTotal concepts:  {s['total_concepts']}")
    print(f"Total relations: {s['total_relations']}")
    print(f"Errors:          {s['total_errors']}")

In [None]:
# Get all unique concepts
pipeline = [
    {"$unwind": "$concepts"},
    {"$group": {
        "_id": "$concepts.title",
        "count": {"$sum": 1},
        "difficulty": {"$first": "$concepts.difficulty"},
        "definition": {"$first": "$concepts.definition"}
    }},
    {"$sort": {"count": -1}},
    {"$limit": 30}
]

top_concepts = list(db.extractions.aggregate(pipeline))

print("Top 30 Most Frequent Concepts:")
print("=" * 60)
for c in top_concepts:
    print(f"{c['count']:3d}x  {c['_id'][:40]:<40} [{c.get('difficulty', '?')}]")

In [None]:
# Get all unique relations
pipeline = [
    {"$unwind": "$relations"},
    {"$group": {
        "_id": {
            "source": "$relations.source",
            "target": "$relations.target",
            "type": "$relations.relation_type"
        },
        "count": {"$sum": 1}
    }},
    {"$sort": {"count": -1}},
    {"$limit": 20}
]

top_relations = list(db.extractions.aggregate(pipeline))

print("\nTop 20 Most Frequent Relations:")
print("=" * 70)
for r in top_relations:
    rel = r['_id']
    print(f"{r['count']:3d}x  {rel['source'][:25]:<25} --[{rel['type']}]--> {rel['target'][:25]}")

In [None]:
# Relation types distribution
pipeline = [
    {"$unwind": "$relations"},
    {"$group": {
        "_id": "$relations.relation_type",
        "count": {"$sum": 1}
    }},
    {"$sort": {"count": -1}}
]

relation_types = list(db.extractions.aggregate(pipeline))

print("\nRelation Types Distribution:")
print("=" * 40)
for rt in relation_types:
    print(f"  {rt['_id']:<20} {rt['count']:>6}")

In [None]:
# Sample extraction with concepts and relations
sample = db.extractions.find_one({"concepts.0": {"$exists": True}, "relations.0": {"$exists": True}})

if sample:
    print("Sample Extraction:")
    print("=" * 60)
    print(f"Chunk ID: {sample['chunk_id']}")
    print(f"Source: {sample['source_url'][:60]}...")
    print(f"\nConcepts ({len(sample['concepts'])}):\n")
    for c in sample['concepts'][:5]:
        print(f"  ‚Ä¢ {c['title']} [{c.get('difficulty', '?')}]")
        print(f"    {c.get('definition', 'No definition')[:80]}...")
    
    print(f"\nRelations ({len(sample['relations'])}):\n")
    for r in sample['relations'][:5]:
        print(f"  ‚Ä¢ {r['source']} --[{r['relation_type']}]--> {r['target']}")

In [None]:
# Check for errors
errors = list(db.extractions.find({"error": {"$ne": None}}).limit(5))

if errors:
    print(f"\nSample Errors ({len(errors)} shown):")
    print("=" * 60)
    for e in errors:
        print(f"Chunk: {e['chunk_id'][:40]}...")
        print(f"Error: {e['error'][:100]}")
        print()
else:
    print("No errors found!")

## 5. Next Steps

Now that you have extractions in MongoDB, the next steps are:

1. **Deduplicate concepts** - Merge similar concepts (e.g., "Neural Network" and "neural networks")
2. **Build Neo4j graph** - Load concepts as nodes, relations as edges
3. **Add resources** - Link concepts back to source chunks
4. **Visualize** - Use Neo4j browser or pyvis to explore the knowledge graph

In [None]:
# Clean up
client.close()
print("MongoDB connection closed.")