# 🤖 Agentic AI Web Crawler - Google Colab Edition

**Schema-less AI-Powered Intelligent Web Crawling**

This notebook uses:
- **Ollama** with **deepseek-r1:14b** for AI-powered decisions
- **Crawl4AI** for intelligent web crawling
- **AI Navigation** - LLM figures out which links are relevant
- **Schema-less Extraction** - LLM determines what's important to scrape
- **Two-phase crawling** - Reconnaissance + Targeted Deep Dive

---


## 📦 Step 1: Install Dependencies

Install all required packages for web crawling and AI processing.


In [None]:
# Install required packages
!pip install -q crawl4ai>=0.3.0 ollama>=0.1.0 aiohttp>=3.9.0 beautifulsoup4>=4.12.0 lxml>=4.9.0

print("✅ All dependencies installed successfully!")

# Install Playwright browsers (required by Crawl4AI)
print("\n📥 Installing Playwright browsers...")
!playwright install chromium
!playwright install-deps chromium

print("✅ Playwright browsers installed!")


## 🔧 Step 2: Setup Ollama Connection

**Important**: This notebook expects Ollama to be running locally or accessible via network.

For Google Colab, you have two options:
1. **Tunnel from local machine**: Run Ollama locally and use ngrok/cloudflare tunnel
2. **Install Ollama in Colab**: Install and run Ollama directly in this notebook

We'll use option 2 for simplicity.


In [None]:
# Install Ollama in Colab
!curl -fsSL https://ollama.com/install.sh | sh

print("✅ Ollama installed!")


In [None]:
# Start Ollama server in background
import subprocess
import time

# Start Ollama server
ollama_process = subprocess.Popen(['ollama', 'serve'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(5)  # Wait for server to start

print("✅ Ollama server started!")


In [None]:
# Pull the deepseek-r1:14b model
print("📥 Downloading deepseek-r1:14b model (this may take several minutes)...")
!ollama pull deepseek-r1:14b

print("✅ Model downloaded and ready!")


## 🧠 Step 3: Import Libraries


In [None]:
import asyncio
import json
import re
from datetime import datetime
from typing import List, Dict, Any, Optional
from urllib.parse import urljoin, urlparse
from collections import defaultdict
import ollama
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import NoExtractionStrategy
from bs4 import BeautifulSoup

print("✅ All libraries imported successfully!")


## 🕷️ Step 4: Define the Agentic Web Crawler Class

This class implements:
- AI-powered objective analysis
- Schema-less content extraction
- Intelligent link navigation
- Two-phase crawling strategy


In [None]:
class ImprovedAgenticWebCrawler:
    """An intelligent web crawler that uses AI to make navigation decisions and extract content schema-lessly."""
    
    def __init__(self, decision_model: str = "deepseek-r1:14b", extraction_model: str = "deepseek-r1:14b", max_pages: int = 50):
        self.decision_model = decision_model
        self.extraction_model = extraction_model
        self.max_pages = max_pages
        self.visited_urls = set()
        self.scraped_data = []
        self.base_domain = None
        self.crawl_objective = ""
        self.crawl_objective_analysis = {}
        self.desired_data_types = []
        self.site_understanding = {"site_type": None, "main_sections": [], "content_patterns": defaultdict(list), "high_value_url_patterns": [], "recommended_focus": ""}
        self.page_relevance_scores = {}
        self.high_value_pages = []
        self.current_phase = "initialization"
        
    async def analyze_user_objective(self, objective: str) -> Dict[str, Any]:
        print("\n🤖 Analyzing your objective with AI...")
        prompt = f"""You are helping to plan a web crawling operation. Analyze the user's objective and provide a detailed crawl strategy.

USER'S OBJECTIVE: "{objective}"

Analyze this objective and provide:
1. What TYPE of data they're looking for (e.g., products, articles, contact info, documentation, etc.)
2. What specific FIELDS or attributes they likely want extracted
3. What sections of a website would be most valuable
4. What URL patterns to prioritize
5. What URL patterns to avoid

Respond in JSON format:
{{
  "data_types": ["primary type", "secondary type"],
  "key_fields": ["field1", "field2", "field3"],
  "valuable_sections": ["section1", "section2"],
  "url_patterns_to_seek": ["pattern1", "pattern2"],
  "url_patterns_to_avoid": ["pattern1", "pattern2"],
  "extraction_strategy": "Description of how to approach extraction",
  "success_criteria": "How to know when we have enough data"
}}

Be specific and actionable."""

        try:
            response = ollama.generate(model=self.decision_model, prompt=prompt)
            response_text = response['response'].strip()
            if '```json' in response_text:
                json_match = re.search(r'```json\s*(.*?)\s*```', response_text, re.DOTALL)
                if json_match:
                    response_text = json_match.group(1)
            elif '```' in response_text:
                json_match = re.search(r'```\s*(.*?)\s*```', response_text, re.DOTALL)
                if json_match:
                    response_text = json_match.group(1)
            analysis = json.loads(response_text)
            print("\n✓ Objective Analysis Complete:")
            print(f"  • Data Types: {', '.join(analysis.get('data_types', []))}")
            print(f"  • Key Fields: {', '.join(analysis.get('key_fields', []))}")
            print(f"  • Focus Areas: {', '.join(analysis.get('valuable_sections', []))}")
            self.crawl_objective_analysis = analysis
            self.desired_data_types = analysis.get('data_types', [])
            return analysis
        except Exception as e:
            print(f"⚠ Analysis error: {e}")
            return {"data_types": ["general content"], "key_fields": ["title", "content", "links"], "valuable_sections": ["main content"], "url_patterns_to_seek": [], "url_patterns_to_avoid": ["login", "signup", "cart"], "extraction_strategy": "Extract all available content", "success_criteria": "Crawl specified number of pages"}
        
    def _is_same_domain(self, url: str) -> bool:
        if not self.base_domain:
            return False
        parsed = urlparse(url)
        return parsed.netloc == self.base_domain
    
    def _extract_url_pattern(self, url: str) -> str:
        parsed = urlparse(url)
        path_parts = [p for p in parsed.path.split('/') if p]
        pattern_parts = []
        for part in path_parts:
            if part.isdigit() or len(part) > 30:
                pattern_parts.append('*')
            else:
                pattern_parts.append(part)
        return '/' + '/'.join(pattern_parts) if pattern_parts else '/'
    
    def _find_similar_visited_urls(self, url: str, limit: int = 3) -> List[str]:
        pattern = self._extract_url_pattern(url)
        similar = []
        for visited in self.visited_urls:
            if self._extract_url_pattern(visited) == pattern:
                similar.append(visited)
                if len(similar) >= limit:
                    break
        return similar
    
    async def _extract_content_with_ai(self, html: str, url: str, markdown: str = "") -> Dict[str, Any]:
        soup = BeautifulSoup(html, 'lxml')
        for tag in soup.find_all(['nav', 'header', 'footer', 'script', 'style']):
            tag.decompose()
        page_text = soup.get_text(separator='\n', strip=True)
        content_to_analyze = markdown[:4000] if markdown else page_text[:4000]
        headers = [h.get_text(strip=True) for h in soup.find_all(['h1', 'h2', 'h3']) if h.get_text(strip=True)][:15]
        main_links = []
        for a in soup.find_all('a', href=True)[:30]:
            text = a.get_text(strip=True)
            if text and len(text) > 2:
                main_links.append(text)
        
        prompt = f"""You are analyzing a web page to extract relevant information based on a specific objective.

CRAWL OBJECTIVE: {self.crawl_objective}

TARGET DATA TYPES: {', '.join(self.desired_data_types)}
KEY FIELDS TO EXTRACT: {', '.join(self.crawl_objective_analysis.get('key_fields', []))}

PAGE URL: {url}

PAGE HEADERS:
{chr(10).join(headers[:10])}

PAGE CONTENT (excerpt):
{content_to_analyze}

MAIN LINKS:
{', '.join(main_links[:15])}

YOUR TASK:
1. Determine the TYPE of page this is (e.g., "product listing", "article", "documentation", "about page", "homepage", etc.)
2. Rate how RELEVANT this page is to the crawl objective (0-10 scale)
3. Extract ALL relevant structured data that matches the objective
4. Be FLEXIBLE - adapt your extraction schema to what's actually on the page

Respond in JSON format:
{{
  "page_type": "...",
  "relevance_score": 0-10,
  "key_content": {{
    // Extract whatever structured data is relevant
    // Examples: "items": [...], "article_text": "...", "metadata": {{...}}
    // Adapt to the page content and objective
  }},
  "reasoning": "Brief explanation of why this page is/isn't relevant",
  "content_summary": "One sentence summary of page content"
}}

Be thorough but concise. Extract actual data, not descriptions."""

        try:
            response = ollama.generate(model=self.extraction_model, prompt=prompt)
            response_text = response['response'].strip()
            if '```json' in response_text:
                json_match = re.search(r'```json\s*(.*?)\s*```', response_text, re.DOTALL)
                if json_match:
                    response_text = json_match.group(1)
            elif '```' in response_text:
                json_match = re.search(r'```\s*(.*?)\s*```', response_text, re.DOTALL)
                if json_match:
                    response_text = json_match.group(1)
            extracted = json.loads(response_text)
            self._update_site_knowledge(url, extracted)
            return extracted
        except Exception as e:
            print(f"  ⚠ AI extraction error: {str(e)[:100]}")
            return {"page_type": "unknown", "relevance_score": 5, "key_content": {"title": soup.find('title').get_text() if soup.find('title') else "No title", "headers": headers, "text_excerpt": page_text[:500]}, "reasoning": "Fallback extraction due to error", "content_summary": "Content extracted with fallback method"}
    
    def _update_site_knowledge(self, url: str, extraction_result: Dict):
        relevance = extraction_result.get('relevance_score', 0)
        self.page_relevance_scores[url] = relevance
        if relevance >= 7:
            self.high_value_pages.append(url)
            url_pattern = self._extract_url_pattern(url)
            if url_pattern not in self.site_understanding['high_value_url_patterns']:
                self.site_understanding['high_value_url_patterns'].append(url_pattern)
            page_type = extraction_result.get('page_type')
            if page_type:
                self.site_understanding['content_patterns'][page_type].append({'url': url, 'pattern': url_pattern, 'relevance': relevance})
    
    def _normalize_url(self, url: str, base_url: str) -> str:
        url = urljoin(base_url, url)
        parsed = urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
    
    async def _extract_links_with_context(self, html: str, base_url: str) -> List[Dict]:
        soup = BeautifulSoup(html, 'lxml')
        links_with_context = []
        avoid_patterns = self.crawl_objective_analysis.get('url_patterns_to_avoid', [])
        avoid_patterns.extend(['javascript:', 'mailto:', 'tel:', '#', '.jpg', '.png', '.pdf', '.css', '.js'])
        
        for link in soup.find_all('a', href=True):
            href = link.get('href')
            if any(pattern.lower() in href.lower() for pattern in avoid_patterns):
                continue
            try:
                full_url = self._normalize_url(href, base_url)
            except:
                continue
            if not self._is_same_domain(full_url) or full_url in self.visited_urls:
                continue
            anchor_text = link.get_text(strip=True)
            if not anchor_text or len(anchor_text) < 2:
                continue
            parent = link.parent
            context_text = parent.get_text(strip=True)[:200] if parent else ""
            title = link.get('title', '')
            aria_label = link.get('aria-label', '')
            is_in_nav = bool(link.find_parent(['nav', 'header']))
            is_in_main = bool(link.find_parent('main'))
            is_prominent = bool(link.find_parent(['h1', 'h2', 'h3']))
            
            links_with_context.append({'url': full_url, 'anchor_text': anchor_text, 'context': context_text, 'title': title, 'aria_label': aria_label, 'is_navigation': is_in_nav, 'is_main_content': is_in_main, 'is_prominent': is_prominent, 'url_path': urlparse(full_url).path})
        return links_with_context
    
    async def _score_url_relevance(self, link_info: Dict) -> Dict:
        url = link_info['url']
        similar_urls = self._find_similar_visited_urls(url)
        historical_scores = [self.page_relevance_scores.get(u, 0) for u in similar_urls]
        url_pattern = self._extract_url_pattern(url)
        pattern_bonus = 2 if url_pattern in self.site_understanding['high_value_url_patterns'] else 0
        heuristic_score = 5 + pattern_bonus
        if link_info.get('is_main_content'):
            heuristic_score += 1
        if link_info.get('is_prominent'):
            heuristic_score += 1
        seek_patterns = self.crawl_objective_analysis.get('url_patterns_to_seek', [])
        if any(pattern.lower() in url.lower() for pattern in seek_patterns):
            heuristic_score += 2
        return {'url': url, 'relevance_score': heuristic_score, 'historical_avg': sum(historical_scores) / len(historical_scores) if historical_scores else 5, 'should_crawl': heuristic_score >= 5, 'priority': 'high' if heuristic_score >= 8 else 'medium' if heuristic_score >= 6 else 'low'}
    
    async def _ask_ollama_for_navigation_advanced(self, current_url: str, available_links: List[Dict], page_extraction: Dict) -> List[str]:
        if not available_links:
            return []
        scored_links = []
        for link_info in available_links[:30]:
            score_info = await self._score_url_relevance(link_info)
            scored_links.append({**link_info, **score_info})
        scored_links.sort(key=lambda x: x['relevance_score'], reverse=True)
        top_candidates = scored_links[:12]
        links_summary = []
        for i, link in enumerate(top_candidates, 1):
            links_summary.append(f"{i}. [{link['relevance_score']:.1f}] {link['anchor_text'][:50]} → {link['url_path']}")
        
        prompt = f"""You are guiding a web crawler. Review these pre-scored URLs and select the best ones to crawl next.

CRAWL OBJECTIVE: {self.crawl_objective}

PROGRESS:
- Pages crawled: {len(self.visited_urls)}/{self.max_pages}
- High-value pages: {len(self.high_value_pages)}
- Current phase: {self.current_phase}

CURRENT PAGE: {current_url}
Page type: {page_extraction.get('page_type', 'unknown')}
Relevance: {page_extraction.get('relevance_score', '?')}/10
Summary: {page_extraction.get('content_summary', 'N/A')}

LEARNED PATTERNS:
High-value URL patterns: {self.site_understanding['high_value_url_patterns'][:5]}

TOP CANDIDATE URLS (with pre-scored relevance [score]):
{chr(10).join(links_summary)}

Select 3-5 URLs that best match the objective. Consider:
1. URLs matching learned high-value patterns
2. URLs with high relevance scores
3. URLs that explore new areas vs going deeper
4. Current progress toward objective

Respond with ONLY the numbers (comma-separated, e.g., "1,3,5,8").
If no links are worth crawling, respond "NONE"."""
        
        try:
            response = ollama.generate(model=self.decision_model, prompt=prompt)
            answer = response['response'].strip()
            if 'NONE' in answer.upper():
                print("  → AI: No valuable links found")
                return []
            numbers = re.findall(r'\d+', answer)
            selected_urls = []
            for num in numbers[:5]:
                idx = int(num) - 1
                if 0 <= idx < len(top_candidates):
                    selected_urls.append(top_candidates[idx]['url'])
            return selected_urls
        except Exception as e:
            print(f"  ⚠ Navigation AI error: {str(e)[:100]}")
            return [link['url'] for link in scored_links[:3]]
    
    async def _crawl_page(self, url: str, crawler: AsyncWebCrawler) -> Optional[Dict[str, Any]]:
        try:
            print(f"📄 Crawling: {url}")
            result = await crawler.arun(url=url, extraction_strategy=NoExtractionStrategy(), bypass_cache=True)
            if not result.success:
                print(f"  ✗ Failed to crawl")
                return None
            ai_extraction = await self._extract_content_with_ai(result.html, url, result.markdown if result.markdown else "")
            page_data = {"url": url, "title": result.metadata.get("title", "No title"), "description": result.metadata.get("description", ""), "timestamp": datetime.now().isoformat(), "metadata": result.metadata, "ai_extraction": ai_extraction, "relevance_score": ai_extraction.get('relevance_score', 0), "page_type": ai_extraction.get('page_type', 'unknown')}
            relevance = ai_extraction.get('relevance_score', 0)
            page_type = ai_extraction.get('page_type', 'unknown')
            print(f"  ✓ Type: {page_type} | Relevance: {relevance}/10")
            key_content = ai_extraction.get('key_content', {})
            if key_content:
                content_types = list(key_content.keys())
                print(f"  → Extracted: {', '.join(content_types[:3])}")
            return page_data
        except Exception as e:
            print(f"  ✗ Error: {str(e)[:80]}")
            return None
    
    async def _analyze_site_structure(self) -> Dict:
        print("\n🔍 Analyzing site structure...")
        page_types = {}
        for page in self.scraped_data:
            page_type = page.get('page_type', 'unknown')
            relevance = page.get('relevance_score', 0)
            if page_type not in page_types:
                page_types[page_type] = []
            page_types[page_type].append(relevance)
        pages_summary = []
        for page in self.scraped_data[:10]:
            pages_summary.append(f"- {page['url']}: {page['page_type']} (relevance: {page.get('relevance_score', 0)}/10)")
        
        prompt = f"""Analyze the reconnaissance crawl results and provide strategic guidance.

CRAWL OBJECTIVE: {self.crawl_objective}

PAGES CRAWLED IN RECONNAISSANCE ({len(self.scraped_data)} pages):
{chr(10).join(pages_summary)}

PAGE TYPE DISTRIBUTION:
{json.dumps({pt: {'count': len(scores), 'avg_relevance': sum(scores)/len(scores)} for pt, scores in page_types.items()}, indent=2)}

HIGH-VALUE URL PATTERNS DISCOVERED:
{self.site_understanding['high_value_url_patterns']}

Provide strategic analysis:
1. What TYPE of website is this?
2. Which sections/page types are most valuable for the objective?
3. What URL patterns should we prioritize in deep crawl?
4. What's the recommended crawl strategy going forward?

Respond in JSON:
{{
  "site_type": "...",
  "most_valuable_page_types": ["type1", "type2"],
  "recommended_focus": "Description of where to focus",
  "high_priority_patterns": ["pattern1", "pattern2"],
  "strategy": "continue_deep | adjust_objective | site_not_suitable"
}}"""

        try:
            response = ollama.generate(model=self.decision_model, prompt=prompt)
            response_text = response['response'].strip()
            if '```json' in response_text:
                json_match = re.search(r'```json\s*(.*?)\s*```', response_text, re.DOTALL)
                if json_match:
                    response_text = json_match.group(1)
            analysis = json.loads(response_text)
            print(f"  ✓ Site Type: {analysis.get('site_type', 'Unknown')}")
            print(f"  ✓ Focus: {analysis.get('recommended_focus', 'General crawl')}")
            return analysis
        except Exception as e:
            print(f"  ⚠ Analysis error: {e}")
            return {"site_type": "unknown", "most_valuable_page_types": list(page_types.keys())[:2], "recommended_focus": "Continue crawling all page types", "high_priority_patterns": self.site_understanding['high_value_url_patterns'], "strategy": "continue_deep"}
    
    async def crawl_website(self, start_url: str) -> List[Dict[str, Any]]:
        parsed_url = urlparse(start_url)
        self.base_domain = parsed_url.netloc
        print(f"\n{'='*80}")
        print(f"🚀 STARTING INTELLIGENT WEB CRAWL")
        print(f"{'='*80}")
        print(f"Target: {start_url}")
        print(f"Objective: {self.crawl_objective}")
        print(f"Max pages: {self.max_pages}")
        print(f"{'='*80}\n")
        
        recon_budget = max(5, self.max_pages // 10)
        self.current_phase = "reconnaissance"
        print(f"📡 PHASE 1: RECONNAISSANCE ({recon_budget} pages)")
        print(f"{'─'*80}")
        
        async with AsyncWebCrawler(verbose=False) as crawler:
            url_queue = [start_url]
            while url_queue and len(self.visited_urls) < recon_budget:
                current_url = url_queue.pop(0)
                if current_url in self.visited_urls:
                    continue
                self.visited_urls.add(current_url)
                page_data = await self._crawl_page(current_url, crawler)
                if page_data:
                    self.scraped_data.append(page_data)
                    result = await crawler.arun(url=current_url, bypass_cache=True)
                    if result.success:
                        links = await self._extract_links_with_context(result.html, current_url)
                        if links:
                            selected = links[:8]
                            for link in selected:
                                if link['url'] not in self.visited_urls and link['url'] not in url_queue:
                                    url_queue.append(link['url'])
                print(f"  Progress: {len(self.visited_urls)}/{recon_budget} recon pages\n")
        
        site_analysis = await self._analyze_site_structure()
        self.site_understanding.update(site_analysis)
        
        deep_budget = self.max_pages - len(self.visited_urls)
        self.current_phase = "deep_crawl"
        print(f"\n{'─'*80}")
        print(f"🎯 PHASE 2: TARGETED DEEP CRAWL ({deep_budget} pages)")
        print(f"{'─'*80}\n")
        
        async with AsyncWebCrawler(verbose=False) as crawler:
            url_queue = []
            for page in self.scraped_data:
                if page.get('relevance_score', 0) >= 6:
                    result = await crawler.arun(url=page['url'], bypass_cache=True)
                    if result.success:
                        links = await self._extract_links_with_context(result.html, page['url'])
                        for link in links[:5]:
                            if link['url'] not in self.visited_urls:
                                url_queue.append(link['url'])
            url_queue = list(dict.fromkeys(url_queue))
            
            while url_queue and len(self.visited_urls) < self.max_pages:
                current_url = url_queue.pop(0)
                if current_url in self.visited_urls:
                    continue
                self.visited_urls.add(current_url)
                page_data = await self._crawl_page(current_url, crawler)
                if page_data:
                    self.scraped_data.append(page_data)
                    result = await crawler.arun(url=current_url, bypass_cache=True)
                    if result.success:
                        links = await self._extract_links_with_context(result.html, current_url)
                        if links:
                            selected_urls = await self._ask_ollama_for_navigation_advanced(current_url, links, page_data.get('ai_extraction', {}))
                            print(f"  → AI selected {len(selected_urls)} links")
                            for url in selected_urls:
                                if url not in self.visited_urls and url not in url_queue:
                                    url_queue.append(url)
                print(f"  Progress: {len(self.visited_urls)}/{self.max_pages} | Queue: {len(url_queue)} | High-value: {len(self.high_value_pages)}\n")
        
        print(f"\n{'='*80}")
        print(f"✅ CRAWL COMPLETE")
        print(f"{'='*80}")
        print(f"Total pages: {len(self.scraped_data)}")
        print(f"High-value pages: {len(self.high_value_pages)}")
        print(f"Avg relevance: {sum(self.page_relevance_scores.values())/len(self.page_relevance_scores) if self.page_relevance_scores else 0:.1f}/10")
        print(f"{'='*80}\n")
        return self.scraped_data

print("✅ ImprovedAgenticWebCrawler class defined!")


In [None]:
def format_content_recursive(content: Any, level: int = 3) -> list:
    """Recursively format extracted content into readable markdown."""
    lines = []
    if isinstance(content, dict):
        for key, value in content.items():
            title = key.replace('_', ' ').title()
            if isinstance(value, (dict, list)):
                lines.append(f"{'#' * level} {title}\n")
                lines.extend(format_content_recursive(value, level + 1))
            else:
                lines.append(f"**{title}**: {value}\n")
    elif isinstance(content, list):
        if not content:
            lines.append("*No items found*\n")
        elif all(isinstance(item, dict) for item in content):
            for idx, item in enumerate(content, 1):
                lines.append(f"\n### Item {idx}\n")
                lines.extend(format_content_recursive(item, level + 1))
        else:
            for item in content:
                if isinstance(item, str):
                    lines.append(f"- {item}\n")
                else:
                    lines.append(f"- {str(item)}\n")
    else:
        lines.append(f"{content}\n")
    return lines

def generate_human_readable_report(scraped_data: List[Dict], objective: str) -> str:
    """Generate a human-readable report from scraped data."""
    md = []
    md.append("# 📊 Extracted Data Analysis Report\n")
    md.append(f"**Generated**: {datetime.now().strftime('%B %d, %Y at %I:%M %p')}\n")
    md.append(f"**Search Objective**: {objective}\n")
    md.append("---\n")
    relevant_pages = [p for p in scraped_data if p.get('relevance_score', 0) >= 5]
    if not relevant_pages:
        md.append("## No Relevant Data Found\n")
        md.append("No pages matched the objective with sufficient relevance.\n")
    else:
        md.append(f"## Summary\n")
        md.append(f"Found **{len(relevant_pages)}** pages with relevant information.\n\n")
        md.append("---\n")
        for idx, page in enumerate(relevant_pages, 1):
            url = page.get('url', 'Unknown URL')
            ai_extraction = page.get('ai_extraction', {})
            content = ai_extraction.get('key_content', {})
            relevance = page.get('relevance_score', 0)
            page_type = page.get('page_type', 'unknown')
            summary = ai_extraction.get('content_summary', 'N/A')
            if not content or content == {}:
                continue
            md.append(f"\n## 📄 Source {idx}\n")
            md.append(f"**Page Type**: {page_type}\n")
            md.append(f"**Relevance Score**: {relevance}/10\n")
            md.append(f"**Summary**: {summary}\n\n")
            md.extend(format_content_recursive(content, level=3))
            md.append(f"\n**Source URL**: [{url}]({url})\n")
            md.append("\n---\n")
    return ''.join(md)

print("✅ Data formatting functions defined!")


In [None]:
# Get user inputs
print("="*80)
print("🤖 AGENTIC WEB CRAWLER - CONFIGURATION")
print("="*80)
print()

# URL Input
start_url = input("🌐 Enter the website URL to crawl: ").strip()
if not start_url.startswith(('http://', 'https://')):
    start_url = 'https://' + start_url

print()
print("📝 What information are you looking for?")
print("   Examples:")
print("   - 'Product names, prices, and availability'")
print("   - 'Blog articles with titles and publication dates'")
print("   - 'Documentation pages with code examples'")
print("   - 'Contact information and team member details'")
print()

# Objective Input
crawl_objective = input("Your objective: ").strip()
if not crawl_objective:
    crawl_objective = "Extract all relevant content and structured data"

print()

# Max pages input
max_pages_input = input("🔢 Maximum pages to crawl (default: 20): ").strip()
max_pages = int(max_pages_input) if max_pages_input.isdigit() else 20

print()
print("✅ Configuration complete!")
print(f"   URL: {start_url}")
print(f"   Objective: {crawl_objective}")
print(f"   Max Pages: {max_pages}")


## 🚀 Step 7: Run the Intelligent Crawler

This will:
1. Analyze your objective with AI
2. Perform reconnaissance crawl
3. Execute targeted deep crawl
4. Extract relevant data schema-lessly


In [None]:
# Initialize crawler
crawler = ImprovedAgenticWebCrawler(
    decision_model="deepseek-r1:14b",
    extraction_model="deepseek-r1:14b",
    max_pages=max_pages
)

# Set objective
crawler.crawl_objective = crawl_objective

# Analyze objective with AI
await crawler.analyze_user_objective(crawl_objective)

# Start crawling
scraped_data = await crawler.crawl_website(start_url)

print(f"\n✅ Successfully crawled {len(scraped_data)} pages")
print(f"   High-value pages: {len(crawler.high_value_pages)}")
if crawler.page_relevance_scores:
    avg_relevance = sum(crawler.page_relevance_scores.values()) / len(crawler.page_relevance_scores)
    print(f"   Average relevance: {avg_relevance:.1f}/10")


## 📊 Step 8: Display Scraped Data (Raw JSON)

View the raw extracted data in JSON format.


In [None]:
# Prepare output data (only relevant pages)
output_data = {
    "objective": crawl_objective,
    "total_pages_crawled": len(scraped_data),
    "high_value_pages": len(crawler.high_value_pages),
    "extracted_data": [
        {
            "url": page['url'],
            "page_type": page.get('page_type', 'unknown'),
            "relevance": page.get('relevance_score', 0),
            "extracted_content": page.get('ai_extraction', {}).get('key_content', {})
        }
        for page in scraped_data
        if page.get('relevance_score', 0) >= 5  # Only include relevant pages
    ]
}

# Display as formatted JSON
print("\n" + "="*80)
print("📦 SCRAPED DATA (JSON FORMAT)")
print("="*80 + "\n")
print(json.dumps(output_data, indent=2, ensure_ascii=False))


## 📝 Step 9: Generate Human-Readable Analysis

Convert the scraped data into a comprehensive, human-readable report that answers your objective.


In [None]:
# Generate human-readable report
report = generate_human_readable_report(scraped_data, crawl_objective)

print("\n" + "="*80)
print("📊 HUMAN-READABLE ANALYSIS REPORT")
print("="*80 + "\n")
print(report)


## 🤖 Step 10: AI-Powered Summary Generation

Use the LLM to generate a comprehensive summary answering your specific question.


In [None]:
# Generate AI summary based on extracted data
print("\n" + "="*80)
print("🤖 GENERATING AI-POWERED SUMMARY")
print("="*80 + "\n")

# Prepare data for AI analysis
relevant_pages = [p for p in scraped_data if p.get('relevance_score', 0) >= 5]

if not relevant_pages:
    print("⚠ No relevant data found to summarize.")
else:
    # Build context from extracted data
    context_parts = []
    for idx, page in enumerate(relevant_pages[:10], 1):
        ai_extraction = page.get('ai_extraction', {})
        content = ai_extraction.get('key_content', {})
        summary = ai_extraction.get('content_summary', '')
        
        context_parts.append(f"Source {idx} ({page['url']}):")
        context_parts.append(f"Summary: {summary}")
        context_parts.append(f"Data: {json.dumps(content, indent=2)}")
        context_parts.append("---")
    
    context = "\n".join(context_parts)
    
    # Generate summary with AI
    summary_prompt = '''You are analyzing data extracted from a web crawling operation.

USER'S QUESTION/OBJECTIVE:
''' + crawl_objective + '''

EXTRACTED DATA FROM ''' + str(len(relevant_pages)) + ''' RELEVANT PAGES:
''' + context[:8000] + '''

YOUR TASK:
Based on the extracted data above, provide a comprehensive answer to the user's question/objective.

Format your response as:
1. **Executive Summary**: A brief overview answering the main question
2. **Key Findings**: Bullet points of the most important information discovered
3. **Detailed Analysis**: Full sentences and paragraphs elaborating on the findings
4. **Data Points**: Specific facts, numbers, or details extracted
5. **Conclusion**: Final thoughts and recommendations

Be thorough, accurate, and emphasize information most relevant to the user's objective.
Use clear headings, bullet points, and well-structured paragraphs.'''
    
    try:
        print("🤖 Generating comprehensive analysis...\n")
        response = ollama.generate(
            model="deepseek-r1:14b",
            prompt=summary_prompt
        )
        
        ai_summary = response['response'].strip()
        
        print("="*80)
        print("✨ AI-GENERATED COMPREHENSIVE ANALYSIS")
        print("="*80 + "\n")
        print(ai_summary)
        print("\n" + "="*80)
        
    except Exception as e:
        print(f"⚠ Error generating AI summary: {e}")


In [None]:
# Save JSON data
with open('scraped_data.json', 'w', encoding='utf-8') as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print("✅ Saved: scraped_data.json")

# Save human-readable report
with open('analysis_report.md', 'w', encoding='utf-8') as f:
    f.write(report)

print("✅ Saved: analysis_report.md")

# Save AI summary if generated
if 'ai_summary' in locals():
    with open('ai_summary.md', 'w', encoding='utf-8') as f:
        f.write(f"# AI-Generated Analysis\n\n")
        f.write(f"**Objective**: {crawl_objective}\n\n")
        f.write(f"**Generated**: {datetime.now().strftime('%B %d, %Y at %I:%M %p')}\n\n")
        f.write("---\n\n")
        f.write(ai_summary)
    
    print("✅ Saved: ai_summary.md")

print("\n" + "="*80)
print("🎉 CRAWLING AND ANALYSIS COMPLETE!")
print("="*80)


## 📥 Step 12: Download Results (Optional)

Download the generated files to your local machine.


In [None]:
# For Google Colab - download files
try:
    from google.colab import files
    
    print("📥 Downloading files...\n")
    files.download('scraped_data.json')
    files.download('analysis_report.md')
    if 'ai_summary' in locals():
        files.download('ai_summary.md')
    
    print("\n✅ Files downloaded successfully!")
except ImportError:
    print("ℹ️ Not running in Google Colab. Files saved to current directory.")


---

## 🎓 How This Works

### Phase 1: Reconnaissance
- Crawls a small sample of pages to understand site structure
- AI analyzes page types and identifies high-value patterns
- Learns which URL patterns contain relevant information

### Phase 2: Targeted Deep Crawl
- Focuses on high-value pages discovered in reconnaissance
- AI makes intelligent navigation decisions
- Prioritizes links matching learned patterns

### Schema-less Extraction
- No predefined extraction patterns
- AI determines what's important based on your objective
- Adapts extraction strategy to each page's content

### AI-Powered Analysis
- Converts raw data into human-readable insights
- Generates comprehensive answers to your questions
- Emphasizes information relevant to your objective

---

## 🔧 Troubleshooting

**Ollama Connection Issues:**
- Ensure Ollama server is running (check cell 5)
- Model must be downloaded (deepseek-r1:14b)

**Memory Issues:**
- Reduce `max_pages` parameter
- Use a smaller model (e.g., qwen2.5:7b)

**Slow Performance:**
- Normal for large models like deepseek-r1:14b
- Consider using GPU runtime in Colab

---

## 📚 Credits

Built with:
- **Crawl4AI** - Intelligent web crawling
- **Ollama** - Local LLM inference
- **DeepSeek R1** - Advanced reasoning model
- **BeautifulSoup** - HTML parsing

---
