# Smart Web Search using AI

Web scraping in Python is evolving. While tools like BeautifulSoup and Selenium were once the standard, the future is **LLM-powered and agentic scraping**. This shift from rigid, rule-based bots to intelligent, adaptive crawlers is essential for feeding modern AI agents the structured, markdown-formatted data they crave.

Today’s scraping requires a sophisticated toolkit: concurrent sessions, proxy rotation, identity management, and automated retries. Building this from the ground up is a heavy lift. That's where **Crawl4AI**, an open-source project, comes in. It packages all this advanced functionality into a single, easy-to-use module.

In this project, we'll dive into the core concepts that allow Crawl4AI to scrape anywhere, anytime, with minimal effort.

## Installation and Setup

Let's start by installing the required dependencies and setting up Crawl4AI for our smart web search engine.

In [4]:
%%capture
!pip install -U crawl4ai
!pip install nest_asyncio

In [5]:
%%capture
!crawl4ai-setup

In [6]:
!crawl4ai-doctor

[1;36m[[0m[36mINIT[0m[1;36m][0m[36m...[0m[36m. → Running Crawl4AI health check[0m[36m...[0m[36m [0m
[1;36m[[0m[36mINIT[0m[1;36m][0m[36m...[0m[36m. → Crawl4AI [0m[1;36m0.7[0m[36m.[0m[1;36m4[0m[36m [0m
[1;36m[[0m[36mTEST[0m[1;36m][0m[36m...[0m[36m. ℹ Testing crawling capabilities[0m[36m...[0m[36m [0m
[1;36m[[0m[36mEXPORT[0m[1;36m][0m[36m.. ℹ Exporting media [0m[1;36m([0m[36mPDF/MHTML/screenshot[0m[1;36m)[0m[36m took [0m[1;36m6.[0m[36m38s [0m
[1;32m[[0m[32mFETCH[0m[1;32m][0m[32m...[0m[32m ↓ [0m[4;32mhttps://crawl4ai.com[0m[32m                                               [0m
[32m| [0m[32m✓[0m[32m | ⏱: [0m[1;32m12.[0m[32m20s [0m
[1;32m[[0m[32mSCRAPE[0m[1;32m][0m[32m.. ◆ [0m[4;32mhttps://crawl4ai.com[0m[32m                                               [0m
[32m| [0m[32m✓[0m[32m | ⏱: [0m[1;32m0.[0m[32m16s [0m
[1;32m[[0m[32mCOMPLETE[0m[1;32m][0m[32m ● [0m[4;32mhttps://craw

## Implementing Smart Web Crawling

Now let's implement our AI-powered web crawler using Crawl4AI. We'll create a deep crawling strategy that can intelligently navigate websites and extract structured data.

In [11]:
import asyncio
import nest_asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

nest_asyncio.apply()

async def main():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            include_external=False,
            max_pages=5
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://www.wikipedia.org", config=config)

        print(f"Crawled {len(results)} pages in total")

        # Access individual results
        for result in results[:3]:  # Show first 3 results
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")

        return results

In [12]:
search_results = asyncio.run(main())

  end = _w(s, end).end()


Crawled 5 pages in total
URL: https://www.wikipedia.org
Depth: 0
URL: https://ru.wikipedia.org
Depth: 1
URL: https://de.wikipedia.org
Depth: 1


## Exploring the Crawled Results

Let's examine the results from our web crawling operation to understand the structure and content that was extracted.

In [15]:
len(search_results)

5

In [19]:
search_results[0].metadata

{'title': 'Wikipedia',
 'description': 'Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.',
 'keywords': None,
 'author': None,
 'og:title': 'Wikipedia, the free encyclopedia',
 'og:type': 'website',
 'og:description': 'Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.',
 'og:image': 'https://upload.wikimedia.org/wikipedia/en/thumb/8/80/Wikipedia-logo-v2.svg/2244px-Wikipedia-logo-v2.svg.png',
 'depth': 0,
 'parent_url': None}

In [20]:
search_results[0].dispatch_result

DispatchResult(task_id='8ce2a642-dc13-4b6d-8a57-4f0303ecd714', memory_usage=0.0, peak_memory=0.0, start_time=1758738217.9210482, end_time=1758738219.2423272, error_message='')

In [25]:
type(search_results[0].links)

dict

In [27]:
search_results[0].links.keys()

dict_keys(['internal', 'external'])

In [38]:
from IPython.display import display_json

display_json(search_results[0].links['internal'][0:5], raw=True)

## Conclusion

This project demonstrated the power of modern AI-driven web scraping using Crawl4AI. We successfully implemented a smart web search engine that can:

### Key Achievements:
- **Intelligent Crawling**: Used BFS deep crawling strategy to systematically explore websites up to 2 levels deep
- **Structured Data Extraction**: Converted web content into structured, markdown-formatted data suitable for AI processing
- **Link Discovery**: Automatically identified and categorized internal links for further exploration
- **Scalable Architecture**: Leveraged async/await patterns for efficient concurrent web scraping

This foundation provides a robust starting point for building more sophisticated AI-powered search and data extraction systems. The combination of intelligent crawling strategies and structured output makes it ideal for feeding modern AI applications with high-quality web data.