In [1]:
# Install the package
%pip install -U crawl4ai

Collecting crawl4ai
  Using cached Crawl4AI-0.4.247-py3-none-any.whl.metadata (25 kB)
Collecting aiosqlite~=0.20 (from crawl4ai)
  Using cached aiosqlite-0.20.0-py3-none-any.whl.metadata (4.3 kB)
Collecting lxml~=5.3 (from crawl4ai)
  Using cached lxml-5.3.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Collecting litellm>=1.53.1 (from crawl4ai)
  Downloading litellm-1.59.0-py3-none-any.whl.metadata (36 kB)
Collecting numpy<3,>=1.26.0 (from crawl4ai)
  Downloading numpy-2.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting pillow~=10.4 (from crawl4ai)
  Using cached pillow-10.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting playwright>=1.49.0 (from crawl4ai)
  Using cached playwright-1.49.1-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting python-dotenv~=1.0 (from crawl4ai)
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting requests~=2.26 (from crawl4ai)
  Using cached requests

> **Note:** The following commands require sudo/admin privileges to run. As Jupyter notebooks cannot input passwords, you may need to run these commands in a terminal outside of the notebook environment.


In [None]:
# Run post-installation setup
!crawl4ai-setup

# Verify your installation
!crawl4ai-doctor

After running the above commands, you should see output similar to the following:

1. Post-installation setup:
   ```
   [COMPLETE] ● Database initialization completed successfully.
   [COMPLETE] ● Post-installation setup completed!
   ```

2. Installation verification:
   ```
   [COMPLETE] ● https://crawl4ai.com... | Status: True | Total: 17.10s
   [COMPLETE] ● ✅ Crawling test passed!
   ```

These messages indicate that the setup was successful and the installation is working correctly.


In [3]:
# Install pandas for easy view results in notebook
%pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.3 pytz-2024.2 tzdata-2024.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Load .env file
from dotenv import load_dotenv
load_dotenv()

# logging
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)

True

## Example 1: Using crawl4ai without LLM

The following code demonstrates how to use crawl4ai to crawl a website and extract links without using any language model:

```python
import asyncio
from crawl4ai import AsyncWebCrawler
import pandas as pd

async def crawl_website():
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        result = await crawler.arun(
            url="https://www1.hkej.com/features/topic/tag/%E8%99%9B%E5%B9%A3%E5%8B%95%E6%85%8B",
            bypass_cache=False,
            verbose=False,
        )

        # Get all internal links from the website
        df = pd.DataFrame(result.links["internal"])
        display(df)

asyncio.run(crawl_website())
```

This example crawls the specified URL and displays all internal links found on the page using a pandas DataFrame.


In [2]:
import asyncio
from crawl4ai import AsyncWebCrawler
import os
import json

import pandas as pd

In [3]:
# Try to Grapping news from hkej website

async with AsyncWebCrawler(verbose=True,headless=True) as crawler:
    result = await crawler.arun(
        url="https://www1.hkej.com/features/topic/tag/%E8%99%9B%E5%B9%A3%E5%8B%95%E6%85%8B",
        bypass_cache=False,
        verbose=False,
    )

    # Example 1: Get all links from the website
    df = pd.DataFrame(result.links["internal"])
    display(df)
    
    # Example 2: Get all news links from the website
    df_filtered = df[df['href'].str.contains('article', case=False, na=False)]
    display(df_filtered)

[INIT].... → Crawl4AI 0.4.247
[FETCH]... ↓ https://www1.hkej.com/features/topic/tag/%E8%99%9B... | Status: True | Time: 0.01s
[COMPLETE] ● https://www1.hkej.com/features/topic/tag/%E8%99%9B... | Status: True | Total: 0.02s


Unnamed: 0,href,text,title,base_domain
0,https://subscribe.hkej.com,訂閱 / 續訂,,hkej.com
1,https://subscribe.hkej.com/register,註冊,,hkej.com
2,https://subscribe.hkej.com/member/login?forwar...,登入,,hkej.com
3,https://www2.hkej.com/landing/index,,,hkej.com
4,https://www2.hkej.com/weather/weather.php?loca...,,,hkej.com
...,...,...,...,...
120,https://www2.hkej.com/info/privacy,私隱條款,,hkej.com
121,https://www2.hkej.com/info/disclaimer,免責聲明,,hkej.com
122,https://www.hkej.com/ratecard/html/index.html,廣告查詢,,hkej.com
123,https://www2.hkej.com/info/jobs,加入信報,,hkej.com


Unnamed: 0,href,text,title,base_domain
30,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
31,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
32,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
33,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
34,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
35,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
36,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
37,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
38,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com
39,https://www1.hkej.com/features/article?q=%23%E...,,,hkej.com


## Example 2: Using crawl4ai with LLM and Pydantic Schema

This example demonstrates how to use crawl4ai with a language model (LLM) and a Pydantic schema to extract specific information from a website.

```python
from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy, CrawlerRunConfig, CacheMode

async def crawl_with_llm():
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        result = await crawler.arun(
            url="https://www1.hkej.com/features/topic/tag/%E8%99%9B%E5%B9%A3%E5%8B%95%E6%85%8B",
            config=crawler_config,
            verbose=False
        )
        
        # Display the extracted data
        for item in result.extracted_data:
            print(json.dumps(item, indent=2, ensure_ascii=False))

asyncio.run(crawl_with_llm())
```

This code uses the `LLMExtractionStrategy` with a custom Pydantic schema to extract specific information (title, image URL, article URL, and date) from news articles. The LLM is guided by a simple prompt to focus on these details during the extraction process.


In [9]:
from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, CrawlerRunConfig, CacheMode

class OpenAIModelFee(BaseModel):
    title: str = Field(..., description="News title.")
    image: str = Field(..., description="Url of the news thumbnail.")
    url: str = Field(..., description="Url of the news.")
    date: str = Field(..., description="Date of the news.")

deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
provider = f"azure/{deployment_name}"
api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
api_token = os.getenv("AZURE_OPENAI_API_KEY")
extra_args = {"temperature": 0.7}

llm_strategy = LLMExtractionStrategy(
            api_base = api_base,
            provider=provider,
            api_token=api_token,
            schema=OpenAIModelFee.model_json_schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all the news title, image url, url and date.""",
            extra_args=extra_args,
            verbose=False
        )

crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=1,
        page_timeout=80000,
        extraction_strategy=llm_strategy,
        verbose=False
    )
    
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://www1.hkej.com/features/topic/tag/%E8%99%9B%E5%B9%A3%E5%8B%95%E6%85%8B",
        bypass_cache=False,
        verbose=False,
        config=crawler_config
    )
    # The JSON output is stored in 'extracted_content'
    data = json.loads(result.extracted_content)
    df = pd.DataFrame(data)
    display(df)

    llm_strategy.show_usage() 


[INIT].... → Crawl4AI 0.4.247
[FETCH]... ↓ https://www1.hkej.com/features/topic/tag/%E8%99%9B... | Status: True | Time: 6.65s
[SCRAPE].. ◆ Processed https://www1.hkej.com/features/topic/tag/%E8%99%9B... | Time: 102ms


17:45:18 - LiteLLM:INFO: utils.py:2820 - 
LiteLLM completion() model= gpt-4o-mini; provider = azure
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o-mini; provider = azure
INFO:httpx:HTTP Request: POST https://azugenaia201.openai.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"
17:45:41 - LiteLLM:INFO: utils.py:952 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


[EXTRACT]. ■ Completed for https://www1.hkej.com/features/topic/tag/%E8%99%9B... | Time: 23.121848894050345s
[COMPLETE] ● https://www1.hkej.com/features/topic/tag/%E8%99%9B... | Status: True | Total: 29.90s


Unnamed: 0,title,image,url,date,error
0,$TRUMP幣熱炒 存利益衝突,https://static.hkej.com/hkej/images/2025/01/20...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月20日,False
1,HashKey Exchange去年交易量升85% 料今年達收支平衡,https://static.hkej.com/hkej/images/2025/01/15...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月15日,False
2,Z世代寧買比特幣更勝置業,https://static.hkej.com/hkej/images/2025/01/15...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月15日,False
3,歐股靠穩 比特幣失守9.2萬美元,https://static.hkej.com/hkej/images/2025/01/10...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月10日,False
4,【ETF特搜】比特幣重上10萬美元 相關ETF升逾2%,https://static.hkej.com/hkej/images/2025/01/07...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月7日,False
5,馬斯克X賬號改名 迷因幣爆炒,https://static.hkej.com/hkej/images/2025/01/03...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月3日,False
6,比特幣獲唱好今年衝上20萬美元,https://static.hkej.com/hkej/images/2025/01/02...,https://www1.hkej.com/features/article?q=%23%E...,2025年1月2日,False
7,聯儲局官員戴利:加密貨幣並非黃金,https://static.hkej.com/hkej/images/2024/12/31...,https://www1.hkej.com/features/article?q=%23%E...,2024年12月31日,False
8,星洲年批13加密幣牌照 遠超香港,https://static.hkej.com/hkej/images/2024/12/27...,https://www1.hkej.com/features/article?q=%23%E...,2024年12月27日,False
9,比特幣瘋炒 2024最癲交易,https://static.hkej.com/hkej/images/2024/12/24...,https://www1.hkej.com/features/article?q=%23%E...,2024年12月24日,False



=== Token Usage Summary ===
Type                   Count
------------------------------
Completion             2,878
Prompt                12,430
Total                 15,308

=== Usage History ===
Request #    Completion       Prompt        Total
------------------------------------------------
1                 2,878       12,430       15,308
