# Tutorial for Craw4AI use
`https://docs.crawl4ai.com/core/quickstart/`

## Generating Markdown 
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.
To customize the markdown generators, there has different ways to filter the content, for example the `PruningContentFilter`

In [1]:
from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com")
    print(type(result))
    print(result.markdown)

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://news.ycombinator.com... | Status: True | Time: 1.20s
[SCRAPE].. ◆ https://news.ycombinator.com... | Time: 0.091s
[COMPLETE] ● https://news.ycombinator.com... | Status: True | Total: 1.30s
<class 'crawl4ai.async_webcrawler.CrawlResultContainer'>
| [![](https://news.ycombinator.com/y18.svg)](https://news.ycombinator.com) | **[Hacker News](https://news.ycombinator.com/news)** [new](https://news.ycombinator.com/newest) | [past](https://news.ycombinator.com/front) | [comments](https://news.ycombinator.com/newcomments) | [ask](https://news.ycombinator.com/ask) | [show](https://news.ycombinator.com/show) | [jobs](https://news.ycombinator.com/jobs) | [submit](https://news.ycombinator.com/submit) |  [login](https://news.ycombinator.com/login?goto=news)  
---|---|---  
| 1. | [](https://news.ycombinator.com/vote?id=43734953&how=up&goto=news)| [A Map of British Dialects](https://starkeycomics.com/2023/11/07/map-of-british-english-dialects/) (

In [None]:
from crawl4ai import AsyncWebCrawler
##########################################################################
#  configs
from crawl4ai.content_filter_strategy import PruningContentFilter # this method ranks items by their importance from 0 to 1, any item blow the threshold will be cut
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import CacheMode
##########################################################################

# apply configs
from crawl4ai import CrawlerRunConfig


# configs
md_generator = DefaultMarkdownGenerator(
    # how to process the content/response
    content_filter=PruningContentFilter(min_word_threshold=0.3,threshold_type="type")
)


# apply configs
configs = CrawlerRunConfig(
    markdown_generator=md_generator,
    cache_mode=CacheMode.BYPASS # no cache
)

# run crawler
async with AsyncWebCrawler() as crawler:
    response = await crawler.arun("https://news.ycombinator.com", config=configs)
    print(len(response.markdown.raw_markdown))
    print(len(response.markdown.fit_markdown))

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://news.ycombinator.com... | Status: True | Time: 1.36s
[SCRAPE].. ◆ https://news.ycombinator.com... | Time: 0.176s
[COMPLETE] ● https://news.ycombinator.com... | Status: True | Total: 1.55s
17382
10415


## Simple data Extraction css-based
It allows extract structured data like josn usig CSS or XPath selectors
it supports automatic data extraction with LLM

In [None]:
#!/usr/bin/env python3

# tools + config tools
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig, # 伪装我们的爬虫
    CrawlerRunConfig,
    CacheMode,
    LLMConfig, # parse data config
)

# llm to parse data
from crawl4ai.extraction_strategy import LLMExtractionStrategy

# config our llm
from pydantic import BaseModel, Field

# save data as json
import json

# load API from env
import os

# run asy tool
import asyncio

# handle jupyter issue
import nest_asyncio

# 应用 nest_asyncio 补丁来允许在已有事件循环中运行异步代码
nest_asyncio.apply()

# Create a .env file to add your API in the current dir
from dotenv import load_dotenv
load_dotenv() 

async def main():
    # to use a llm for extraction, some of them need an API key. Here we use deepseek
    api = os.getenv("Deepseek_API")
    provider = "deepseek/deepseek-chat"
    llm_choice = LLMConfig(provider=provider, api_token=api)

    # configs for our LLM: what kind of information need to be extracted, here we will extract topn books from Douban, a chinese online community
    class LLM(BaseModel):
        # structure LLM's searche
        title: str = Field(...,description="书名")
        rank: str = Field(...,description="排名")
        comment: str = Field(...,description="书的简介")


    # configs for llm strategy
    llm_strategy = LLMExtractionStrategy(
        llm_config=llm_choice,
        schema=LLM.model_json_schema(), # method of extraction
        extraction_type="schema",
        verbose=True,
        instruction="请从抓取的内容中提取及其每本书的名字,其排名/评分/简介，并以 JSON 列表形式返回。"
        # extra_arg = {"temperature":0, "max_tokens":n}
    )

    # configs for better crawler
    browser_config = BrowserConfig(headless=True)
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=llm_strategy,
        page_timeout=3000,
        word_count_threshold=1
    )

    # run our crawler
    async with AsyncWebCrawler(config=browser_config) as crawler:
        for i in range(10):
            result = await crawler.arun(url=f"https://book.douban.com/top250?start={i}", config=crawler_config)

            # read and save data
            json_Data = json.loads(result.extracted_content)
            with open(f"./{i}_page.json","w") as f:
                json.dump(json_Data,f,ensure_ascii=False, indent=4)

asyncio.get_event_loop().run_until_complete(main())

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://book.douban.com/top250?start=0... | Status: True | Time: 2.99s
[SCRAPE].. ◆ https://book.douban.com/top250?start=0... | Time: 0.062s
[LOG] Call LLM for https://book.douban.com/top250?start=0 - block index: 0
[LOG] Extracted 25 blocks from URL: https://book.douban.com/top250?start=0 block index: 0
[EXTRACT]. ■ Completed for https://book.douban.com/top250?start=0... | Time: 56.607991648999814s
[COMPLETE] ● https://book.douban.com/top250?start=0... | Status: True | Total: 59.67s
[FETCH]... ↓ https://book.douban.com/top250?start=1... | Status: True | Time: 1.16s
[SCRAPE].. ◆ https://book.douban.com/top250?start=1... | Time: 0.057s
[LOG] Call LLM for https://book.douban.com/top250?start=1 - block index: 0


In [None]:
# Customization of crawler: BrowserConfig and CrawlerRunConfig
# BrowserConfig: headless(no interface), full UI(with interface), user agent(behave like a user by sending chrome type, os... to avoid being deteceted as webCrawler), JavaScript control
# use example: headless + random UA, host IP

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    """
        What is cache: 缓存就是爬虫程序在请求访问一个网站的时候
        会先确认是否有有缓存存在，如果缓存存在且有效，就直接读取本地缓存
        如果有一个条件不满足就正常请求，并且储存缓存,这样子避免重复下载
        爬虫软件的工作流程：发出请求访问服务器，获得响应，解析响应，储存数据，重复这个过程
    """
    # confi
    browser_conf = BrowserConfig(headless=True, user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1")
    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS 
        # decide if save cache or not
    )

    # set up crawler
    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun(
            url = "https://book.douban.com/top250",
            config=run_conf
        )
        print(result.markdown)
   

if __name__ == "__main__":
    asyncio.run(main())