## [crawl4ai](https://docs.crawl4ai.com/core/installation/)

In [1]:
%%capture
! pip install crawl4ai

In [2]:
%%capture
! crawl4ai-setup


In [3]:
%%capture
! crawl4ai-doctor

In [1]:
import nest_asyncio
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

nest_asyncio.apply()  # 기존 루프에 중첩 실행 허용

# 글로벌 변수 선언
result = None

async def main():
    global result  # 전역 변수로 지정
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            # url="https://www.example.com",
            url ='https://quotes.toscrape.com/'
        )
        print(result.markdown[:300])

        return result.markdown

final = await main()

# 이후 result를 자유롭게 사용 가능
# 예: print(result.html) 또는 result.title 등


[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://quotes.toscrape.com/... | Status: True | Time: 1.35s
[SCRAPE].. ◆ https://quotes.toscrape.com/... | Time: 0.054s
[COMPLETE] ● https://quotes.toscrape.com/... | Status: True | Total: 1.42s
#  [Quotes to Scrape](https://quotes.toscrape.com/)
[Login](https://quotes.toscrape.com/login)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein [(about)](https://quotes.toscrape.com/author/Albert-Einstein)
Tags: [c


In [2]:
final

"#  [Quotes to Scrape](https://quotes.toscrape.com/)\n[Login](https://quotes.toscrape.com/login)\n“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein [(about)](https://quotes.toscrape.com/author/Albert-Einstein)\nTags: [change](https://quotes.toscrape.com/tag/change/page/1/) [deep-thoughts](https://quotes.toscrape.com/tag/deep-thoughts/page/1/) [thinking](https://quotes.toscrape.com/tag/thinking/page/1/) [world](https://quotes.toscrape.com/tag/world/page/1/)\n“It is our choices, Harry, that show what we truly are, far more than our abilities.” by J.K. Rowling [(about)](https://quotes.toscrape.com/author/J-K-Rowling)\nTags: [abilities](https://quotes.toscrape.com/tag/abilities/page/1/) [choices](https://quotes.toscrape.com/tag/choices/page/1/)\n“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by Albert Einstein [(about)](ht

In [7]:
import re

# final 변수에 HTML 전체 텍스트가 저장되어 있다고 가정
pattern = r'“(.*?)” by ([\w\.\-\' ]+)'

matches = re.findall(pattern, final)

# 결과 확인
for quote, author in matches:
    print(f'"{quote}" - {author}')


"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." - Albert Einstein 
"It is our choices, Harry, that show what we truly are, far more than our abilities." - J.K. Rowling 
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." - Albert Einstein 
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." - Jane Austen 
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring." - Marilyn Monroe 
"Try not to become a man of success. Rather become a man of value." - Albert Einstein 
"It is better to be hated for what you are than to be loved for what you are not." - André Gide 
"I have not failed. I've just found 10,000 ways that won't work." - Thomas A. Edison 
"A woman is like a tea bag; you never know how strong it is until it's in hot water." - Elea

In [8]:
result

CrawlResultContainer([CrawlResult(url='https://quotes.toscrape.com/', html='<!DOCTYPE html><html lang="en"><head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of 

In [9]:
from bs4 import BeautifulSoup

# body 태그 내 텍스트만 추출
soup = BeautifulSoup(result.html, "html.parser")
body_text = soup.body.get_text(separator="\n", strip=True)

print(body_text)


Quotes to Scrape
Login
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by
Albert Einstein
(about)
Tags:
change
deep-thoughts
thinking
world
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by
J.K. Rowling
(about)
Tags:
abilities
choices
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by
Albert Einstein
(about)
Tags:
inspirational
life
live
miracle
miracles
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by
Jane Austen
(about)
Tags:
aliteracy
books
classic
humor
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by
Marilyn Monroe
(about)
Tags:
be-yourself
inspirational
“Try not to become a man of success. Rather become a man of value.”
by
Albert Einstein
(about)
Tags:
adulthood
success

In [10]:
print(soup)

<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="ta

### Rendered HTML의 내용을 직접 추출

In [11]:
from playwright.async_api import async_playwright
# JS도 Rn=endered HTML에서 직접 추출

url = "https://comic.naver.com/webtoon/weekday"
# url = "https://quotes.toscrape.com/"
# url = 'https://example.com'

async def test_browser():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        print(f'Title: {await page.title()}')
        await browser.close()

asyncio.run(test_browser())

Title: 요일전체 : 네이버 웹툰


## **[quickstart](https://docs.crawl4ai.com/core/quickstart/)**

In [None]:
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:300])  # Print first 300 chars

if __name__ == "__main__":
    asyncio.run(main())


[INIT].... → Crawl4AI 0.5.0.post4
[FETCH]... ↓ https://example.com... | Status: True | Time: 1.91s
[SCRAPE].. ◆ https://example.com... | Time: 0.003s
[COMPLETE] ● https://example.com... | Status: True | Total: 1.94s
# Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)



In [None]:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_conf = BrowserConfig(headless=True)  # or False to see the browser
    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_conf
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())


[INIT].... → Crawl4AI 0.5.0.post4
[FETCH]... ↓ https://example.com... | Status: True | Time: 1.78s
[SCRAPE].. ◆ https://example.com... | Time: 0.003s
[COMPLETE] ● https://example.com... | Status: True | Total: 1.83s
# Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)



In [None]:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    markdown_generator=md_generator
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com", config=config)
    print("Raw Markdown length:", len(result.markdown.raw_markdown))
    print("Fit Markdown length:", len(result.markdown.fit_markdown))

[INIT].... → Crawl4AI 0.5.0.post4
[FETCH]... ↓ https://news.ycombinator.com... | Status: True | Time: 1.39s
[SCRAPE].. ◆ https://news.ycombinator.com... | Time: 0.231s
[COMPLETE] ● https://news.ycombinator.com... | Status: True | Total: 1.65s
Raw Markdown length: 16895
Fit Markdown length: 14209


In [None]:
from google.colab import userdata
openai_token = userdata.get('openapi')

In [None]:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import LLMConfig

# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"


# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(provider="openai/gpt-4o",api_token= openai_token)  # Required for OpenAI
)

# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)

In [None]:
strategy.schema

{'name': 'Product Listing',
 'baseSelector': '.product',
 'fields': [{'name': 'product_name', 'selector': 'h2', 'type': 'text'},
  {'name': 'price', 'selector': '.price', 'type': 'text'}]}

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://comic.naver.com/webtoon/weekday"
res = requests.get(url)

soup = BeautifulSoup(res.text, "lxml")  # HTML 파싱

In [None]:
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
    soup,
    llm_config = LLMConfig(provider="openai/gpt-4o",api_token= openai_token)  # Required for OpenAI
)

# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)

In [None]:
strategy.schema

{'name': 'Naver Webtoon Page',
 'baseSelector': 'head',
 'fields': [{'name': 'title', 'selector': 'title', 'type': 'text'},
  {'name': 'favicon',
   'selector': "link[rel='icon']",
   'type': 'attribute',
   'attribute': 'href'},
  {'name': 'canonical_url',
   'selector': "link[rel='canonical']",
   'type': 'attribute',
   'attribute': 'href'},
  {'name': 'google_site_verification',
   'selector': "meta[name='google-site-verification']",
   'type': 'attribute',
   'attribute': 'content'},
  {'name': 'charset',
   'selector': 'meta[charset]',
   'type': 'attribute',
   'attribute': 'charset'},
  {'name': 'og_type',
   'selector': "meta[property='og:type']",
   'type': 'attribute',
   'attribute': 'content'},
  {'name': 'og_author',
   'selector': "meta[property='og:article:author']",
   'type': 'attribute',
   'attribute': 'content'},
  {'name': 'og_author_url',
   'selector': "meta[property='og:article:author:url']",
   'type': 'attribute',
   'attribute': 'content'},
  {'name': 'og_ti

In [None]:
# prompt: strategy 에서 text를 추출하려면

# ... (Your existing code)

# Assuming 'strategy' is your JsonCssExtractionStrategy instance
if strategy and strategy.schema:
  print("Extracted text from schema:")
  for key, value in strategy.schema.items():
    print(f"Key: {key}, Value: {value['css_selector']}")
    # You might need to adapt this based on how your schema is structured

    # Example: Extract the text content of elements matching the CSS selector
    elements = soup.select(value['css_selector'])
    for element in elements:
      print(element.get_text(strip=True))



Extracted text from schema:


TypeError: string indices must be integers, not 'str'

In [None]:
# Assuming 'strategy' is your JsonCssExtractionStrategy instance
if strategy and strategy.schema:
  print("Extracted data from schema:")
  for key, value in strategy.schema.items():
    print(f"Key: {key}, Value: {value}")  # Print the extracted value directly
    # If you need to access specific parts of the extracted data,
    # you'll need to inspect its structure and adjust the code accordingly.
    # For example, if 'value' is a dictionary, you might access specific fields like this:
    # if isinstance(value, dict) and 'text' in value:
    #   print(f"  Text: {value['text']}")

Extracted data from schema:
Key: name, Value: Naver Webtoon Page
Key: baseSelector, Value: head
Key: fields, Value: [{'name': 'title', 'selector': 'title', 'type': 'text'}, {'name': 'favicon', 'selector': "link[rel='icon']", 'type': 'attribute', 'attribute': 'href'}, {'name': 'canonical_url', 'selector': "link[rel='canonical']", 'type': 'attribute', 'attribute': 'href'}, {'name': 'google_site_verification', 'selector': "meta[name='google-site-verification']", 'type': 'attribute', 'attribute': 'content'}, {'name': 'charset', 'selector': 'meta[charset]', 'type': 'attribute', 'attribute': 'charset'}, {'name': 'og_type', 'selector': "meta[property='og:type']", 'type': 'attribute', 'attribute': 'content'}, {'name': 'og_author', 'selector': "meta[property='og:article:author']", 'type': 'attribute', 'attribute': 'content'}, {'name': 'og_author_url', 'selector': "meta[property='og:article:author:url']", 'type': 'attribute', 'attribute': 'content'}, {'name': 'og_title', 'selector': "meta[proper

In [None]:
strategy.schema['name']

'Naver Webtoon Page'

In [None]:
strategy.schema['baseSelector']

'head'