**Table of contents**<a id='toc0_'></a>    
- [简介](#toc1_)    
- [功能特点](#toc2_)    
- [安装](#toc3_)    
- [示例](#toc4_)    
  - [异步API](#toc4_1_)    
  - [指定用户目录](#toc4_2_)    
  - [反自动化检测](#toc4_3_)    
  - [连接到现有浏览器](#toc4_4_)    
  - [抓取抖音视频](#toc4_5_)    
- [遇到的问题](#toc5_)    
  - [在`Jupyter`中使用](#toc5_1_)    
    - [不能使用同步`API`](#toc5_1_1_)    
    - [异步API出现`NotImplementedError`](#toc5_1_2_)    
- [参考](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[简介](#toc0_)

[Playwright](https://playwright.dev/python/)是一个用于自动化网页浏览器操作的开源工具库,支持多种主流浏览器(Chrome、Firefox、Safari),用于Web应用的自动化测试、网页爬取等场景.

# <a id='toc2_'></a>[功能特点](#toc0_)

- 支持多种主流浏览器(Chrome、Firefox、Safari)
- 支持多种编程语言(JavaScript、Python、C#、Java)
- 无头(headless)模式

# <a id='toc3_'></a>[安装](#toc0_)

In [None]:
! pip install playwright
! playwright install --with-deps

# <a id='toc4_'></a>[示例](#toc0_)

## <a id='toc4_1_'></a>[异步API](#toc0_)

In [None]:
import asyncio
from playwright.async_api import async_playwright

    

async def main():
    async with async_playwright() as playwright:
        browser = await playwright.chromium.launch(headless=False)
        page = await browser.new_page()
        
        # 访问 GitHub 主页
        await page.goto("https://www.douyin.com/discover")
        
        # 在搜索框中输入 "playwright"
        await page.fill('input[name="q"]', "playwright")
        
        # 提交搜索
        await page.press('input[name="q"]', "Enter")
        
        # 等待搜索结果加载
        await page.wait_for_selector('ul.repo-list')
        
        # 获取第一个搜索结果的标题
        first_result = await page.locator('ul.repo-list li:first-child a:has-text("playwright")').inner_text()
        
        print(f"First search result: {first_result}")
        
        # 验证搜索结果中包含 "playwright"
        assert "playwright" in first_result.lower(), "Expected 'playwright' in search results"
        
        await browser.close()

await main()

## <a id='toc4_2_'></a>[指定用户目录](#toc0_)

有些网站需要使用验证码登录过后才能访问,这时候可以事先登录好,然后使用`playwright`基于这个已经登录的用户目录进行操作.

**1. 找到用户目录**

要找到`Chrome`的用户目录,可以在`Chrome`的地址栏输入`chrome://version/`查看`Profile Path`字段,这个字段就是用户目录的路径.

如图:

![](https://github.com/cruldra/picx-images-hosting/raw/master/image.361hoxlyqa.png)

**2. 使用用户目录运行`playwright`**


In [None]:
import asyncio
from playwright.async_api import async_playwright

    

async def main():
    async with async_playwright() as playwright:
        # 指定 Chrome 用户配置文件的路径
        user_data_dir = r"E:\AppData\ChromeUserData\Test1" #这里后面不要加\Default,Chrome会自动加上

        browser =await playwright.chromium.launch_persistent_context(
            user_data_dir=user_data_dir,
            viewport={"width": 1920, "height": 1080},  # 设置浏览器窗口大小
           # channel="chrome",  # 使用安装的 Chrome 而不是 Playwright 内置的 Chromium
            headless=False,  # 以有头模式运行，这样你可以看到浏览器窗口
             args=['--disable-blink-features=AutomationControlled'],
           user_agent=  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        )
        page = await browser.new_page()
        # 访问 抖音 主页
        await page.goto("https://www.douyin.com/discover")
        
        # 等待直到用户关闭浏览器
        await page.wait_for_event("close",timeout=0)

await main()

## <a id='toc4_3_'></a>[反自动化检测](#toc0_)

有些网站会检测用户是否是通过自动化工具访问的,这时候可以通过一些方法来规避这种检测.

[Antibot](https://bot.sannysoft.com/)这个网站可以检测用户是否是通过自动化工具访问的,可以用来测试规避反自动化检测的方法.

In [None]:
import asyncio
from playwright.async_api import async_playwright

    

async def main():
    async with async_playwright() as playwright:
        # 指定 Chrome 用户配置文件的路径
        user_data_dir = r"E:\AppData\ChromeUserData\Test11" #这里后面不要加\Default,Chrome会自动加上

        browser =await playwright.chromium.launch_persistent_context(
            user_data_dir=user_data_dir,
            viewport={"width": 1920, "height": 1080},  # 设置浏览器窗口大小
           # channel="chrome",  # 使用安装的 Chrome 而不是 Playwright 内置的 Chromium
            headless=False,  # 以有头模式运行，这样你可以看到浏览器窗口
             args=['--disable-blink-features=AutomationControlled'],
           user_agent=  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        )
        page = await browser.new_page()

        with open('stealth.min.js','r') as f:
            js=f.read()
        print(js)
        # 注入 stealth.min.js,删除掉navigator的webdriver属性
        await page.add_init_script(js)
        # 访问 抖音 主页
        await page.goto("https://bot.sannysoft.com/")
        
        # 等待直到用户关闭浏览器
        await page.wait_for_event("close",timeout=0)

await main()

如图:

![](https://github.com/cruldra/picx-images-hosting/raw/master/image.4xugk8vf79.png)

**参考**:

- [使用playwright防止被网站检测的方法_playwright防检测-CSDN博客](https://blog.csdn.net/qq_37781464/article/details/137639747)

## <a id='toc4_4_'></a>[连接到现有浏览器](#toc0_)


通过`CDP`协议可以[连接到现有的浏览器](https://github.com/microsoft/playwright/issues/23217#issuecomment-1561521867),这样可以手动控制浏览器的生命周期.

**1. 打开浏览器监听`CDP`协议**

`WIN+R`运行以下命令:

```cmd
chrome --remote-debugging-port=9222 --user-data-dir="C:\Users\markw\AppData\Local\Google\Chrome\User Data\Default\TestUser"
```


**2. 连接到浏览器**

然后就可以通过[connect_over_cdp](https://playwright.dev/python/docs/api/class-browsertype#browser-type-connect-over-cdp)方法连接到这个浏览器.

In [None]:
import asyncio
from playwright.async_api import async_playwright

    

async def main():
    async with async_playwright() as playwright:
        # 使用这种方式打开浏览器不会出现“Chrome 正在受到自动测试软件的控制”通知
        browser =await playwright.chromium.connect_over_cdp("http://localhost:9222")
        default_context = browser.contexts[0]
        page = default_context.pages[0]

        #region 结合反自动化检测
        with open('stealth.min.js','r') as f:
            js=f.read()
        print(js)
        # 注入 stealth.min.js,删除掉navigator的webdriver属性
        await page.add_init_script(js)
        #endregion
        await page.goto("https://bot.sannysoft.com/")

        result =await page.evaluate("navigator.webdriver===undefined")
        print(f"navigator.webdriver属性是否被删除了:{result}")
        await page.wait_for_event("close",timeout=0)

await main()

## <a id='toc4_5_'></a>[抓取抖音视频](#toc0_)

In [10]:
import asyncio
from playwright.async_api import async_playwright

    

async def main():
    async with async_playwright() as playwright:
        # 使用这种方式打开浏览器不会出现“Chrome 正在受到自动测试软件的控制”通知
        browser =await playwright.chromium.connect_over_cdp("http://localhost:9222")
        default_context = browser.contexts[0]
        page = default_context.pages[0]

        #region 结合反自动化检测
        with open('stealth.min.js','r') as f:
            js=f.read()
        # 注入 stealth.min.js,删除掉navigator的webdriver属性
        await page.add_init_script(js)
        #endregion
        await page.goto("https://www.douyin.com/discover")

        result =await page.evaluate("navigator.webdriver===undefined")
        print(f"navigator.webdriver属性是否被删除了:{result}")


        # 找到搜索框的输入框
        search_input =await page.wait_for_selector("""#douyin-header input[data-e2e="searchbar-input"][placeholder="搜索你感兴趣的内容"]""")
        await search_input.type("留学")
        search_button =await page.wait_for_selector("""#douyin-header button[data-e2e="searchbar-button"]""")
        # 获取元素的外部 HTML
        outer_html =await search_button.evaluate("el => el.outerHTML")
        print(outer_html)
        #await search_button.click()
         
        # 切换到第二个页面
        async with default_context.expect_page() as new_page_info:
            await search_button.click()  # 或其他打开新页面的操作
        page2 = await new_page_info.value
        #page2 = default_context.pages[1]
        page2.bring_to_front()
        
        # 切到视频模式
        from yarl import URL
        url=URL(page2.url)
        new_query=url.query.copy()
        new_query['type']='video'
        url=url.with_query(new_query)
        await page2.goto(url.human_repr())

        #page.fill("""//*[@id="douyin-header"]/div[1]/header/div[1]/div/div[1]/div/div[2]/div/div[1]/input""", "留学")
        await page.wait_for_event("close",timeout=0)

await main()

navigator.webdriver属性是否被删除了:True
<button class="JMEzcqbO" data-e2e="searchbar-button" type="button"><svg width="18" height="18" fill="none" xmlns="http://www.w3.org/2000/svg" class="tO5FPupE"><path fill-rule="evenodd" clip-rule="evenodd" d="M7.875 1.5a6.375 6.375 0 103.642 11.608l3.063 3.063a1.125 1.125 0 001.59-1.591l-3.062-3.063A6.375 6.375 0 007.875 1.5zM3.75 7.875a4.125 4.125 0 118.25 0 4.125 4.125 0 01-8.25 0z" fill="#4F5168"></path></svg><span class="btn-title">搜索</span></button>


  page2.bring_to_front()


# <a id='toc5_'></a>[遇到的问题](#toc0_)


## <a id='toc5_1_'></a>[在`Jupyter`中使用](#toc0_)


### <a id='toc5_1_1_'></a>[不能使用同步`API`](#toc0_)

**问题描述**

[Error: It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead. · Issue #462 · microsoft/playwright-python](https://github.com/microsoft/playwright-python/issues/462)

**解决方案**

使用异步API.


### <a id='toc5_1_2_'></a>[异步API出现`NotImplementedError`](#toc0_)


**问题描述**

**解决方案**

找到`venv\Lib\site-packages\ipykernel\kernelapp.py`文件的第`662`行,将其注释掉

如图:

![](https://github.com/cruldra/picx-images-hosting/raw/master/image.1e8iu0f9o7.png)

**不确定会不会有其他问题,参考[这里](https://github.com/microsoft/playwright-python/issues/178#issuecomment-1302869947)**


# <a id='toc6_'></a>[参考](#toc0_)

* [Installation | Playwright Python](https://playwright.dev/python/docs/intro)
* [反爬虫检测](https://bot.sannysoft.com/)
* [查看远程浏览器](chrome://inspect/#devices)