Skip to content

ericrav/playwright-web-scraping

Repository files navigation

Intro to Web Scraping

Setup

System requirements

  • Node.js 18+
  • Windows 10+, Windows Server 2016+ or Windows Subsystem for Linux (WSL).
  • macOS 13 Ventura, or macOS 14 Sonoma.
  • Debian 11, Debian 12, Ubuntu 20.04 or Ubuntu 22.04, Ubuntu 24.04, on x86-64 and arm64 architecture.
  1. Clone the repo
git clone https://github.com/ericrav/playwright-web-scraping.git
cd playwright-web-scraping
  1. Install Node and dependencies

You must have Node.js version >= 18 to run Playwright and this repo. Check your node version with node -v in the terminal. If you don't have the right version, try installing nvm (Node Version Manager) to install and use the latest version.

Then, install the dependencies:

npm install

Then, install browser binaries used by Playwright with:

npx playwright install --with-deps

More info: https://playwright.dev/docs/browsers

  1. Run example scraper
npm run scrape

You should see a Chromium browser window open and close automatically, a snippet of text printed in your terminal, and an image saved to the ./screenshots directory. Now you're ready to scrape the web!

Workshop

To run your workshop scraper, run in the terminal:

node scrapers/00-workshop.js

REPL

Experiment with Playwright functions in an interactive Node.js repl terminal (repl = "read evaluate print loop"). This will open the browser instance and Playwright inspector to show you what's happening.

npm run repl

Note: The repl script is ran with the Playwright debugger, so make sure to click the "Resume" button in the browser inspector so your commands can actually execute.

Playwright

Debug

Use the PWDEBUG=inspector environment variable to debug your script with the Playwright inspector:

PWDEBUG=inspector node scrapers/00-workshop.js

Add await page.pause(); in your script to pause as a specific step.

Codegen

npx playwright codegen nyu.edu

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published