puppeteer-scrape

This is a headless web scraping service built with Node.js, Express, and Puppeteer. It provides an API endpoint to scrape a given URL and analyze the network relationships, response headers, and performance metrics of the page as well as the base64 content of the page.

Installation

Clone the repository.
Run docker build -t puppeteer-scrape . to build the docker file.

Usage

Start the server by running docker run -it --rm -p 3000:3000 puppeteer-scrape. The server will start on port 3000.

The service provides a POST endpoint at /scrape that accepts a JSON body with a url field. The url should be the page you want to scrape.

Example request:

curl -X POST http://localhost:3000/detailed_scrape -H 'Content-Type: application/json' -d '{"url": "https://www.example.com"}'
curl -X POST http://localhost:3000/simple_scrape -H 'Content-Type: application/json' -d '{"url": "https://www.example.com"}'

The response will be a JSON object containing the following fields:

networkMap: An array of objects, each representing a network request made by the page. Each object includes the hostname, URL, HTTP method, status code, and MIME type of the request.
headersInfo: An array of objects, each representing the response headers for a network request. Each object includes the URL of the request and an object mapping header names to their values.
performanceMetrics: An object summarizing the performance of the page. This includes the total load time, time to first byte, and the sizes of different types of resources (images, scripts, stylesheets, and other).

Error Handling

If the url field is not provided in the request, the server will respond with a 400 status code and a message "URL is required".

If an error occurs during scraping, the server will respond with a 500 status code and a message "An error occurred during scraping".

License

This project is licensed under the AGPL License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
dockerfile		dockerfile
license.md		license.md
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
scrape.js		scrape.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.dockerignore

.dockerignore

.gitignore

.gitignore

dockerfile

dockerfile

license.md

license.md

package-lock.json

package-lock.json

package.json

package.json

readme.md

readme.md

scrape.js

scrape.js

Repository files navigation

puppeteer-scrape

Installation

Usage

Error Handling

License

About

Releases

Packages

Languages

License

clwg/puppeteer-scrape

Folders and files

Latest commit

History

Repository files navigation

puppeteer-scrape

Installation

Usage

Error Handling

License

About

Resources

License

Stars

Watchers

Forks

Languages