Clauncher

A minimal web scraper that allows you to extract the main content from web pages. It is primarily for articles and pages with a lot of text content. It is not aimed at scraping web apps or pages that rely on a lot of javascript.

It uses the Mozilla Readability library to extract the main content from the web page. This library powers the reader mode in Mozilla Firefox.

This project was developed using NodeJS v20.10.0 and pnpm

Installation

This project requires NodeJS and Git.

Clone the Repo and install the necessary dependencies.

git clone https://github.com/VVoruganti/clauncher.git
cd clauncher
pnpm install

Usage

pnpm start

This launches an express API that you can send requests too. It takes a url parameter to the base route. An example curl request in below.

curl "http://localhost:3000/?url=https://vineeth.io"

This returns the JSON response from the readability library. To prettify the output on your terminal you can pipe the results in jq

curl "http://localhost:3000/?url=https://vineeth.io" | jq

Deploying

The project contains a Dockerfile and fly.toml for hosting fly.io or wherever you can host docker containers.

Docker

You can build and run the project with the following docker commands

docker build -t clauncher .
docker run -p 3000:3000 clauncher

Fly

You can deploy to fly.io using flyctl, and the command fly launch

Roadmap

These are a few other features that could be nice to implement at some point.

puppeteer based scraping for JS sites - This guide has details on how to work with puppeteer in docker

License

This project is licensed under the Apache 2.0 license. Read more at LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
fly.toml		fly.toml
index.js		index.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE.md

LICENSE.md

README.md

README.md

fly.toml

fly.toml

index.js

index.js

package.json

package.json

pnpm-lock.yaml

pnpm-lock.yaml

Repository files navigation

Clauncher

Installation

Usage

Deploying

Docker

Fly

Roadmap

License

About

Releases

Packages

Languages

License

VVoruganti/clauncher

Folders and files

Latest commit

History

Repository files navigation

Clauncher

Installation

Usage

Deploying

Docker

Fly

Roadmap

License

About

Resources

License

Stars

Watchers

Forks

Languages