404 Crawler 🏊‍♂️

A command line interface to crawl and detect 404 pages from sitemap.

📊 Usage

Install

Make sure npm is installed in your computer. To know more about it, visit https://docs.npmjs.com/downloading-and-installing-node-js-and-npm

In a terminal, run

npm install -g @algolia/404-crawler

After that, you'll be able to use the command 404crawler in your terminal

Examples

Crawl and detect every 404 pages from algolia website's sitemap:
```
404crawler crawl -u https://algolia.com/sitemap.xml
```
Use JavaScript rendering to crawl and identify all 404 or 'Not Found' pages on the Algolia website.
```
404crawler crawl -u https://algolia.com/sitemap.xml --render-js
```
Crawl and identify all 404 pages on the Algolia website by analyzing its sitemap, including all potential sub-path variations
```
404crawler crawl -u https://algolia.com/sitemap.xml --include-variations
```

Options

--sitemap-url or -u: Required URL of the sitemap.xml file.
--render-js or -r: Use JavaScript rendering to crawl and identify a 'Not Found Page' if the status code isn't a 404. This option is useful for websites that returns a 200 status code even if the page is not found (Next.js with custom not found page for example)
--output or -o: Ouput path for the JSON file of the results. Example: crawler/results.json. If not set, no file is written after the crawl.
--include-variations or -v: Include all sub-path variations from URLs found in the sitemap.xml. For example, if https://algolia.com/foo/bar/baz is found in the sitemap, the crawler will test https://algolia.com/foo/bar/baz, https://algolia.com/foo/bar, https://algolia.com/foo and https://algolia.com
--exit-on-detection or -e: Exit when a 404 or a 'Not Found' page is detected.
--run-in-parallel or -p: Run the crawler with multiple pages in parallel. By default, the number of parallel instances is set to 10. See --batch-size option to configure this number.
--batch-size or -s: Number or parallel instances of crawler to run: the more this number is, the more resources are consumed. Only available when --run-in-parallel option is set. If not set, default to 10.
--browser-type or -b: Type of the browser to use to crawl pages. Can be 'firefox', 'chromium' or 'webkit'. If not set, default to 'firefox'.

👨‍💻 Get started (maintainers)

This CLI is built with TypeScript and uses ts-node to run the code locally.

Install

Install all dependencies

pnpm i

Run locally

pnpm 404crawler crawl <options>

Deploy

Update package.json version
Commit and push changes
Build JS files in dist/ with
```
pnpm build
```
Initialize npm with Algolia org as scope
```
npm init --scope=algolia
```
Follow instructions
Publish package with
```
npm publish
```

🔗 References

This package uses:

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md
README.png		README.png
cli.ts		cli.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

utils

utils

.gitignore

.gitignore

README.md

README.md

README.png

README.png

cli.ts

cli.ts

package.json

package.json

pnpm-lock.yaml

pnpm-lock.yaml

tsconfig.json

tsconfig.json

Repository files navigation

404 Crawler 🏊‍♂️

📊 Usage

Install

Examples

Options

👨‍💻 Get started (maintainers)

Install

Run locally

Deploy

🔗 References

About

Releases

Packages

Languages

algolia/404-crawler

Folders and files

Latest commit

History

Repository files navigation

404 Crawler 🏊‍♂️

📊 Usage

Install

Examples

Options

👨‍💻 Get started (maintainers)

Install

Run locally

Deploy

🔗 References

About

Resources

Code of conduct

Stars

Watchers

Forks

Languages