🚀You Can find All Urls from base url!🚀
Dynamic web crawler that uses dynamic browser (Puppeteer) which fetches all links on a page and its children.
- Clone the repo and run
npm install puppeteer yargs
- Create a file that lists scrapped targets on {root}/inputs/targets.txt
- Create a file that lists unscrapped targets on {root}/inputs/blacklist.txt
- Run
node index.js -t targets.txt -r results.txt -b blacklist.txt -d 1
- Clone the repo and run
npm install puppeteer yargs
- pip install -r requirements.txt
python3 exec.py results.txt 1
(results.txt = resultsfile name , 1 = depth)
It works on amd64, arm64
- pull image from dockerhub ➡️link
-Plain version: tag nameServerCrawler
-ExecBot version: tag nameservercrawler_v2
- docker run with some options
docker run -d eogns47/linkcrawler:{Your tag}
- Connect container shell
docker exec -it {container id} /bin/sh
python3 exec.py results.txt 1
(results.txt = resultsfile name , 1 = depth)
- Install Unit test tool
npm install --save-dev jest
- Run
npx jest
More Options:
--version
Show version number [boolean]
-t
Input file path [required]
-u
Targets array list
-r
Output file path
-d
Crawling depth
-b
Blacklist file path to prevent an url for being crawled (hard match)
--full
Use full url for crawling instead of its base
-v
Verbosity level [boolean]
--base
Must include the base url to except external links when crawling [boolean]
-h, --help Show help
[boolean]
link-crawler
├─ .dockerignore
│
├─ .gitignore
├─ Dockerfile
├─ ExecBot
│ └─ exec.py
├─ Logger
│ └─ logger.js
├─ README.md
├─ babel.config.js
├─ inputs
│ └─ .gitkeep
├─ results
│ └─ .gitkee
├─ jest.config.js
├─ logs
│ └─ .gitkeep
├─ node_modules
│
├─ package-lock.json
├─ package.json
├─ src
│ ├─ Config
│ │ └─ Extensions.js
│ ├─ IOView.js
│ ├─ LinkCrawler.js
│ ├─ LinkPreprocessor.js
│ ├─ Validator.js
│ ├─ index.js
│ ├─ messageHandler.js
│ └─ tests
│ ├─ File.test.js
│ ├─ Link.test.js
│ └─ Validate.test.js
└─ yarn.lock