Bash script to spider a site, follow links, and fetch urls -- with some filtering. A list of URLs will be generated and saved to a text file.
How To Use
- Download the script and save to the desired location on your machine.
- You'll need
wgetinstalled on your machine in order to continue. To check if it's already installed (if you're on Linux or a Mac, chances are you already have it) open Git Bash, Terminal, etc. and run the command:
$ wget. If you receive an error message or command not found, you're probably on Windows. Here's the Windows installation instructions:
- Download the lastest wget binary for windows from https://eternallybored.org/misc/wget/ (they are available as a zip with documentation, or just an exe. I'd recommend just the exe.)
- If you downloaded the zip, extract all (if windows built in zip utility gives an error, use 7-zip). If you downloaded the 64-bit version,
- Open Git Bash, Terminal, etc. and run the
$ bash /path/to/script/fetchurls.sh
- You will be prompted to enter the full URL (including HTTPS/HTTP protocol) of the site you would like to crawl:
# # Fetch a list of unique URLs for a domain. # # Enter the full URL ( http://example.com ) # URL:
- You will then be prompted to change/accept the name of the outputted file (simply press enter to accept the default filename):
# # Fetch a list of unique URLs for a domain. # # Enter the full URL ( http://example.com ) # URL: https://www.example.com # # Save txt file as: example-com
- When complete, the script will show a message and the location of your outputted file:
# # Fetch a list of unique URLs for a domain. # # Enter the full URL ( http://example.com ) # URL: https://www.example.com # # Save txt file as: example-com # # Fetching URLs for example.com # Finished! # # File Location: /c/Users/username/Desktop/example-com.txt #
The script will crawl the site and compile a list of valid URLs into a text file that will be placed on your Desktop.
To change the default file output location, edit line #7. Default:
Ensure that you enter the correct protocol and subdomain for the URL or the outputted file may be empty or incomplete. For example, entering the incorrect, HTTP, protocol for https://adamdehaven.com generates an empty file. Entering the proper protocol, HTTPS, allows the script to successfully run.
The script, by default, filters out the following file extensions:
The script filters out several common WordPress files and directories such as:
To change or edit the regular expressions that filter out some pages, directories, and file types, you may edit lines #35 through #44. Caution: If you're not familiar with grep and regular expressions, you can easily break the script.