Library containing utilities for the robots exclusion and inclusion protocols. To use the library, download the JAR from the latest release and include it in your project.
Robots exclusion protocol
The library offers facilities for parsing robots.txt files from raw strings and building an abstract robots.txt file representation containing all the parsed rules.
Supported directives are:
For the Allow/Disallow directives, the relative URL paths may contain the wildcard "*" character that matches any string (even the empty one) and the end-of-string character "$" that matches the end of the URL.
In the case in which for a given URL path, more Allow/Disallow directives apply, the most specific one (the longer directive path) is considered. If equality still holds, the Allow directive has priority. Decision between wildcard paths is undefined.
Unrecognized directives are discarded and comments are ignored.
Read more about the robots.txt protocol here.
HTML documents can be parsed as scala XML documents and extraction of outlinks and robot-specific meta-tags is possible.
Currently, supported meta-tags are:
Read more about the robot meta-tags here.
Robots inclusion protocol
Allows creation of sitemaps from raw string data and a given URL as the location of the sitemap.
Currently, it supports sitemaps in the following format:
Sitemap indexes are also supported, but all the linked sitemaps must be somewhere inside the same directory as the sitemap index in order to be considered a valid link.
Read more about the sitemaps protocol here.