Skip to content
🤖 robots.txt as a service. Crawls robots.txt files, downloads and parses them to check rules through an API
Java Kotlin Shell ANTLR Makefile Dockerfile Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
checker
config
crawler
database
downloader
gradle/wrapper
infrastructure
parser
postman
setup
.env
.gitattributes
.gitignore
LICENSE
Makefile
README.md
build.gradle
docker-compose.yml
gradle.properties
gradlew
gradlew.bat
settings.gradle

README.md

🤖 robots.txt as a service 🤖

License Twitter Follow

🚧 Project in early development

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.

For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it here. Expect support from other robot specifications soon!

Why this project?

If you are building a distributed web crawler, you know that manage robots.txt rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. robots.txt can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second!

Requirements

In order to build this project in your machine you will need to have installed in your system:

Getting started

If you want to test this project locally, then you will need to be installed in your system Docker, docker-compose and Make. When done, then execute the following command to compile all projects, build docker images and run it:

👉 Change the DOCKER_NETWORK environment variable from .env file to match you docker host interface

$ make start-all

You can execute make logs to see how things have gone

Now you can send some URL's to the crawler system to download the rules found in the robots.txt file and persist it in the database. For example, you can invoke the crawl API using this command:

$ curl -X POST http://localhost:8081/v1/send \
       -d 'url=https://news.ycombinator.com/newcomments' \
       -H 'Content-Type: application/x-www-form-urlencoded'

Also, there is another method in the API to make a crawl request but using a GET method.

This command will send the URL to the streaming service, and when received, the robots.txt file will be downloaded, parsed and saved into the database.

The next step is to check if you can access any resource of a known host using a user-agent directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the newest resource from hacker news. You will execute:

$ curl -X POST http://localhost:8080/v1/allowed \
       -d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \
       -H 'Content-Type: application/json'

The response will be:

{
  "url":"https://news.ycombinator.com/newest",
  "agent":"AwesomeBot",
  "allowed":true
}

This is like saying: Hey!, you can crawl content from https://news.ycombinator.com/newest

When you finish your test, execute the next command to stop and remove all docker containers and associated volumes:

$ make stop-all

If you want to start hacking, stop all containers and use the instructions from this directory to start all project required services

You can’t perform that action at this time.