Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

🤖 robots.txt as a service 🤖

License Twitter Follow

🚧 Project in development

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.

For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it here. Expect support from other robot specifications soon!

Why this project?

If you are building a distributed web crawler, you know that manage robots.txt rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. robots.txt can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second!

Requirements

In order to build this project in your machine you will need to have installed in your system:

Getting started

If you want to test this project locally, then you will need to be installed in your system Docker, docker-compose and Make. When done, then execute the following command to compile all projects, build docker images and run it:

👉 Be patient!

$ make start-all

You can execute make logs to see how things have gone

Now you can send some URL's to the crawler system to download the rules found in the robots.txt file and persist it in the database. For example, you can invoke the crawl API using this command:

$ curl -X POST http://localhost:9081/v1/send \
       -d 'url=https://news.ycombinator.com/newcomments' \
       -H 'Content-Type: application/x-www-form-urlencoded'

Also, there is another method in the API to make a crawl request but using a GET method. If you want to check all methods this application expose, import this Postman collection.

This command will send the URL to the streaming service, and when received, the robots.txt file will be downloaded, parsed and saved into the database.

The next step is to check if you can access any resource of a known host using a user-agent directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the newest resource from hacker news. You will execute:

$ curl -X POST http://localhost:9080/v1/allowed \
       -d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \
       -H 'Content-Type: application/json'

The response will be:

{
  "url":"https://news.ycombinator.com/newest",
  "agent":"AwesomeBot",
  "allowed":true
}

This is like saying: Hey!, you can crawl content from https://news.ycombinator.com/newest

When you finish your test, execute the next command to stop and remove all docker containers:

$ make stop-all

🔥 Happy Hacking! 🔥