Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add robots.txt to endpoint #7

Open
lockefox opened this issue Apr 19, 2017 · 3 comments
Open

add robots.txt to endpoint #7

lockefox opened this issue Apr 19, 2017 · 3 comments

Comments

@lockefox
Copy link
Contributor

lockefox commented Apr 19, 2017

In an effort to subdue crawlers, add a robots.txt to publicAPI/static

http://stackoverflow.com/questions/14048779/with-flask-how-can-i-serve-robots-txt-and-sitemap-xml-as-static-files

Notes:

  • need to add publicAPI/static to MANIFEST.in
  • may need to add publicAPI/static/* to package_data in setup.py
  • Any new endpoints will need to add tests to test_crest_endpoint.py
@lockefox
Copy link
Contributor Author

lockefox commented Apr 19, 2017

Bots to block

How can I block MJ12bot?

MJ12bot adheres to the robots.txt standard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:

User-agent: MJ12bot
Disallow: /
Please do not block our bot via IP in htaccess - we do not use any consecutive IP blocks as we are a community based distributed crawler. Please always make sure the bot can actually retrieve robots.txt itself. If it can't then it will assume that it is okay to crawl your site.

If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.

Testing/validating robots.txt functionality
https://docs.python.org/3.5/library/urllib.robotparser.html

@kanelarrete
Copy link
Contributor

kanelarrete commented Apr 19, 2017

To keep multiple bots from crawling the site, you can follow this.
`This example tells all robots to stay out of a website:

User-agent: *
Disallow: /

This is taken from https://en.wikipedia.org/wiki/Robots_exclusion_standard#About_the_standard

To protect the API only, it could be setup as
User-agent: * Disallow: /CREST/

@lockefox
Copy link
Contributor Author

Robots.txt functionality should also be added to ProsperCookiecutter for Flask projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants