robots.txt for AI-related crawlers and bots #3900
Labels
🗄️ aspect: data
Concerns the data in our catalog and/or databases
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: frontend
Related to the Nuxt frontend
🔒 staff only
Restricted to staff members
Problem
We currently get a decent amount of bot traffic that while not currently disrupting service does place considerable load on our servers.
This primarily comes in the form of frontend, client-side searches.
One recent example is the https://imagesift.com/ reverse image search site, a service run by AI company https://thehive.ai/.
Description
Consider adding new robots.txt rules to block a majority of these platforms. This behavior violates our terms of service and these users should be using our API with its throttling rules instead.
This blog post from Neil Clarke shows some examples:
https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website
The text was updated successfully, but these errors were encountered: