Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix frontend robots.txt #4186

Merged
merged 3 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions frontend/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ src/locales/scripts/wp-locales.json
/.npmrc
/.pnpmfile.cjs
/pnpm-lock.yaml

# To prevent accidentally adding a hardcoded robots.txt, see
# /src/server-middleware/robots.js for the robots.txt file.
src/static/robots.txt
zackkrida marked this conversation as resolved.
Show resolved Hide resolved
46 changes: 45 additions & 1 deletion frontend/src/server-middleware/robots.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,45 @@
const { LOCAL, PRODUCTION } = require("../constants/deploy-env")

const AI_ROBOTS_CONTENT = `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to maintain a list of user-agent names, and create the block string using it:

uaList.map(ua => `User-agent: ${ua}\nDisallow: /\n`).join("\n")

# Block known AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: cohere-ai
Disallow: /

`

/**
* Send the correct robots.txt information per-environment.
*/
Expand All @@ -10,13 +50,17 @@ export default function robots(_, res) {
deployEnv === PRODUCTION
? `# Block search result pages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
? `# Block search result pages
? `# Block search result pages and single result pages

User-agent: *
Crawl-delay: 10
Disallow: /search/audio/
Disallow: /search/image/
Disallow: /search/
Disallow: /image/
Disallow: /audio/

crawl-delay:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of crawl-delay with no value here? Could you add a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo!

${AI_ROBOTS_CONTENT}
`
: `# Block crawlers from the staging site
: `# Block everyone from the staging site
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I suppose we should have a similar one for the staging API docs, then?

User-agent: *
Disallow: /
`
Expand Down
35 changes: 0 additions & 35 deletions frontend/src/static/robots.txt

This file was deleted.