Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another bot problem #7514

Closed
2 tasks done
tomkolp opened this issue Apr 19, 2022 · 10 comments
Closed
2 tasks done

Another bot problem #7514

tomkolp opened this issue Apr 19, 2022 · 10 comments
Labels
question This is more a question for the support than an issue.

Comments

@tomkolp
Copy link
Contributor

tomkolp commented Apr 19, 2022

Describe the issue

Queries suck two CPU cores (200% CPU). I would like to completely block robots from accessing my projects (no indexing needed). How can I edit the robots.txt file?

nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:46 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2020-12-01&end_date=2020-12-31 HTTP/1.0" 200 16033 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:50 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-01-01&end_date=2021-01-31 HTTP/1.0" 200 79082 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:54 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-02-01&end_date=2021-02-28 HTTP/1.0" 200 84605 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:58 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-03-01&end_date=2021-03-31 HTTP/1.0" 200 79798 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:52:03 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-04-01&end_date=2021-04-30 HTTP/1.0" 200 80495 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:52:08 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-05-01&end_date=2021-05-31 HTTP/1.0" 200 78260 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:52:12 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-06-01&end_date=2021-06-30 HTTP/1.0" 200 80506 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"

I already tried

  • I've read and searched the documentation.
  • I've searched for similar issues in this repository.

Steps to reproduce the behavior

No response

Expected behavior

No response

Screenshots

No response

Exception traceback

No response

Additional context

Docker

@tomkolp tomkolp added the question This is more a question for the support than an issue. label Apr 19, 2022
@github-actions
Copy link

This issue looks more like a support question than an issue. We strive to answer these reasonably fast, but purchasing the support subscription is not only more responsible and faster for your business but also makes Weblate stronger.

In case your question is already answered, making a donation is the right way to say thank you!

@nijel
Copy link
Member

nijel commented Apr 19, 2022

/changes/ is already present in the robots.txt:

Disallow: /changes/

@nijel nijel transferred this issue from WeblateOrg/docker Apr 19, 2022
@tomkolp
Copy link
Contributor Author

tomkolp commented Apr 19, 2022

This is not a malicious bot and should apply to your robots. How can I edit the roobots.txt file in docker myself?

@nijel
Copy link
Member

nijel commented Apr 19, 2022

Create custom /app/data/python/customize/templates/robots.txt, see https://docs.weblate.org/en/latest/admin/install/docker.html#further-configuration-customization

@tomkolp
Copy link
Contributor Author

tomkolp commented Apr 19, 2022

I temporarily blocked all bots.
Before the blockade, eg "SiteCheckerBotCrawler / 1.0 (+ http: //sitechecker.pro)" read /changes/
After being banned, it read robots.txt and stopped reading /changes/

nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:11 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2021-11-01&end_date=2021-11-30 HTTP/1.0" 200 89684 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:11 +0200] "GET /languages/pl/eso-spolszczenie/ HTTP/1.0" 200 5779132 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:12 +0200] "GET /changes/?project=eso-spolszczenie&action=46 HTTP/1.0" 200 21649 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:13 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2020-10-01&end_date=2020-10-31 HTTP/1.0" 200 15817 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:14 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2020-12-01&end_date=2020-12-31 HTTP/1.0" 200 47463 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:15 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2021-01-01&end_date=2021-01-31 HTTP/1.0" 200 74489 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" celery-celery stderr | [2022-04-19 18:30:16,045: INFO/MainProcess] Task weblate.utils.tasks.heartbeat[15110110-0f7d-41db-a0c3-fc01c812901c] received nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:20 +0200] "GET /robots.txt HTTP/1.0" 200 27 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 127.0.0.1 - - [19/Apr/2022:18:30:26 +0200] "GET /healthz/ HTTP/1.1" 200 12 "-" "curl/7.74.0" nginx stdout | 127.0.0.1 - - [19/Apr/2022:18:30:56 +0200] "GET /healthz/ HTTP/1.1" 200 12 "-" "curl/7.74.0"

It would be nice to have a real ip so I could block it on the firewall.

@tomkolp
Copy link
Contributor Author

tomkolp commented Apr 20, 2022

Weird that robots obey the "block User-agent" entry, not "block /changes/".

Noticed that robots read also /static/, /matrix/, should it be like that?

nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /static/CACHE/css/output.d784a4e73944.css HTTP/1.0" 200 151125 "https://site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /static/state/alert.svg HTTP/1.0" 200 266 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /static/CACHE/js/output.91ba9ed0a400.js HTTP/1.0" 200 357236 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /css/custom.css?a075fb12203a9a2 HTTP/1.0" 200 5443 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /js/i18n/ HTTP/1.0" 200 3343 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)"

@nijel
Copy link
Member

nijel commented Apr 20, 2022

It would be nice to have a real ip so I could block it on the firewall.

That should be done since WeblateOrg/docker#1306

Noticed that robots read also /static/, /matrix/, should it be like that?

Static files are okay, matrix is disallowed since b255385

@tomkolp
Copy link
Contributor Author

tomkolp commented Apr 20, 2022

For me, the topic has been completed, I created my roobots.txt file and blocked all User-Agents (no search engine indexation). When real IP in the future appears in the future, the default robots.txt restores. As the logs will be real IP check if the IP belongs to a real bot or malicious. At the moment, some bots do not apply to robots.txt and continue to read prohibited folders. After complete prohibiting visits, the user-agent stops visiting folders.

@tomkolp tomkolp closed this as completed Apr 20, 2022
@github-actions
Copy link

The issue you have reported is now resolved. If you don’t feel it’s right, please follow its labels to get a clue for further steps.

  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

@tomkolp
Copy link
Contributor Author

tomkolp commented Apr 20, 2022

All bots that do not respect the robots.txt file are malicious. Now, when you can see the ip, they can be easily tracked and blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question This is more a question for the support than an issue.
Projects
None yet
Development

No branches or pull requests

2 participants