Another bot problem #7514

tomkolp · 2022-04-19T09:57:31Z

Describe the issue

Queries suck two CPU cores (200% CPU). I would like to completely block robots from accessing my projects (no indexing needed). How can I edit the robots.txt file?

nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:46 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2020-12-01&end_date=2020-12-31 HTTP/1.0" 200 16033 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:50 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-01-01&end_date=2021-01-31 HTTP/1.0" 200 79082 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:54 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-02-01&end_date=2021-02-28 HTTP/1.0" 200 84605 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:51:58 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-03-01&end_date=2021-03-31 HTTP/1.0" 200 79798 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:52:03 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-04-01&end_date=2021-04-30 HTTP/1.0" 200 80495 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:52:08 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-05-01&end_date=2021-05-31 HTTP/1.0" 200 78260 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
nginx stdout | 172.21.0.4 - - [19/Apr/2022:11:52:12 +0200] "GET /changes/?component=3952276&project=eso-spolszczenie&start_date=2021-06-01&end_date=2021-06-30 HTTP/1.0" 200 80506 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"

I already tried

I've read and searched the documentation.
I've searched for similar issues in this repository.

Steps to reproduce the behavior

No response

Expected behavior

No response

Screenshots

No response

Exception traceback

No response

Additional context

Docker

The text was updated successfully, but these errors were encountered:

github-actions · 2022-04-19T09:57:45Z

This issue looks more like a support question than an issue. We strive to answer these reasonably fast, but purchasing the support subscription is not only more responsible and faster for your business but also makes Weblate stronger.

In case your question is already answered, making a donation is the right way to say thank you!

nijel · 2022-04-19T12:30:16Z

/changes/ is already present in the robots.txt:

weblate/weblate/templates/robots.txt

Line 13 in 28bf9ca

Disallow: /changes/

tomkolp · 2022-04-19T12:37:09Z

This is not a malicious bot and should apply to your robots. How can I edit the roobots.txt file in docker myself?

nijel · 2022-04-19T12:47:02Z

Create custom /app/data/python/customize/templates/robots.txt, see https://docs.weblate.org/en/latest/admin/install/docker.html#further-configuration-customization

tomkolp · 2022-04-19T16:50:40Z

I temporarily blocked all bots.
Before the blockade, eg "SiteCheckerBotCrawler / 1.0 (+ http: //sitechecker.pro)" read /changes/
After being banned, it read robots.txt and stopped reading /changes/

nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:11 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2021-11-01&end_date=2021-11-30 HTTP/1.0" 200 89684 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:11 +0200] "GET /languages/pl/eso-spolszczenie/ HTTP/1.0" 200 5779132 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:12 +0200] "GET /changes/?project=eso-spolszczenie&action=46 HTTP/1.0" 200 21649 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:13 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2020-10-01&end_date=2020-10-31 HTTP/1.0" 200 15817 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:14 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2020-12-01&end_date=2020-12-31 HTTP/1.0" 200 47463 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:15 +0200] "GET /changes/?project=eso-spolszczenie&start_date=2021-01-01&end_date=2021-01-31 HTTP/1.0" 200 74489 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" celery-celery stderr | [2022-04-19 18:30:16,045: INFO/MainProcess] Task weblate.utils.tasks.heartbeat[15110110-0f7d-41db-a0c3-fc01c812901c] received nginx stdout | 172.21.0.2 - - [19/Apr/2022:18:30:20 +0200] "GET /robots.txt HTTP/1.0" 200 27 "-" "SiteCheckerBotCrawler/1.0 (+http://sitechecker.pro)" nginx stdout | 127.0.0.1 - - [19/Apr/2022:18:30:26 +0200] "GET /healthz/ HTTP/1.1" 200 12 "-" "curl/7.74.0" nginx stdout | 127.0.0.1 - - [19/Apr/2022:18:30:56 +0200] "GET /healthz/ HTTP/1.1" 200 12 "-" "curl/7.74.0"

It would be nice to have a real ip so I could block it on the firewall.

tomkolp · 2022-04-20T06:47:16Z

Weird that robots obey the "block User-agent" entry, not "block /changes/".

Noticed that robots read also /static/, /matrix/, should it be like that?

nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /static/CACHE/css/output.d784a4e73944.css HTTP/1.0" 200 151125 "https://site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /static/state/alert.svg HTTP/1.0" 200 266 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /static/CACHE/js/output.91ba9ed0a400.js HTTP/1.0" 200 357236 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /css/custom.css?a075fb12203a9a2 HTTP/1.0" 200 5443 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)" nginx stdout | 172.21.0.2 - - [20/Apr/2022:08:27:58 +0200] "GET /js/i18n/ HTTP/1.0" 200 3343 "site.eu/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 SeoSiteCheckup (https://seositecheckup.com)"

nijel · 2022-04-20T08:12:07Z

It would be nice to have a real ip so I could block it on the firewall.

That should be done since WeblateOrg/docker#1306

Noticed that robots read also /static/, /matrix/, should it be like that?

Static files are okay, matrix is disallowed since b255385

tomkolp · 2022-04-20T08:47:52Z

For me, the topic has been completed, I created my roobots.txt file and blocked all User-Agents (no search engine indexation). When real IP in the future appears in the future, the default robots.txt restores. As the logs will be real IP check if the IP belongs to a real bot or malicious. At the moment, some bots do not apply to robots.txt and continue to read prohibited folders. After complete prohibiting visits, the user-agent stops visiting folders.

github-actions · 2022-04-20T08:48:08Z

The issue you have reported is now resolved. If you don’t feel it’s right, please follow its labels to get a clue for further steps.

In case you see a similar problem, please open a separate issue.
If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

tomkolp · 2022-04-20T11:03:59Z

All bots that do not respect the robots.txt file are malicious. Now, when you can see the ip, they can be easily tracked and blocked.

tomkolp added the question This is more a question for the support than an issue. label Apr 19, 2022

nijel transferred this issue from WeblateOrg/docker Apr 19, 2022

tomkolp closed this as completed Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another bot problem #7514

Another bot problem #7514

tomkolp commented Apr 19, 2022 •

edited by nijel

github-actions bot commented Apr 19, 2022

nijel commented Apr 19, 2022

tomkolp commented Apr 19, 2022

nijel commented Apr 19, 2022

tomkolp commented Apr 19, 2022 •

edited

tomkolp commented Apr 20, 2022 •

edited

nijel commented Apr 20, 2022

tomkolp commented Apr 20, 2022

github-actions bot commented Apr 20, 2022

tomkolp commented Apr 20, 2022

Another bot problem #7514

Another bot problem #7514

Comments

tomkolp commented Apr 19, 2022 • edited by nijel

Describe the issue

I already tried

Steps to reproduce the behavior

Expected behavior

Screenshots

Exception traceback

Additional context

github-actions bot commented Apr 19, 2022

nijel commented Apr 19, 2022

tomkolp commented Apr 19, 2022

nijel commented Apr 19, 2022

tomkolp commented Apr 19, 2022 • edited

tomkolp commented Apr 20, 2022 • edited

nijel commented Apr 20, 2022

tomkolp commented Apr 20, 2022

github-actions bot commented Apr 20, 2022

tomkolp commented Apr 20, 2022

tomkolp commented Apr 19, 2022 •

edited by nijel

tomkolp commented Apr 19, 2022 •

edited

tomkolp commented Apr 20, 2022 •

edited