Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cleanup_metrics job remains hanging on a DELETE statement in the database #8825

Closed
2 tasks done
SlavekB opened this issue Feb 20, 2023 · 12 comments · Fixed by #8563
Closed
2 tasks done

The cleanup_metrics job remains hanging on a DELETE statement in the database #8825

SlavekB opened this issue Feb 20, 2023 · 12 comments · Fixed by #8563
Assignees
Labels
bug Something is broken.
Milestone

Comments

@SlavekB
Copy link
Contributor

SlavekB commented Feb 20, 2023

Describe the issue

I regularly observe that there is a DELETE command hanging in the database, which consumes 100% of one CPU core and it is not going to be completed. When I kill the relevant database process, the Weblate celery log reports that the cleanup_metrics job failed.

postgres: 11/main: weblate weblate 127.0.0.1(58416) DELETE

I tried to wait patiently for a long time, but it didn't seem like it was going anywhere.

I already tried

  • I've read and searched the documentation.
  • I've searched for similar issues in this repository.

Steps to reproduce the behavior

  1. Let the celery jobs run and wait for cleanup_metrics to run.
  2. Look at the CPU usage and you will see one core loaded at 100%.
  3. Kill the postgres process that is doing infinite DELETE.
  4. The celery log reports that the cleanup_metrics job failed.

Expected behavior

I expect that cleanup_metrics should be completed at some point to allow processing of other tasks that are waiting in the celery queue.

Screenshots

No response

Exception traceback

No response

How do you run Weblate?

PyPI module

Weblate versions

  • Weblate: 4.15.2
  • Django: 3.2.4
  • siphashc: 1.3
  • translate-toolkit: 3.8.3
  • lxml: 4.9.1
  • Pillow: 9.1.0
  • bleach: 3.1.5
  • python-dateutil: 2.8.1
  • social-auth-core: 4.3.0
  • social-auth-app-django: 5.0.0
  • django-crispy-forms: 1.9.0
  • oauthlib: 3.1.0
  • django-compressor: 2.4
  • djangorestframework: 3.11.0
  • django-filter: 2.4.0
  • django-appconf: 1.0.4
  • user-agents: 2.1
  • filelock: 3.7.1
  • rapidfuzz: 2.13.2
  • openpyxl: 3.0.3
  • celery: 5.1.1
  • kombu: 5.1.0
  • translation-finder: 2.15
  • weblate-language-data: 2022.8
  • html2text: 2020.1.16
  • pycairo: 1.19.1
  • pygobject: 3.36.1
  • diff-match-patch: 20200713
  • requests: 2.27.1
  • django-redis: 4.11.0
  • hiredis: 1.0.1
  • sentry_sdk: 1.6.0
  • Cython: 0.29.17
  • misaka: 2.1.1
  • GitPython: 3.1.2
  • borgbackup: 1.1.11
  • pyparsing: 3.0.9
  • pyahocorasick: 1.4.1
  • python-redis-lock: 4.0.0
  • charset-normalizer: 2.0.12
  • Python: 3.7.3
  • Git: 2.20.1
  • psycopg2-binary: 2.8.5
  • phply: 1.2.5
  • ruamel.yaml: 0.16.10
  • boto3: 1.14.7
  • zeep: 4.1.0
  • aeidon: 1.7.0
  • iniparse: 0.5
  • Mercurial: 4.8.2
  • git-svn: 2.20.1
  • git-review: 1.27.0
  • Redis server: 5.0.14
  • PostgreSQL server: 11.19
  • Database backends: django.db.backends.postgresql
  • Cache backends: default:RedisCache, avatar:FileBasedCache
  • Email setup: django_sendmail_backend.backends.EmailBackend: localhost
  • OS encoding: filesystem=utf-8, default=utf-8
  • Celery: redis://localhost:6379, redis://localhost:6379, regular
  • Platform: Linux 5.10.159 (x86_64)

Weblate deploy checks

System check identified no issues (3 silenced).

Additional context

The problem started appearing after upgrade to Weblate 4.14.x.

@nijel
Copy link
Member

nijel commented Feb 23, 2023

The initial cleanup might take long, but it should eventually complete. What query is actually taking that log?

@rob006
Copy link
Contributor

rob006 commented Feb 25, 2023

I have noticed that too. It is not related to initial cleanup, but to queries deleting orphaned metrics - these queries do not use indexes, so for large metrics table they can be super slow, even if they do not delete anything as a result. For me each query like DELETE FROM `metrics_metric` WHERE (`metrics_metric`.`scope` = 4 AND NOT (`metrics_metric`.`relation` IN (SELECT U0.`id` FROM `weblate_auth_user` U0))) takes 5-10 minutes to execute, so the whole job takes almost an hour. At this point I would rather keep these orphaned metrics for 800 days than run such slow cleanup job every day.

@SlavekB
Copy link
Contributor Author

SlavekB commented Feb 26, 2023

I tried to be very patient and the delete query runs for more than 49 hours. In the process, I had to kill the parallel execution of cleanup_metric job for the second time so that multiple delete queries would not run unnecessarily. I'm not experienced enough with postgresql to verify what command is running all the time. It definitely takes a lot of patience.

@nijel nijel mentioned this issue Feb 26, 2023
7 tasks
@nijel
Copy link
Member

nijel commented Feb 26, 2023

This should be addressed together with #8119 (#8563).

@nijel nijel added this to the 4.17 milestone Feb 27, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because there wasn’t any recent activity.

It will be closed soon if no further action occurs.

Thank you for your contributions!

@github-actions github-actions bot added the wontfix Nobody will work on this. label Mar 10, 2023
@SlavekB
Copy link
Contributor Author

SlavekB commented Mar 10, 2023

The problem is still valid. Since updating to Weblate 4.16.1, the job is now running for 100 hours and since I have a lot of patience, I can keep it running…

@nijel
Copy link
Member

nijel commented Mar 10, 2023

That's expected, nothing was changed in 4.16...

@github-actions github-actions bot removed the wontfix Nobody will work on this. label Mar 11, 2023
@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot added the wontfix Nobody will work on this. label Mar 22, 2023
@SlavekB

This comment was marked as off-topic.

@nijel nijel added bug Something is broken. and removed wontfix Nobody will work on this. labels Mar 22, 2023
@nijel

This comment was marked as off-topic.

@nijel nijel self-assigned this Apr 3, 2023
nijel added a commit to nijel/weblate that referenced this issue Apr 3, 2023
This removes need for possibly expensive garbage collection.

Fixes WeblateOrg#8825
@nijel nijel linked a pull request Apr 3, 2023 that will close this issue
7 tasks
nijel added a commit to nijel/weblate that referenced this issue Apr 3, 2023
This removes need for possibly expensive garbage collection.

Fixes WeblateOrg#8825
nijel added a commit that referenced this issue Apr 3, 2023
This removes need for possibly expensive garbage collection.

Fixes #8825
@github-actions
Copy link

github-actions bot commented Apr 3, 2023

Thank you for your report; the issue you have reported has just been fixed.

  • In case you see a problem with the fix, please comment on this issue.
  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

@tschulte
Copy link

We just needed to remove some components from a translation project in a Weblate 4.17 installation and the deletion did not work due to the same query. Only after adding a database index CREATE INDEX "metrics_metric_relation_scope" ON "metrics_metric" ("relation" , "scope") we were able to successfully delete the components. We have not tested with the latest Weblate version, but you might consider adding this index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants