Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database default space allowance gets filled fairly quickly #401

Open
pcm32 opened this issue Dec 16, 2022 · 7 comments
Open

Database default space allowance gets filled fairly quickly #401

pcm32 opened this issue Dec 16, 2022 · 7 comments

Comments

@pcm32
Copy link
Member

pcm32 commented Dec 16, 2022

After only 10 executions or so of our single cell pipeline, the database disk got full:

root@galaxy-galaxy-dev-postgres-0:/home/postgres/pgdata# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc        9.8G  9.7G  129M  99% /home/postgres/pgdata

is there any process in place that is cleaning the database with some regularity? Or should we suggest a higher default for the postgres disk space? I have been cleaning up disk space regularly, but I suspect that the database logs of jobs is not getting cleaned up as part of this process. None of the maintenance jobs take care of this?

@pcm32
Copy link
Member Author

pcm32 commented Dec 16, 2022

I have run:

galaxy@galaxy-dev-job-0-6c8f594ff5-cwzd9:/galaxy/server$ bash scripts/maintenance.sh --no-dry-run --days 1

inside the job container, but I get this failure:

galaxy@galaxy-dev-job-0-6c8f594ff5-cwzd9:/galaxy/server$ bash scripts/maintenance.sh
Unsetting $PYTHONPATH
Activating virtualenv at .venv

Dry run: false
Days: 1

Will run following commands and output in maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 r --delete_userless_histories >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_histories >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_datasets >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_folders >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --delete_datasets >> maintenance.log
python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --purge_datasets >> maintenance.log
Traceback (most recent call last):
  File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 702, in <module>
    main()
  File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 212, in main
    delete_datasets(app, cutoff_time, args.remove_from_disk, info_only=args.info_only, force_retry=args.force_retry)
  File "/galaxy/server/scripts/cleanup_datasets/cleanup_datasets.py", line 383, in delete_datasets
    (app.model.Dataset.table.c.id, app.model.Dataset.table.c.state),
AttributeError: type object 'Dataset' has no attribute 'table'

it seems to happen on:

python scripts/cleanup_datasets/cleanup_datasets.py /galaxy/server/config/galaxy.yml -d 1 -r --delete_datasets >> maintenance.log

it does return 1 error code though, so I would guess that the maintenance jobs should be detecting this error?

@pcm32
Copy link
Member Author

pcm32 commented Dec 16, 2022

However, even after running the subsequent step to the failing one (plus all others that run before), the disk usage in the database is more less the same...

@pcm32
Copy link
Member Author

pcm32 commented Dec 16, 2022

Main culprit seems to be:

 table_schema |             table_name              | total_size | data_size | external_size
--------------+-------------------------------------+------------+-----------+---------------
 public       | history_dataset_association_history | 1193 MB    | 936 kB    | 1192 MB
 public       | history_dataset_association         | 394 MB     | 5168 kB   | 389 MB
 public       | galaxy_session                      | 2712 kB    | 1448 kB   | 1264 kB
 public       | job                                 | 1544 kB    | 920 kB    | 624 kB
 public       | tool_shed_repository                | 1432 kB    | 1056 kB   | 376 kB
 public       | job_parameter                       | 1336 kB    | 936 kB    | 400 kB
 public       | job_state_history                   | 944 kB     | 640 kB    | 304 kB
 public       | dataset                             | 760 kB     | 256 kB    | 504 kB
 public       | dataset_collection_element          | 576 kB     | 192 kB    | 384 kB
 public       | job_to_input_dataset                | 480 kB     | 216 kB    | 264 kB
(10 rows)

interestingly the data size is very small, maybe there is some postgres purge or something that is not happening?

ok, apparently in postgres speak, external size means the size with external indices, references, etc.

@nuwang
Copy link
Member

nuwang commented Dec 16, 2022

Try setting: .Values.postgresql.persistence.size. If I remember right, the operator will attempt to resize the disk. If that doesn't happen, you might have to resize manually.

Regarding the maintenance failure, The maintenance cron job doesn't run that maintenance script, it only runs a job for cleaning up the tmpdir. But I think we should include this script as well. I think that the silent job failure should be reported on Galaxy, it's probable that it's affecting a lot of people.

@pcm32
Copy link
Member Author

pcm32 commented Dec 16, 2022

But I think we should include this script as well.

Agreed, will look to where it should go.

I think that the silent job failure should be reported on Galaxy, it's probably that it's affecting a lot of people.

Sure, will do!

@pcm32
Copy link
Member Author

pcm32 commented Dec 19, 2022

So, for the record here, changing .Values.postgresql.persistence.size meant that the operator attempted a live resize of the disk, which corrupted it :-( and meant that I had to redo the deployment (no issues since this is a development setup for testing). But probably better to do that manually (downscaling setup manually to make sure is not done in "hot"). Maybe we should have small section about on the readme re-sizing disks.

@pcm32
Copy link
Member Author

pcm32 commented Dec 19, 2022

Also, I was told by Nicola I think that the scripts/maintenance.sh --no-dry-run doesn't actually attempt to delete anything on the database, so there must be another mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants