Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce metrics disk usage #8119

Closed
nijel opened this issue Sep 15, 2022 · 8 comments · Fixed by #8563
Closed

Reduce metrics disk usage #8119

nijel opened this issue Sep 15, 2022 · 8 comments · Fixed by #8563
Assignees
Labels
enhancement Adding or requesting a new feature.
Milestone

Comments

@nijel
Copy link
Member

nijel commented Sep 15, 2022

Describe the problem

Currently, the metrics and related indexes account to the most of the database usage. That is not really reasonable for a feature that displays trends and summaries.

Describe the solution you'd like

The database usage should be reduced. The way to go might be to reduce the number of the rows by storing all data for a single day in a JSONField instead of having it in ~20 rows. This will also heavily reduce the index size as it will have one less column to index.

Describe alternatives you've considered

No response

Screenshots

No response

Additional context

No response

@nijel nijel added the enhancement Adding or requesting a new feature. label Sep 15, 2022
@nijel nijel added this to the 4.15 milestone Sep 15, 2022
@nijel nijel self-assigned this Sep 15, 2022
@nijel nijel modified the milestones: 4.14.2, 4.15 Nov 4, 2022
nijel added a commit that referenced this issue Nov 9, 2022
This reverts commit 674d342.

This migration is slow, so we need a better solution together with #8119.
@nijel nijel modified the milestones: 4.15, 4.16 Dec 13, 2022
@rob006
Copy link
Contributor

rob006 commented Jan 4, 2023

The way to go might be to reduce the number of the rows by storing all data for a single day in a JSONField instead of having it in ~20 rows. This will also heavily reduce the index size as it will have one less column to index.

Wouldn't that increase data size? Storing number as integer takes less space than storing it in JSON as string.

Alternative solutions:

  1. Do not store stats with 0 as value (it over 70% metrics in my case).
  2. Store only monthly stats for metrics older than 60 days (AFAIK they're not used individually).

@nijel
Copy link
Member Author

nijel commented Jan 5, 2023

Wouldn't that increase data size? Storing number as integer takes less space than storing it in JSON as string.

That's true, but there are additional fields which are now duplicated because there can be nearly 30 values stored for each object. So, the choice really is between duplicating date and three int columns 30 times, or having it once and having JSON overhead. The change will definitely decrease index size, as there will be 1/30 of rows and one less column to index.

Still, this needs proper benchmarking before making the change, that's why this is still being postponed ;-).

Do not store stats with 0 as value

We need to differentiate the case between no metrics calculated and 0 calculated. So even the 0 needs to be stored; otherwise it would have to be calculated again.

Store only monthly stats for metrics older than 60 days

That would be an optimization as well. We really require daily metrics after 60 days only to calculate monthly ones. Presently, monthly ones are stored in the cache only, that's why we have to keep daily ones around as they might disappear from the cache.

This table is still massive in my case (96% of database size) (in #6611 (comment))

Can you please share table and index sizes? For my test environment, the table data has 16GB, and the two indexes 6GB + 18GB.

@rob006
Copy link
Contributor

rob006 commented Jan 5, 2023

Can you please share table and index sizes? For my test environment, the table data has 16GB, and the two indexes 6GB + 18GB.

It is 5.3 GB for data and 4.2 GB for indexes. It is MySQL.

9aec555b

BTW: It would be also a great improvement to store stats only if they change. I guess in many cases these stats are that same for each day, so this table is full of duplicated stats stored for each day individually.

@nijel nijel mentioned this issue Jan 5, 2023
7 tasks
@nijel
Copy link
Member Author

nijel commented Jan 5, 2023

Okay, you actually made me to do the benchmark.

First, I did vacuum on the tables to make sure I'm measuring actual usage.

Current metrics take: data 16 GB, primary key 6 GB, unique index 11 GB

Moving them to JSON: data 7 GB, primary key 1 GB, unique index 1 GB

Moving them to JSON with string keys: data 10 GB, primary key 1 GB, unique index 1 GB

So, it almost matches my expectations – you save something on the data, but a lot on the indexes.

The code is here in case you want to try it out: #8563

It would be also a great improvement to store stats only if they change.

This will always be a trade-off between storage needed and complexity when working with the metrics. And my current preference is to make the code simple – once the metric is in the database, it can be used, if not, it needs to be calculated.

@rob006
Copy link
Contributor

rob006 commented Jan 6, 2023

Do you have plans to try store metrics in 27 separate int fields instead JSON? I should cut data size even more (assuming that you always store stats for all these fields) and may simplify some calculations.

@nijel
Copy link
Member Author

nijel commented Jan 6, 2023

It will be stored in JSON – the metrics can be added in future and changing structure of this table takes ages on big sites. But storing as a list of integers is definitely an option, see https://github.com/WeblateOrg/weblate/pull/8563/files#diff-30660b786ec366973fcef283c4f9531b57361cfc72a774d73b443fdfa25eecc9R116-R121

@nijel nijel modified the milestones: 4.16, 4.17 Feb 27, 2023
nijel added a commit to nijel/weblate that referenced this issue Apr 3, 2023
- Data is now stored as JSON list
- It can be wiped after two months when no longer needed while keeping
  the changes metric around
- This heavily reduces table size and improves metrics performance

Fixes WeblateOrg#8119
@nijel nijel linked a pull request Apr 3, 2023 that will close this issue
7 tasks
nijel added a commit to nijel/weblate that referenced this issue Apr 3, 2023
- Data is now stored as JSON list
- It can be wiped after two months when no longer needed while keeping
  the changes metric around
- This heavily reduces table size and improves metrics performance

Fixes WeblateOrg#8119
nijel added a commit that referenced this issue Apr 3, 2023
- Data is now stored as JSON list
- It can be wiped after two months when no longer needed while keeping
  the changes metric around
- This heavily reduces table size and improves metrics performance

Fixes #8119
@github-actions
Copy link

github-actions bot commented Apr 3, 2023

Thank you for your report; the issue you have reported has just been fixed.

  • In case you see a problem with the fix, please comment on this issue.
  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

@nijel
Copy link
Member Author

nijel commented Apr 5, 2023

I just did the migration for one of the larger sites: The original table was 90 GB data and 107 GB indexes, after migration it is 15 GB data and 8 GB indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding or requesting a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants