-
-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce metrics disk usage #8119
Comments
Wouldn't that increase data size? Storing number as integer takes less space than storing it in JSON as string. Alternative solutions:
|
That's true, but there are additional fields which are now duplicated because there can be nearly 30 values stored for each object. So, the choice really is between duplicating date and three int columns 30 times, or having it once and having JSON overhead. The change will definitely decrease index size, as there will be 1/30 of rows and one less column to index. Still, this needs proper benchmarking before making the change, that's why this is still being postponed ;-).
We need to differentiate the case between no metrics calculated and 0 calculated. So even the 0 needs to be stored; otherwise it would have to be calculated again.
That would be an optimization as well. We really require daily metrics after 60 days only to calculate monthly ones. Presently, monthly ones are stored in the cache only, that's why we have to keep daily ones around as they might disappear from the cache.
Can you please share table and index sizes? For my test environment, the table data has 16GB, and the two indexes 6GB + 18GB. |
It is 5.3 GB for data and 4.2 GB for indexes. It is MySQL. BTW: It would be also a great improvement to store stats only if they change. I guess in many cases these stats are that same for each day, so this table is full of duplicated stats stored for each day individually. |
Okay, you actually made me to do the benchmark. First, I did vacuum on the tables to make sure I'm measuring actual usage. Current metrics take: data 16 GB, primary key 6 GB, unique index 11 GB Moving them to JSON: data 7 GB, primary key 1 GB, unique index 1 GB Moving them to JSON with string keys: data 10 GB, primary key 1 GB, unique index 1 GB So, it almost matches my expectations – you save something on the data, but a lot on the indexes. The code is here in case you want to try it out: #8563
This will always be a trade-off between storage needed and complexity when working with the metrics. And my current preference is to make the code simple – once the metric is in the database, it can be used, if not, it needs to be calculated. |
Do you have plans to try store metrics in 27 separate int fields instead JSON? I should cut data size even more (assuming that you always store stats for all these fields) and may simplify some calculations. |
It will be stored in JSON – the metrics can be added in future and changing structure of this table takes ages on big sites. But storing as a list of integers is definitely an option, see https://github.com/WeblateOrg/weblate/pull/8563/files#diff-30660b786ec366973fcef283c4f9531b57361cfc72a774d73b443fdfa25eecc9R116-R121 |
- Data is now stored as JSON list - It can be wiped after two months when no longer needed while keeping the changes metric around - This heavily reduces table size and improves metrics performance Fixes WeblateOrg#8119
- Data is now stored as JSON list - It can be wiped after two months when no longer needed while keeping the changes metric around - This heavily reduces table size and improves metrics performance Fixes WeblateOrg#8119
- Data is now stored as JSON list - It can be wiped after two months when no longer needed while keeping the changes metric around - This heavily reduces table size and improves metrics performance Fixes #8119
Thank you for your report; the issue you have reported has just been fixed.
|
I just did the migration for one of the larger sites: The original table was 90 GB data and 107 GB indexes, after migration it is 15 GB data and 8 GB indexes. |
Describe the problem
Currently, the metrics and related indexes account to the most of the database usage. That is not really reasonable for a feature that displays trends and summaries.
Describe the solution you'd like
The database usage should be reduced. The way to go might be to reduce the number of the rows by storing all data for a single day in a JSONField instead of having it in ~20 rows. This will also heavily reduce the index size as it will have one less column to index.
Describe alternatives you've considered
No response
Screenshots
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: