We are redeploying the augur server we use in RH OSPO. In order to make sure it is done right, I took a closer look at the schema, and what was using space. I found several potential improvements (like using bool instead of small int, etc), but I also found that the schema is using 1 entry in the commit table per modified files in a commit.
In turn, this mean that this commit who modify 50 000 files and present in 43 repositories (as openjdk copy repos for each version) appear 2 millions time. This one appear 1 million time in the database, that's a change from http to https on hadoop website.
So for example, the email of the committer of the 1st commit is duplicated 6 millions time in the DB (2 millions as commiter, 2 millions as raw_commiter email, and 2 millions for a 3rd column). Based on a quick estimation of the size of 1 entry (around 180 bytes), the duplication mean that the DB is 360 Mbytes bigger just with this one commit.
And while that's the worst case, there is several others:
sandiego=# select count(*), cmt_commit_hash from commits group by cmt_commit_hash order by count(*) desc limit 30;
count | cmt_commit_hash
---------+------------------------------------------
2732304 | 3789983e89c9de252ef546a1b98a732a7d066650
1078684 | c41c790a92ff858e9e35c0ac862906e7b7b34c56
892404 | 319a3b994703aac84df7bcde272adfcb3cdbbbf0
890190 | fe008ae27a7fe6b5ceb3ab903c2b3302767ce1b7
628378 | 3bc69d393d3e0cb89afc7a86a001e2fbfbd7ac93
424374 | a4f5d9d960afde5884e478d945741fc32e45a788
323472 | 733027052103ccce02f33e408ccebf232a917594
308844 | 6d6dcdbfa377393306bf79585f61baea524ac124
288996 | 3dd0c30674d50e08bcf4622847e3ac2e413d2085
238188 | 987c3a2d874fae6cfedf90d5bc920b66c275098a
189600 | 657b9db1ba9be6520dbd1086702e64746feafd34
186800 | 8d04e5aec5fc118ccb82910a1af82bfc4f57a398
179982 | ba7c173659897a68a56defa032e5536defc61768
172104 | a82f3e0256eace61ac66c344edff8b2d35d81331
160836 | b9b5a60049c3d3f3c6479c2037a7b78afbc89779
160836 | 35865ef3dd98c56290eb9d98ae9b0b3737caa224
160542 | f57b87e8f6b8ca9e3c1e640dca51c2a18ee1ef6a
159678 | bd4f4be9d7e1336de4ad13287801783a8a0dc13b
158884 | e48cfc709dbd811ded123e61fbbb2c74e79a7ca1
158539 | 121125c0f86bde236a1d22b855c5d4bd236ce63b
158458 | 31c158a95db33f92dddee035cbc8fec2fd78592f
156276 | 8153779ad32d1e8ddd37ced826c76c7aafc61894
156000 | bc113af571d331e174668ab25a5649835b701e5e
154720 | d07db25e2a8fbe78791bcaf960469111f379ee00
148606 | 08850bd5c4f46d0b2e0eec64d9ca793a3b75fce5
145422 | 2aa71b1071bcbf45ad8400dfaf33257e3d0b9d2d
145368 | 71315ac3a3258881e14aa07d0028a483fd3900d5
134154 | c420248b9b459efcfbd3657170d9be0b96b5fb38
132720 | 7d969ccd5589f4b07aea72165bee38b405a10cab
127282 | 59ac4c2629189eee5921fe024a49d54345987bea
Most seems to be caused by the openjdk policy of forking their own repository.
So, this likely also cause trouble on the indexes, since the table is bigger than it should.
If there is a need on a per file information, it should likely be in a separate table so the information can be deduplicated.
Note
Migrated from augurlabs/augur#2065
Originally opened by
@mschereron 2022-12-08We are redeploying the augur server we use in RH OSPO. In order to make sure it is done right, I took a closer look at the schema, and what was using space. I found several potential improvements (like using bool instead of small int, etc), but I also found that the schema is using 1 entry in the commit table per modified files in a commit.
In turn, this mean that this commit who modify 50 000 files and present in 43 repositories (as openjdk copy repos for each version) appear 2 millions time. This one appear 1 million time in the database, that's a change from http to https on hadoop website.
So for example, the email of the committer of the 1st commit is duplicated 6 millions time in the DB (2 millions as commiter, 2 millions as raw_commiter email, and 2 millions for a 3rd column). Based on a quick estimation of the size of 1 entry (around 180 bytes), the duplication mean that the DB is 360 Mbytes bigger just with this one commit.
And while that's the worst case, there is several others:
Most seems to be caused by the openjdk policy of forking their own repository.
So, this likely also cause trouble on the indexes, since the table is bigger than it should.
If there is a need on a per file information, it should likely be in a separate table so the information can be deduplicated.