`commit` storage is a tad wasteful

> [!NOTE]
> Migrated from [augurlabs/augur#2065](https://github.com/augurlabs/augur/issues/2065)
> Originally opened by `@mscherer` on 2022-12-08

---

We are redeploying the augur server we use in RH OSPO. In order to make sure it is done right, I took a closer look at the schema, and what was using space. I found several potential improvements (like using bool instead of small int, etc), but I also found that the schema is using 1 entry in the commit table per modified files in a commit.

In turn, this mean that [this commit](https://github.com/openjdk/jdk10/commit/3789983e89c9de252ef546a1b98a732a7d066650) who modify 50 000 files and present in 43 repositories (as openjdk copy repos for each version) appear 2 millions time. This [one appear 1 million time](https://github.com/apache/hadoop-site/commit/c41c790a92ff858e9e35c0ac862906e7b7b34c56) in the database, that's a change from http to https on hadoop website.

So for example, the email of the committer of the 1st commit is duplicated 6 millions time in the DB (2 millions as commiter, 2 millions as raw_commiter email, and 2 millions for a 3rd column).  Based on a quick estimation of the size of 1 entry (around 180 bytes), the duplication mean that the DB is 360 Mbytes bigger just with this one commit.

And while that's the worst case, there is several others:

```
sandiego=# select count(*), cmt_commit_hash   from commits group by cmt_commit_hash order by count(*) desc limit 30;
  count  |             cmt_commit_hash              
---------+------------------------------------------
 2732304 | 3789983e89c9de252ef546a1b98a732a7d066650
 1078684 | c41c790a92ff858e9e35c0ac862906e7b7b34c56
  892404 | 319a3b994703aac84df7bcde272adfcb3cdbbbf0
  890190 | fe008ae27a7fe6b5ceb3ab903c2b3302767ce1b7
  628378 | 3bc69d393d3e0cb89afc7a86a001e2fbfbd7ac93
  424374 | a4f5d9d960afde5884e478d945741fc32e45a788
  323472 | 733027052103ccce02f33e408ccebf232a917594
  308844 | 6d6dcdbfa377393306bf79585f61baea524ac124
  288996 | 3dd0c30674d50e08bcf4622847e3ac2e413d2085
  238188 | 987c3a2d874fae6cfedf90d5bc920b66c275098a
  189600 | 657b9db1ba9be6520dbd1086702e64746feafd34
  186800 | 8d04e5aec5fc118ccb82910a1af82bfc4f57a398
  179982 | ba7c173659897a68a56defa032e5536defc61768
  172104 | a82f3e0256eace61ac66c344edff8b2d35d81331
  160836 | b9b5a60049c3d3f3c6479c2037a7b78afbc89779
  160836 | 35865ef3dd98c56290eb9d98ae9b0b3737caa224
  160542 | f57b87e8f6b8ca9e3c1e640dca51c2a18ee1ef6a
  159678 | bd4f4be9d7e1336de4ad13287801783a8a0dc13b
  158884 | e48cfc709dbd811ded123e61fbbb2c74e79a7ca1
  158539 | 121125c0f86bde236a1d22b855c5d4bd236ce63b
  158458 | 31c158a95db33f92dddee035cbc8fec2fd78592f
  156276 | 8153779ad32d1e8ddd37ced826c76c7aafc61894
  156000 | bc113af571d331e174668ab25a5649835b701e5e
  154720 | d07db25e2a8fbe78791bcaf960469111f379ee00
  148606 | 08850bd5c4f46d0b2e0eec64d9ca793a3b75fce5
  145422 | 2aa71b1071bcbf45ad8400dfaf33257e3d0b9d2d
  145368 | 71315ac3a3258881e14aa07d0028a483fd3900d5
  134154 | c420248b9b459efcfbd3657170d9be0b96b5fb38
  132720 | 7d969ccd5589f4b07aea72165bee38b405a10cab
  127282 | 59ac4c2629189eee5921fe024a49d54345987bea
```

Most seems to be caused by the openjdk policy of forking their own repository. 

So, this likely also cause trouble on the indexes, since the table is bigger than it should. 

If there is a need on a per file information, it should likely be in a separate table so the information can be deduplicated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`commit` storage is a tad wasteful #17

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

commit storage is a tad wasteful #17

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`commit` storage is a tad wasteful #17