Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
How to avoid coverage drift? #118
This is more of an idea than an issue. If we include the checksum as the key for each file within the redis store, we would remove the need for the user clearing the store. For example:
This would also remove confusion around a files coverage not being accurate when the file has been changed but the store has not been cleared.
In order to avoid stale files hanging out forever within redis, we could also just supply a high default ttl and the stale files should automatically be cleaned up by redis.
interesting idea, we do have a ttl option but it defaults to forever... so something like 1 year as a high default would work...
I like the idea, I had been considering namespacing by deployed git hash for a while, but this in some ways is better as it doesn't impact files that weren't changed between deployments.
referenced this issue
Oct 8, 2018
I rephrased this as I think there are a couple different things that could be helpful.
One thing was that
While the checksum solution solves the drift issue, it does cause a slightly different trust issue. Basically, that a frequently changed file might no have a very long history showing usage... So it could look like it isn't used if it was just part of a not often run cron job or something like that.
I guess in a perfect world we would really have line number level MD5 and only throw out coverage when the actual line has moved / changed. This is not realistic.
Ideas to help with understanding the time range that the coverage data is valid for, and to avoid the drift.
update the UI so for any file you would have something like
In theory, one could actually users see all versions of a file and navigate back and forth... seeing coverage change over file history... over being able to generate a coverage report based on time range... Show me coverage data between date X and Y.
I had been considering similar ideas based on git hashes at time of deploy... so you could query coverage based on related git versions or release time... but I do think the MD5 version has advantages as it wouldn't be tied to any code repository or deployment methods.
The benchmarks you shared @kbaum make it seem reasonably fast... I think when first introduced it still probably makes sense to do as an optional feature.
Related: I believe one still might want to be able to clear all coverage data, as changing routes or controllers could mean downstream models no longer receive coverage... That being said if we add the metadata about coverage time alongside the MD5 hash, this is where coverage queries over time could be very powerful... Never clear, but query the time of interest and know that MD5 shows only hits on most recent versions of files.
Love the idea of tying the md5 change with updating the UI to display "first seen" and "coverage last updated". Showing previous versions of a file would also be a great feature though not a must have for the first release.
Seeing coverage data over time could be useful eventually but not sure it's worth the complexity it introduces. I would imagine the size of the data stored in redis would greatly increase since we would need to store snapshots of data.
I agree that one might want to clear coverage data but with the features discussed hear, it feels like the exception.
Agreed on avoiding complexity at first, I think MD5 and metadata but only showing the current version would be a good first version of this. Showing old versions introduces some complexity on the code viewing side of the report so that might not ever be worth adding.