Overlaps between Common Crawl Monthly Archives
Overlaps between monthly crawl archives are calculated and plotted as Jaccard similarity of unique URLs or content digests. The cardinality of the monthly crawls and the union of two crawls are Hyperloglog estimates, cf. plot/overlap.py for details.
Note, that the content overlaps are small and in the same order of magnitude as the 1% error rate of the Hyperloglog cardinality estimates.