Skip to content
Permalink
master
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time
10 lines (6 sloc) 974 Bytes

Overlaps between Common Crawl Monthly Archives

Overlaps between monthly crawl archives are calculated and plotted as Jaccard similarity of unique URLs or content digests. The cardinality of the monthly crawls and the union of two crawls are Hyperloglog estimates, cf. plot/overlap.py for details.

URL overlap (Jaccard similarity) between Common Crawl monthly crawls

Content overlap between Common Crawl monthly crawls (Jaccard similarity on unique content digests)

Note, that the content overlaps are small and in the same order of magnitude as the 1% error rate of the Hyperloglog cardinality estimates.