Having the data be biased towards only front pages makes the data less useful than it could be. Front pages are often different compared to deep pages.
I realize this would balloon the size of the data. But it has been done before (e.g. http://dotnetdotcom.org/ )