Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Improve host-level PageRanks #52
As explained in our blog post, our host-level PageRank is very experimental and still very subject to spam.
Here is a list of our current ideas to improve it, feel free to contribute yours!
Going to URL-level PageRanks would obviously help a a lot but it is out of scope for this issue.
Sebastian from Common Crawl just did a very interesting first pass on spam in the dumps:
This script is able to detect a few webspam clusters based on their domain name and pagerank similarity.