New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve host-level PageRanks #52

Open
sylvinus opened this Issue Jul 31, 2016 · 1 comment

Comments

Projects
None yet
1 participant
@sylvinus
Contributor

sylvinus commented Jul 31, 2016

As explained in our blog post, our host-level PageRank is very experimental and still very subject to spam.

Here is a list of our current ideas to improve it, feel free to contribute yours!

  • Don't follow rel=nofollow links
  • Better weights on the edges (treat links between subdomains differently? give less weight for links in the boilerplate and/or at the end of the page? give more weight depending on the number of distinct pages linking to the domain?)
  • Try to group domains belonging to the same owner (By IP address/DNS info? See #15)

Going to URL-level PageRanks would obviously help a a lot but it is out of scope for this issue.

@sylvinus

This comment has been minimized.

Show comment
Hide comment
@sylvinus

sylvinus Aug 22, 2016

Contributor

Sebastian from Common Crawl just did a very interesting first pass on spam in the dumps:
https://gist.github.com/sebastian-nagel/beb244bf1f7092a06a1479335a5e268b

This script is able to detect a few webspam clusters based on their domain name and pagerank similarity.

Contributor

sylvinus commented Aug 22, 2016

Sebastian from Common Crawl just did a very interesting first pass on spam in the dumps:
https://gist.github.com/sebastian-nagel/beb244bf1f7092a06a1479335a5e268b

This script is able to detect a few webspam clusters based on their domain name and pagerank similarity.

sylvinus added a commit that referenced this issue Aug 25, 2016

Exclude nofollow links from the webgraph (#52), rename "datasources" …
…to "dataproviders" to avoid confusion with document sources, and other smaller refactors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment