Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Framework for page deduplication #60
As discussed in issue #22.
A lot of code to do relatively little, but it provides the basic framework and a simple first implementation for checking whether two pages that were presumably retrieved from the same URL are the same document, completely different pages, or something in between (the more recent one presumably being an update to the other). A complex workaround to the lack of versioning in URLs.
Ideally we could tell directly whether a page is the same as last time or not, for example by checking the ETag HTTP header, if present. The code needed for such checks still has to be implemented, but the required boilerplate is included here (
If a page was not reidentified beforehand, a new page doc is created in the database, and the usual page analysis is run on it. After this the analysed page contents are compared against the candidate page (= the previous page we got from the URL), currently by a simple text comparison between the
If anybody would like to share their thoughts on the sanity of the direction of this approach, I would be glad to hear. There are some important design choices embedded in here. Regard this code as a first iteration though, I hope the approach will flesh out and evolve further over time.
It seems indeed a lot of code to do one thing. It might be worth doing it for future development but as for today my impression is that 50% of it is currently unused and you just end up either replacing a page or not. What is the final picture of this approach? What do you want to give the user?
I want to bring versioning to the web, and provide users with a model of webpages that matches the way people think about it. The view that one URL identifies one document is simply wrong. The current approach, that each time you dereference a URL you get a completely unrelated document, swings too far to the other side. We need something in between.
Today's newspaper frontpage is a different document than yesterday's, though they share the same URL. Today's view of a specific article may be still be the same article however, even though the advertisements around it changed. And tomorrow the article may have had a small correction, making it a revision of the same document.