As discussed in issue #22.
A lot of code to do relatively little, but it provides the basic framework and a simple first implementation for checking whether two pages that were presumably retrieved from the same URL are the same document, completely different pages, or something in between (the more recent one presumably being an update to the other). A complex workaround to the lack of versioning in URLs.
Ideally we could tell directly whether a page is the same as last time or not, for example by checking the ETag HTTP header, if present. The code needed for such checks still has to be implemented, but the required boilerplate is included here (
If a page was not reidentified beforehand, a new page doc is created in the database, and the usual page analysis is run on it. After this the analysed page contents are compared against the candidate page (= the previous page we got from the URL), currently by a simple text comparison between the
If anybody would like to share their thoughts on the sanity of the direction of this approach, I would be glad to hear. There are some important design choices embedded in here. Regard this code as a first iteration though, I hope the approach will flesh out and evolve further over time.
It seems indeed a lot of code to do one thing. It might be worth doing it for future development but as for today my impression is that 50% of it is currently unused and you just end up either replacing a page or not. What is the final picture of this approach? What do you want to give the user?
I want to bring versioning to the web, and provide users with a model of webpages that matches the way people think about it. The view that one URL identifies one document is simply wrong. The current approach, that each time you dereference a URL you get a completely unrelated document, swings too far to the other side. We need something in between.
Today's newspaper frontpage is a different document than yesterday's, though they share the same URL. Today's view of a specific article may be still be the same article however, even though the advertisements around it changed. And tomorrow the article may have had a small correction, making it a revision of the same document.
Each revisit to a URL is expected to return a different page than last time. After having done page analysis however, we check if its contents are still the same as those of the previous page we got from this URL, and if they are, we consider it the same page, and deduplicate it by replacing the contents of the old one with a 'seeInstead' reference to the new one. There are many things to be done still: - The comparison is simply checking for equivalence of body.innerText and page title. Better sameness detection is a must. - Sameness could often be determined beforehand if we check the ETag in the HTTP header, thus reducing the need for retrospective deduplication. - The framework for expressing 'fuzzy sameness' is drafted, but not used. - Visits could possibly be deduplicated too, e.g. not counting a page reload as a new visit. - Probably a lot more...
When getting pages from the database, if their contents have been forgotten and been replaced by a 'seeInstead' redirect, we replace their docs in the result rows. Downsides of this approach: - Not-forgotten page metadata is not readable anymore. - The row.doc does not match row.id/.value/.key. Those are not updated since the original id is still necessary to keep insertPagesIntoVisits working.