Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework for page deduplication #60

merged 6 commits into from Mar 16, 2017

Framework for page deduplication #60

merged 6 commits into from Mar 16, 2017


Copy link

@Treora Treora commented Mar 9, 2017

As discussed in issue #22.

A lot of code to do relatively little, but it provides the basic framework and a simple first implementation for checking whether two pages that were presumably retrieved from the same URL are the same document, completely different pages, or something in between (the more recent one presumably being an update to the other). A complex workaround to the lack of versioning in URLs.

Ideally we could tell directly whether a page is the same as last time or not, for example by checking the ETag HTTP header, if present. The code needed for such checks still has to be implemented, but the required boilerplate is included here (tryReidentifyPage in src/page-storage/store-page.js).

If a page was not reidentified beforehand, a new page doc is created in the database, and the usual page analysis is run on it. After this the analysed page contents are compared against the candidate page (= the previous page we got from the URL), currently by a simple text comparison between the body.innerText and title of the two pages (see src/page-storage/sameness.js). A level of 'sameness' is determined (rather subjectively), and this level is then used for deciding what to do with the two (in src/page-storage/deduplication.js). Currently the two possible actions are to either keep them both as-is, or to forget the analysed contents of the older page and add a seeInstead field that redirects any reader to the new page; much like HTTP redirects. Which actions are taken for which sameness levels can be worked out in more detail later.

If anybody would like to share their thoughts on the sanity of the direction of this approach, I would be glad to hear. There are some important design choices embedded in here. Regard this code as a first iteration though, I hope the approach will flesh out and evolve further over time.

return Sameness.EXACTLY

const normaliseWhitespace = s => s.trim().replace(/s+/g, ' ')
if (normaliseWhitespace(text1) === normaliseWhitespace(text2))

This comment has been minimized.


obsidianart Mar 10, 2017

I'm surprised this check is needed and you don't get something like 0.99 in this case

This comment has been minimized.


Treora Mar 11, 2017
Author Collaborator

The idea is that 'ostensibly' means that any changes are of no visible influence, or at least of no semantic influence. A single letter is considered a tiny but actual change to the content, but an extra space is regarded unimportant, and in html it would normally not even appear on screen. Also for other data types, 'ostensibly the same' would mean that the rendered result is the same.

In any case, the tests here are largely intended as simple examples, to be worked out in more detail and for more data types another time.

Copy link

@obsidianart obsidianart commented Mar 10, 2017

It seems indeed a lot of code to do one thing. It might be worth doing it for future development but as for today my impression is that 50% of it is currently unused and you just end up either replacing a page or not. What is the final picture of this approach? What do you want to give the user?

@Treora Treora mentioned this pull request Mar 11, 2017
@Treora Treora force-pushed the deduplication branch from d6b8242 to 6c3e5ac Mar 11, 2017
Copy link
Collaborator Author

@Treora Treora commented Mar 11, 2017

What is the final picture of this approach? What do you want to give the user?

I want to bring versioning to the web, and provide users with a model of webpages that matches the way people think about it. The view that one URL identifies one document is simply wrong. The current approach, that each time you dereference a URL you get a completely unrelated document, swings too far to the other side. We need something in between.

Today's newspaper frontpage is a different document than yesterday's, though they share the same URL. Today's view of a specific article may be still be the same article however, even though the advertisements around it changed. And tomorrow the article may have had a small correction, making it a revision of the same document.

Copy link

@obsidianart obsidianart commented Mar 11, 2017

ok, it's a similar approach to the internet archive. Good idea.

Treora added 6 commits Mar 3, 2017
Each revisit to a URL is expected to return a different page than last time.
After having done page analysis however, we check if its contents are still
the same as those of the previous page we got from this URL, and if they
are, we consider it the same page, and deduplicate it by replacing the
contents of the old one with a 'seeInstead' reference to the new one.

There are many things to be done still:
- The comparison is simply checking for equivalence of body.innerText and
  page title. Better sameness detection is a must.
- Sameness could often be determined beforehand if we check the ETag in the
  HTTP header, thus reducing the need for retrospective deduplication.
- The framework for expressing 'fuzzy sameness' is drafted, but not used.
- Visits could possibly be deduplicated too, e.g. not counting a page reload
  as a new visit.
- Probably a lot more...
When getting pages from the database, if their contents have been forgotten
and been replaced by a 'seeInstead' redirect, we replace their docs in the
result rows.

Downsides of this approach:
- Not-forgotten page metadata is not readable anymore.
- The row.doc does not match Those are not updated since
  the original id is still necessary to keep insertPagesIntoVisits working.
@Treora Treora force-pushed the deduplication branch from 6c3e5ac to 2d8d314 Mar 16, 2017
@Treora Treora merged commit 2d8d314 into master Mar 16, 2017
@Treora Treora deleted the deduplication branch Mar 16, 2017
@Treora Treora mentioned this pull request Mar 18, 2017
@poltak poltak mentioned this pull request Apr 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants