Framework for page deduplication #60

Treora · 2017-03-09T22:54:41Z

As discussed in issue #22.

A lot of code to do relatively little, but it provides the basic framework and a simple first implementation for checking whether two pages that were presumably retrieved from the same URL are the same document, completely different pages, or something in between (the more recent one presumably being an update to the other). A complex workaround to the lack of versioning in URLs.

Ideally we could tell directly whether a page is the same as last time or not, for example by checking the ETag HTTP header, if present. The code needed for such checks still has to be implemented, but the required boilerplate is included here (tryReidentifyPage in src/page-storage/store-page.js).

If a page was not reidentified beforehand, a new page doc is created in the database, and the usual page analysis is run on it. After this the analysed page contents are compared against the candidate page (= the previous page we got from the URL), currently by a simple text comparison between the body.innerText and title of the two pages (see src/page-storage/sameness.js). A level of 'sameness' is determined (rather subjectively), and this level is then used for deciding what to do with the two (in src/page-storage/deduplication.js). Currently the two possible actions are to either keep them both as-is, or to forget the analysed contents of the older page and add a seeInstead field that redirects any reader to the new page; much like HTTP redirects. Which actions are taken for which sameness levels can be worked out in more detail later.

If anybody would like to share their thoughts on the sanity of the direction of this approach, I would be glad to hear. There are some important design choices embedded in here. Regard this code as a first iteration though, I hope the approach will flesh out and evolve further over time.

obsidianart · 2017-03-10T13:44:59Z

src/page-storage/sameness.js

+        return Sameness.EXACTLY
+
+    const normaliseWhitespace = s => s.trim().replace(/s+/g, ' ')
+    if (normaliseWhitespace(text1) === normaliseWhitespace(text2))


I'm surprised this check is needed and you don't get something like 0.99 in this case

The idea is that 'ostensibly' means that any changes are of no visible influence, or at least of no semantic influence. A single letter is considered a tiny but actual change to the content, but an extra space is regarded unimportant, and in html it would normally not even appear on screen. Also for other data types, 'ostensibly the same' would mean that the rendered result is the same.

In any case, the tests here are largely intended as simple examples, to be worked out in more detail and for more data types another time.

obsidianart · 2017-03-10T13:55:25Z

It seems indeed a lot of code to do one thing. It might be worth doing it for future development but as for today my impression is that 50% of it is currently unused and you just end up either replacing a page or not. What is the final picture of this approach? What do you want to give the user?

Treora · 2017-03-11T14:57:53Z

What is the final picture of this approach? What do you want to give the user?

I want to bring versioning to the web, and provide users with a model of webpages that matches the way people think about it. The view that one URL identifies one document is simply wrong. The current approach, that each time you dereference a URL you get a completely unrelated document, swings too far to the other side. We need something in between.

Today's newspaper frontpage is a different document than yesterday's, though they share the same URL. Today's view of a specific article may be still be the same article however, even though the advertisements around it changed. And tomorrow the article may have had a small correction, making it a revision of the same document.

obsidianart · 2017-03-11T15:10:20Z

ok, it's a similar approach to the internet archive. Good idea.

Each revisit to a URL is expected to return a different page than last time. After having done page analysis however, we check if its contents are still the same as those of the previous page we got from this URL, and if they are, we consider it the same page, and deduplicate it by replacing the contents of the old one with a 'seeInstead' reference to the new one. There are many things to be done still: - The comparison is simply checking for equivalence of body.innerText and page title. Better sameness detection is a must. - Sameness could often be determined beforehand if we check the ETag in the HTTP header, thus reducing the need for retrospective deduplication. - The framework for expressing 'fuzzy sameness' is drafted, but not used. - Visits could possibly be deduplicated too, e.g. not counting a page reload as a new visit. - Probably a lot more...

When getting pages from the database, if their contents have been forgotten and been replaced by a 'seeInstead' redirect, we replace their docs in the result rows. Downsides of this approach: - Not-forgotten page metadata is not readable anymore. - The row.doc does not match row.id/.value/.key. Those are not updated since the original id is still necessary to keep insertPagesIntoVisits working.

obsidianart reviewed Mar 10, 2017

View reviewed changes

Treora mentioned this pull request Mar 11, 2017

Forget individual pages #21

Closed

Treora force-pushed the deduplication branch from d6b8242 to 6c3e5ac Compare March 11, 2017 14:29

Treora added 6 commits March 16, 2017 01:10

Use diff-match-patch instead of string-similarity.

e0412bf

Only follow redirects when desired.

b4fa5e5

Habituate to prefer row.id over row.doc._id.

0ae7227

Factor out page-storage, comments & tweaks.

2d8d314

Treora force-pushed the deduplication branch from 6c3e5ac to 2d8d314 Compare March 16, 2017 00:12

Treora merged commit 2d8d314 into master Mar 16, 2017

Treora deleted the deduplication branch March 16, 2017 22:11

Treora mentioned this pull request Mar 18, 2017

Deduplicate pages #22

Closed

poltak mentioned this pull request Apr 24, 2017

Import browser bookmarks #29

Closed

blackforestboi mentioned this pull request May 1, 2017

MTNI-281 ⁃ Store only one page object per URL WorldBrain/Memex#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework for page deduplication #60

Framework for page deduplication #60

Treora commented Mar 9, 2017 •

edited

Loading

obsidianart Mar 10, 2017

Treora Mar 11, 2017

obsidianart commented Mar 10, 2017

Treora commented Mar 11, 2017 •

edited

Loading

obsidianart commented Mar 11, 2017

Framework for page deduplication #60

Framework for page deduplication #60

Conversation

Treora commented Mar 9, 2017 • edited Loading

obsidianart Mar 10, 2017

Choose a reason for hiding this comment

Treora Mar 11, 2017

Choose a reason for hiding this comment

obsidianart commented Mar 10, 2017

Treora commented Mar 11, 2017 • edited Loading

obsidianart commented Mar 11, 2017

Treora commented Mar 9, 2017 •

edited

Loading

Treora commented Mar 11, 2017 •

edited

Loading