Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine key features in diffs that could be used for filtration #51

Closed
suchthis opened this issue Jun 23, 2017 · 7 comments
Closed

Determine key features in diffs that could be used for filtration #51

suchthis opened this issue Jun 23, 2017 · 7 comments
Assignees

Comments

@suchthis
Copy link
Contributor

No description provided.

@danielballan
Copy link
Contributor

@janakrajchadha When you decide you've looked at enough diffs to get an initial sense of their variety, can you put some remarks here?

@janakrajchadha
Copy link

@danielballan Well, I can never be too sure if I've looked at enough diffs. However, I'll add interesting observations here.
Here are a few:
Most of the date changes have a certain pattern

  • Month: From example Jun to Aug
  • Days: 17 to 25
  • Years: 1997 to 1998
  • HTML table tags with any one of the aforementioned diffs - <td class="c" id="displayDayEl" style="width:34px;font-size:24px;" title="You are here: 11:19:49 Feb 16, 2017">16 to <td class="c" id="displayDayEl" style="width:34px;font-size:24px;" title="You are here: 07:36:25 Feb 18, 2017">18

For now, I've only used versions from Internet Archive.
An interesting observation about the newer versions stored by IA:
There is some information embedded in the HTML by IA when it stores a version.
I'm not sure when this started but it is present in 2015-2017 versions.
Again, irrelevant changes.

Other than dates, I have not noticed other irrelevant changes which occur frequently.
Maybe I need to try very specific cases or just focus on the recent versions as they change almost every day. I will now try this out with PageFreezer versions.
At the same time, as I have a basic sense of the date changes and others as well, I'll move ahead of the pre-filtering issue.

@suchthis
Copy link
Contributor Author

Thanks for the comments above @janakrajchadha. Re.:

Other than dates, I have not noticed other irrelevant changes which occur frequently.

Have you not seen any examples of entire page sections that might be both updated frequently and irrelevant, such as a scrolling banner ad?

@janakrajchadha
Copy link

I tried looking for patterns for scrolling banners and other sections but as any information or article can be included in a scrolling banner, it is hard to create a simple filter for them. I haven't seen any ads as we're mostly dealing with government agency websites, so scrolling banner ads are out of the picture.
As I've mentioned in the slack conversations, I've pinged Toly and Maya asking for recent files to help me with this.
Looking for diffs between random recent versions hasn't been very effective.
Even though I've looked at more PF diffs than anyone else, I can never be sure how many should I look at before I can spot more patterns.

@janakrajchadha
Copy link

  1. The only changes which seem to follow some pattern which can be easily identified are date/time changes.
    Different websites display date/time in different ways and while we can start with certain filters which will work with most of the cases, we may have to take care of specific cases and modify our filtering as and when we encounter them.
    A fair amount of all changes on the frequently occurring changes dictionary are these changes and there are a few more which can be considered as a date/time change but couldn't be tagged as the same because of the other changes they appeared with.

  2. Scrolling news, banners etc. have been repeatedly tagged as irrelevant but not because of the page element that makes them, but the content in them i.e the text that appeared in them. Without looking at the text and tagging any scrolling banner etc. as irrelevant is risky and can easily make us ignore important changes.

  3. Another element that can be filtered out is the social media links but as it does not have a common format on all sites, we can follow the date/time path, where we start with a few formats and then modify and improve over time.

@suchthis
Copy link
Contributor Author

Just noting that after convo in Slack, we are leaning toward id'ing scrolling news feeds & banners as far as possible, though probably tagging those changes rather than filtering them automatically, per above concern.

@Mr0grog
Copy link
Member

Mr0grog commented Jul 24, 2017

FWIW: in the new DB, we have annotations (all the current ways we classify in spreadsheets), priority, and significance. It might make sense to auto-assign scrolling news feeds the normal annotation (repeated changes 12) and maybe a low priority (if they are the only change on the page), like 0.1 (priority is from 0-1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants