Research similar projects #18

dcwalk · 2017-03-12T04:48:09Z

In our 2017-03-11 Dev standup, the question was raised about what comparable projects are out there. We should compile a list and pay attention to their features/implementation specifics.
@ambergman mentioned Klaxon as a one

This could also be a great first-timer issue: we could collect those projects and document important details?

dcwalk · 2017-03-12T14:24:06Z

Internet Archive's Wayback Machine

From #9:

Have you folks talked to @markjohngraham about how some of these html diff problems are (or could be addressed) in the Wayback Machine -- web.archive.org?

dcwalk · 2017-03-12T14:32:43Z

Klaxon
Klaxon enables reporters and editors to monitor scores of sites on the web for newsworthy changes

Description: You list websites you want monitored and Klaxon will visit them and, if they change, email you what's different. It saves you having to reload dozens of links yourself every day.
Features: ??
Deployment: Heroku / Docker / Ubuntu
Development: github.com/themarshallproject/klaxon, Ruby on Rails

ChaiBapchya · 2017-03-17T14:43:39Z

List of Similar projects - Website Change Monitoring / Notification

Klaxon : https://newsklaxon.org/
Versionista (the one we are using) : https://versionista.com/
Visual Ping : https://visualping.io/
Follow that page : https://www.followthatpage.com/
Page monitor (Google Chrome extension) : http://chrome.google.com/webstore/detail/page-monitor/pemhgklkefakciniebenbfclihhmmfcd
Wachete : https://www.wachete.com/
NewsWhip Analytics : https://www.newswhip.com/newswhip-analytics/
Alertbox (Mozilla Firefox extension) : https://addons.mozilla.org/en-us/firefox/addon/alertbox/
ChangeDetection : https://www.changedetection.com/

thisisashukla · 2017-03-17T18:31:25Z

Diffbot :
Diffbot's computer vision and machine learning services structure Web data with better-than-human-level accuracy across any website or language

A tool for web data extraction

janakrajchadha · 2017-03-18T14:48:36Z

@daas-ankur-shukla - Well, this adds a whole new dimension to the problem.
This may be extremely helpful in the process of creating a training dataset.

ChaiBapchya · 2017-03-18T16:48:49Z

Yes. After going through the website and its different products Diffbot, Crawbot and Custom-made APIs, and Article,Discussion,Image,Product and Video APIs, its believable how the bot returns back structured data. In order to understand "Mechanics of How they do it",I went through the Diffbot official Github repo here - https://github.com/diffbot, unfortunately all they have to offer is Client (Ruby, JS), documentation of API calls, etc.
Actual inner working, Proof of Concept, etc remains closely wrapped.

So all in all despite being a promising product, it remains a "product to be purchased" (after the 14 day trial ofcourse)

dcwalk · 2017-03-18T22:22:03Z

Another monitoring one...
Huginn
Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn's Agents create and consume events, propagating them along a directed graph. Think of it as a hackable Yahoo! Pipes plus IFTTT on your own server. You always know who has your data. You do.

Description: Intro Screencast
Features:

Scrape websites and receive emails when they change
Track counts of high frequency events
Send and receive WebHooks
Run custom JavaScript or CoffeeScript functions

Deployment: Docker / Heroku / Local ubuntu or debian
Development: github.com/cantino/huginn, Ruby on Rails

ChaiBapchya · 2017-03-19T16:54:05Z

Have tried setting the Huginn on Fedora 25, i3 (4-core 4GB DDR3).
But hid roadblocks. Trying to get support from their repo now.

dcwalk · 2017-03-20T16:59:02Z

Here is another one:
Diff Engine
diffengine is a utility for watching RSS feeds to see when story content changes. When new content is found a snapshot is saved at the Internet Archive, and a diff is generated for sending to social media. The hope is that it can help draw attention to the way news is being shaped on the web. It also creates a database of changes over time that can be useful for research purposes.

Features:

generates diffs
stores a snapshot to Internet Archive

Deployment: Local install
Development: https://github.com/DocNow/diffengine, Python 3

janakrajchadha · 2017-03-21T08:20:12Z

@dcwalk - I installed Diff Engine on my machine and it's quite simple to use.
However, it restricts the monitoring to RSS feeds, which I think will seriously limit the number of pages we can monitor with Diff Engine.

ChaiBapchya · 2017-03-26T14:12:28Z

OnWebChange

Similar tool for tracking web changes.

Features -

Notifications
Extensive report
Ability to select specific things to track

Disadvantage -

Price
User has to select areas that he / she wants to track

dcwalk · 2017-03-26T19:41:59Z

Glad to have these projects listed and thoughts on use/disadvantages here!

ChaiBapchya · 2017-03-26T19:58:57Z

Thinking of making a detailed study and comparative analysis of each of these projects for the benefit of everyone. What do you reckon? Useful or not needed? @dcwalk

chiniwini1 · 2017-04-17T23:53:19Z

Hello! I ve tried really all of them with antoher docezens from internet, and no one is working properly. I tried to monitor this site: http://www.yapo.cl/chile/vehiculos?ca=15_s&l=0&cmn=&st=s for creating alerts for specific model but nothing works... really i tried wachete, versionista, webchangedetection, webwatcher, visual ping, follow that page, changedetection, distill, onewebchange, changetower,.etc.... only thewebwatcher.com work partially.
Thank you, still trying woth klaxon but no filters can be added.

chiniwini1 · 2017-04-18T00:26:42Z

also...trackly, watchthatpage, still trying

dcwalk · 2017-04-19T04:47:59Z

A new one which looks intersting--
Pagelyzer
"Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.
It is based on: a web page segmentation algorithm a combination of structural and visual comparison methods embedded in a statistical discriminative model a visual similarity measure designed for Web pages that improves change detection a supervised feature selection method adapted to Web archiving"

patcon · 2017-05-24T07:05:07Z

As per the above linked issue, I've created this as an example of the sort of public resource we could maintain with the larger community in this space: https://github.com/patcon/awesome-website-change-monitoring

I've added the tools from this thread, and some others that were linked from the diffengine readme (incl newsdiffs, a mozilla project).

I've also created a copy of @mhucka's web archiving spreadsheet, stubbing out more in-depth info, and linked it from the awesome-list repo.

If we're down with this approach, thinking next steps could be:

flesh out short tool descriptions in list itself.
transfer repo to edgi org and create a team for it.
open PRs in open source projects on the list, linking our list in their readme, and inviting them onto the list maintainer team. (while inviting their communities to engage with our project)
update the linked in-depth spreadsheet to the degree we care to

Would this be a path that made sense to people? Again, the main perk would be that it's a simple and useful collaborative resource that could be a social hack to help us intersect more with communities in the same space :)

dcwalk · 2017-05-25T02:34:32Z

I really dig the idea of this @patcon! My one thought is how does the awesome list not over-duplicate the spreadsheet? I have a link dump of papers on this issue (comparing web archivers) just point me to where they should be!

(sorry, out of sync email checking :))

KrzysztofMadejski · 2017-06-01T14:00:07Z

Metamorphosis Foundation in Macedonia has developed Time Machine: a website to track where a news article has originated and it was copied by other outlets. They also track changes on a given website.

Website: http://timemachine.truthmeter.mk/
Source: https://bitbucket.org/metamorfozis/news/src

It is quite custom so I think it can be hard to reuse it, but I'm leaving it here for reference.

patcon · 2017-06-13T16:50:19Z

@KrzysztofMadejski nice! ~~Do you see this more as a web archiving tool (like archive.org) or a website change monitoring tool?~~

EDIT: ~~it seems to be both, but curious your thoughts.~~ Further research seems to show it's main value as comparison, so I'll list add it at web change monitoring tool. (Not sure if we want to start listing things in both but we could do that too.)

EDIT: Added to list (edgi-govdata-archiving/awesome-website-change-monitoring#2) and spreadsheet.

patcon · 2017-06-13T17:30:14Z

@dcwalk The thinking was that it might be easier to ask maintainers to point their READMEs to an awesome list, but you're right that it is a bit wierd to have it in two places. Maybe the repo could be a thin README for the spreadsheet itself, and just point there, with a nice screenshot like @mhucka did in the research repo? The trade-off is that now we don't have a clear process to say to project maintainers "let's look after this resource together", because editing google spreadsheets is much more opaque than submitting pull requests. (Who made the change? Was it a new person who we should reach out to? Does the tool even fit?)

I like the pretty of the google spreadsheet, but maybe it should just be a CSV in github repo.

stale · 2019-01-09T22:37:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

lightandluck · 2019-01-14T22:43:31Z

I feel it would be a loss not to surface all these resources even if it's not something that we can actively maintain. Will throw into idea pile for now

Mr0grog · 2019-01-22T16:42:52Z

I feel it would be a loss not to surface all these resources

I agree, but we absolutely should not have an issue doing that work for us—it’s effectively invisible here. This should be a doc in the repo.

dcwalk assigned ambergman Mar 12, 2017

dcwalk added the good-first-issue label Mar 12, 2017

dcwalk added the help-wanted label Mar 12, 2017

dcwalk mentioned this issue Mar 12, 2017

Build PageFreezer-Outputter that fits into current Versionista workflow #9

Closed

dcwalk unassigned ambergman Mar 20, 2017

patcon mentioned this issue May 23, 2017

Create Awesome lists for our tool tracking needs edgi-govdata-archiving/overview#130

Closed

remusao mentioned this issue Jun 1, 2017

Identify related projects cliqz-oss/privacy-bot#16

Open

stale bot added the stale label Jan 9, 2019

stale bot removed the stale label Jan 14, 2019

lightandluck added the idea label Jan 14, 2019

stale bot added the stale label Aug 19, 2020

Mr0grog added never-stale and removed idea stale labels Aug 19, 2020

edgi-govdata-archiving deleted a comment from stale bot Aug 19, 2020

Mr0grog mentioned this issue Jan 17, 2023

Put this project to rest #168

Closed

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research similar projects #18

Research similar projects #18

dcwalk commented Mar 12, 2017 •

edited

Loading

dcwalk commented Mar 12, 2017

dcwalk commented Mar 12, 2017

ChaiBapchya commented Mar 17, 2017

thisisashukla commented Mar 17, 2017

janakrajchadha commented Mar 18, 2017

ChaiBapchya commented Mar 18, 2017

dcwalk commented Mar 18, 2017 •

edited

Loading

ChaiBapchya commented Mar 19, 2017

dcwalk commented Mar 20, 2017

janakrajchadha commented Mar 21, 2017

ChaiBapchya commented Mar 26, 2017 •

edited

Loading

dcwalk commented Mar 26, 2017

ChaiBapchya commented Mar 26, 2017

chiniwini1 commented Apr 17, 2017

chiniwini1 commented Apr 18, 2017

dcwalk commented Apr 19, 2017

patcon commented May 24, 2017 •

edited

Loading

dcwalk commented May 25, 2017

KrzysztofMadejski commented Jun 1, 2017

patcon commented Jun 13, 2017 •

edited

Loading

patcon commented Jun 13, 2017

stale bot commented Jan 9, 2019

lightandluck commented Jan 14, 2019

Mr0grog commented Jan 22, 2019

Research similar projects #18

Research similar projects #18

Comments

dcwalk commented Mar 12, 2017 • edited Loading

dcwalk commented Mar 12, 2017

dcwalk commented Mar 12, 2017

ChaiBapchya commented Mar 17, 2017

thisisashukla commented Mar 17, 2017

janakrajchadha commented Mar 18, 2017

ChaiBapchya commented Mar 18, 2017

dcwalk commented Mar 18, 2017 • edited Loading

ChaiBapchya commented Mar 19, 2017

dcwalk commented Mar 20, 2017

janakrajchadha commented Mar 21, 2017

ChaiBapchya commented Mar 26, 2017 • edited Loading

dcwalk commented Mar 26, 2017

ChaiBapchya commented Mar 26, 2017

chiniwini1 commented Apr 17, 2017

chiniwini1 commented Apr 18, 2017

dcwalk commented Apr 19, 2017

patcon commented May 24, 2017 • edited Loading

dcwalk commented May 25, 2017

KrzysztofMadejski commented Jun 1, 2017

patcon commented Jun 13, 2017 • edited Loading

patcon commented Jun 13, 2017

stale bot commented Jan 9, 2019

lightandluck commented Jan 14, 2019

Mr0grog commented Jan 22, 2019

dcwalk commented Mar 12, 2017 •

edited

Loading

dcwalk commented Mar 18, 2017 •

edited

Loading

ChaiBapchya commented Mar 26, 2017 •

edited

Loading

patcon commented May 24, 2017 •

edited

Loading

patcon commented Jun 13, 2017 •

edited

Loading