What do we need for Machine Learning? #21

titaniumbones · 2017-03-09T17:08:02Z

I hope someone can write a better description of what we need from a machine-learning component in this repository. Please feel free to edit this description directly.

titaniumbones · 2017-03-09T17:09:27Z

This is of high interest to our incoming GSoC applicants, so if we can move on this, that would be great.

danielballan · 2017-03-09T18:53:37Z

Definition of Terms

Page: a web page that might change over time
Version: a snapshot of a Page at a specific time (saved as HTML, for now)
Change: two different Versions of the same Page
Diff: a representation of a Change: this could be a plain text diff (as in the UNIX comand line utility) or a richer representation (as in the JSON blobs returned by PageFreezer) that takes into account HTML semantics

Goal

Analyze ~10^5 Changes. Filter out unimportant ones and priortize important ones so that human analysts can be directed to the most important changes.

Expected API

As outlined in #7, a data processing function is expected to take in an unordered collection of Diff namedtuple objects, which include a uuid, a hash, and a dictionary containing the diff content (about which more below). It is expected to return a dictionary mapping each Diff's uuid to a priority, a float between zero and one. Zero means "Do not waste a human analyst's time on this" (because it is a straight duplicate or otherwise known with high confidence to be uninteresting) and one means, "This is extremely likely to be important."

The diff content in itself is a dictionary (mirroring the JSON blobs returned by PageFreezer) that includes the full text of the page before, the full text after, chunked changes, and some statistics.

danielballan · 2017-03-09T19:53:01Z

The function assign_priorities at the end of this notebook is a trivial example of the API I have mind for integrating ML filtering/prioritization code.

dcwalk · 2017-03-09T20:35:25Z

I reframed this as a question for now, as I'm not sure we will want to be tracking this as a "meta-issue".

Also @danielballan's definitions are awesome and I think we want to have that upstreamed into the main web monitoring repo?

Please adjust as needed.

jaclynweiser · 2017-03-10T19:50:05Z

I will have input once you have a few ML training examples. We need some positive and negative instances of alarming changes. i.e. major messaging change, column of data changed or removed. We can potentially synthetically create some of these and bootstrap from there.

danielballan · 2017-03-10T19:58:08Z

@jaclynweiser Sounds good. I'm focused on getting a minimal deployment up, in coordination with the Rails app, where we can start pulling in a small amount of data to start.

aleatha · 2017-03-10T21:48:48Z

Some of the conversations we had during the event were around trying to capture significant keywords and assigning weights to a change based on those. Someone (maybe @dcwalk?) also noted that we don't have a static weights distribution. In other words, the set of important keywords may shift over time, so it's important to keep collecting feedback and retraining whatever model you use.

dcwalk · 2017-03-10T22:01:00Z

Wasn't me (I'm not generally that coherent)!

@jaclynweiser -- we have "example data" here of some of those categoris but I think not enough : https://github.com/edgi-govdata-archiving/web-monitoring/tree/master/example-data

Could you give us (okay--me) some sense of sizing of how many changes you'd like to see to begin training? We are hoping to have a fairly robust training set for people to play around with, and any guidance on that would help me wrap my head around it a little more!

titaniumbones · 2017-03-10T22:45:49Z

We also have this issue: edgi-govdata-archiving/web-monitoring#6 so if @aleatha or @jaclynweiser or other have advice on how to set up the initial data collection so we get meaningful results, please drop them there!!!

maylad31 · 2017-03-11T11:23:20Z

Sir,
I would like to apply for GSOC 2017.I think alarming changes could be detected if we focus only on changes in content and not in presentation or cosmetic changes.Please guide me further.Thanks.

danielballan · 2017-03-11T15:01:30Z

@dcwalk My understanding is that ~50 is generally the bare minimum for even playing around with the data, so boosting the sample-data folder to, 50-100 curated examples would be a start. To train something we could actually use, my rough sense is that ~thousands are needed, likely depending on how rare true positives are. (I defer to more experienced ML practitioners here; I'm just trying to give a rough sense....) Perhaps that's a moot point: in production, we will simply use all the labeled data we have: ~tens of diffs of 25k pages.

danielballan · 2017-03-11T15:23:01Z

@maylad31 That's a great point. We're still setting things up (gathering our data into a distributable form) and making sure we have clear internal agreement about the task. Engaging new contributors is Priority #1, and we'll post guidance in this README or in the README of the web-monitoring repo when we have it.

maylad31 · 2017-03-11T15:45:37Z

Yes but how do I go about preparing proposal for GSOC? Please help. Thanks.

danielballan · 2017-03-11T15:49:58Z

Please take GSOC questions to the #gsoc channel of the Archivers slack. If you're not on that yet, you can obtain and invitation here.

dcwalk · 2017-03-12T13:45:50Z

Just tracking whether we have pulled out the relevant issues here:

the sample dataset issue is covered by Create sample dataset for machine learning projects web-monitoring#6
@danielballan's rad def'ns are gonna be integrated into the readme for the project
issues are currently tracking the some identified elements of filtering/prioritization

I am going to try tagging all the ML stuff as ML, that might be helpful to orient newcomers?

thisisashukla · 2017-03-15T14:23:39Z

Hi all. As a part of my course on Machine Learning at college i have got some hints at what we can use to learn patterns and changes in documents. May be kernels can be helpful in the task. Should I work more on it and make an abstract?

danielballan · 2017-03-15T16:48:07Z

@daas-ankur-shukla We're still getting our act together, not quite ready to comment on the deluge of ideas coming from various channels. Information will be available in the #gsoc slack channel as we have it.

thisisashukla · 2017-03-15T16:50:30Z

@danielballan what shall be the course of action for aspiring students untill then?

janakrajchadha · 2017-03-16T02:34:47Z

@danielballan, I believe aspiring students need to discuss the project ideas with your team and get a better understanding of the problem at hand before preparing a good proposal.
As the time for submitting proposals starts next week, I request you to have a session with aspiring students before the proposal submission period starts.

dcwalk · 2017-03-16T19:37:14Z

@janakrajchadha and @daas-ankur-shukla we have an upcoming session for students on Fri, Mar 24 @ 8:00 pm EST, if you aren't already on our chat in the #gsoc channel in our chat I'd recommend joining us there? (instructions on https://envirodatagov.org/gsoc/)

We're making ourselves available in the chat to talk through proposal ideas with students in advance of the submissions opening. Also, we recommended on our ideas page to get a proposal idea up early and iterate on it with feedback.

Google Summer of Code and proposals questions are better suited for the chat then our GitHub issues in various repositories.

janakrajchadha · 2017-03-17T01:36:22Z

@dcwalk Thanks for the information.
I'm already a part of the slack #gsoc channel.
My nickname is @the_automator.
I have gone through the ideas page and I'm trying to understand the details of the web-monitoring project. I had posted a question a couple of days back and since I did not get any reply there, I just wanted to know if you could arrange a session before Mon, Mar 20.
Should I post my questions in the chat and tag one of the admins or should I keep my questions for the upcoming session for students?

ChaiBapchya · 2017-03-17T19:52:53Z

Summary of yesterday's Slack conversation Along with @b5 @janakrajchadha
Topic - ML / NLP aspect of the project
Context - Understanding "meaningful change"

Brief

Test existing software for "Web site change monitoring/tracking" for ability to Provide Training data
Challenges -
a. Infrastructure setup and deployment
b. Defining the criteria for negative and positive instances

Solution
For 2.b. Training Dataset + Feedback from Analysts is the key

To tackle problem of Insufficient data quantum/size
NLP + base set of Important keywords

Hence To Do List

Generate / Find base set of Important keywords
Feedback of Analysts
Train and Enrich the base set

thisisashukla · 2017-03-20T12:37:40Z

@dcwalk thanks for the chat time information. I have already joined the slack channel. My username is daas-ankur. I shall join the chat and put some ideas regarding my proposal. I'll also try to put up my proposal asap for the mentors to look and suggest. Looking forward to the chat session.

mhucka · 2017-03-25T02:25:35Z

In case it's useful to anyone, here's a link to some info about the PageFreezer API and data it returns: https://www.pagefreezer.com/development/

stale · 2019-01-10T03:11:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

titaniumbones assigned kmcculloch, danielballan, dcwalk and jaclynweiser Mar 9, 2017

titaniumbones added enhancement help-wanted labels Mar 9, 2017

titaniumbones added this to Issues in Machine Learning Mar 9, 2017

dcwalk added question and removed enhancement labels Mar 9, 2017

dcwalk changed the title ~~Meta-issue for Machine Learning~~ What do we need for Machine Learning? Mar 9, 2017

dcwalk mentioned this issue Mar 12, 2017

Create sample dataset for machine learning projects edgi-govdata-archiving/web-monitoring#6

Closed

dcwalk added automated-analysis and removed help-wanted labels Mar 12, 2017

ChaiBapchya mentioned this issue Mar 27, 2017

Added environmental corpus edgi-govdata-archiving/web-monitoring#28

Closed

stale bot added the stale label Jan 10, 2019

stale bot closed this as completed Jan 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do we need for Machine Learning? #21

What do we need for Machine Learning? #21

titaniumbones commented Mar 9, 2017

titaniumbones commented Mar 9, 2017

danielballan commented Mar 9, 2017 •

edited

danielballan commented Mar 9, 2017

dcwalk commented Mar 9, 2017

jaclynweiser commented Mar 10, 2017

danielballan commented Mar 10, 2017

aleatha commented Mar 10, 2017

dcwalk commented Mar 10, 2017

titaniumbones commented Mar 10, 2017

maylad31 commented Mar 11, 2017

danielballan commented Mar 11, 2017

danielballan commented Mar 11, 2017

maylad31 commented Mar 11, 2017

danielballan commented Mar 11, 2017

dcwalk commented Mar 12, 2017 •

edited

thisisashukla commented Mar 15, 2017

danielballan commented Mar 15, 2017

thisisashukla commented Mar 15, 2017

janakrajchadha commented Mar 16, 2017

dcwalk commented Mar 16, 2017

janakrajchadha commented Mar 17, 2017

ChaiBapchya commented Mar 17, 2017

thisisashukla commented Mar 20, 2017

mhucka commented Mar 25, 2017

stale bot commented Jan 10, 2019

What do we need for Machine Learning? #21

What do we need for Machine Learning? #21

Comments

titaniumbones commented Mar 9, 2017

titaniumbones commented Mar 9, 2017

danielballan commented Mar 9, 2017 • edited

Definition of Terms

Goal

Expected API

danielballan commented Mar 9, 2017

dcwalk commented Mar 9, 2017

jaclynweiser commented Mar 10, 2017

danielballan commented Mar 10, 2017

aleatha commented Mar 10, 2017

dcwalk commented Mar 10, 2017

titaniumbones commented Mar 10, 2017

maylad31 commented Mar 11, 2017

danielballan commented Mar 11, 2017

danielballan commented Mar 11, 2017

maylad31 commented Mar 11, 2017

danielballan commented Mar 11, 2017

dcwalk commented Mar 12, 2017 • edited

thisisashukla commented Mar 15, 2017

danielballan commented Mar 15, 2017

thisisashukla commented Mar 15, 2017

janakrajchadha commented Mar 16, 2017

dcwalk commented Mar 16, 2017

janakrajchadha commented Mar 17, 2017

ChaiBapchya commented Mar 17, 2017

thisisashukla commented Mar 20, 2017

mhucka commented Mar 25, 2017

stale bot commented Jan 10, 2019

danielballan commented Mar 9, 2017 •

edited

dcwalk commented Mar 12, 2017 •

edited