Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do we need for Machine Learning? #21

Closed
titaniumbones opened this issue Mar 9, 2017 · 25 comments
Closed

What do we need for Machine Learning? #21

titaniumbones opened this issue Mar 9, 2017 · 25 comments

Comments

@titaniumbones
Copy link
Contributor

I hope someone can write a better description of what we need from a machine-learning component in this repository. Please feel free to edit this description directly.

@titaniumbones
Copy link
Contributor Author

This is of high interest to our incoming GSoC applicants, so if we can move on this, that would be great.

@titaniumbones titaniumbones added this to Issues in Machine Learning Mar 9, 2017
@danielballan
Copy link
Contributor

danielballan commented Mar 9, 2017

Definition of Terms

  • Page: a web page that might change over time
  • Version: a snapshot of a Page at a specific time (saved as HTML, for now)
  • Change: two different Versions of the same Page
  • Diff: a representation of a Change: this could be a plain text diff (as in the UNIX comand line utility) or a richer representation (as in the JSON blobs returned by PageFreezer) that takes into account HTML semantics

Goal

Analyze ~10^5 Changes. Filter out unimportant ones and priortize important ones so that human analysts can be directed to the most important changes.

Expected API

As outlined in #7, a data processing function is expected to take in an unordered collection of Diff namedtuple objects, which include a uuid, a hash, and a dictionary containing the diff content (about which more below). It is expected to return a dictionary mapping each Diff's uuid to a priority, a float between zero and one. Zero means "Do not waste a human analyst's time on this" (because it is a straight duplicate or otherwise known with high confidence to be uninteresting) and one means, "This is extremely likely to be important."

The diff content in itself is a dictionary (mirroring the JSON blobs returned by PageFreezer) that includes the full text of the page before, the full text after, chunked changes, and some statistics.

@danielballan
Copy link
Contributor

The function assign_priorities at the end of this notebook is a trivial example of the API I have mind for integrating ML filtering/prioritization code.

@dcwalk dcwalk changed the title Meta-issue for Machine Learning What do we need for Machine Learning? Mar 9, 2017
@dcwalk
Copy link
Contributor

dcwalk commented Mar 9, 2017

I reframed this as a question for now, as I'm not sure we will want to be tracking this as a "meta-issue".

Also @danielballan's definitions are awesome and I think we want to have that upstreamed into the main web monitoring repo?

Please adjust as needed.

@jaclynweiser
Copy link

I will have input once you have a few ML training examples. We need some positive and negative instances of alarming changes. i.e. major messaging change, column of data changed or removed. We can potentially synthetically create some of these and bootstrap from there.

@danielballan
Copy link
Contributor

@jaclynweiser Sounds good. I'm focused on getting a minimal deployment up, in coordination with the Rails app, where we can start pulling in a small amount of data to start.

@aleatha
Copy link

aleatha commented Mar 10, 2017

Some of the conversations we had during the event were around trying to capture significant keywords and assigning weights to a change based on those. Someone (maybe @dcwalk?) also noted that we don't have a static weights distribution. In other words, the set of important keywords may shift over time, so it's important to keep collecting feedback and retraining whatever model you use.

@dcwalk
Copy link
Contributor

dcwalk commented Mar 10, 2017

Wasn't me (I'm not generally that coherent)!

@jaclynweiser -- we have "example data" here of some of those categoris but I think not enough : https://github.com/edgi-govdata-archiving/web-monitoring/tree/master/example-data

Could you give us (okay--me) some sense of sizing of how many changes you'd like to see to begin training? We are hoping to have a fairly robust training set for people to play around with, and any guidance on that would help me wrap my head around it a little more!

@titaniumbones
Copy link
Contributor Author

We also have this issue: edgi-govdata-archiving/web-monitoring#6 so if @aleatha or @jaclynweiser or other have advice on how to set up the initial data collection so we get meaningful results, please drop them there!!!

@maylad31
Copy link

Sir,
I would like to apply for GSOC 2017.I think alarming changes could be detected if we focus only on changes in content and not in presentation or cosmetic changes.Please guide me further.Thanks.

@danielballan
Copy link
Contributor

@dcwalk My understanding is that ~50 is generally the bare minimum for even playing around with the data, so boosting the sample-data folder to, 50-100 curated examples would be a start. To train something we could actually use, my rough sense is that ~thousands are needed, likely depending on how rare true positives are. (I defer to more experienced ML practitioners here; I'm just trying to give a rough sense....) Perhaps that's a moot point: in production, we will simply use all the labeled data we have: ~tens of diffs of 25k pages.

@danielballan
Copy link
Contributor

@maylad31 That's a great point. We're still setting things up (gathering our data into a distributable form) and making sure we have clear internal agreement about the task. Engaging new contributors is Priority #1, and we'll post guidance in this README or in the README of the web-monitoring repo when we have it.

@maylad31
Copy link

Yes but how do I go about preparing proposal for GSOC? Please help. Thanks.

@danielballan
Copy link
Contributor

Please take GSOC questions to the #gsoc channel of the Archivers slack. If you're not on that yet, you can obtain and invitation here.

@dcwalk
Copy link
Contributor

dcwalk commented Mar 12, 2017

Just tracking whether we have pulled out the relevant issues here:

I am going to try tagging all the ML stuff as ML, that might be helpful to orient newcomers?

@thisisashukla
Copy link

Hi all. As a part of my course on Machine Learning at college i have got some hints at what we can use to learn patterns and changes in documents. May be kernels can be helpful in the task. Should I work more on it and make an abstract?

@danielballan
Copy link
Contributor

@daas-ankur-shukla We're still getting our act together, not quite ready to comment on the deluge of ideas coming from various channels. Information will be available in the #gsoc slack channel as we have it.

@thisisashukla
Copy link

@danielballan what shall be the course of action for aspiring students untill then?

@janakrajchadha
Copy link
Contributor

@danielballan, I believe aspiring students need to discuss the project ideas with your team and get a better understanding of the problem at hand before preparing a good proposal.
As the time for submitting proposals starts next week, I request you to have a session with aspiring students before the proposal submission period starts.

@dcwalk
Copy link
Contributor

dcwalk commented Mar 16, 2017

@janakrajchadha and @daas-ankur-shukla we have an upcoming session for students on Fri, Mar 24 @ 8:00 pm EST, if you aren't already on our chat in the #gsoc channel in our chat I'd recommend joining us there? (instructions on https://envirodatagov.org/gsoc/)

We're making ourselves available in the chat to talk through proposal ideas with students in advance of the submissions opening. Also, we recommended on our ideas page to get a proposal idea up early and iterate on it with feedback.

Google Summer of Code and proposals questions are better suited for the chat then our GitHub issues in various repositories.

@janakrajchadha
Copy link
Contributor

@dcwalk Thanks for the information.
I'm already a part of the slack #gsoc channel.
My nickname is @the_automator.
I have gone through the ideas page and I'm trying to understand the details of the web-monitoring project. I had posted a question a couple of days back and since I did not get any reply there, I just wanted to know if you could arrange a session before Mon, Mar 20.
Should I post my questions in the chat and tag one of the admins or should I keep my questions for the upcoming session for students?

@ChaiBapchya
Copy link
Contributor

Summary of yesterday's Slack conversation Along with @b5 @janakrajchadha
Topic - ML / NLP aspect of the project
Context - Understanding "meaningful change"

Brief

  1. Test existing software for "Web site change monitoring/tracking" for ability to Provide Training data
  2. Challenges -
    a. Infrastructure setup and deployment
    b. Defining the criteria for negative and positive instances

Solution
For 2.b. Training Dataset + Feedback from Analysts is the key

To tackle problem of Insufficient data quantum/size
NLP + base set of Important keywords

Hence To Do List

  1. Generate / Find base set of Important keywords
  2. Feedback of Analysts
  3. Train and Enrich the base set

@thisisashukla
Copy link

@dcwalk thanks for the chat time information. I have already joined the slack channel. My username is daas-ankur. I shall join the chat and put some ideas regarding my proposal. I'll also try to put up my proposal asap for the mentors to look and suggest. Looking forward to the chat session.

@mhucka
Copy link
Member

mhucka commented Mar 25, 2017

In case it's useful to anyone, here's a link to some info about the PageFreezer API and data it returns: https://www.pagefreezer.com/development/

@stale
Copy link

stale bot commented Jan 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Jan 10, 2019
@stale stale bot closed this as completed Jan 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests