New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What do we need for Machine Learning? #21
Comments
This is of high interest to our incoming GSoC applicants, so if we can move on this, that would be great. |
Definition of Terms
GoalAnalyze ~10^5 Changes. Filter out unimportant ones and priortize important ones so that human analysts can be directed to the most important changes. Expected APIAs outlined in #7, a data processing function is expected to take in an unordered collection of The diff content in itself is a dictionary (mirroring the JSON blobs returned by PageFreezer) that includes the full text of the page before, the full text after, chunked changes, and some statistics. |
The function |
I reframed this as a question for now, as I'm not sure we will want to be tracking this as a "meta-issue". Also @danielballan's definitions are awesome and I think we want to have that upstreamed into the main web monitoring repo? Please adjust as needed. |
I will have input once you have a few ML training examples. We need some positive and negative instances of alarming changes. i.e. major messaging change, column of data changed or removed. We can potentially synthetically create some of these and bootstrap from there. |
@jaclynweiser Sounds good. I'm focused on getting a minimal deployment up, in coordination with the Rails app, where we can start pulling in a small amount of data to start. |
Some of the conversations we had during the event were around trying to capture significant keywords and assigning weights to a change based on those. Someone (maybe @dcwalk?) also noted that we don't have a static weights distribution. In other words, the set of important keywords may shift over time, so it's important to keep collecting feedback and retraining whatever model you use. |
Wasn't me (I'm not generally that coherent)! @jaclynweiser -- we have "example data" here of some of those categoris but I think not enough : https://github.com/edgi-govdata-archiving/web-monitoring/tree/master/example-data Could you give us (okay--me) some sense of sizing of how many changes you'd like to see to begin training? We are hoping to have a fairly robust training set for people to play around with, and any guidance on that would help me wrap my head around it a little more! |
We also have this issue: edgi-govdata-archiving/web-monitoring#6 so if @aleatha or @jaclynweiser or other have advice on how to set up the initial data collection so we get meaningful results, please drop them there!!! |
Sir, |
@dcwalk My understanding is that ~50 is generally the bare minimum for even playing around with the data, so boosting the sample-data folder to, 50-100 curated examples would be a start. To train something we could actually use, my rough sense is that ~thousands are needed, likely depending on how rare true positives are. (I defer to more experienced ML practitioners here; I'm just trying to give a rough sense....) Perhaps that's a moot point: in production, we will simply use all the labeled data we have: ~tens of diffs of 25k pages. |
@maylad31 That's a great point. We're still setting things up (gathering our data into a distributable form) and making sure we have clear internal agreement about the task. Engaging new contributors is Priority #1, and we'll post guidance in this README or in the README of the web-monitoring repo when we have it. |
Yes but how do I go about preparing proposal for GSOC? Please help. Thanks. |
Please take GSOC questions to the #gsoc channel of the Archivers slack. If you're not on that yet, you can obtain and invitation here. |
Just tracking whether we have pulled out the relevant issues here:
I am going to try tagging all the ML stuff as ML, that might be helpful to orient newcomers? |
Hi all. As a part of my course on Machine Learning at college i have got some hints at what we can use to learn patterns and changes in documents. May be kernels can be helpful in the task. Should I work more on it and make an abstract? |
@daas-ankur-shukla We're still getting our act together, not quite ready to comment on the deluge of ideas coming from various channels. Information will be available in the #gsoc slack channel as we have it. |
@danielballan what shall be the course of action for aspiring students untill then? |
@danielballan, I believe aspiring students need to discuss the project ideas with your team and get a better understanding of the problem at hand before preparing a good proposal. |
@janakrajchadha and @daas-ankur-shukla we have an upcoming session for students on Fri, Mar 24 @ 8:00 pm EST, if you aren't already on our chat in the We're making ourselves available in the chat to talk through proposal ideas with students in advance of the submissions opening. Also, we recommended on our ideas page to get a proposal idea up early and iterate on it with feedback. Google Summer of Code and proposals questions are better suited for the chat then our GitHub issues in various repositories. |
@dcwalk Thanks for the information. |
Summary of yesterday's Slack conversation Along with @b5 @janakrajchadha Brief
Solution To tackle problem of Insufficient data quantum/size Hence To Do List
|
@dcwalk thanks for the chat time information. I have already joined the slack channel. My username is daas-ankur. I shall join the chat and put some ideas regarding my proposal. I'll also try to put up my proposal asap for the mentors to look and suggest. Looking forward to the chat session. |
In case it's useful to anyone, here's a link to some info about the PageFreezer API and data it returns: https://www.pagefreezer.com/development/ |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
I hope someone can write a better description of what we need from a machine-learning component in this repository. Please feel free to edit this description directly.
The text was updated successfully, but these errors were encountered: