Customize Stopwords #58

nkmeyers · 2020-05-25T18:11:24Z

We need a way for users to customize the stopwords list and or swap in their own for use by the various NLP processes that check a stopwords list. @ericleasemorgan I think this enahncement relies on us making some stopwords documentation(showing which processes read a stop words list and where the default is and what is in it) and how to customize or swap out the default list(s)? Then maybe also include some UI on the airavata job input that lets a user indicate if they want to use a customized stopwords list(s) and submit that along with their input target(s)?

FYI this ticket spurred by @ralphlevan 's test Pratchett Carrel.

ralphlevan · 2020-05-25T19:12:53Z

I think we need more than just stopwords. The huge number of "he said" and "she said" suggest that we may want some n-gram exclusion patterns. That would be a nice thing for users to add as part of an iterative refinement of their product

molikd · 2020-05-26T00:54:54Z

Could we automate stopwords? Open question. The 1000 most common for sure. But proper nouns are very difficult.

The problem is that you can develop so many stopwords based on frequency of occurrence, but at some-point you would have to spill over to entity detection.

Imagine some kind of machine learning algorithm, it is trained to find stopwords in a corpus of documents. The algorithm starts by using the most common words, than maybe a dictionary, but you still get "Elsiver" in your results. So you add "Elsiver" to the dictionary, but then you get "PLoS" so you add that... and so on. The problem is that you, the user, are deciding to add these words to the list, you know that Elsiver and PLoS are publishers, but how does the machine learning algorithm know to get rid of publisher proper nouns and not others like "Covid-19" or "Manhattan Plot."

Could we go to the context in which the word sits? That's how we, as humans, know to get rid of a word, but you'd probably have to rely on some Neural Net black-box. Better to be on some kind of explainable statistical model, so, is there something about stopwords that tells them apart from other words? Maybe. maybe among stop words associated words there are some details, let's say we only care about a stopword if it is causing some problem in clustering or factorization. So we build a cluster of associated words from each of the words that are separating clusters/topics, and we test these associated words for relevance? maybe if assume that all words that are separating clusters are going to have similar associated words and if they don't they're no longer talking about the same things (on some threshold of similarity), so we remove them. It'd basically just be training topic modeler LDA to not be perfect, which I think is what we want.

ericleasemorgan · 2020-05-26T16:58:51Z

I am not ignoring y'all. Interesting discussions, and I encourage them to continue. Concurrently, we need to: 1) get the whole thing running, and 2) then do enhancements. When it comes to Item #1, we have to: 1. harvest/cache the data set (done) 2. stuff the result into a database (done) 3. enhance the database with additional content (all but done) 4. index the database (all but done) 5. make it easy for Team CORD to create study carrels (half done) 6. make many carrels (barely started) 7. create a Web presence (almost done) Once we get that far, which I anticipate will be by next Friday, we can go for enhancements, and there are many possibilities: * add long titles to list of carrels * allow people other than Team CORD to create study carrels * create a study carrel out of the whole of CORD, which requires scalability * create better stop word list * enable the whole "library" to be re-created * enhance author names with corresponding ORCHIDs * enhance Web presence with additional logos and attributions * extract additional grammars * figure out a way to dynamically create stop word list * generate additional measures of the documents * hyperlink bibliographic items to full text and other things * illustrate relationships using a network diagram * improve topic modeling * index study carrels * make everything FAIR * plot results on a map * plot results on a time line * refine entity output As we enhance, we will repeatedly go back to Step #6 and re-build study carrels over and over, thus the carrels will be in a state of "continuous improvement".† The whole thing is like playing guitar. First you need to learn how hold it. Then you need to learn how to tune it. Then you need to learn a few chords. After that you need to learn how to "keep time". Once you get that far, then you can concentrate to bending notes, advance to finger picking, playing syncopation, experiment with alternative tunings, moving the chords up and down the fret board, improvising, playing in various styles, performing, recording, etc. We are getting there. I assure you. Please continue to discuss all of these things, and once we get the Reader running, we will prioritize enhancements, divvy up the work, and make the whole something we can be proud of. † I can't believe I actually used that phrase.

…

-- Eric M.

ralphlevan · 2020-05-26T17:10:29Z

I absolutely understand and think I understand the priorities. Get it working first! Lipstick later.

ericleasemorgan · 2020-07-09T19:12:51Z

Who are "the users"?

nkmeyers added the enhancement New feature or request label May 25, 2020

nkmeyers assigned ericleasemorgan and ralphlevan May 25, 2020

nkmeyers added this to Triage in The Reader Meets COVID-19 May 25, 2020

nkmeyers mentioned this issue May 26, 2020

Sprint planning and enhancements - ticket-ify this stuff and stick it in sprints #60

Closed

molikd moved this from Triage to In Progress in The Reader Meets COVID-19 May 27, 2020

ericleasemorgan removed this from In Progress in The Reader Meets COVID-19 Jun 26, 2020

ericleasemorgan added this to To do in Project CORD #6 (done) Jun 26, 2020

ericleasemorgan unassigned ericleasemorgan and ralphlevan Jun 26, 2020

ericleasemorgan removed this from To do in Project CORD #6 (done) Jul 8, 2020

ericleasemorgan added this to to do in Project CORD #7 (done) Jul 8, 2020

ericleasemorgan removed this from to do in Project CORD #7 (done) Jul 27, 2020

ericleasemorgan added this to To-do in Project CORD #9 (date ending July 31) Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customize Stopwords #58

Customize Stopwords #58

nkmeyers commented May 25, 2020 •

edited

Loading

ralphlevan commented May 25, 2020

molikd commented May 26, 2020

ericleasemorgan commented May 26, 2020 via email

ralphlevan commented May 26, 2020

ericleasemorgan commented Jul 9, 2020

Customize Stopwords #58

Customize Stopwords #58

Comments

nkmeyers commented May 25, 2020 • edited Loading

ralphlevan commented May 25, 2020

molikd commented May 26, 2020

ericleasemorgan commented May 26, 2020 via email

ralphlevan commented May 26, 2020

ericleasemorgan commented Jul 9, 2020

nkmeyers commented May 25, 2020 •

edited

Loading