Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize Stopwords #58

Open
nkmeyers opened this issue May 25, 2020 · 5 comments
Open

Customize Stopwords #58

nkmeyers opened this issue May 25, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@nkmeyers
Copy link
Collaborator

nkmeyers commented May 25, 2020

We need a way for users to customize the stopwords list and or swap in their own for use by the various NLP processes that check a stopwords list. @ericleasemorgan I think this enahncement relies on us making some stopwords documentation(showing which processes read a stop words list and where the default is and what is in it) and how to customize or swap out the default list(s)? Then maybe also include some UI on the airavata job input that lets a user indicate if they want to use a customized stopwords list(s) and submit that along with their input target(s)?

FYI this ticket spurred by @ralphlevan 's test Pratchett Carrel.

@nkmeyers nkmeyers added the enhancement New feature or request label May 25, 2020
@ralphlevan
Copy link
Collaborator

I think we need more than just stopwords. The huge number of "he said" and "she said" suggest that we may want some n-gram exclusion patterns. That would be a nice thing for users to add as part of an iterative refinement of their product

@molikd
Copy link
Collaborator

molikd commented May 26, 2020

Could we automate stopwords? Open question. The 1000 most common for sure. But proper nouns are very difficult.

The problem is that you can develop so many stopwords based on frequency of occurrence, but at some-point you would have to spill over to entity detection.

Imagine some kind of machine learning algorithm, it is trained to find stopwords in a corpus of documents. The algorithm starts by using the most common words, than maybe a dictionary, but you still get "Elsiver" in your results. So you add "Elsiver" to the dictionary, but then you get "PLoS" so you add that... and so on. The problem is that you, the user, are deciding to add these words to the list, you know that Elsiver and PLoS are publishers, but how does the machine learning algorithm know to get rid of publisher proper nouns and not others like "Covid-19" or "Manhattan Plot."

Could we go to the context in which the word sits? That's how we, as humans, know to get rid of a word, but you'd probably have to rely on some Neural Net black-box. Better to be on some kind of explainable statistical model, so, is there something about stopwords that tells them apart from other words? Maybe. maybe among stop words associated words there are some details, let's say we only care about a stopword if it is causing some problem in clustering or factorization. So we build a cluster of associated words from each of the words that are separating clusters/topics, and we test these associated words for relevance? maybe if assume that all words that are separating clusters are going to have similar associated words and if they don't they're no longer talking about the same things (on some threshold of similarity), so we remove them. It'd basically just be training topic modeler LDA to not be perfect, which I think is what we want.

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 26, 2020 via email

@ralphlevan
Copy link
Collaborator

I absolutely understand and think I understand the priorities. Get it working first! Lipstick later.

@ericleasemorgan
Copy link
Owner

Who are "the users"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

4 participants