-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add More Stopword Lists #20
Comments
hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url... |
I thought about that, but in some they'll just be a few kilobytes, so I
don't think it's worth adding that. Regarding adding URLs, we can, but it
doesn't seem worth it since it'd have to be in a txt styled format (which
is rare to find online in my experience) and they really should save it
locally anyways.
…On Jul 31, 2017 11:55 AM, "Fabian von Feilitzsch" ***@***.***> wrote:
hmm, wonder if there's a good way we can pull those down and cache them if
they're requested, rather than adding them all to the repository. Or just
generally adding the ability to pull a stopwords list from a url...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AShd7LU-k1wYp4Hed3Yl6xu7S2nd8nTgks5sTfjigaJpZM4Okq9u>
.
|
Well, if you went the URL route I'd thought you'd provide a URL and separation regex, so like RAKE.load_stopwords('http://example.com/beststopwords', re.compile('super-cool-regex')) so it wouldn't matter how they formatted it so long as it was a list of some kind. Just feel like it would be convenient, especially if you were just hacking/prototyping and wanted to experiment with different stoplists, without requiring you to download/format them manually. |
Interesting. You may be right that that's a useful feature and I don't see it, but I've never seen someone who wanted to do that as a data scientist. Also it'd require more than just a regex for the vast majority of sites--it'd require playing around in beautiful soup or something too. The way I've seen everyone do it because it's always been the fastest has been to copy and paste into ipython and do some quick for loop. |
It looks like this project has amassed a large group of stopwords lists from a variety of sources, do you think we could leverage this work? |
For posterities sake: Hi Justin, Thanks for asking. Does your script work with HTML documents or text without markup only ? If HTML, I'm curious if you've had a chance to test the results from the Page Analyzer tool on ranks.nl ? Kind regards, On Tue, Aug 1, 2017 at 10:02 PM, Justin Terry justinkterry@gmail.com wrote: Can I include your stop word lists into the package by default if I credit you? -- |
@fabianvf please close this, I fixed this in my last PR that you merged and forgot to mention it. |
nevermind apparnetly i can now |
After the current round of PRs are worked out, we should build in more stop words. I vote adding all the ones here, along with any others asked for: http://www.ranks.nl/stopwords . Also @fabianvf one of these is what I used as a test file, think that'll cause a problem?
The text was updated successfully, but these errors were encountered: