Add More Stopword Lists #20

jkterry1 · 2017-07-27T01:07:01Z

After the current round of PRs are worked out, we should build in more stop words. I vote adding all the ones here, along with any others asked for: http://www.ranks.nl/stopwords . Also @fabianvf one of these is what I used as a test file, think that'll cause a problem?

fabianvf · 2017-07-31T15:55:13Z

hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url...

jkterry1 · 2017-07-31T19:45:59Z

I thought about that, but in some they'll just be a few kilobytes, so I don't think it's worth adding that. Regarding adding URLs, we can, but it doesn't seem worth it since it'd have to be in a txt styled format (which is rare to find online in my experience) and they really should save it locally anyways.

…

On Jul 31, 2017 11:55 AM, "Fabian von Feilitzsch" ***@***.***> wrote: hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AShd7LU-k1wYp4Hed3Yl6xu7S2nd8nTgks5sTfjigaJpZM4Okq9u> .

fabianvf · 2017-07-31T20:07:24Z

Well, if you went the URL route I'd thought you'd provide a URL and separation regex, so like

RAKE.load_stopwords('http://example.com/beststopwords', re.compile('super-cool-regex'))

so it wouldn't matter how they formatted it so long as it was a list of some kind. Just feel like it would be convenient, especially if you were just hacking/prototyping and wanted to experiment with different stoplists, without requiring you to download/format them manually.

jkterry1 · 2017-08-01T02:20:10Z

Interesting. You may be right that that's a useful feature and I don't see it, but I've never seen someone who wanted to do that as a data scientist. Also it'd require more than just a regex for the vast majority of sites--it'd require playing around in beautiful soup or something too. The way I've seen everyone do it because it's always been the fastest has been to copy and paste into ipython and do some quick for loop.

fabianvf · 2017-08-01T14:34:50Z

It looks like this project has amassed a large group of stopwords lists from a variety of sources, do you think we could leverage this work?
https://github.com/igorbrigadir/stopwords

jkterry1 · 2017-08-08T09:14:29Z

For posterities sake:

Hi Justin,

Thanks for asking.
Yes you can use our stopword lists if you credit 'ranks.nl'

Does your script work with HTML documents or text without markup only ?

If HTML, I'm curious if you've had a chance to test the results from the Page Analyzer tool on ranks.nl ?
It is basically a tool for Automatic Keyword Extraction from Individual HTML Documents.

Kind regards,
Damian Doyle
Ranks NL

On Tue, Aug 1, 2017 at 10:02 PM, Justin Terry justinkterry@gmail.com wrote:
Hello, I'm working on an MIT licensed open source natural language processing tool in python: https://github.com/fabianvf/python-rake

Can I include your stop word lists into the package by default if I credit you?

--
Thank you for your time,
Justin Terry

jkterry1 · 2017-09-11T22:15:09Z

@fabianvf please close this, I fixed this in my last PR that you merged and forgot to mention it.

jkterry1 · 2017-09-12T00:08:54Z

nevermind apparnetly i can now

jkterry1 mentioned this issue Aug 1, 2017

Fixes My Things #22

Merged

fabianvf added the enhancement label Aug 31, 2017

jkterry1 closed this as completed Sep 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add More Stopword Lists #20

Add More Stopword Lists #20

jkterry1 commented Jul 27, 2017

fabianvf commented Jul 31, 2017

jkterry1 commented Jul 31, 2017 via email

fabianvf commented Jul 31, 2017

jkterry1 commented Aug 1, 2017 •

edited

Loading

fabianvf commented Aug 1, 2017 •

edited

Loading

jkterry1 commented Aug 8, 2017

jkterry1 commented Sep 11, 2017

jkterry1 commented Sep 12, 2017

Add More Stopword Lists #20

Add More Stopword Lists #20

Comments

jkterry1 commented Jul 27, 2017

fabianvf commented Jul 31, 2017

jkterry1 commented Jul 31, 2017 via email

fabianvf commented Jul 31, 2017

jkterry1 commented Aug 1, 2017 • edited Loading

fabianvf commented Aug 1, 2017 • edited Loading

jkterry1 commented Aug 8, 2017

jkterry1 commented Sep 11, 2017

jkterry1 commented Sep 12, 2017

jkterry1 commented Aug 1, 2017 •

edited

Loading

fabianvf commented Aug 1, 2017 •

edited

Loading