Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add More Stopword Lists #20

Closed
jkterry1 opened this issue Jul 27, 2017 · 8 comments
Closed

Add More Stopword Lists #20

jkterry1 opened this issue Jul 27, 2017 · 8 comments

Comments

@jkterry1
Copy link
Collaborator

After the current round of PRs are worked out, we should build in more stop words. I vote adding all the ones here, along with any others asked for: http://www.ranks.nl/stopwords . Also @fabianvf one of these is what I used as a test file, think that'll cause a problem?

@fabianvf
Copy link
Owner

hmm, wonder if there's a good way we can pull those down and cache them if they're requested, rather than adding them all to the repository. Or just generally adding the ability to pull a stopwords list from a url...

@jkterry1
Copy link
Collaborator Author

jkterry1 commented Jul 31, 2017 via email

@fabianvf
Copy link
Owner

Well, if you went the URL route I'd thought you'd provide a URL and separation regex, so like

RAKE.load_stopwords('http://example.com/beststopwords', re.compile('super-cool-regex'))

so it wouldn't matter how they formatted it so long as it was a list of some kind. Just feel like it would be convenient, especially if you were just hacking/prototyping and wanted to experiment with different stoplists, without requiring you to download/format them manually.

@jkterry1
Copy link
Collaborator Author

jkterry1 commented Aug 1, 2017

Interesting. You may be right that that's a useful feature and I don't see it, but I've never seen someone who wanted to do that as a data scientist. Also it'd require more than just a regex for the vast majority of sites--it'd require playing around in beautiful soup or something too. The way I've seen everyone do it because it's always been the fastest has been to copy and paste into ipython and do some quick for loop.

@fabianvf
Copy link
Owner

fabianvf commented Aug 1, 2017

It looks like this project has amassed a large group of stopwords lists from a variety of sources, do you think we could leverage this work?
https://github.com/igorbrigadir/stopwords

@jkterry1 jkterry1 mentioned this issue Aug 1, 2017
@jkterry1
Copy link
Collaborator Author

jkterry1 commented Aug 8, 2017

For posterities sake:

Hi Justin,

Thanks for asking.
Yes you can use our stopword lists if you credit 'ranks.nl'

Does your script work with HTML documents or text without markup only ?

If HTML, I'm curious if you've had a chance to test the results from the Page Analyzer tool on ranks.nl ?
It is basically a tool for Automatic Keyword Extraction from Individual HTML Documents.

Kind regards,
Damian Doyle
Ranks NL

On Tue, Aug 1, 2017 at 10:02 PM, Justin Terry justinkterry@gmail.com wrote:
Hello, I'm working on an MIT licensed open source natural language processing tool in python: https://github.com/fabianvf/python-rake

Can I include your stop word lists into the package by default if I credit you?

--
Thank you for your time,
Justin Terry

@jkterry1
Copy link
Collaborator Author

@fabianvf please close this, I fixed this in my last PR that you merged and forgot to mention it.

@jkterry1
Copy link
Collaborator Author

nevermind apparnetly i can now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants