-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Domain names treated as sentences #24
Comments
Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows: from nltk.tokenize import TweetTokenizer, sent_tokenize |
@nsehwan: I am open to any extension to the package as long as the following are met:
Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it. |
Thanks @csurfer for the information, working on your suggestions |
Sorry for my evanesce !!! get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better. `def get_sanitized_word_list(data):
It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this. |
If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".
The text was updated successfully, but these errors were encountered: