Domain names treated as sentences #24

quantoid · 2018-07-09T05:27:35Z

If the text contains a domain name like www.google.com then the parts of that name are extracted as words, e.g. the word "com".

ghost · 2018-08-08T01:15:21Z

Hi for this issue and also for real world language which can often be cramped up with numerous punctuation marks, I tried various tokenizers and was satisfied with the way nltk's TweetTokenizer works. I implemented it as follows:

from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
def _generate_phrases(self, sentences):
phrase_list = set()
for sentence in sentences:
word_list = [word.lower() for word in tokenizer_words.tokenize(sentence)]
phrase_list.update(self._get_phrase_list_from_words(word_list))
return phrase_list
Not only does this chalk out www.google.com as is, it also conserves important marks such as #hashtag, @person, etc.

csurfer · 2018-08-09T14:46:49Z

@nsehwan: I am open to any extension to the package as long as the following are met:

It is a problem for the vast majority.
The solution to the problem can be made generic enough.

Even though it meets the (1) requirement I think we should first formulate your simple solution to a generic one so that it can be used by everyone before implementing it.

ghost · 2018-08-14T17:11:44Z

Thanks @csurfer for the information, working on your suggestions

ghost · 2018-11-15T19:02:49Z

Sorry for my evanesce !!!
After trying various tokenizers, I thought it better to build a sanitizer/tokenizer based on your suggestions. And really it was actually better that way, i.e. more general.

get_sanitized_word_list is basically a function which takes as input individual sentences, segregated by sent_tokenize and returns list of words similar to what wordpunct_tokenize(sentence) was returning previously but sanitized better.

`def get_sanitized_word_list(data):
result = []
word = ''

for char in data:
	if char not in string.whitespace:
		if char not in string.ascii_letters + "'.~`^:<>/-_%&@*#$123456789":	#List of whatever could be within or at start/end of words
			if word:
				result.append(word)
			result.append(char)
			word = ''
		else:
			word = ''.join([word,char])

	else:
		if word:
			result.append(word)
			word = ''
if word != '':
	result.append(word)
	word=''
return result`

It works on most general cases that I tried so far. And yes better than TweetTokenizer as well. Please let me know what you think about this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain names treated as sentences #24

Domain names treated as sentences #24

quantoid commented Jul 9, 2018

ghost commented Aug 8, 2018 •

edited by ghost

Loading

csurfer commented Aug 9, 2018

ghost commented Aug 14, 2018 •

edited by ghost

Loading

ghost commented Nov 15, 2018 •

edited by ghost

Loading

Domain names treated as sentences #24

Domain names treated as sentences #24

Comments

quantoid commented Jul 9, 2018

ghost commented Aug 8, 2018 • edited by ghost Loading

csurfer commented Aug 9, 2018

ghost commented Aug 14, 2018 • edited by ghost Loading

ghost commented Nov 15, 2018 • edited by ghost Loading

ghost commented Aug 8, 2018 •

edited by ghost

Loading

ghost commented Aug 14, 2018 •

edited by ghost

Loading

ghost commented Nov 15, 2018 •

edited by ghost

Loading