Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我想用jieba分词后,只想提出里面的中文分词,不要标点符号,怎么用python处理啊 谢谢 #528

Open
tianke0711 opened this issue Sep 27, 2017 · 4 comments

Comments

@tianke0711
Copy link

No description provided.

@cbzhuang
Copy link

我是用正则表达式处理的,new_sentence = re.sub(r'[^\u4e00-\u9fa5]', ' ', old_sentence) 然后再进行分词的, \u4e00-\u9fa5这个是utf-8中,中文编码的范围

@tianke0711
Copy link
Author

@cbzhuang 非常谢谢你的回复! 我用了这个,不知道可对。#169

@kn45
Copy link

kn45 commented Aug 7, 2018

Actually, CJK characters are encoded together so there's no critical range for Chinese characters. A punctuation dict could be used to do the filtering.

@Zhya1124
Copy link

@cbzhuang 很棒,但你这个' '中间多打了一个空格吧,应该是new_sentence = re.sub(r'[^\u4e00-\u9fa5]', '', old_sentence)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants