Skip to content

Commit

Permalink
Update __init__.py
Browse files Browse the repository at this point in the history
fix:fxsjy#300
when jieba.cut(sentence,HMM=False),chinese and english characters mixed with whitespace can be also ouput.
userdict:
Edu Trust认证 2000
jieba.cut("我通过了Edu Trust认证",HMM=False)
output:我, 通过, 了, Edu Trust认证
  • Loading branch information
cavonchen committed May 25, 2016
1 parent 0243d56 commit 4d99963
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions jieba/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@

re_eng = re.compile('[a-zA-Z0-9]', re.U)

# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : All non-space characters. Will be handled with re_han
# \u4E00-\u9FD5a-zA-Z0-9+#&\._ : words and whitespace characters and ':' and '-'. Will be handled with re_han
# \r\n|\s : whitespace characters. Will not be handled.
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._]+)", re.U)
re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._\-\: ]+)", re.U)
re_skip_default = re.compile("(\r\n|\s)", re.U)
re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)
re_skip_cut_all = re.compile("[^a-zA-Z0-9+#\n]", re.U)
Expand Down

0 comments on commit 4d99963

Please sign in to comment.