# Tokenizer for Modern Chinese: Jieba #

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

In [5]:
import jieba

- The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
    
- `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
- The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
- `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
- `jieba.lcut` and `jieba.lcut_for_search` returns a list.
- `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.


In [6]:
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式


Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学


In [7]:
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 默认模式


Default Mode: 我/ 来到/ 北京/ 清华大学


In [8]:
seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))


他, 来到, 了, 网易, 杭研, 大厦


In [9]:
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ，, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造


## Jieba shortcut for sklean vectorizers

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from chinese_tokenizer.tokenizer import Tokenizer

jie_ba_tokenizer = Tokenizer().jie_ba_tokenizer
count_vect = CountVectorizer(tokenizer=jie_ba_tokenizer)