Skip to content
Permalink
Browse files

version update 0.38

  • Loading branch information...
anderscui committed Feb 4, 2016
1 parent 6602f26 commit 1b0fb4e8ba664ca0f87bc0a46d509bebc83b80c3
Showing with 41 additions and 14 deletions.
  1. +5 −0 Changelog
  2. +20 −1 README.md
  3. +16 −13 docs/todo.md
@@ -1,3 +1,8 @@
2016-01-04: version 0.38
1. 扩充识别的汉字unicode范围:[\u4E00-\u9FD5];
2. 修复load_userdict加载用户词典不能识别含有空格等特殊字符的问题;
3. 支持命令行分词;

2015-09-18:v0.37.1
1) 完整实现jieba分词(Python版)的第一个版本
2) 发布到了NuGet
@@ -1,6 +1,8 @@
jieba.NET是[jieba中文分词](https://github.com/fxsjy/jieba)的.NET版本(C#实现)。

当前版本为0.37.1,基于jieba 0.37,目标是提供与jieba一致的功能与接口,但以后可能会在jieba基础上提供其它扩展功能。关于jieba的实现思路,可以看看[这篇wiki](https://github.com/anderscui/jieba.NET/wiki/%E7%90%86%E8%A7%A3%E7%BB%93%E5%B7%B4%E5%88%86%E8%AF%8D)里提到的资料。
当前版本为0.38,基于jieba 0.38,提供与jieba一致的功能与接口,以后可能会在jieba基础上提供其它扩展功能。关于jieba的实现思路,可以看看[这篇wiki](https://github.com/anderscui/jieba.NET/wiki/%E7%90%86%E8%A7%A3%E7%BB%93%E5%B7%B4%E5%88%86%E8%AF%8D)里提到的资料。

如果您在开发中遇到与分词有关的需求或困难,请提交一个Issue,I see u:)

## 特点

@@ -197,3 +199,20 @@ jieba分词亦提供了其它的词典文件:
* 全模式:2.5 MB/s
* 精确模式:1.1 MB/s
* 测试环境: Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz;围城.txt(734KB)

### 10. 命令行分词

Segmenter.Cli项目build之后得到jiebanet.ext,它的选项和实例用法如下:

-f --file the file name, (必要的).
-d --delimiter the delimiter between tokens, default: / .
-a --cut-all use cut_all mode.
-n --no-hmm don't use HMM.
-p --pos enable POS tagging.
-v --version show version info.
-h --help show help details.

sample usages:
$ jiebanet -f input.txt > output.txt
$ jiebanet -d | -f input.txt > output.txt
$ jiebanet -p -f input.txt > output.txt
@@ -1,16 +1,19 @@
1. Deploy dicts with dlls?
2. Solution structure - clearer for open source project
3. Spell check and suggests
5. News Classification
2. Spell check and suggests
3. News Classification
4. Synonyms
5. Parallel

Misc
1. synchronization;
2. cache;
4. jieba.pool;
5. other dict files;
6. multiple english words (e.g. Steve Jobs)
7. named entity recognition
8. new word recognition
10. logging
11. Pinyin
12. Remove Console.WriteLine output, use Debug instead
1. cache;
2. other dict files;
3. multiple english words (e.g. Steve Jobs)
4. named entity recognition
5. new word recognition
6. logging
7. Pinyin
8. Simplified <-> Traditional

Ideas
1. [linggle](http://linggle.com/)
2. gensim

0 comments on commit 1b0fb4e

Please sign in to comment.
You can’t perform that action at this time.