Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于idf.txt #87

Open
linkerlin opened this issue Jul 30, 2013 · 14 comments

Comments

@linkerlin
Copy link
Contributor

commented Jul 30, 2013

貌似代码里面没有从语料生成idf.txt的代码,是否遗漏?

@lewsn2008

This comment has been minimized.

Copy link

commented Aug 7, 2013

词典和idf.txt都是坐着事先对语料进行训练和分析得到的,不包含在这个工程里面。
不过我也很想看到语料分析的代码,期待作者共享!

@jannson

This comment has been minimized.

Copy link

commented Aug 12, 2013

如果大家对这个有兴趣,我愿意分享给大家。
当时觉得idf.txt里面的统计不大好,所以自己想办法生成了一份。还有词料统计,用了最大熵的思路,生成了一分语料统计。主要是为了发现新词的。

2013/8/7 lewsn2008 notifications@github.com

词典和idf.txt都是坐着事先对语料进行训练和分析得到的,不包含在这个工程里面。
不过我也很想看到语料分析的代码,期待作者共享!


Reply to this email directly or view it on GitHubhttps://github.com//issues/87#issuecomment-22226557
.

@fxsjy

This comment has been minimized.

Copy link
Owner

commented Aug 13, 2013

@linkerlin , @lewsn2008 , 我找到了之前写的生成idf.txt的脚本,基本思路是对一些小说报刊语料进行分词,然后以段落为单位,统计idf.

import jieba
import math
import sys
import re
re_han = re.compile(ur"([\u4E00-\u9FA5]+)")

d={}
total = 0
for line in open("yuliao_onlyseg.txt",'rb'):
    sentence = line.decode('utf-8').strip()
    words = set(jieba.cut(sentence))
    for w in words:
        if w in jieba.FREQ:
            d[w]=d.get(w,0.0) + 1.0
    total+=1
    if total%10000==0:
        print >>sys.stderr,'sentence count', total

new_d = [(k,math.log(v/total)*-1 ) for k,v in d.iteritems()]

for k,v in new_d:
    print k.encode('utf-8'),v
@lewsn2008

This comment has been minimized.

Copy link

commented Aug 13, 2013

@fxsjy 非常感谢,作者真是无私的大牛啊,膜拜!

@linkerlin

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2013

谢谢!
没看明白为何要 math.log(v/total)*-1

@fxsjy

This comment has been minimized.

Copy link
Owner

commented Aug 13, 2013

@linkerlin , 也可以math.log(total/v)

@lewsn2008

This comment has been minimized.

Copy link

commented Aug 20, 2013

求语料数据,程序中的yuliao_onlyseg.txt还有吗?想跑一下学习学习,thks

@jannson

This comment has been minimized.

Copy link

commented Aug 21, 2013

一个新的分词库:https://github.com/jannson/yaha ,仅与大家交流学习:

  1. 提供了解决结巴分词库的姓名识别,后缀名识别,使用正则表达式等问题的思路;
  2. 同时对提取关键字,ChineseAnalyzer进行了小小的优化;
  3. 附加了 最大熵算法生成新词,自动摘要,比较两文本的相似度算法的实现。

产生这个分词库的原因,是因为在我的一个小小的爬虫,搜索引擎上使用结巴分词库之后,发现了一些小问题优化之后形成的,本来想直接修改结巴代码并提交,但是有一些设计思路区别较大才弄的新的分词库,不是对结巴作者的不敬。

最后感谢结巴库作者,里面的字典以及一些代码思路来自于结巴库。也希望大家以后能有更多交流。

On Tue, Aug 20, 2013 at 5:40 PM, lewsn2008 notifications@github.com wrote:

求语料数据,程序中的yuliao_onlyseg.txt还有吗?想跑一下学习学习,thks


Reply to this email directly or view it on GitHubhttps://github.com//issues/87#issuecomment-22933170
.

@fxsjy

This comment has been minimized.

Copy link
Owner

commented Aug 21, 2013

@jannson , 已关注。

@fxsjy

This comment has been minimized.

Copy link
Owner

commented Aug 21, 2013

@lewsn2008 , 这个文件有200多MB,怎么发给你?

@yanyiwu

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2013

能否发个dropbox链接,我也想下载一份。

wuyanyi09@gmail.com

发件人: Sun Junyi
发送时间: 2013-08-21 17:13
收件人: fxsjy/jieba
主题: Re: [jieba] 关于idf.txt (#87)
@lewsn2008 , 这个文件有200多MB,怎么发给你?

Reply to this email directly or view it on GitHub.

@fxsjy

This comment has been minimized.

Copy link
Owner

commented Aug 22, 2013

@aszxqw , @lewsn2008 ,试用了一下百度云盘,分享地址:http://pan.baidu.com/share/link?shareid=4094310849&uk=1124369080

@wowxunyl

This comment has been minimized.

Copy link

commented Feb 24, 2018

    if w in jieba.FREQ:
        d[w]=d.get(w,0.0) + 1.0

jieba.FREQ现在已经不存在了,请问现在应该如何写if w in ?

@alexwwang

This comment has been minimized.

Copy link

commented Feb 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.