Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

简单程序估计了一下少句的情况。。 #47

Closed
kc910521 opened this issue Mar 16, 2018 · 9 comments
Closed

简单程序估计了一下少句的情况。。 #47

kc910521 opened this issue Mar 16, 2018 · 9 comments

Comments

@kc910521
Copy link
Contributor

对这个项目有很大兴趣,我这边写了点东西,做了一下少句统计(主要是宋词部分),以ci.song.0.json文件内数据为例,我这边统计1000首宋词丢失语句数量最低是134,所以几万首下来数量还是很可观的。
有丢失情况的宋词序号:
[ci_103, ci_112, ci_159, ci_215, ci_302, ci_332, ci_366, ci_413, ci_451, ci_486, ci_487, ci_558, ci_686, ci_694, ci_695, ci_726, ci_736, ci_737, ci_750, ci_791, ci_824, ci_841, ci_909, ci_910, ci_952, ci_57, ci_83, ci_104, ci_144, ci_160, ci_218, ci_292, ci_306, ci_328, ci_344, ci_365, ci_485, ci_550, ci_555, ci_563, ci_590, ci_667, ci_669, ci_670, ci_672, ci_673, ci_689, ci_692, ci_780, ci_886, ci_893, ci_948, ci_28, ci_56, ci_82, ci_95, ci_102, ci_244, ci_296, ci_301, ci_311, ci_312, ci_326, ci_337, ci_347, ci_369, ci_399, ci_408, ci_409, ci_417, ci_482, ci_527, ci_593, ci_682, ci_683, ci_684, ci_687, ci_688, ci_753, ci_795, ci_804, ci_825, ci_857, ci_890, ci_895, ci_85, ci_105, ci_161, ci_263, ci_281, ci_345, ci_351, ci_352, ci_368, ci_450, ci_524, ci_553, ci_612, ci_627, ci_681, ci_820, ci_822, ci_823, ci_881, ci_947, ci_949, ci_8, ci_12, ci_111, ci_162, ci_182, ci_209, ci_275, ci_277, ci_303, ci_346, ci_364, ci_370, ci_373, ci_387, ci_488, ci_489, ci_530, ci_531, ci_546, ci_557, ci_570, ci_678, ci_690, ci_732, ci_735, ci_835, ci_891, ci_911]
所以我准备完善一下这个程序吧,希望能自动修复一下少句问题。。

@kc910521
Copy link
Contributor Author

丢失数量134是指134首宋词有丢失,每首不一定丢了几句,有的甚至丢了一半

@jackeyGao
Copy link
Member

关注这个问题, 这个程序是怎么判定丢失的?

@kc910521
Copy link
Contributor Author

呃,就是解析出诗词主题部分,然后拼接成对应baidu的url去百度文学查一下,如果百度文学返回的词的字数超过原始json词字数一定范围(比如5),就判定少句。。暂时是这么干的,当然还不完善也不精确,只是个大概的判定。
源码暂时扔这了:
https://github.com/kc910521/springBootFriends/tree/elastic_search

@kc910521
Copy link
Contributor Author

后期可以改成把缺失的语句补充上什么的

@jackeyGao
Copy link
Member

期待 👍

@kc910521
Copy link
Contributor Author

现在我这边补全后的json:
https://github.com/kc910521/springBootFriends/blob/elastic_search/src/main/resources/ci.0.json
有两个问题,一个是相较源文件,词的顺序会有变化;
第二个问题是json的视觉样式变化,不知道作者这边之前在导出时,是否约定了格式?
正在想怎么解决这问题。

@jackeyGao
Copy link
Member

可以在现有词顺序基础的上面做内容的修改, json 的 map 字段使用的有序词典。

@ghost
Copy link

ghost commented Apr 2, 2018

一小部分唐诗也缺句,似乎缺的是最后一句?

这里几个例子:, ,

@jackeyGao
Copy link
Member

宋词缺句现象问题已找到, 是爬虫的BUG, 详情#70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants