Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问对应的资源文件要去哪里获取呢? #2

Closed
cjxx2016 opened this issue Jan 12, 2024 · 2 comments
Closed

请问对应的资源文件要去哪里获取呢? #2

cjxx2016 opened this issue Jan 12, 2024 · 2 comments

Comments

@cjxx2016
Copy link

这是个很棒的项目, 我在代码中看到 ./data/wiki.simple.txt, 等资源数据文件, 是否有对应的说明和获取方式呢?

@charent
Copy link
Owner

charent commented Jan 12, 2024

详细的数据清洗过程在我的另外一个项目,raw_data_process.py#L599,以及raw_data_process.py#L975

自己也可以去wiki 下载地址:https://dumps.wikimedia.org/zhwiki/,下载zhwiki-[存档日期]-pages-articles-multistream.xml.bz2文件,大概2.7GB, 将下载的bz2文件转换为wiki.txt参考:WikiExtractor,最后利用OpenCC库转换为简体中文就是了。

@cjxx2016
Copy link
Author

了解了, 谢谢解答~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants