We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
这是个很棒的项目, 我在代码中看到 ./data/wiki.simple.txt, 等资源数据文件, 是否有对应的说明和获取方式呢?
The text was updated successfully, but these errors were encountered:
详细的数据清洗过程在我的另外一个项目,raw_data_process.py#L599,以及raw_data_process.py#L975
自己也可以去wiki 下载地址:https://dumps.wikimedia.org/zhwiki/,下载zhwiki-[存档日期]-pages-articles-multistream.xml.bz2文件,大概2.7GB, 将下载的bz2文件转换为wiki.txt参考:WikiExtractor,最后利用OpenCC库转换为简体中文就是了。
zhwiki-[存档日期]-pages-articles-multistream.xml.bz2
OpenCC
Sorry, something went wrong.
了解了, 谢谢解答~
No branches or pull requests
这是个很棒的项目, 我在代码中看到 ./data/wiki.simple.txt, 等资源数据文件, 是否有对应的说明和获取方式呢?
The text was updated successfully, but these errors were encountered: