Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadingBank下载 #1

Closed
WeihongM opened this issue Oct 9, 2021 · 6 comments
Closed

ReadingBank下载 #1

WeihongM opened this issue Oct 9, 2021 · 6 comments

Comments

@WeihongM
Copy link

WeihongM commented Oct 9, 2021

@HYPJUDY 你好,下载了ReadingBank,发现里面只有json文件(dev/test/train三个文件夹),如何能下载到图片呢?
同时想问一下dev文件夹对应的是什么?

谢谢

@HYPJUDY
Copy link
Contributor

HYPJUDY commented Oct 10, 2021

你好,

  1. 完整的图片集没有开源,但这里展示了一些样例供参考。README里也有说明:
    To guarantee there is no potential ethical violation, we publicize a proportion of our dataset (about 100 pages) and this subset will be manually checked and redacted while the access of the whole version requires our further permission. All the data in our dataset will be protected by Apache 2.0 license.
    We further provide some examples of them and the visualization of their reading orders.

  2. dev文件夹和test/train文件夹类似,只是split不一样,可以用于validation。

@WeihongM
Copy link
Author

你好, @HYPJUDY
感谢你的回复。
想请问一下:

  1. LayoutReader训练不需要图片集,后面会考虑开源图片吗?
  2. 我如何在自己的小数据集下训练?数据集数量只有2k张。

@HYPJUDY
Copy link
Contributor

HYPJUDY commented Oct 10, 2021

不客气。

  1. 目前没有计划开源图片。
  2. 可以以LayoutReader pre-trained model作为初始化,在此基础上还可以考虑做数据增强扩充数据集。

@HYPJUDY HYPJUDY closed this as completed Oct 14, 2021
@WeihongM
Copy link
Author

@HYPJUDY
请问您有在小数据集上尝试该方法的性能吗? 或者对于数据集规模小的情况,有什么其它推荐的方法吗?

@HYPJUDY
Copy link
Contributor

HYPJUDY commented Oct 18, 2021

你好,我在小数据集上尝试过上面提到的初始化和数据增强,都对小数据集有帮助,性能正常。没有尝试其他方法了。

@WeihongM
Copy link
Author

@HYPJUDY 感谢您的回复,我觉得这是一个很通用的文档理解工作,可以的话也希望加您的微信交流。Wechat: 13427522474

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants