Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IIT CDIP 1.0 (Illinois Institute of Technology Complex Document Information Processing Test Collection, version 1.0) #12

Closed
kba opened this issue Sep 23, 2019 · 38 comments

Comments

@kba
Copy link

kba commented Sep 23, 2019

https://ir.nist.gov/cdip/README.txt

@cneud
Copy link
Owner

cneud commented Oct 14, 2019

added in 07ac808

@cneud cneud closed this as completed Oct 14, 2019
@mineshmathew
Copy link

@kba are you able to access the dataset? I get a "permission denied" error

@ning-mz
Copy link

ning-mz commented Oct 5, 2020

@kba are you able to access the dataset? I get a "permission denied" error

Hi, did you solve this problem? I also got the forbidden.

@Bonjour123
Copy link

@kba or anyone, could you provide a mirror possibly ?

@cneud
Copy link
Owner

cneud commented Nov 12, 2020

@Bonjour123 I am afraid we don't have access either, this file just gathers the metadata for the datasets.

An excerpt from CDIP seems to be available publicly here https://www.cs.cmu.edu/~aharley/rvl-cdip/, but this also links back to the URL for the main datasets that returns a 403 now.

@cneud cneud reopened this Nov 12, 2020
@Bonjour123
Copy link

Yes, in the rvl-cdip site, they do state that it's available publicly at ir.nist.gov/cdip, so I think that it has been and after removed from public access. But I have hope that someone has a copy somewhere ..

@pushpendradahiya
Copy link

Did anyone find a copy? Would really appreciate it if you could post the link.

@brian8128
Copy link

Anyone manage to get your hands on the data? I emailed one of the authors yesterday but no response so far. If anyone else would like to email them you can track them down on researchgate.

https://www.researchgate.net/publication/221299542_Building_a_test_collection_for_complex_document_information_processing

@tengerye
Copy link

tengerye commented Feb 8, 2021

Anyone manage to get your hands on the data? I emailed one of the authors yesterday but no response so far. If anyone else would like to email them you can track them down on researchgate.

https://www.researchgate.net/publication/221299542_Building_a_test_collection_for_complex_document_information_processing

Hi Brain, have you received any reply?

@isoboroff
Copy link

Hi, this is Ian Soboroff from NIST. We do indeed still host the collection, but only open access to the files on request. Web crawlers kept getting trapped in the image file directories and loading down our server.

@kba
Copy link
Author

kba commented Apr 7, 2021

Hi Ian, thanks for reaching out! I would like to get access to it :) Perhaps uploading it in bulk to zenodo might be a good idea, they have lots of bandwidth. If you want, I can also host a mirror.

@cneud
Copy link
Owner

cneud commented Apr 7, 2021

Zenodo would be a good idea for hosting since they will provide a DOI as well.

Otherwise, perhaps some documentation could be added again as https://ir.nist.gov/cdip/README.txt with the information that files are available upon request?

@isoboroff
Copy link

I changed how the directory is protected, so that everyone can get to the README file. That file describes that you can get the OCR data, and to contact me for access to the raw TIFF images (which are about 1.3TB).

@gulixin0922
Copy link

@isoboroff I want to download the raw TIFF images (which are about 1.3TB). Can you share the dataset? Thank you.

@isoboroff
Copy link

isoboroff commented Apr 19, 2021 via email

@cneud cneud closed this as completed Jul 26, 2021
@TaekyungKi
Copy link

TaekyungKi commented Aug 2, 2021

@isoboroff Hi, Can you share the dataset? Thank you.

@isoboroff
Copy link

I am working out an alternate hosting setup that will hopefully be easier on all of us.

I’ve been asked if this hosting setup could include demo code. Would anyone here like to help out?

@isoboroff
Copy link

isoboroff commented Aug 2, 2021 via email

@abdksyed
Copy link

@isoboroff
Is the alternate hosting setup done??
Also, for the demo code, if it's in Python I can help with that.

@WenmuZhou
Copy link

mark

@sookienlane
Copy link

Those files are DVD disk images, and they contain the OCR text from the original page scans. The collection was originally distributed on DVD. Ian

On Mon, Aug 2, 2021 at 12:56 AM TaekyungKi @.***> wrote: @isoboroff https://github.com/isoboroff Thanks for the reply anyway! I downloaed CDIP_x.cdr file on the link: https://ir.nist.gov/cdip/ . but i'm not sure this is right files because of the size of the files. Can you tell me how can i use these files? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB4U5CNPRZO3IPVWGR5YFDT2YQO5ANCNFSM4IZGEA3Q .

@isoboroff I want the source image , How can I contact to you? I can't see the email address.

@isoboroff
Copy link

isoboroff commented Sep 26, 2021 via email

@Cogdof
Copy link

Cogdof commented Dec 17, 2021

@isoboroff
lan, my colleague sent an mail about asking about about access at IIT-CDIP test dataset.
but there is no answer, so we ask at here.
is there some hosting site for acess IIT-CDIP dataset what img set contain OCR text data?
how can i get that data?
thanks

@etrigger
Copy link

@Cogdof Did you get the data already? I want to use the data for research. Can you share the data if you got them?

@Cogdof
Copy link

Cogdof commented Jan 14, 2022

@etrigger
i contacted lan with email.
also, these dataset is so huge. there is no method to send this data effiecently.
I think that you shall send mail to him is best.

@TaekyungKi
Copy link

TaekyungKi commented Jan 14, 2022

@Cogdof stil no available link, only email him, right?

@isoboroff
Copy link

Hi, folks. We are still working on the alternate hosting (hopefully in Amazon S3) but these things take time. I will try to respond to download requests as quickly as I can... if you don't get a reply from me, it's ok to ping me again.

@wayne1199111810
Copy link

@isoboroff
I have sent you two emails this week from "at amazon dot com". It will be helpful if you can take a quick look. thanks.

@isoboroff
Copy link

At long last, the image data has been re-hosted. You can now find it at https://data.nist.gov/od/id/mds2-2531. Hopefully, transfers will now be faster (but it is quite big, so it will take some time!)

@Randy-1009
Copy link

At long last, the image data has been re-hosted. You can now find it at https://data.nist.gov/od/id/mds2-2531. Hopefully, transfers will now be faster (but it is quite big, so it will take some time!)

Excuse me~
This url(https://data.nist.gov/od/id/mds2-2531) might not be accessed. https://data.nist.gov/pdr/od/id shows "Empty Record".
Is this dataset locked again? Could you please tell me how can I get the source images which is 1.3TB?
Looking forward to your reply, thank you~

@isoboroff
Copy link

isoboroff commented Jun 7, 2022 via email

@Randy-1009
Copy link

It works! Thank you so much!
Have a nice day!

@wonbeeny
Copy link

마침내 이미지 데이터가 다시 호스팅되었습니다. 이제 https://data.nist.gov/od/id/mds2-2531 에서 찾을 수 있습니다 . 이제 전송이 더 빨라지기를 바랍니다(하지만 꽤 커서 시간이 좀 걸릴 것입니다!).

I checked these files in that link.

If i look at the xml file that comes out after decompressing cdip-n.tar, it looks like each image has its own xml file.
But I didn't find the xml file for each image.
I'd like to know where the xml files for each image are.

Can you give me some information for it ??

Thank you for your support :)

@eliyara
Copy link

eliyara commented Apr 5, 2023

Hi,
Is the link https://data.nist.gov/od/id/mds2-2531 still valid, I cannot access the data!

@Cppowboy
Copy link

Cppowboy commented Jun 6, 2023

Is this dataset still publicly available? Is there any valid way to download this data?

@Randy-1009
Copy link

Is this dataset still publicly available? Is there any valid way to download this data?

https://data.nist.gov/od/id/mds2-2531 , this link can still be accessed, you can download IIT-CDIP data here
image

@CyndiWangle
Copy link

Is there any method I can access to the dataset?

@lyc728
Copy link

lyc728 commented Mar 11, 2024

hello, I can download the IIT-CDIP-annotations. How do images correspond to json?
企业微信截图_17101610043571
I want to download the diagram corresponding to json
企业微信截图_171016109629

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests