-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IIT CDIP 1.0 (Illinois Institute of Technology Complex Document Information Processing Test Collection, version 1.0) #12
Comments
added in 07ac808 |
@kba are you able to access the dataset? I get a "permission denied" error |
Hi, did you solve this problem? I also got the forbidden. |
@kba or anyone, could you provide a mirror possibly ? |
@Bonjour123 I am afraid we don't have access either, this file just gathers the metadata for the datasets. An excerpt from CDIP seems to be available publicly here https://www.cs.cmu.edu/~aharley/rvl-cdip/, but this also links back to the URL for the main datasets that returns a 403 now. |
Yes, in the rvl-cdip site, they do state that it's available publicly at ir.nist.gov/cdip, so I think that it has been and after removed from public access. But I have hope that someone has a copy somewhere .. |
Did anyone find a copy? Would really appreciate it if you could post the link. |
Anyone manage to get your hands on the data? I emailed one of the authors yesterday but no response so far. If anyone else would like to email them you can track them down on researchgate. |
Hi Brain, have you received any reply? |
Hi, this is Ian Soboroff from NIST. We do indeed still host the collection, but only open access to the files on request. Web crawlers kept getting trapped in the image file directories and loading down our server. |
Hi Ian, thanks for reaching out! I would like to get access to it :) Perhaps uploading it in bulk to zenodo might be a good idea, they have lots of bandwidth. If you want, I can also host a mirror. |
Zenodo would be a good idea for hosting since they will provide a DOI as well. Otherwise, perhaps some documentation could be added again as https://ir.nist.gov/cdip/README.txt with the information that files are available upon request? |
I changed how the directory is protected, so that everyone can get to the README file. That file describes that you can get the OCR data, and to contact me for access to the raw TIFF images (which are about 1.3TB). |
@isoboroff I want to download the raw TIFF images (which are about 1.3TB). Can you share the dataset? Thank you. |
Yes, contact me at ***@***.***
Ian
…On Apr 19, 2021, 08:53 -0400, seuGLX ***@***.***>, wrote:
@isoboroff I want to download the raw TIFF images (which are about 1.3TB). Can you share the dataset? Thank you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@isoboroff Hi, Can you share the dataset? Thank you. |
I am working out an alternate hosting setup that will hopefully be easier on all of us. I’ve been asked if this hosting setup could include demo code. Would anyone here like to help out? |
Those files are DVD disk images, and they contain the OCR text from the
original page scans. The collection was originally distributed on DVD.
Ian
…On Mon, Aug 2, 2021 at 12:56 AM TaekyungKi ***@***.***> wrote:
@isoboroff <https://github.com/isoboroff> Thanks for the reply anyway! I
downloaed CDIP_x.cdr file on the link: https://ir.nist.gov/cdip/ . but
i'm not sure this is right files because of the size of the files. Can you
tell me how can i use these files?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB4U5CNPRZO3IPVWGR5YFDT2YQO5ANCNFSM4IZGEA3Q>
.
|
@isoboroff |
mark |
@isoboroff I want the source image , How can I contact to you? I can't see the email address. |
You can contact me at ian dot soboroff at nist dot gov.
…On Sep 26, 2021, 09:20 -0400, sookienlane ***@***.***>, wrote:
> Those files are DVD disk images, and they contain the OCR text from the original page scans. The collection was originally distributed on DVD. Ian
> …
> On Mon, Aug 2, 2021 at 12:56 AM TaekyungKi @.***> wrote: @isoboroff https://github.com/isoboroff Thanks for the reply anyway! I downloaed CDIP_x.cdr file on the link: https://ir.nist.gov/cdip/ . but i'm not sure this is right files because of the size of the files. Can you tell me how can i use these files? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB4U5CNPRZO3IPVWGR5YFDT2YQO5ANCNFSM4IZGEA3Q .
@isoboroff I want the source image , How can I contact to you? I can't see the email address.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
@isoboroff |
@Cogdof Did you get the data already? I want to use the data for research. Can you share the data if you got them? |
@etrigger |
@Cogdof stil no available link, only email him, right? |
Hi, folks. We are still working on the alternate hosting (hopefully in Amazon S3) but these things take time. I will try to respond to download requests as quickly as I can... if you don't get a reply from me, it's ok to ping me again. |
@isoboroff |
At long last, the image data has been re-hosted. You can now find it at https://data.nist.gov/od/id/mds2-2531. Hopefully, transfers will now be faster (but it is quite big, so it will take some time!) |
Excuse me~ |
I think we had a web site issue. I couldn’t bring it up yesterday evening but today it’s up. Can you try again?
…On Jun 6, 2022, 08:44 -0400, Randy-1009 ***@***.***>, wrote:
> At long last, the image data has been re-hosted. You can now find it at https://data.nist.gov/od/id/mds2-2531. Hopefully, transfers will now be faster (but it is quite big, so it will take some time!)
Excuse me~
This url(https://data.nist.gov/od/id/mds2-2531) might not be accessed. https://data.nist.gov/pdr/od/id shows "Empty Record".
Is this dataset locked again? Could you please tell me how can I get the source images which is 1.3TB?
Looking forward to your reply, thank you~
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
It works! Thank you so much! |
I checked these files in that link. If i look at the xml file that comes out after decompressing cdip-n.tar, it looks like each image has its own xml file. Can you give me some information for it ?? Thank you for your support :) |
Hi, |
Is this dataset still publicly available? Is there any valid way to download this data? |
https://data.nist.gov/od/id/mds2-2531 , this link can still be accessed, you can download IIT-CDIP data here |
Is there any method I can access to the dataset? |
https://ir.nist.gov/cdip/README.txt
The text was updated successfully, but these errors were encountered: