Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong contents in gold standard #9

Closed
tfmorris opened this issue Mar 3, 2016 · 7 comments
Closed

Wrong contents in gold standard #9

tfmorris opened this issue Mar 3, 2016 · 7 comments
Milestone

Comments

@tfmorris
Copy link
Contributor

tfmorris commented Mar 3, 2016

When I look at the contents of https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/CleanEvalGoldStandard/103.txt the header says that it is 104.txt and the contents match the contents of 104, even though the file name is 103

@OmniaZayed
Copy link

Hi Tom,

This gold standard dataset is downloaded from CLEANEVAL homepage without applying any changes. So this header/content mix of the file 103.txt is originated from CLEANEVAL itself.
Thanks,
Best regards,
Omnia

@tfmorris
Copy link
Contributor Author

tfmorris commented Mar 3, 2016 via email

@OmniaZayed
Copy link

You are welcome.
I am not sure if there is other corrupted files or not. But if you opened
103.txt with notepad++ (kindly check the attached picture) you will see
some strange lines at the beginning of the file which does not occur in
some other files that I have checked so far.

@tfmorris
Copy link
Contributor Author

tfmorris commented Mar 3, 2016

I looked into it a little further using the original data set and 103.txt is a tar file made up of all the other files. My guess is that someone ran a command like tar cvf *.txt which the shell expanded to tar cvf 103.txt 104.txt 105.txt ... overwriting the original 103.txt with a tar made up of all the other files.

I'd recommend just deleting it since it contains no useful information. The only way to recover the original file would be to contact the authors.

@OmniaZayed
Copy link

Yes, you are right. Thanks a lot for the clarification.

@habernal habernal added this to the 1.0.0 release milestone Mar 3, 2016
@habernal habernal reopened this Mar 3, 2016
habernal added a commit that referenced this issue Mar 3, 2016
@habernal
Copy link
Contributor

habernal commented Mar 3, 2016

Has to be documented in #4

@habernal habernal closed this as completed Mar 3, 2016
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Mar 7, 2016
habernal added a commit that referenced this issue Mar 7, 2016
Replacement GoldStandard 103.txt provided by Miloš Jakubíček - fixes #9
@tfmorris
Copy link
Contributor Author

tfmorris commented Mar 7, 2016

Just discovered that there's another repo with the GoldStandard files in it in case any more problems turn up: https://github.com/ppke-nlpg/boilerplateResults/tree/master/cleanEvalResults/GoldStandard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants