New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong contents in gold standard #9
Comments
Hi Tom, This gold standard dataset is downloaded from CLEANEVAL homepage without applying any changes. So this header/content mix of the file 103.txt is originated from CLEANEVAL itself. |
Thanks for the quick response.
Any idea how widespread the corruption is? Obviously having completely
unrelated gold standard data is going to render the scores invalid.
|
You are welcome. |
I looked into it a little further using the original data set and 103.txt is a tar file made up of all the other files. My guess is that someone ran a command like I'd recommend just deleting it since it contains no useful information. The only way to recover the original file would be to contact the authors. |
Yes, you are right. Thanks a lot for the clarification. |
Has to be documented in #4 |
Replacement GoldStandard 103.txt provided by Miloš Jakubíček - fixes #9
Just discovered that there's another repo with the GoldStandard files in it in case any more problems turn up: https://github.com/ppke-nlpg/boilerplateResults/tree/master/cleanEvalResults/GoldStandard |
When I look at the contents of https://github.com/dkpro/dkpro-c4corpus/blob/master/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/CleanEvalGoldStandard/103.txt the header says that it is 104.txt and the contents match the contents of 104, even though the file name is 103
The text was updated successfully, but these errors were encountered: