Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suspicious dwca parsing exception with escaped double quotes in json fragment #48

Closed
jhpoelen opened this issue Dec 3, 2019 · 5 comments

Comments

@jhpoelen
Copy link

jhpoelen commented Dec 3, 2019

GloBI is using your dwca-io library (thanks!) for parsing dwc archives.

During routine integration testing (see https://travis-ci.org/globalbioticinteractions/scan/jobs/588191246#L229), I found:

org.eol.globi.data.StudyImporterException: failed to read archive [https://scan-bugs.org:443/portal/content/dwca/MCZ_DwC-A.zip]
	...
Caused by: java.lang.IllegalStateException: java.text.ParseException: Unexpected character (';' (code 59)): Expected separator ('"' (code 34)) or end-of-line
 at [Source: (BufferedReader); line: 116580, column: 320]
	at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:70)
	at org.eol.globi.data.StudyImporterForDwCA.importCore(StudyImporterForDwCA.java:108)
	at org.eol.globi.data.StudyImporterForDwCA.importStudy(StudyImporterForDwCA.java:94)
	... 10 more
Caused by: java.text.ParseException: Unexpected character (';' (code 59)): Expected separator ('"' (code 34)) or end-of-line
 at [Source: (BufferedReader); line: 116580, column: 320]
	at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:119)
	at org.gbif.utils.file.tabular.JacksonCsvFileReader.read(JacksonCsvFileReader.java:38)
	at org.gbif.dwc.DwcRecordIterator.hasNext(DwcRecordIterator.java:63)
	... 12 more

I've isolated the offending line and reproduced the issue. On close inspection, I see usage of the "" to escape double quotes in csv for json fragments. However, I don't see any malformed csv.

Does dwca-io support "" -style escaping?

jhpoelen pushed a commit to globalbioticinteractions/globalbioticinteractions that referenced this issue Dec 3, 2019
@jhpoelen
Copy link
Author

jhpoelen commented Dec 3, 2019

After second take, I realized that there's actually an error in the data:

"Labels: ""granulat-/ tus, S \\"; ""T. Say/ Type""

should be:

"Labels: ""granulat-/ tus, S ""; ""T. Say/ Type""

Notice the \\" --> "" .

@jhpoelen
Copy link
Author

jhpoelen commented Dec 3, 2019

@neilcobb - please note that record:

id,institutionCode,collectionCode,ownerInstitutionCode,collectionID,basisOfRecord,occurrenceID,catalogNumber,otherCatalogNumbers,kingdom,phylum,class,order,family,scientificName,taxonID,scientificNameAuthorship,genus,specificEpithet,taxonRank,infraspecificEpithet,identifiedBy,dateIdentified,identificationReferences,identificationRemarks,taxonRemarks,identificationQualifier,typeStatus,recordedBy,recordNumber,eventDate,year,month,day,startDayOfYear,endDayOfYear,verbatimEventDate,occurrenceRemarks,habitat,fieldNumber,informationWithheld,dataGeneralizations,dynamicProperties,associatedTaxa,reproductiveCondition,establishmentMeans,lifeStage,sex,individualCount,samplingProtocol,samplingEffort,preparations,country,stateProvince,county,municipality,locality,locationRemarks,decimalLatitude,decimalLongitude,geodeticDatum,coordinateUncertaintyInMeters,verbatimCoordinates,georeferencedBy,georeferenceProtocol,georeferenceSources,georeferenceVerificationStatus,georeferenceRemarks,minimumElevationInMeters,maximumElevationInMeters,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,verbatimElevation,disposition,language,recordEnteredBy,modified,rights,rightsHolder,accessRights,recordId,references
26225229,MCZ,Ent,"Museum of Comparative Zoology, H",029816b2-ba46-4c89-9ebb-c1c630a0ce7e,PreservedSpecimen,MCZ:Ent:36086,36086,"type number=36086",Animalia,Arthropoda,Insecta,Coleoptera,Curculionidae,"Anametis granulatus",,"(Say, 1832)",,,,,"Rachel L. Hawkins","2017-03-30 00:00:00.0",,"Labels: ""granulat-/ tus, S \\"; ""T. Say/ Type""; ""Anametis/ grisea/ H""; ""MCZ SYNTYPE/ 36086/ R. L. Hawkins/ 2017.iii.07""",,,"Syntype of Barynotus granulatus","[no agent data]",,1700-01-01,1700,1,1,1,,"[no verbatim date data]","collection: Thomas Say Collection; life stage: adult",,,,,"{""collection"":""Thomas Say Collection"", ""life stage"":""adult""}",,,,,,1,,,"whole animal (pinned)","United States",Indiana,,,"[no specific locality data]",,,,,,,,,,,,,,,,,,"not applicable",en,,"2017-03-30 10:09:26",http://creativecommons.org/licenses/by-nc/4.0/,"President and Fellows of Harvard College","The publisher and rights holder of this work is Museum of Comparative Zoology, Harvard University. Copyright © 2018 President and Fellows of Harvard College, Some Rights Reserved. This work is licensed under a Creative Commons Attribution Non Commercial (CC-BY-NC) 4.0 License.",urn:uuid:929ab79e-0e24-44e7-92a7-53bb1f826fe5,https://scan-bugs.org:443/portal/collections/individual/index.php?occid=26225229

from https://scan-bugs.org:443/portal/content/dwca/MCZ_DwC-A.zip appears to have an invalid occurrence record at https://scan-bugs.org:443/portal/collections/individual/index.php?occid=26225229 . Does Symbiota do any validation on escaped field values ?

@jhpoelen
Copy link
Author

jhpoelen commented Dec 3, 2019

@neilcobb please note that at https://scan-bugs.org/portal/collections/individual/index.php?occid=26225229 , the "correct" identification remarks are shown:

Labels: "granulat-/ tus, S \"; "T. Say/ Type"; "Anametis/ grisea/ H"; "MCZ SYNTYPE/ 36086/ R. L. Hawkins/ 2017.iii.07"

It appears that Symbiota adds an extra backslash.

@jhpoelen
Copy link
Author

jhpoelen commented Dec 3, 2019

Bug transferred to Symbiota/Symbiota-deprecated#130 .

@jhpoelen jhpoelen closed this as completed Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant