-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verify generated/imported SOLR data against input CSV (missing records) #9
Comments
#!/bin/bash
# extract the first column values from the CSV file, and remove the enclosing double-quotes
for intBiotaID in `cat tblBiota_20180620.csv | cut -d ',' -f1 | sed -e 's/"//g'`
do
# NOTE: you need curl -L (in order to follow HTTP 301 redirects to the linked record-s)
# (for example intBiotaID=106779 redirect to other record)
json=`curl -s -L --header 'Accept: application/json' "https://ag-bie.oztaxa.com/ws/species/${intBiotaID}"`
if [ "`echo ${json} | jq '. | has("error")'`" == "true" ]; then
echo "TEST: ${intBiotaID} error => `echo ${json} | jq '.error'`"
fi
done
details of the above 5 records are as follows:
|
Some of these are being rejected early by the talend processing. They can be found in /data/work/taxxas/Processed/rejected.csv (theres also a vernacular_rejected.csv). The sanity checking rules may be over strict. |
I've already located these and am preparing to have the source data corrected. They're appear to be rejected for using unexpected characters in one of FullName, Epithet, Author or YearOfPub. |
It gets stripped out into 'invalid_synonyms.csv' by the process that creates the directory /data/work/taxxas/DwC |
Test script is already added. The test script can check what names are missing uaing ID or name. |
@ess-acppo-djd identified 5 missing records between the input
tblBiota_20180620.csv
and the generated SOLR index.The text was updated successfully, but these errors were encountered: