Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify generated/imported SOLR data against input CSV (missing records) #9

Closed
mbohun opened this issue Jun 28, 2018 · 5 comments
Closed
Assignees

Comments

@mbohun
Copy link

mbohun commented Jun 28, 2018

@ess-acppo-djd identified 5 missing records between the input tblBiota_20180620.csv and the generated SOLR index.

@mbohun mbohun self-assigned this Jun 28, 2018
@mbohun
Copy link
Author

mbohun commented Jun 28, 2018

check_tblBiota.sh

#!/bin/bash                                                                                                                                                                  
                                                                                                                                                                             
# extract the first column values from the CSV file, and remove the enclosing double-quotes                                                                                  
for intBiotaID in `cat tblBiota_20180620.csv | cut -d ',' -f1 | sed -e 's/"//g'`                                                                                             
do                                                                                                                                                                           
    # NOTE: you need curl -L (in order to follow HTTP 301 redirects to the linked record-s)                                                                                  
    #       (for example intBiotaID=106779 redirect to other record)                                                                                                         
    json=`curl -s -L --header 'Accept: application/json' "https://ag-bie.oztaxa.com/ws/species/${intBiotaID}"`                                                               
    if [ "`echo ${json} | jq '. | has("error")'`" == "true" ]; then                                                                                                          
        echo "TEST: ${intBiotaID} error => `echo ${json} | jq '.error'`"                                                                                                     
    fi                                                                                                                                                                       
done
ubuntu@ip-172-31-2-29:/tmp$ ./check_tblBiota.sh
TEST: intBiotaID error => "Not Found"
TEST: 102340 error => "Not Found"
TEST: 103926 error => "Not Found"
TEST: 71079 error => "Not Found"
TEST: 112099 error => "Not Found"
TEST: 30 error => "Not Found"

details of the above 5 records are as follows:

"intBiotaID","intParentID","vchrEpithet","vchrFullName","vchrYearOfPub","vchrAuthor","vchrNameQualifier","chrElemType","vchrRank","chrKingdomCode","intOrder","vchrParentage","bitChangedComb","bitShadowed","bitUnplaced","bitUnverified","bitAvailableName","bitLiteratureName","dtDateCreated","vchrWhoCreated","dtDateLastUpdated","vchrWhoLastUpdated","txtDistQual","GUID"
"102340","20","Phytobiota","","","","","KING ","","P ","0","\20\102340","False","False","False","False","True","False","2003-07-28 11:33:17.857000000","Clayton Winter","2003-07-28 11:33:24.997000000","Clayton Winter","","{9B626B79-DE67-4B58-849C-2B5429F9A83B}"
"103926","64792","Xyleutes eucalypti: Walker [misspelling!]","Xyleutes eucalypti: Walker [misspelling!]","","","","SP   ","","A ","0","\1\106786\6\100975\12\52112\101129\101130\101134\58791\74799\64792\103926","False","False","False","False","False","True","2004-09-27 12:48:37.270000000","graham brown","2004-09-27 12:48:40.630000000","graham brown","","{4F19BBB1-4097-4804-9B48-2F6E1394B4AF}"
"71079","66889","hirtus","Croton hirtus L’herit","","L’herit","","SP   ","","P ","0","\20\102341\102343\101427\21\22\102360\99968\66575\66889\71079","False","False","False","False","False","False","2003-03-25 12:54:09.450000000","Migration","2004-04-07 21:19:27.373000000","sa","","{51ABE293-3031-4310-894B-2353BF4C32E8}"
"112099","101848","Ornithogalum Mosaic Virus","Potyvirus (definitive_species) Ornithogalum Mosaic Virus Smith and Brierley, 1944a","1944a","Smith and Brierley","","SP   ","","V ","0","\101171\101661\104483\61073\61217\101848\112099","False","False","False","False","False","False","2016-09-05 10:37:37.967000000","NAQSTaxaTree","2016-09-05 13:47:30.587000000","AGDAFF\Teakle Graham","","{C0B11D33-42CD-4A55-A410-863A2A0CFD87}"
"30","106089","<No_Species_Entered>","<No_Species_Entered>","","","","     ","","A ","0","\24\106089\30","False","False","False","False","False","False","2003-03-25 12:54:09.450000000","Data Conversion","2007-06-12 12:29:30.250000000","Graham Brown","","{7EB978EA-7584-4285-9DA2-D66FAE5F1B3D}"

@charvolant
Copy link
Collaborator

Some of these are being rejected early by the talend processing. They can be found in /data/work/taxxas/Processed/rejected.csv (theres also a vernacular_rejected.csv). The sanity checking rules may be over strict.

@ess-acppo-djd
Copy link

ess-acppo-djd commented Jun 29, 2018

I've already located these and am preparing to have the source data corrected. They're appear to be rejected for using unexpected characters in one of FullName, Epithet, Author or YearOfPub.
There is one other record being dropped somewhere (Phytobiota, a synonym for Plantae) and I've yet to hunt it down.

@ess-acppo-djd
Copy link

It gets stripped out into 'invalid_synonyms.csv' by the process that creates the directory /data/work/taxxas/DwC

@moziauddin
Copy link
Contributor

Test script is already added. The test script can check what names are missing uaing ID or name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants