# MatCit-Validator

This Jupyter notebook validates annotation of material citations being done as part of the Insects of Guam Datamining Project. The notebook reads an XML file generated by GGI and outputs an HTML report which flags errors and warnings.

In [11]:
from bs4 import BeautifulSoup
import re
from datetime import datetime

In [12]:
PATH_TO_XML_FILE = 'A676FD1EF22D3F34FF8F8907FFDAFC58.xml'

## Fields from the Materials Citations Dialog

```
collectionCode, specimenCount, specimenCode, accessionNumber
typestatus, collectingCountry, collectingRegion, collectingMunicipality
collectingCounty, location, locationDeviation, originalDetermination
determinerName, collectorName, collectingDate, collectedFrom
collectingMethod, collectingPermit, geoCoordinate, elevation
geologicalTimeScale, backReference
```

## Mapping Material Citation Fields to DwCA

The following list was extracted from the DwCA meta.xml file. Here, we can see how GGI terms are mapped to DwCA terms.

dwca | GGI|
-----|:---|
http://rs.tdwg.org/dwc/terms/taxonID|treatment ID + ".taxon"
http://rs.tdwg.org/dwc/terms/catalogNumber|mc@specimenCode (explode to one record per specimen code if possible)
http://rs.tdwg.org/dwc/terms/collectionCode|mc@collectionCode (explode to one record per collection code if possible)
http://rs.tdwg.org/dwc/terms/institutionCode|blank
http://rs.tdwg.org/dwc/terms/typeStatus|mc@typeStatus (blank if none given)
http://rs.gbif.org/terms/1.0/verbatimLabel|mc text
http://rs.tdwg.org/dwc/terms/sex|mc@sex (also other specimen types like "queen", "worker", etc.)
http://rs.tdwg.org/dwc/terms/individualCount|mc@specimenCount (explode things like "5 workers, 2 females" to one record per typified specimen count if possible)
http://rs.tdwg.org/dwc/terms/eventDate|mc@collectingDate
http://rs.tdwg.org/dwc/terms/recordedBy|mc@collectorName
http://rs.tdwg.org/dwc/terms/recordNumber|blank
http://rs.tdwg.org/dwc/terms/decimalLatitude|mc@latitude
http://rs.tdwg.org/dwc/terms/decimalLongitude|mc@longitude
http://rs.tdwg.org/dwc/terms/minimumElevationInMeters|mc@elevation, or mc@elevationMin if given
http://rs.tdwg.org/dwc/terms/maximumElevationInMeters|mc@elevationMax if given
http://rs.tdwg.org/dwc/terms/country|mc@collectingCountry
http://rs.tdwg.org/dwc/terms/stateProvince|mc@stateProvince or mc@collectingRegion
http://rs.tdwg.org/dwc/terms/municipality|mc@collectingMunicipality
http://rs.tdwg.org/dwc/terms/locality|mc@location

Here's my idea of which fields are required:

* collectingDate
* collectorName
* collectingCountry
* collectingMunicipality OR location

In [13]:
# These elements were taken from the Check Materials Citation dialog

mat_cit_child_fields = [
    'collectionCode', 'specimenCount', 'specimenCode', 'accessionNumber',
    'typestatus', 'collectingCountry', 'collectingRegion', 'collectingMunicipality',
    'collectingCounty', 'location', 'locationDeviation', 'originalDetermination',
    'determinerName', 'collectorName', 'collectingDate', 'collectedFrom',
    'collectingMethod', 'collectingPermit', 'geoCoordinate', 'elevation',
    'geologicalTimeScale', 'backReference'
]

In [14]:
# I am basically guessing at these. Needs to be checked.

mat_cit_attr_fields = [
    'collectionCode', 'specimenCount', 'specimenCode', 'accessionNumber',
    'typestatus', 'collectingCountry', 'collectingRegion', 'collectingMunicipality',
    'collectingCounty', 'location', 'locationDeviation', 'originalDetermination',
    'determinerName', 'collectorName', 'collectingDate', 'collectedFrom',
    'collectingMethod', 'collectingPermit', 'geoCoordinate', 'elevation',
    'geologicalTimeScale', 'backReference',
    'ID-GBIF-Occurrence', 'pageId', 'pageNumber'
]

In [15]:
collector_list = [
    'A. Cruz',
    'E. H. Bryan',
    'H. G. Hornbostel',
    'R. G. Oakley',
    'O. H. Swezey',
    'O. H. Swezey & R. L. Usinger',
    'Rowley',
    'R. L. Usinger',
    'R. L. Usinger & O. H.Swezey',
]

In [16]:
location_list = [
    'Agana',
    'Agana Swamp',
    'Agat',
    'Atao Beach',
    'Barrigada',
    'Dandan',
    'Dededo',
    'Guam',
    'Inarajan',
    'Machanao',
    'Merizo',
    'Mogfog',
    'Mt. Alifan',
    'Mt. Sasalaguan',
    'Orote Peninsula',
    'Piti',
    'Ritidian Point',
    'Rota Island',
    'Root School Farm',
    'Santa Rosa Peak',
    'Sumay Road',
    'Tarague Beach',
    'Yigo',
    'Yona',
]

In [17]:
# def check_required_fields(matcit):
#     html = ''
#     if matcit.get('collectingDate','') == '':
#         html += '<div class="notification is-danger">ERROR: no collectingDate</div>\n'
#     if matcit.get('collectorName','') == '':
#         html += '<div class="notification is-danger">ERROR: no collectorName</div>\n'
#     if matcit.get('collectingCountry','') == '':
#         html += '<div class="notification is-danger">ERROR: no collectingCountry</div>\n'
#     if (matcit.get('collectingMunicipality','') == '') and (matcit.get('location','') == ''):
#         html += '<div class="notification is-danger">ERROR: no collectingMunicipality or location</div>\n'
#     return html 

def check_for_unlisted_child_fields(matcit):
    soup = BeautifulSoup(str(matcit), 'xml')
    li = soup.find('materialsCitation')
    children = li.findChildren(recursive=False)
    for child in children:
        if not (child.name in mat_cit_child_fields):
            html = '<div class="notification is-info">'
            html += f'INFO: <b>{child.name}</b> is not a regular material citations child field'
            html += '</div>\n'
    return ''

def check_for_unlisted_attribites(matcit):
    for attr in matcit.attrs:
        if not (attr in mat_cit_attr_fields):
            html = '<div class="notification is-info">'
            html += f'INFO: <b>{attr}</b> is not a regular material citations attribute'
            html += '</div>\n'
    return ''

def check_date(matcit, doc_attrs):
    if collectingDate := matcit.get('collectingDate'):
        if matches := re.search('(\d{4})', collectingDate):
            year = matches.group(1)
            docDate = doc_attrs.get('docDate')
            if year > docDate:
                html = '<div class="notification is-danger">'
                html += f'ERROR: collectingDate year [{year}] is greater than publication date [{docDate}]'
                html += '</div>\n'
                return html
        else:
            return '<div class="notification is-danger">ERROR: collectingDate year not found</div>\n'
    else:
        return '<div class="notification is-danger">ERROR: no collectingDate</div>\n'
    return ''

def check_location(matcit):
    if location := matcit.get('location'):
        if location in location_list:
            return ''
        else:
            html = '<div class="notification is-info">'
            html += f'INFO: <b>location [{location}]</b> is not in location list.'
            html += '</div>\n'
            return html        
    else:
        return '<div class="notification is-danger">ERROR: no location.</div>\n'
    return ''

def check_collector(matcit):
    if collectorName := matcit.get('collectorName'):
        if collectorName in collector_list:
            return ''
        else:
            html = '<div class="notification is-info">'
            html += f'INFO: <b>collectorName [{collectorName}]</b> is not in collector list.'
            html += '</div>\n'
            return html        
    else:
        return '<div class="notification is-danger">ERROR: no collectorName</div>\n'
    return ''

def check_material_citation(matcit, doc_attrs):
    html = ''
    html += check_date(matcit, doc_attrs)
    html += check_location(matcit)
    html += check_collector(matcit)
    html += check_for_unlisted_attribites(matcit)
    html += check_for_unlisted_child_fields(matcit)
    return html

In [18]:
def check_material_citations():
    
    # Read xml file into a string
    
    with open(PATH_TO_XML_FILE, 'r') as f:
            s = f.read()

    soup = BeautifulSoup(s, 'xml')

    # List document attributes

    doc_attrs = soup.find('document').attrs
    html = f'<p class="title is-1">{doc_attrs["docTitle"]}</p>\n'
    html += f'<p class="subtitle is-3">uuid: {doc_attrs["docId"]}</p>\n'
    html += f'<p class="subtitle is-3">Report generated by MatCit-Validator at {datetime.utcnow()} UTC</p>\n'
    html += f'<hr>\n'
    html += f'<p class="title is-2">Document attributes</p>\n'

    for key in doc_attrs:
        html += f'<b>{key}:</b> {doc_attrs[key]}<br>\n'
    html += '<hr>'

    # Check material citations
    
    treatments = soup.find_all('treatment')
    for treatment in treatments:
        extract = treatment.text.split()[:4]
        extract = ' '.join(extract)
        html += f'<p class="title is-2">treatment: {extract} ...</h1>\n'

        materialcitations = treatment.find_all('materialsCitation')
        for materialcitation in materialcitations:
            html += '<p class="title is-6">Materials citation</p>\n'
            html += f'<div class="notification is-info is-light">{materialcitation.text}</div>\n' 
            if gbif_occ_rec := materialcitation.attrs.get('ID-GBIF-Occurrence'):
                html += f'<p><a href="https://www.gbif.org/occurrence/{gbif_occ_rec}">View latest version of GBIF occurrence record</a></p><br>'
            html += '<p class="title is-6">Attributes</p>\n'
            for key in materialcitation.attrs:
                html += f'<b>{key}:</b> {materialcitation[key]}<br>\n'
            html += '<br>\n'
            html += '<p class="title is-6">Child nodes</p>\n'
            for child in materialcitation.findChildren(recursive=False):
                html += f'<b>{child.name}:</b> {child.text}<br>'            
            
            html += check_material_citation(materialcitation, doc_attrs)
            html += '<br><br><hr>\n'
    return html

In [19]:
def generate_html_report(mat_cit_chk_html, output_file):
    timestamp = datetime.utcnow()
    html = f'''
        <html>
            <header>
                <meta charset="utf-8">
                <meta name="viewport" content="width=device-width, initial-scale=1">
                <title>mat_cit_chk</title>
                <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.3/css/bulma.min.css">
            </header>
            <body>
                <section class="section">
                    <div class="container">
                        {mat_cit_chk_html}
                    </div>
                </section>
            </body>
        </html>        
        '''
    with open(output_file, 'w') as f:
        f.write(html)    

In [20]:
# MAIN

mat_cit_chk_html = check_material_citations()
generate_html_report(mat_cit_chk_html, PATH_TO_XML_FILE.replace('.xml','.html'))
print('FINISHED')
print('Please remember to push to GitHub.')

FINISHED
