Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with pycsw mapping ISO-DIF #657

Open
2 of 8 tasks
epifanio opened this issue Feb 15, 2021 · 3 comments
Open
2 of 8 tasks

Issues with pycsw mapping ISO-DIF #657

epifanio opened this issue Feb 15, 2021 · 3 comments

Comments

@epifanio
Copy link
Contributor

epifanio commented Feb 15, 2021

Description

Problem: mapping of ISO records to DIF (using GCMD DIF type/subtype vocabulary).

Given an ISO-compliant metadata Record, I encountered some issues in the mapping to DIF at different levels. Listing two examples:

  • Data Access
  • Dataset Landing Page

Environment

  • operating system:
    • Linux - Ubuntu Server 20.04
  • Python version:
    • Python 3.8.x
  • pycsw version:
    • Git Master
  • source/distribution
    • git clone
    • DebianGIS/UbuntuGIS
    • PyPI
    • zip/tar.gz
    • other (please specify):
  • web server
    • Apache/mod_wsgi
    • CGI
    • other (please specify):

Steps to Reproduce

Indexing the following ISO Record:

Results in the following DIF profile

The DIF output doesn't match the information available in the original ISO source.

Data Access

  • Problem: mapping of protocols between ISO and DIF (using GCMD DIF type/subtype vocabulary).

Currently the protocols are just the same as the ISO records.

Current DIF output

<dif:Related_URL>
  <dif:URL_Content_Type>
    <dif:Type>OPENDAP:OPENDAP</dif:Type>
  </dif:URL_Content_Type>
  <dif:URL>opendap url</dif:URL> 
  <dif:Description>None</dif:Description>
</dif:Related_URL>

<dif:Related_URL>
  <dif:URL_Content_Type>
    <dif:Type>download</dif:Type>
  </dif:URL_Content_Type>
  <dif:URL>http download url</dif:URL>
  <dif:Description>None</dif:Description>
</dif:Related_URL>

Expected DIF9.7 output

<Related_URL>
  <URL_Content_Type>
    <Type>GET DATA</Type>
    <Subtype>OPENDAP DATA (DODS)</Subtype>
  </URL_Content_Type>
  <URL>opendapurl</URL>
</Related_URL>

<Related_URL>
  <URL_Content_Type>
    <Type>GET SERVICE</Type>
    <Subtype>GET WEB MAP SERVICE (WMS)</Subtype>
  </URL_Content_Type>
  <URL>wmsurl</URL>
</Related_URL>

<Related_URL>
  <URL_Content_Type>
    <Type>GET DATA</Type>
    </URL_Content_Type>
  <URL>Http download url</URL>
</Related_URL>

Dataset landing page

Current ISO output

<gmd:dataSetURI>
   <gco:CharacterString>Dataset landing page</gco:CharacterString>
</gmd:dataSetURI>
  • In DIF the landing page is exposed in two ways:

As Related_URL using type DATASET LANDING PAGE.

Expected DIF output

<Related_URL>
  <URL_Content_Type>
    <Type>DATASET LANDING PAGE</Type>
    </URL_Content_Type>
  <URL>dataset landing page url</URL>
</Related_URL>
  • the Online_Resource in the Data_set_citation element. This is currently missing, and should be added to our reference ISO record.

Current DIF output:

<dif:Data_Set_Citation>
   <dif:Dataset_Creator/>
   <dif:Dataset_Release_Date/>
   <dif:Dataset_Publisher/>
   <dif:Data_Presentation_Form/>
</dif:Data_Set_Citation>

Expected DIF output

<Data_Set_Citation>
   <Dataset_Creator>xx</Dataset_Creator>
   <Dataset_Title>xx</Dataset_Title>
   <Dataset_Release_Date>2017-02-23T00:00:00:00Z</Dataset_Release_Date>
   <Dataset_Publisher>xx</Dataset_Publisher>
...
   <Online_Resource>Dataset landing page URI</Online_Resource>
</Data_Set_Citation>

Additional Information

There are other issues related to how the ISO keywords are mapped to DIF in particular the GCMD Science Keywords.

in ISO we have:

<?xml version="1.0"?>
<gmd:descriptiveKeywords>
  <gmd:MD_Keywords>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Temperature &gt; Surface Temperature &gt; Air Temperature
</gco:CharacterString>
    </gmd:keyword>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Winds &gt; Surface Winds
</gco:CharacterString>
    </gmd:keyword>
    <gmd:keyword>
      <gco:CharacterString>
EARTH SCIENCE &gt; Atmosphere &gt; Atmospheric Water Vapor
</gco:CharacterString>
    </gmd:keyword>
    <gmd:thesaurusName>
      <gmd:CI_Citation>
        <gmd:title>
          <gco:CharacterString>gcmd</gco:CharacterString>
        </gmd:title>
      </gmd:CI_Citation>
    </gmd:thesaurusName>
  </gmd:MD_Keywords>
</gmd:descriptiveKeywords>

see reference ISO

  • This in mapped from apiso:Subject into csw:Keywords which is then mapped to dif:Keyword in dif.py

  • In principle it should be mapped instead into DIF 9 As Parameters (with subelement) when the thesauri name is GCMD and Keyword (string) for any other thesauri name.

As this is too complicated I would try to get only the GCMD thesauri, thus I need to map all ISO entries to Parameter in this structure:

<Parameters>
<Category>EARTH SCIENCE</Category>
<Topic>SPECTRAL/ENGINEERING</Topic>
<Term>RADAR</Term>
<Variable_Level_1>RADAR BACKSCATTER</Variable_Level_1>
</Parameters>

See http://metadata.nersc.no/oai?verb=ListRecords&metadataPrefix=dif for example

epifanio added a commit to epifanio/joint-ogc-osgeo-asf-sprint-2021 that referenced this issue Feb 16, 2021
I will be working full-time on pycsw during the sprint. I am focusing in understand and possibly improve the mapping from ISO to other Profiles like DIF. I opened a related issue at: geopython/pycsw#657
@kalxas kalxas changed the title Issues with pyCSW mapping ISO-DIF Issues with pycsw mapping ISO-DIF Feb 16, 2021
@epifanio
Copy link
Contributor Author

Regarding the last part of the issue, the one related to the keywords issue - to distinguish between keywords in ISO with and without a thesaurus_name, will it make sense to have a column (which can be empty) to sp[ecify the 'dialect'/'flavour' of the ISO record ... in my case GCMD? -- then try to add some logic in the core code to distinguish between keywords with/without a thesaurs_name .. which will affect the transformation into a specific output profile?

@epifanio
Copy link
Contributor Author

epifanio commented Feb 24, 2021

I may have found a little hack to tune the output the way I needed, by modifying 'dif.py':

    # keywords
    val = util.getqattr(result, context.md_core_model['mappings']['pycsw:Keywords'])

    if val:
        for kw in val.split(','):
            if len(kw.split(">")) >= 2:
                values = kw.split(">")
                parameters = etree.SubElement(node, util.nspath_eval('dif:Parameters', NAMESPACES))  # .text = kw
                etree.SubElement(parameters, util.nspath_eval('dif:Category', NAMESPACES)).text = values[0]
                etree.SubElement(parameters, util.nspath_eval('dif:Topic', NAMESPACES)).text = values[1]
                etree.SubElement(parameters, util.nspath_eval('dif:Term', NAMESPACES)).text = values[2]
                for i,v in enumerate(values[3:]):
                    etree.SubElement(parameters, util.nspath_eval(f'dif:Variable_Level_{i+1}', NAMESPACES)).text = v
            else:
                etree.SubElement(node, util.nspath_eval('dif:Keywords', NAMESPACES)).text = kw

Note, this will work only for my specific case where I am sure the GCMD keywords I need to parse have all the > symbol as splitter.

The code above will return:

<dif:Parameters>
    <dif:Category>Earth Science</dif:Category>
    <dif:Topic>Atmosphere</dif:Topic>
    <dif:Term>Atmospheric radiation</dif:Term>
    <dif:Variable_Level_1>Reflectance</dif:Variable_Level_1>
</dif:Parameters>

From a ISO keywords like:

<gmd:keyword>
    <gco:CharacterString>
        EARTH SCIENCE > Atmosphere > Atmospheric Winds > Surface Winds
    </gco:CharacterString>
</gmd:keyword>

@tomkralidis
Copy link
Member

@epifanio is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants