Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheming support #281

Merged
merged 57 commits into from
Jul 5, 2024
Merged

Scheming support #281

merged 57 commits into from
Jul 5, 2024

Conversation

amercader
Copy link
Member

@amercader amercader commented May 22, 2024

This PR adds initial support for seamless integration between ckanext-dcat and ckanext-scheming, providing a custom profile that modifies the dataset dicts generated and consumed from the existing profiles so it plays well with the scheming presets defined.

Summary of changes

  • A scheming schema definition with the different DCAT properties defined (ckanext/dcat/schemas/dcat_ap_2.1.yaml)
  • Helper functions in the RDFProfile class to access schema field definitions from datasets and resources
  • A new euro_dcat_ap_scheming profile that adds support for the field serializations supported by the ckanext-scheming presets. The existing profiles (euro_dcat_ap and euro_dcat_ap_2) remain unchanged (except for some very minor backward compatible changes regarding the handling of access services in distributions/resources). This means that existing sites will keep working as currently, but maintainers can choose to enable scheming support if they choose to migrate to that approach. Upcoming DCAT 3 based profiles will be scheming based (in a new ckanext-dcat version)
  • Includes "end to end" tests for DCAT -> CKAN and CKAN -> DCAT using the new schema to ensure it works as expected

Compatibility and release plan

Extra care has been taken to not break any existing systems. Sites using the existing euro_dcat_ap and euro_dcat_ap_2 profiles should not see any change in their current parsing and serialization functionalities and these profiles will never change their outputs. Sites willing to migrate to a scheming based profile can do so by adding the new euro_dcat_ap_scheming profile at the end of their profile chain (value of ckanext.dcat.rdf.profiles config option, eg ckanext.dcat.rdf.profiles = euro_dcat_ap_2 euro_dcat_ap_scheming), which will modify the existing profile outputs to the expected format by the scheming validators. Note that the scheming profile will only affect fields defined in the schema definition file, so sites can start migrating gradually different metadata fields.

This compatibility profile will be released in the next ckanext-dcat version (1.8.0). The upcoming DCAT v3 based profiles for DCAT-AP 3 and DCAT-US 3 will be scheming based and will incorporate the mapping changes described below.

Mapping changes

The main changes between the old processors (parsers and serializers) and the new scheming-based ones are:

Root level fields

Custom DCAT fields that didn't link directly to standard CKAN fields were stored as extras (see all the ones marked extra: here). So the DCAT version_notes field would be stored as:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "extras": [
         {"key": "version_notes", "value": "Some version notes"}
    ]
}

In the scheming-based profile, if the field is defined in the scheming schema, it will get stored as a root level field, like all custom dataset properties:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "version_notes": "Some version notes"
}

List fields

The old profiles stored lists as JSON strings:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "extras": [
         {"key": "conforms_to", "value":"[\"Standard 1\", \"Standard 2\"]"}
    ],
    "resources": [
        {
             "name": "Some resource",
             "documentation": "[\"http://dataset.info.org/distribution1/doc1\", \"http://dataset.info.org/distribution1/doc2\"]"
        }
    ]
}

By using the multiple_text preset, lists are now automatically handled:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "conforms_to": [
         "Standard 1", 
         "Standard 2"
    ],
    "resources": [
        {
             "name": "Some resource",
             "documentation": [
                 "http://dataset.info.org/distribution1/doc1", 
                 "http://dataset.info.org/distribution1/doc2"
             ]
        }
    ]
}

The form snippets UI allows to provide multiple values:

Screenshot 2024-05-22 at 10-15-58 Dataset - CKAN

Repeating subfields

Mapping complex entities like dcat:contactPoint or dct:publisher was very limited, storing a subset of properties of just one linked entity as prefixed extras:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "extras": [
        {"key":"contact_name","value":"PointofContact"},
        {"key":"contact_email","value":"contact@some.org"}
    ],
}

By using the repeating_subfields preset we can consume and present these as proper objects, and store multiple entities for those properties that have 0..n cardinality (see comment in "Issues"):

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "contact": [
        {
            "name": "Point of Contact 1",
            "email": "contact1@some.org"
        },
        {
            "name": "Point of Contact 2",
            "email": "contact2@some.org"
        },
    ]
}

Repeating subfields are also supported in resources/distributions. In this case complex objects like dcat:accessService were stored as JSON strings:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "resources": [
        {
             "name": "Some resource",
             "access_services": "[{\"availability\": \"http://publications.europa.eu/resource/authority/planned-availability/AVAILABLE\", \"title\": \"Sparql-end Point\", \"endpoint_description\": \"SPARQL url description\", \"license\": \"http://publications.europa.eu/resource/authority/licence/COM_REUSE\", \"access_rights\": \"http://publications.europa.eu/resource/authority/access-right/PUBLIC\", \"description\": \"This SPARQL end point allow to directly query the EU Whoiswho content (organization / membership / person)\", \"endpoint_url\": [\"http://publications.europa.eu/webapi/rdf/sparql\"], \"uri\": \"\", \"access_service_ref\": \"N2ff5798aac56447e89438cc838512d26\"}]"
        }
    ]
}

They now appear as proper objects:

{
    "name": "test_dataset_dcat",
    "title": "Test dataset DCAT",
    "resources": [
        {
             "name": "Some resource",
             "access_services": [                                                                                                                                                                                                                                                 
                    {                                                                                                                                                                                                                                                                
                        "availability": "http://publications.europa.eu/resource/authority/planned-availability/AVAILABLE",                                                                                                                                                           
                        "title": "Sparql-end Point",                                                                                                                                                                                                                                 
                        "endpoint_description": "SPARQL url description",                                                                                                                                                                                                            
                        "license": "http://publications.europa.eu/resource/authority/licence/COM_REUSE",                                                                                                                                                                             
                        "access_rights": "http://publications.europa.eu/resource/authority/access-right/PUBLIC",                                                                                                                                                                     
                        "description": "This SPARQL end point allow to directly query the EU Whoiswho content (organization / membership / person)",                                                                                                                                 
                        "endpoint_url": [                                                                                                                                                                                                                                            
                            "http://publications.europa.eu/webapi/rdf/sparql"                                                                                                                                                                                                        
                        ],                                                                                                                                                                                                                                                           
                        "uri": "",                                                                                                                                                                                                                                                   
                    }                                                                                                                                                                                                                                                                
                ]
        }
    ]
}

Again, these can be easily managed via the UI thanks to the scheming form snippets:

Screenshot 2024-05-22 at 10-56-35 Dataset - CKAN

Issues

  • For complex objects like dct:publisher that have 0..1 cardinality, I don't think CKAN supports "non-repeating" subfields so it makes sense to use the repeating_subfields one for now and create a new one in the future.
  • Scheming has presets for date and datetime with nice UI form snippets so it's tempting to use them for properties like issued and modified, but these support other formats like xsd:gYear or xsd:gYearMonth which will fail with these presets so we can consider creating a new one that extends the existing ones to support these formats

@amercader amercader changed the title 56 add schema file dcat ap 2.1 Scheming support May 22, 2024
@@ -2003,3 +2093,122 @@ def _distribution_url_graph(self, distribution, resource_dict):
def _distribution_numbers_graph(self, distribution, resource_dict):
if resource_dict.get('size'):
self.g.add((distribution, SCHEMA.contentSize, Literal(resource_dict['size'])))


# TODO: split all these classes in different files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #282

@EricSoroos
Copy link

EricSoroos commented May 22, 2024

@amercader I've been working in DCAT this week, including adding spec compliant HVD 2.2.0 output and scheming portions to the current 1.7 version. (somewhat split across dcat and our schema extension at the moment).

A couple of things have come up for making the output compliant with the HVD shaql files (https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/#validation):

  1. There are some items that need to be typed, e.g. licenses. This is a first cut, and I want to refactor this into the add_triples... methods: https://github.com/derilinx/ckanext-dcat/blob/dcat-hvd-2.2.0/ckanext/dcat/profiles.py#L914
 def _add_with_class(self, dataset_dict, dataset_ref, key, predicate, _type, _class, list_value=False):
        value = self._get_dataset_value(dataset_dict, key)

        def _add(v):
            ref = _type(v)
            self.g.add((ref, RDF.type, _class))
            self.g.add((dataset_ref, predicate, ref))

        if value:
            if list_value:
                for v in self._read_list_value(value):
                    _add(v)
            else:
                _add(value)
...
            self._add_with_class(resource_dict, distribution, 'license', DCT.license, URIRefOrLiteral, DCT.LicenseDocument)

gives us something like this:

...
<http://www.opendefinition.org/licenses/cc-by> a dct:LicenseDocument .

<http://data.europa.eu/eli/reg_impl/2023/138/oj> a <http://data.europa.eu/eli/ontology#LegalResource> .

<https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec/resource/52dcb446-d1f1-40d2-a515-bd708a57b9c6> a dcat:Distribution ;
    dcatap:applicableLegislation <http://data.europa.eu/eli/reg_impl/2023/138/oj> ;
    dct:format "HTML" ;
    dct:issued "2024-05-20T16:12:07"^^xsd:dateTime ;
    dct:license <http://www.opendefinition.org/licenses/cc-by> ;
    dct:modified "2024-05-20T17:00:51"^^xsd:dateTime ;
    dct:title "Test" ;
    dcat:accessURL <https://test.staging.derilinx.com/> .
  1. Codelists are important, e.g., the HVD Category needs to be from this list: https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/high-value-dataset-category (which when it's not being slammed, has an RDF file with a skos:Concept and entries, each with a prefLabel from each official EU language.) (dl is here: https://op.europa.eu/o/opportal-service/euvoc-download-handler?cellarURI=http%3A%2F%2Fpublications.europa.eu%2Fresource%2Fcellar%2F29a21fd5-5c6f-11ee-9220-01aa75ed71a1.0001.02%2FDOC_1&fileName=high-value-dataset-category.rdf)

The codelists get rendered like this in the .ttl:

<https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec> a dcat:Dataset ;
    dcatap:applicableLegislation <http://data.europa.eu/eli/dir/2007/2/2019-06-26>,
        <http://data.europa.eu/eli/reg_impl/2023/138/oj> ;
    dcatap:hvdCategory <http://data.europa.eu/bna/c_dd313021> ;
    dct:identifier "242e33cf-a097-4f59-94f3-25fcddeffaec" ;
    dct:issued "2024-05-20T16:11:40"^^xsd:dateTime ;
    dct:language "en" ;
    dct:modified "2024-05-20T17:00:51"^^xsd:dateTime ;
    dct:publisher <https://test.staging.derilinx.com/organization/b30a8777-1478-43e1-8dcb-9beded4f5052> ;
    dct:title "Test Dataset" ;
    dcat:distribution <https://test.staging.derilinx.com/dataset/242e33cf-a097-4f59-94f3-25fcddeffaec/resource/52dcb446-d1f1-40d2-a515-bd708a57b9c6> .

<http://data.europa.eu/bna/c_dd313021> a skos:Concept ;
    skos:inScheme <http://data.europa.eu/bna/asd487ae75> .

@amercader
Copy link
Member Author

@EricSoroos thanks for the feedback:

  1. HVD 2.20 support sounds amazing. Is this developed in a separate profile built on top of euro_dcat_ap_2? If so would be great to have it upstream. I see it as not directly related to this PR, once support for DCAT-AP 2.1 is ready we can create a separate profile and schema for HDV 2.2.0
  2. More generally regarding SHACL validation I've been thinking about integrating it as part of the test suite or even as a command that site maintainers can run as a way to "certify" support for the different DCAT specs
  3. Types: absolutely. I literally was thinking about this today with relation to the different types that dates can have according to the spec (see "Issues" in the description), but if it's a requirement of the SHAQL validation even more so. I like the approach of extending the _add_triples_... utility functions
  4. Codelists: yes, that's definitely on the list for the DCAT v3 profiles, as there are controlled vocabularies used but of course it also makes sense if needed for HDV. We could explore using choices or more likely choices_helper presets for required or recommended fields, with CLI commands to import them into the main or datastore database, plus a form snippet that shows the options (or autocompletes them if there are a lot of them)

Great to see you are working on this same area. If you have any feedback on the general approach followed for scheming support it would be great to hear it.

@amercader
Copy link
Member Author

@seitenbau-govdata @bellisk would love to know your take on this, and see if this approach would play well with how you are using ckanext-dcat

@EricSoroos
Copy link

In terms of HVD support, the current EU DCAT 2 implementation is close, at least, it has all of the fields. This commit: derilinx/ckanext-dcat@d5ef9f4 is the difference, and it's only the codelist and two types that were required. There are some other compliance issues, like one of the license or rights needs to be available, and the applicable_legislation has to have at least one specific value. I'm looking at validation level stuff for those (legislation already done, license/rights not). I'm at the point of thinking that these things are more general, so _add_hvd_category should be _add_from_codelist.

I'm not clear that we'd necessarily want to be adding a separate profile for this -- Inheritance is really tricky when you're blatting in items to a graph, and may need to override just one piece of it. From what I can tell, the extra profiles tend to be aggregative and compatible, so realistically, there are potentially a few extra fields per entity and/or additional codes/required fields. Also, I think that the changes here are more of the form of "potentially backwards incompatible fixing the implementation" rather than actually adding support for the profile.

FWIW, I think this has been the general take previously, e.g, the geo fields are added from GeoDCAT.

For the Codelists, (at least on the scheming side) I've got something like this in my schema:

    {
      "field_name": "hvd_category",
      "grouping": "High Value Datasets",
      "label": "High Value Dataset Category",
      "form_snippet": "select.html",
      "validators": "ignore_missing",
      "choices_helper": "dlxschema_codelist_choices",
      "codelist": "high-value-dataset-category",
      "help_text": {
        "en": "EU Category for HVD."
      }
    },

And then the choices_helper is this:

@lru_cache(maxsize=None)
def _load_codelist(choices_path):
    """ Cache the json load, so that we're only actually reading once per invocation """
    return json.loads(choices_path.read_text())

def codelist_choices(field):
    """ Get the choices corresponding to the code list from the codelists directory                                                                                                
                                                                                                                                                                                   
    :param name: string, name of the codelist, not including the extension                                                                                                         
    :returns: list of scheming choices                                                                                                                                             
    """

    name = field.get('codelist', None)
    if not name:
        return []
    choices_path = Path(__file__).parent / 'codelists' / (name + ".json")
    if not choices_path.exists:
        return []

    choices = _load_codelist(choices_path)
    return choices

The codelist directory has the .rdf and a .json converted from it, with the languages I'm interested in (though realistically, it wouldn't hurt to put all the eu languages in)

[
  {
    "label": {
      "en": "Geospatial",
      "ga": "Geosp\u00e1s\u00fail",
    },
    "value": "http://data.europa.eu/bna/c_ac64a52d"
  },
  {
    "label": {
      "en": "Earth observation and environment",
      "ga": "Faire na cruinne agus an comhshaol",
    },
    "value": "http://data.europa.eu/bna/c_dd313021"
  },
...

Right now, this is spread over my schema plugin and the dcat plugin, but the next iteration is going to need to pull the codelists into dcat so that I can kick out the prefLabels there.

for index, item in enumerate(dataset_dict[field['field_name']]):
for key in item:
# Index a flattened version
new_key = f'{field["field_name"]}_{index}_{key}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this would let us ask solr for things like 'datasets with the 6th name field in contacts equal to "frank"', but not 'datasets containing any contact named "frank"'. If we the same keys from all subfields into a single field like

new_key = f'{field["field_name"]}__{key}'

then we could, right? We're not doing dynamic solr schemas so we'd have to combine everything as text fields but it should work for text searches.

Also nothing here is specific to dcat so shouldn't it be a scheming PR?

Copy link
Member Author

@amercader amercader May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, given this:

Screenshot 2024-05-23 at 13-26-38 Test dcat 1 - Dataset - CKAN

With your suggestion you need to run a query like q=contact__name:*Racoon* to get a hit, with the original it would be impossible to get this hit without knowing the index, so that's obviously better.
But I think users would expect to find these results in a free text search as well. For this, the subfields need to be indexed in an extras_* field, as these are copied to the catch-all text field. So this would allow just to do q=Racoon and get a result back, so maybe indexing

new_key = f'extras_{field["field_name"]}__{key}'

with the combined key values is the better approach.

Also nothing here is specific to dcat so shouldn't it be a scheming PR?

Sure, I was just getting things working in this extension. Do you want to replace the logic in SchemingNerfIndexPlugin with this one or create a separate plugin? I'll send a PR

ckanext/dcat/processors.py Outdated Show resolved Hide resolved

- field_name: endpoint_url
label: Endpoint URL
preset: multiple_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about tighter validation on this schema, e.g. BCP47 values in language, valid emails, URIs and URLs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a great idea, but I want to do it in a second stage once all the fields are defined in the schema and the general approach validated. I'll start to compile a list of possible validations, these are all great candidates

@EricSoroos
Copy link

A little more thinking on the relationship between DCAT-AP base and the extension profiles (e.g. HVD, Geo, etc) .

I think that it would definitely make sense to have the individual profiles have either pluggable schema sections or diffs/inheritance against the core schema. E.g., Site A needs HVD, Site B needs HVD + Geo. We're using our schema_field_groups for this, so there's an HVD tab in the dataset view.

image

At the graph generation level, I don't know if there's a clean way to do this in an inherited manner. Right now, the EUDCAT2 is a combination of v1 + base + HVD + Geo. There's no issue adding the additional profiles here if the data doesn't support it.

Maybe a better way to do this would be composition rather than inheritance. E.g., have the profile configure a set of [ckan object]_to_graph methods, and those additional profiles would only be responsible for those items that aren't part of the base. As it is, it feels like the profile inheritance is quite chunky for adding a few fields.

@amercader amercader marked this pull request as ready for review June 11, 2024 09:14
@amercader
Copy link
Member Author

I think this is now ready to go, any further work should be done in separate PRs as this has grown quite a lot.

Highlights are:

@wardi
Copy link
Contributor

wardi commented Jun 11, 2024

This looks good.

I would be tempted to put more of the logic in the schemas but this extension needs to maintain backwards compatibility and ckanext-scheming-less operation so your approach makes sense.

amercader and others added 6 commits June 12, 2024 12:46
Co-authored-by: Ian Ward <ian@excess.org>
As this is a `text` field that allows free text search
Scheming adds a dict with empty keys when empty
repeating subfields are submitted from the form. Check that there's an
actual value before creating the triples when serializing
except Invalid:
raise Invalid(
_(
"Date format incorrect. Supported formats are YYYY, YYYY-MM, YYYY-MM-DD and YYYY-MM-DDTHH:MM:SS"
Copy link

@EricSoroos EricSoroos Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overly restrictive. The first datetime I attempted to parse with the dcat harvester was of the form: YYYY-MM-DDTHH:MM:SS.000Z, which is permitted by ISO8601 (https://en.wikipedia.org/wiki/ISO_8601) and the xsd:dateTime (https://www.w3.org/TR/xmlschema11-2/#dateTime) spec.

It appears that this goes back to the ckan helper date_str_to_datetime which "Converts an ISO like date to a datetime", using a 12 yr old regex based time zone ignoring date parser. This predates python having reasonable sensible timezone aware datetimes and the datetime.datetime.fromisoformat function.

The core helper probably should be fixed, with its potential for knock on changes, but in the meantime since this is new code, perhaps we should just directly use datetime.datetime.fromisoformat here. (python3.7 min for that). We'd still need to support YYYY and YYYY-MM specially in the code, because those aren't covered.

There's also a version of this in scheming (https://github.com/ckan/ckanext-scheming/blob/master/ckanext/scheming/helpers.py#L299) which also does manual date parsing, but again looks replaceable with fromisoformat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime.datetime.fromisoformat() is quite limited up until Python 3.11:

Changed in version 3.11: Previously, this method only supported formats that could be emitted by date.isoformat() or datetime.isoformat().

So on python 3.10 and lower these dates are not parsed:

Python 3.10.10 (main, Mar 14 2023, 15:55:23) [GCC 9.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.13.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import datetime

In [2]: datetime.datetime.fromisoformat("2011-11-04T00:05:23Z")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 datetime.datetime.fromisoformat("2011-11-04T00:05:23Z")

ValueError: Invalid isoformat string: '2011-11-04T00:05:23Z'

What do you think of the approach I followed in c7b8c02? If something is not an xsd:gYear, an xsd:gYearMonth or an xsd:date we just let dateutil parse it (which will accept the timezone values you suggested). If that parses we serve it as an xsd:dateTime.
The tests check that 1) the input value is unchanged and 2) it's served with the correct xsd data type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better than doing it ourselves. You'r enot testing for invallid dates, or ... conditionally valid dates like M/D/Y, but overall, I'm definitely happier with getting a builtin date parser that handles the in the wild formats that I've seen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in 39b4d91

@amercader amercader merged commit 53cedb9 into master Jul 5, 2024
8 checks passed
@amercader amercader deleted the 56-add-schema-file-dcat-ap-2.1 branch July 5, 2024 10:08
@amercader
Copy link
Member Author

Merging this big chunk of changes, any followup like #288 we can do in smaller PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants