## Endpoints DCAT

Recommandations pour le POC Qualité : 
- Pas d'impératif de travailler sur du temps réel, possible de s'appuyer un dump des graphes ou de la base PostgreSQL

Liens utiles : 
- Endpoint DCAT POC CKAN, exemple : https://preprod.data.developpement-durable.gouv.fr/dcat/catalog/jsonld?page=1
- Endpoint DCAT data.gouv, exemple : https://www.data.gouv.fr/catalog.jsonld?page=1
- Endpoint DCAT GeoNetwork via rdf.search, exemple : https://catalogue.datara.gouv.fr/geonetwork/srv/fre/rdf.search? (à vérifier) (passer par l'exposition DCAT des CSW).
- Endpoint DCAT GeoNetwork via CSW (Géoservices de l'IGN) : https://data.geopf.fr/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecords&ElementSetName=full&ResultType=results&MaxRecords=10&OutputFormat=application/xml&OutputSchema=http%3A%2F%2Fwww.w3.org%2Fns%2Fdcat%23&NAMESPACE=xmlns%28dcat%3Dhttp%3A%2F%2Fwww.w3.org%2Fns%2Fdcat%23%29&TypeNames=dcat
- Endpoint DCAT GeoNetwork via CSW depuis Geo-IDE : http://catalogue.geo-ide.developpement-durable.gouv.fr/catalogue/srv/eng/csw-moissonnable?service=CSW&REQUEST=GetRecordById&version=2.0.2&namespace=xmlns:csw=http://www.opengis.net/cat/csw&outputFormat=application/xml&outputSchema=http://www.w3.org/ns/dcat%23&ElementSetName=full&Id=fr-120066022-jdd-a307d028-d9d2-4605-a1e5-8d31bc573bef
- [Documentation API data.gouv](https://doc.data.gouv.fr/api/reference/#/), en particulier */site/catalog.{format}*
- [Documentation data.gouv](https://doc.data.gouv.fr/api/telecharger-un-catalogue-de-donnees/) spécifiquement sur le téléchargement et la consulation du catalogue en RDF.
- [Documention GeoNetwork](https://geonetwork-opensource.org/manuals/4.0.x/en/api/rdf-dcat.html) endpoint DCAT
- [Librairie Python RDFLib](https://rdflib.readthedocs.io/en/stable/intro_to_graphs.html) : navigation dans un graph RDF.

Mettre des chemins d'URI dans les requêtes.

Tests :
- POC CKAN : comparer le temps de requête d'un graphe entre différentes pages.
- Extraire des métadonnées et les convertir en DataFrame.
- Comparer les expositions DCAT data.gouv, GeoNetworks, POC CKAN

### DCAT POC CKAN

#### Temps de réponse de l'API

In [1]:
import rdflib
import time
import pandas as pd

In [None]:
CKAN_INTEGRATION = "https://integration.data.e2.rie.gouv.fr/dcat/catalog/jsonld?page=1"
CKAN_PREPPROD = "https://integration.data.e2.rie.gouv.fr/dcat/catalog/jsonld?page=1"

graph_ckan = rdflib.Graph().parse(CKAN_INTEGRATION)

In [None]:
# Evaluation du temps de réponse de l'API (~30-40s par page)
pages = [1, 50, 100, 200]

for page in pages:
    try:
        start_time = time.time()
        rdflib.Graph().parse(f"https://preprod.data.developpement-durable.gouv.fr/dcat/catalog/jsonld?page={page}")
        print(f"Page {page}:", time.time()-start_time)
    except Exception as exception:
        print(f"Page {page}: ", exception)

In [None]:
### Dump .json des expositions DCAT des pages 
from pathlib import Path
from datetime import date
import glob

for page in range(1, 254):
    file_name = f"tmp/dumps/integration_page_{page}_" + str(date.today()) + ".json"
    if file_name not in glob.glob("tmp/dumps/*.json"):
        print(f"Processing page {page}")
        try:
            graph_ckan = rdflib.Graph().parse(f"https://integration.data.e2.rie.gouv.fr/dcat/catalog/jsonld?page={page}")
            file_name = f"integration_page_{page}_" + str(date.today())
            graph_ckan.serialize(destination = Path(file_name), format="json-ld", auto_compact=True)
        except Exception as exception:
            print(f"Fail to load graph for page {page}")

#### Récupération de l'URI du datasets, du titre, de la description, de la date de modification et de la licence

In [2]:
class DCATReaderCKAN:
    def __init__(self, graph_path:str):
        self.graph_path = graph_path
        self._graph = rdflib.Graph().parse(graph_path)
        
    def get_data(self) -> pd.DataFrame:
        dataset_uri = self._get_datasets_uri()

        data = []
        for uri in dataset_uri:
            data.append(
                {
                    "dataset": uri,
                    "title": self._get_dataset_title(uri),
                    "description": self._get_dataset_description(uri),
                    "modification": self._get_dataset_modification(uri),
                    "right_statement": self._get_dataset_right_statement(uri),
                    "key_words": self._get_dataset_key_words(uri)
                }
            )
            
        return pd.DataFrame(data)
        
    def _get_datasets_uri(self) -> list[rdflib.term.URIRef]:
        uri = []
        for subject, _, _ in self._graph.triples((None, rdflib.term.URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),rdflib.term.URIRef("http://www.w3.org/ns/dcat#Dataset"))):
            uri.append(subject)
        return uri
    
    def _get_dataset_title(self, dataset_uri:rdflib.term.URIRef)->rdflib.term.Literal:
        return self._graph.value(subject=dataset_uri, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/title"))
    
    def _get_dataset_description(self, dataset_uri:rdflib.term.URIRef)->rdflib.term.Literal:
        return self._graph.value(subject=dataset_uri, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/description"))
    
    def _get_dataset_modification(self, dataset_uri:rdflib.term.URIRef)->rdflib.term.Literal:
        return self._graph.value(subject=dataset_uri, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/modified"))
    
    def _get_dataset_right_statement(self, dataset_uri:rdflib.term.URIRef)->str:
        rigth_statements = []
        for _, _, object in self._graph.triples((dataset_uri,rdflib.term.URIRef("http://purl.org/dc/terms/accessRights"), None)):    
            for _, _, object_bnode in self._graph.triples((object,rdflib.term.URIRef("http://www.w3.org/2000/01/rdf-schema#label"), None)):
                rigth_statements.append(object_bnode)
                
        return ' '.join(rigth_statements)
    
    def _get_dataset_key_words(self, dataset_uri:rdflib.term.URIRef)->str:
        key_words = []
        for _, _, object in self._graph.triples((dataset_uri,rdflib.term.URIRef("http://www.w3.org/ns/dcat#keyword"), None)):
            key_words.append(object)
        return ' '.join(key_words)
    

In [4]:
import glob

data = []
for graph in glob.glob("tmp/dumps/*.json"):
    data.append(
        DCATReaderCKAN(graph).get_data()
    )
    
data = pd.concat(data)

### data.gouv
#### Récupération de l'URI du datasets, du titre, de la description, de la date de modification et de la licence

In [None]:
graph = rdflib.Graph().parse("https://www.data.gouv.fr/catalog.jsonld?page=1")

In [None]:
data = []
for subject, predicate, object in graph.triples((None,rdflib.term.URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),rdflib.term.URIRef("http://www.w3.org/ns/dcat#Dataset"))):
    data.append([subject,
                graph.value(subject=subject, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/title")),
                graph.value(subject=subject, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/description")),
                graph.value(subject=subject, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/modified")), 
                graph.value(subject=subject, predicate=rdflib.term.URIRef("http://purl.org/dc/terms/license")),
                graph.value(subject=subject, predicate=rdflib.term.URIRef("http://www.w3.org/ns/dcat#startDate")),
                graph.value(subject=subject, predicate=rdflib.term.URIRef("http://www.w3.org/ns/dcat#endDate")),
                ])
                    
datasets_data_gouv = pd.DataFrame(data, columns= ["Object",
                                        "Title",
                                        "Description",
                                        "Modified",
                                        "Licence",
                                        "Start Date",
                                        "End Date"])