# **Agroterm 2025 - LLOD Notebook**

In this tutorial, we demonstrate how to transform structured data from Excel (XSL) files into a Linked Open Data (LOD) representation using the **Simple Knowledge Organization System (SKOS)** format. The focus is on **Agroterm**, a taxonomy developed within the **MADIN TERM** project, which aims to organize and publish agricultural terminology as an open, interoperable resource.

This notebook is designed for **students and researchers** interested in practical applications of the **Linguistic Linked Open Data (LLOD)** paradigm. It provides a modular and reproducible workflow for converting domain-specific vocabularies into SKOS-compliant RDF data.

## 🧭 Structure of the tutorial

1. **Introduction to SKOS**  
   Overview of the SKOS data model and its importance for representing terminologies.

2. **Data Preparation**  
   Loading and inspecting structured input data from Excel files.

3. **Transformation Process**  
   Using Python to convert tabular data into RDF, mapping it to SKOS concepts and properties.

4. **SKOS Construction**  
   Generation of concepts, preferred labels, and semantic relations such as `skos:broader`, `skos:narrower`, and `skos:related`.

5. **Export and Integration**  
   Saving the resulting taxonomy in RDF/Turtle format, ready for validation, publication, or integration into LOD infrastructures such as Skosmos.

## 🎯 Goal

To support the creation of **FAIR** and **interoperable** terminological resources by enabling the transformation of structured agricultural data into SKOS Linked Data, fostering wider dissemination and reuse through open data platforms.

## 📦 Installing Required Dependencies

The following Python packages are required to run this notebook. Each of them plays a specific role in the data transformation pipeline:

```python
# Used to make HTTP requests (e.g., to retrieve data or metadata from remote sources)
!pip install requests

# Powerful library for data manipulation and analysis; used to read and process Excel/CSV files
!pip install pandas

# Required by pandas to properly read `.xlsx` Excel files
!pip install openpyxl

# Provides YAML parsing and writing capabilities; used when working with configuration files (e.g., mappings)
!pip install ruamel.yaml

# Lightweight YAML processor tailored for RML/YARRRML transformations; useful in RDF generation pipelines
!pip install yatter

# Morph-KGC is a tool for knowledge graph construction based on R2RML/RML; used here for converting tabular data to RDF
!pip install morph-kgc

In [1]:
!pip install requests
!pip install pandas
!pip install openpyxl
!pip install ruamel.yaml
!pip install yatter
!pip install morph-kgc



## 📥 Importing Libraries

We now import the necessary Python libraries used throughout the notebook. Each module serves a specific function in the transformation and RDF generation process:

```python
# For sending HTTP requests
import requests

# For handling JSON data
import json

# For file and path manipulations
import os

# To generate hashes (e.g., for unique identifiers)
import hashlib

# For disabling SSL warnings in HTTP requests (optional)
import urllib3
import warnings

# Data analysis and tabular data processing
import pandas as pd

# For printing detailed error traces during debugging
import traceback

# For regular expressions and text processing
import re

# For reading YAML configuration files
import yaml
from ruamel.yaml import YAML  # More robust YAML parser used by Morph-KGC

# Lightweight library to process YARRRML files (RML simplified in YAML)
import yatter

# Morph-KGC: generates RDF from tabular data using RML or R2RML mappings
import morph_kgc

# RDFLib: for building and manipulating RDF graphs in Python
from rdflib import Graph, Namespace, URIRef
from rdflib.namespace import SKOS, RDF  # Common RDF vocabularies

# For URL encoding and parsing
import urllib.parse


In [2]:
import requests
import json      
import os        
import hashlib   
import urllib3   
import warnings
import pandas as pd
import traceback
import re
import yaml
import yatter
from ruamel.yaml import YAML
import morph_kgc
from rdflib import Graph, Namespace, URIRef
from rdflib.namespace import SKOS, RDF  # Import RDF
import urllib.parse

You have loaded all tools! Now, Now let's get into the operational part.

## 📄 Loading and Preparing Excel Data

We begin by loading the Agroterm taxonomy data from an Excel file. The file includes two sheets:

- **`metadata`**: general information about the taxonomy (e.g., title, authors, version).
- **`concepts`**: the list of terms/concepts to be converted into SKOS.

The code below also includes a normalization step to make column names easier to handle programmatically.


In [3]:
# Import necessary libraries
import pandas as pd
from IPython.display import display  # For better display of DataFrames in the notebook

# Define the path to the Excel file
excel_file_path = './src/agroterm_2025_rev.xlsx'  # Make sure that the file is named agroterm_2025_rev.xlsx

# Load the two sheets into separate DataFrames
try:
    metadata = pd.read_excel(excel_file_path, sheet_name='metadata')
    concepts = pd.read_excel(excel_file_path, sheet_name='concepts')
    print("✅ Excel file loaded successfully.")
except FileNotFoundError:
    print(f"❌ Error: The file '{excel_file_path}' was not found.")
except ValueError as e:
    print(f"❌ Error: {e}")

# Function to normalize column names
def normalize_column_names(df):
    df.columns = (
        df.columns
        .str.strip()          # Remove leading/trailing spaces
        .str.lower()          # Convert to lowercase
        .str.replace(' ', '_') # Replace spaces with underscores
        .str.replace(r'[^\w]', '', regex=True)  # Remove any non-alphanumeric characters
    )
    return df

# Normalize column names in both metadata and concepts
metadata = normalize_column_names(metadata)
concepts = normalize_column_names(concepts)

# Display the DataFrames nicely
print("\n🔎 Preview of normalized metadata:")
display(metadata)

print("\n🔎 Preview of normalized concepts:")
display(concepts)

✅ Excel file loaded successfully.

🔎 Preview of normalized metadata:


Unnamed: 0,autore,data,descrizione,link,titolo
0,Osservatorio di terminologie e politiche lingu...,2025-05-01,Il progetto MADIN-TERM nasce dalla collaborazi...,https://centridiricerca.unicatt.it/otpl-proget...,AGROTERM Taxonomy



🔎 Preview of normalized concepts:


Unnamed: 0,termine,regione,dominio,sottodominio,certificazione,area_geografica,definizione_it,note_it,fonti,definizione_en,definizione_es,definizione_fr,definizione_de
0,Arrosticini (s. m. pl.),ABRUZZO,prodotti agroalimentari,Carni fresche,PAT,Abruzzo (fascia montana e pedemontana- collin...,"Carne di ovino adulto, che si presenta tagliat...",Possono essere conditi con aromi naturali (pep...,https://www.regione.abruzzo.it/system/files/ag...,"Meat obtained from muttons, cubed to 1 cm squa...","Carne de ovino adulto, cortada en dados de apr...",,
1,Caciocavallo abruzzese (s. m.),ABRUZZO,prodotti agroalimentari,Latte e derivati,PAT,Province di Chieti e L'Aquila,"Prodotto caseario a pasta filata e semidura, o...",Il sapore del Caciocavallo abruzzese è dolce ...,https://www.regione.abruzzo.it/system/files/ag...,"Dairy product made up of a soft, compact paste...","Producto lácteo de textura compacta y blanda, ...",,
2,Caciofiore aquilano (s. m.),ABRUZZO,prodotti agroalimentari,Latte e derivati,PAT,Provincia di L'Aquila,"Prodotto caseario a pasta molle, ottenuto da l...",È un formaggio da pronto consumo. Essendo un c...,https://www.regione.abruzzo.it/system/files/ag...,"Soft cheese, made from whole sheep's milk, wit...","Queso de pasta blanda, elaborado con leche ent...",,
3,Canestrato di Castel Monte (s. m.),ABRUZZO,prodotti agroalimentari,Latte e derivati,PAT,Provincia di L'Aquila,Prodotto caseario a pasta dura di forma cilin...,Il formaggio presenta una crosta esterna che r...,https://www.regione.abruzzo.it/system/files/ag...,"Hard cheese, made from raw whole sheep's milk,...","Queso duro, elaborado con leche entera cruda d...",,
4,Carota dell'Altopiano del Fucino (s. f.),ABRUZZO,prodotti agroalimentari,Ortofrutticoli,IGP,Provincia di L'Aquila,"Ortaggio della specie Daucus carota L., coltiv...",La Carota dell'Altopiano del Fucino deriva dal...,https://www.politicheagricole.it/flex/cm/pages...,"Carrot of the species Dacus carota L., grown i...","Hortaliza de la especie Dacus carota L., culti...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
351,Recioto della Valpolicella (s. m.),VENETO,prodotti enologici e distillati,Vini,DOP,19 comuni in provincia di Verona,Vino rosso ottenuto da uve Corvina dal 45% al ...,Il grado alcolico minimo è del 12% Vol. È vie...,https://www.consorziovalpolicella.it/,Red wine made from Corvina (45% to 95%) and Ro...,Vino tinto elaborado con uvas Corvina (45% a 9...,,
352,Riso Nano Vialone Veronese (s. m.),VENETO,prodotti agroalimentari,Cereali,IGP,25 comuni della provincia di Verona,Riso ottenuto esclusivamente da semi della var...,"La lotta alle erbe infestanti, prima che con g...",1. https://www.regione.veneto.it 2. https://ww...,Rice obtained exclusively from seeds of the Vi...,Arroz obtenido exclusivamente a partir de semi...,,
353,Soave (s. m.),VENETO,prodotti enologici e distillati,Vini,DOP,13 comuni in provincia di Verona,Vino bianco ottenuto da uve Garganega al minim...,"Il grado alcolico minimo è del 10,5% Vol. Le o...",https://www.ilsoave.com/disciplinare/,"White wine made from Garganega (minimum 70%), ...",Vino blanco elaborado con uvas Garganega (70% ...,,
354,Soppressa Vicentina (s. f.),VENETO,prodotti agroalimentari,Salumi / insaccati,DOP,Provincia di Vicenza,Salume ottenuto da carni di suini di razza Lar...,La macellazione e la trasformazione della carn...,https://www.regione.veneto.it,Salume obtained from the meat of pigs belongin...,Salume obtenido de la carne de cerdos pertenec...,,


## 🧾 Exporting Metadata to JSON

In this section, we extract metadata information from the Excel sheet and convert it into a JSON file. This structured metadata will later be used to document and enrich the RDF output, and may include project-level details like the taxonomy title, version, creator, and supported languages.

The following steps are performed:

1. Convert the first row of the `metadata` DataFrame into a Python dictionary.
2. Normalize values to ensure compatibility with JSON (e.g., converting timestamps to ISO 8601 format).
3. Manually add a list of language codes (`ita`, `spa`, `eng`) following ISO 639-3.
4. Create a target directory (`./json`) if it doesn’t exist.
5. Save the metadata dictionary to `metadata.json` with proper UTF-8 encoding and indentation.
6. Preview the saved content.



In [4]:
import os
import json
import pandas as pd

# 1. Convert the first row of the metadata DataFrame into a dictionary
metadata_dict = metadata.iloc[0].to_dict()

# 2. Normalize the values: convert Timestamps and other non-serializable objects
for key, value in metadata_dict.items():
    if pd.isna(value):
        metadata_dict[key] = None  # If NaN, set to null
    elif isinstance(value, pd.Timestamp):
        metadata_dict[key] = value.isoformat()  # Convert to ISO 8601 string
    else:
        metadata_dict[key] = str(value) if not isinstance(value, (str, int, float, bool, type(None))) else value

# 3. Add the 'languages' key with ISO 639-3 codes
metadata_dict['languages'] = [
    {"code": "ita"},
    {"code": "spa"},
    {"code": "eng"}
]

# 4. Ensure the 'json' folder exists
os.makedirs('./json', exist_ok=True)

# 5. Define the output path
metadata_json_path = './json/metadata.json'

# 6. Save the enriched metadata dictionary to a JSON file
with open(metadata_json_path, 'w', encoding='utf-8') as f:
    json.dump(metadata_dict, f, ensure_ascii=False, indent=4)

# 7. Confirm the operation
print(f"✅ Metadata with languages saved successfully to {metadata_json_path}")

# 8. Preview the JSON content
print("\n🔎 Preview of metadata.json content:")
display(metadata_dict)


✅ Metadata with languages saved successfully to ./json/metadata.json

🔎 Preview of metadata.json content:


{'autore': 'Osservatorio di terminologie e politiche linguistiche, Università Cattolica del Sacro Cuore, Milano',
 'data': '2025-05-01T00:00:00',
 'descrizione': "Il progetto MADIN-TERM nasce dalla collaborazione tra l'Osservatorio di Terminologie e Politiche Linguistiche (OTPL) dell'Università Cattolica del Sacro Cuore e CLARIN-IT con l'obiettivo di diffondere la terminologia del Made in Italy, partendo dai risultati ottenuti dall'OTPL nell'ambito del progetto AGROTERM in cui è stata raccolta la terminologia dei prodotti italiani DOP, DOC e IGP.  Il progetto MADIN-TERM ha l’obiettivo di creare un banca dati terminologica nel rispetto dei principi FAIR. Per i prodotti selezionati, verrà proposta una definizione in italiano e per promuovere la comunicazione internazionale, in altre lingue, quali inglese, tedesco, francese e spagnolo. ",
 'link': 'https://centridiricerca.unicatt.it/otpl-progetti-sostenibilita-ambientale-e-alimentare',
 'titolo': 'AGROTERM Taxonomy',
 'languages': [{'code

## 🧠 Generating and Saving `concepts.json`

This section processes the list of concepts from the Excel sheet and transforms each row into a structured JSON object. The final output is saved as `concepts.json`, ready for further conversion to SKOS/RDF.

The processing pipeline includes:

1. **Text normalization functions**: utility functions to clean and standardize strings, remove accents, and prepare URI-safe fragments.
2. **Geographical area dictionary**: a normalized mapping from area labels to alias IDs (e.g., `area_1`, `area_2`).
3. **Concept object creation**: iterate through the `concepts` DataFrame and generate a structured dictionary for each concept, including:
   - Preferred label (`termine`) and alternative labels (`altLabels`);
   - Normalized values for region, domain, subdomain, certification;
   - List of regions (`regioni`) linked to each concept;
   - Sources (`fonti`) as a list of URLs or strings.
4. **Save to JSON**: store the resulting list of concepts and the area mapping dictionary into the `./json/` folder.



In [5]:
import os
import json
import pandas as pd
import re
import unicodedata
from datetime import datetime

def clean_string_for_uri(text):
    if not isinstance(text, str):
        text = str(text)
    text = text.strip()
    text = re.sub(r'[^\w\sàèéìòùÀÈÉÌÒÙ]', '', text, flags=re.UNICODE)
    text = re.sub(r'\s+', '_', text)
    return text.lower()

def remove_accents(text):
    return ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if unicodedata.category(c) != 'Mn'
    )

def clean_text_value(text):
    if not isinstance(text, str):
        text = str(text)
    return re.sub(r'\s+', ' ', text.strip())

def remove_marc_morf(text):
    return re.sub(r'\s*\(.*?\)', '', text).strip()

def clean_certificazione(text):
    if not isinstance(text, str):
        text = str(text)
    text = text.strip()
    if text.lower() in ["nessuna certificazione", "nessuna classificazione"]:
        return text.lower().replace(" ", "_")
    text = re.sub(r'\.', '', text)
    text = re.sub(r'\s+', '_', text)
    text = re.sub(r'[^A-Z0-9_]', '', text)
    return text

# --- 1. Costruisci dizionario area_geografica normalizzata → area_n
area_geo_set = set()

for val in concepts['area_geografica']:
    label = clean_text_value(val) if pd.notna(val) else "None"
    norm = clean_string_for_uri(label)
    norm_ascii = remove_accents(norm)
    area_geo_set.add(norm_ascii)

area_geo_dict = {norm: f"area_{i+1}" for i, norm in enumerate(sorted(area_geo_set))}

# --- 2. Generazione oggetti JSON concetto
concepts_list = []
all_columns = concepts.columns.tolist()
concept_counter = 1

for index, row in concepts.iterrows():
    concept_obj = {}

    termine_raw = row.get('termine', "")
    if pd.isna(termine_raw) or termine_raw == "":
        concept_uri_fragment = f"concept_{concept_counter}"
        pref_label = ""
        alt_labels = []
    else:
        term_variants = [t.strip() for t in re.split(r'[;,|\n]', termine_raw) if t.strip()]
        full_form = clean_text_value(term_variants[0]) if term_variants else ""
        pref_label = remove_marc_morf(full_form)
        concept_uri_raw = clean_string_for_uri(pref_label)
        concept_uri_fragment = f"{remove_accents(concept_uri_raw)}_{concept_counter}"
        concept_obj["marc_morf"] = full_form
        alt_labels = []
        for variant in term_variants[1:]:
            cleaned = clean_text_value(variant)
            alt_labels.append({
                "altLabel": remove_marc_morf(cleaned),
                "altLabel_marc_morf": cleaned,
                "alt_concept": concept_uri_fragment
            })

    concept_obj["termine"] = pref_label
    concept_obj["concept"] = concept_uri_fragment
    concept_obj["altLabels"] = alt_labels

    for col in all_columns:
        col_lower = col.lower()
        if col_lower in ["termine", "variante", "ultima_revisione", "fonti", "regione", "area_geografica"]:
            continue

        value = row.get(col, None)
        concept_obj[col] = "" if pd.isna(value) else clean_text_value(value) if isinstance(value, str) else value

    # Fonti
    fonti_val = row.get("fonti", "")
    if pd.notna(fonti_val) and isinstance(fonti_val, str) and fonti_val.strip():
        # Suddivide SOLO per newline (gestisce anche \r\n e \n)
        fonti_raw_list = [f.strip() for f in fonti_val.strip().splitlines() if f.strip()]
        concept_obj["fonti"] = [
            {"url": fonte, "concept": concept_uri_fragment}
            for fonte in fonti_raw_list
        ]
    else:
        concept_obj["fonti"] = []

    # Area geografica
    raw_area_geo_label = clean_text_value(row.get('area_geografica', ""))
    area_geo_norm = clean_string_for_uri(raw_area_geo_label)
    area_geo_ascii = remove_accents(area_geo_norm)
    area_geo_id = area_geo_dict.get(area_geo_ascii, "area_0")  # fallback

    # Regione
    raw_regione = row.get('regione', "")
    regione_norm = clean_string_for_uri(raw_regione)
    concept_obj['regione_normalizzata'] = regione_norm
    concept_obj['dominio_normalizzato'] = clean_string_for_uri(row.get('dominio', ""))
    concept_obj['sottodominio_normalizzato'] = clean_string_for_uri(row.get('sottodominio', ""))
    concept_obj['certificazione_normalizzata'] = clean_certificazione(row.get('certificazione', ""))

    # Lista regioni
    regioni_list = []
    if pd.notna(raw_regione) and isinstance(raw_regione, str) and raw_regione.strip():
        raw_regioni_split = [r.strip() for r in re.split(r'[;,|\n]', raw_regione) if r.strip()]
        for regione_item in raw_regioni_split:
            regione_item_norm = clean_string_for_uri(regione_item)
            regioni_list.append({
                "label": regione_item,
                "ref_regione": regione_item_norm,
                "ref_concept": concept_uri_fragment,
                "ref_area_geografica_label": raw_area_geo_label,
                "ref_area_geografica": area_geo_id  # usa alias area_n
            })
    concept_obj["regioni"] = regioni_list

    concepts_list.append(concept_obj)
    concept_counter += 1

# --- Salvataggio JSON
concepts_json_path = './json/concepts.json'
area_dict_path = './json/area_geo_mapping.json'
os.makedirs('./json', exist_ok=True)

with open(concepts_json_path, 'w', encoding='utf-8') as f:
    json.dump(concepts_list, f, ensure_ascii=False, indent=4)

with open(area_dict_path, 'w', encoding='utf-8') as f:
    json.dump(area_geo_dict, f, ensure_ascii=False, indent=4)

print(f"✅ Concepts saved to: {concepts_json_path}")
print(f"📘 Area geographic aliases saved to: {area_dict_path}")

# Anteprima
print("\n🔎 Preview of first 3 concepts:")
for concept in concepts_list[:3]:
    display(concept)


✅ Concepts saved to: ./json/concepts.json
📘 Area geographic aliases saved to: ./json/area_geo_mapping.json

🔎 Preview of first 3 concepts:


{'marc_morf': 'Arrosticini (s. m. pl.)',
 'termine': 'Arrosticini',
 'concept': 'arrosticini_1',
 'altLabels': [],
 'dominio': 'prodotti agroalimentari',
 'sottodominio': 'Carni fresche',
 'certificazione': 'PAT',
 'definizione_it': 'Carne di ovino adulto, che si presenta tagliata a cubetti di circa 1 cm, di colore rosso più o meno intenso, infilati in spiedini di legno.',
 'note_it': 'Possono essere conditi con aromi naturali (peperoncino, salvia, cipolla) oppure misti, con l’aggiunta di carne di suino o bovino.',
 'definizione_en': 'Meat obtained from muttons, cubed to 1 cm square, which varies in intesity of a red colour and threaded in wooden skewers.',
 'definizione_es': 'Carne de ovino adulto, cortada en dados de aproximadamente 1 cm, de color rojo más o menos intenso, ensartados en pinchos de madera.',
 'definizione_fr': '',
 'definizione_de': '',
 'fonti': [{'url': 'https://www.regione.abruzzo.it/system/files/agricoltura/pord_agroalimentari/Atlante_prodotti_tipici.pdf',
   'con

{'marc_morf': 'Caciocavallo abruzzese (s. m.)',
 'termine': 'Caciocavallo abruzzese',
 'concept': 'caciocavallo_abruzzese_2',
 'altLabels': [],
 'dominio': 'prodotti agroalimentari',
 'sottodominio': 'Latte e derivati',
 'certificazione': 'PAT',
 'definizione_it': 'Prodotto caseario a pasta filata e semidura, ottenuto da latte intero crudo di vacca con aggiunta di caglio e sale, che presenta forma a pera dal peso non inferiore a 1 kg.',
 'note_it': 'Il sapore del Caciocavallo abruzzese è dolce e pastoso quando è ancora fresco, intenso e piccante con la stagionatura. Il latte munto non viene pastorizzato in quanto avviene a una temperatura inferiore ai 40°C.',
 'definizione_en': 'Dairy product made up of a soft, compact paste. It is made with full raw cow’s milk, rennet and salt, it has a smooth outer surface and an unusual pear shape, weighing over 1 kg.',
 'definizione_es': 'Producto lácteo de textura compacta y blanda, obtenido a partir de leche entera cruda de vaca con adición de cu

{'marc_morf': 'Caciofiore aquilano (s. m.)',
 'termine': 'Caciofiore aquilano',
 'concept': 'caciofiore_aquilano_3',
 'altLabels': [],
 'dominio': 'prodotti agroalimentari',
 'sottodominio': 'Latte e derivati',
 'certificazione': 'PAT',
 'definizione_it': "Prodotto caseario a pasta molle, ottenuto da latte intero ovino, con l'aggiunta di caglio di carciofo e zafferano, che si presenta in forma cilindrica, con crosta fine.",
 'note_it': 'È un formaggio da pronto consumo. Essendo un caciofiore è prodotto usando il "fiore del latte" cioè la parte grassa che affiora in superficie.',
 'definizione_en': "Soft cheese, made from whole sheep's milk, with the addition of artichoke rennet and saffron, which has cylindrical shape and thin crust.",
 'definizione_es': 'Queso de pasta blanda, elaborado con leche entera de oveja, con la adición de cuajo de alcachofa y azafrán, que tiene forma cilíndrica y corteza fina.',
 'definizione_fr': '',
 'definizione_de': '',
 'fonti': [{'url': 'https://www.reg

This is a schematic structure of the resultant json:

![Concept JSON Structure](./assets/concept_structure.png)

## 🌐 Defining the Base URI and Prefixes

To ensure that all generated RDF resources are uniquely and consistently identified, we define:

### 1. **Base URI**

This is the base namespace under which all concepts, labels, and metadata for the Agroterm taxonomy will be created:

In [6]:
# Define the base URI for the vocabulary
BASE_URI = "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"

### 2. **RDF Prefix Mapping (YAML)**

In this block, we define a set of RDF prefixes in YAML format. These prefixes are used to shorten URIs when generating RDF triples, especially during the mapping phase with tools like **Morph-KGC** or **YARRRML/YATTER**.

```yaml
prefixes:
  dc: "http://purl.org/dc/elements/1.1/"              # Dublin Core basic metadata elements
  dct: "http://purl.org/dc/terms/"                    # Dublin Core extended terms
  iso639-3: "http://iso639-3.sil.org/code/"           # ISO 639-3 language codes
  skos: "http://www.w3.org/2004/02/skos/core#"        # SKOS vocabulary for thesauri and concept schemes
  xsd: "http://www.w3.org/2001/XMLSchema#"            # XML Schema datatypes (e.g., xsd:string, xsd:date)
  agroterm: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"  # Custom namespace for this vocabulary


In [7]:
prefixes_raw_mapping = """
prefixes:
  dc: "http://purl.org/dc/elements/1.1/"
  dct: "http://purl.org/dc/terms/"
  iso639-3: "http://iso639-3.sil.org/code/"
  skos: "http://www.w3.org/2004/02/skos/core#"
  xsd: "http://www.w3.org/2001/XMLSchema#"
  agroterm: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
"""

### 🗺️ Constructing YARRRML Raw Mapping for `skos:ConceptScheme`

In this step, we define the **YARRRML raw mapping** that will be used to generate RDF triples describing the **Agroterm concept scheme**, based on metadata extracted from the file `metadata.json`.

### 🟦 Define JSON source path


This variable stores the path to the JSON file containing the metadata info (`metadata.json`). It will be referenced by only one mapping block using JSONPath expressions.

In [23]:
metadata_source = "./json/metadata.json"

#### 🔹 `agroterm_conceptscheme`

This rule maps the **main concept scheme resource** with basic metadata using JSONPath to extract values:

```yaml
agroterm_conceptscheme:
  sources:
    - ['./json/metadata.json~jsonpath', "$"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
  predicateobjects:
    - [a, skos:ConceptScheme]                              # Declares the resource as a ConceptScheme
    - [dc:description, $(descrizione), it~lang]            # Description in Italian
    - [dc:title, $(titolo), it~lang]                       # Title in Italian
    - [dc:created, $(data), xsd:date]                      # Creation date
    - [dc:creator, $(autore), it~lang]                     # Author
    - [dct:source, $(link)]                                # Source or external reference


In [22]:
concept_scheme_raw_mapping = f"""
agroterm_conceptscheme:
  sources:
    - ['{metadata_source}~jsonpath', "$"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
  predicateobjects:
    - [a, skos:ConceptScheme]
    - [dc:description, $(descrizione), it~lang]
    - [dc:title, $(titolo), it~lang]
    - [dc:created, $(data), xsd:date]
    - [dc:creator, $(autore), it~lang]
    - [dct:source, $(link)]

agroterm_conceptscheme_languages:
  sources:
    - ['{metadata_source}~jsonpath', "$.languages[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
  predicateobjects:
    - [dct:language, "http://iso639-3.sil.org/code/$(code)"]
"""

### 🧩 Constructing YARRRML Mappings for Concepts

In the following cells, we define the **YARRRML raw mappings** used to transform the concept-level data in `concepts.json` into SKOS triples. Each mapping targets a specific aspect of the concept representation: identity, labels, definitions, alternative labels, and sources.

### 🟦 Define JSON source path


This variable stores the path to the JSON file containing the list of concept objects (`concepts.json`). It will be referenced by multiple mapping blocks using JSONPath expressions.

In [10]:
concepts_source = "./json/concepts.json"

### 🟩 Mapping SKOS Concepts

```yaml
agroterm_concepts:
  sources:
    - ['./json/concepts.json~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"
  predicateobjects:
    - [a, skos:Concept]
    - [skos:prefLabel, $(termine), it~lang]
    - [skos:definition, $(definizione_it), it~lang]
    - [skos:definition, $(definizione_en), en~lang]
    - [skos:definition, $(definizione_es), es~lang]
    - [skos:note, $(marc_morf), it~lang]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]
```

This mapping transforms each concept into a `skos:Concept` with:

* A **URI** based on the normalized identifier `$(concept)`
* A preferred label (`skos:prefLabel`)
* Definitions in **Italian, English, and Spanish**
* Morphological/variant notes (`skos:note`)
* A link to the concept scheme (`skos:inScheme`)

> 📘 JSONPath `"$[*]"` iterates over all top-level concept objects.


In [11]:
concepts_raw_mapping = f"""
agroterm_concepts:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"
  predicateobjects:
    - [a, skos:Concept]
    - [skos:prefLabel, $(termine), it~lang]
    - [skos:definition, $(definizione_it), it~lang]
    - [skos:definition, $(definizione_en), en~lang]
    - [skos:definition, $(definizione_es), es~lang]
    - [skos:note, $(marc_morf), it~lang]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]
"""

### 🟨 Mapping Alternative Labels (altLabels)

```yaml
agroterm_altlabels:
  sources:
    - ['./json/concepts.json~jsonpath', "$[*].altLabels[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(alt_concept)"
  predicateobjects:
    - [skos:altLabel, $(altLabel), it~lang]
    - [skos:note, $(altLabel_marc_morf), it~lang]
```

This mapping handles **alternative lexical variants** of the main term. For each `altLabel` object:

* An `skos:altLabel` is assigned to the corresponding concept (`$(alt_concept)`)
* The full morphological form is recorded in a `skos:note`

> 🏷️ Useful for exposing synonyms and inflected forms in the vocabulary.

In [12]:
concepts_altlabel_raw_mapping = f"""
agroterm_altlabels:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].altLabels[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(alt_concept)"
  predicateobjects:
    - [skos:altLabel, $(altLabel), it~lang]
    - [skos:note, $(altLabel_marc_morf), it~lang]
"""

### 🟫 Cella 4 – Mapping Sources (`fonti`)

```yaml
agroterm_fonti:
  sources:
    - ['./json/concepts.json~jsonpath', "$[*].fonti[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"
  predicateobjects:
    - [dct:source, $(url)]
```

This mapping associates **source references** to each concept:

* Each source is expressed using `dct:source` with a URL or reference string
* Sources are attached to the correct concept via the `$(concept)` ID

> 🔎 These links can refer to bibliographic references, datasets, or institutional glossaries.

In [13]:
concepts_fonti_raw_mapping = f"""
agroterm_fonti:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].fonti[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"
  predicateobjects:
    - [dct:source, $(url)]
"""

### 🗃️ Constructing SKOS Collections and Memberships

In this block, we define multiple **YARRRML mappings** to create `skos:Collection` resources and organize concepts into meaningful groups (regioni, certificazioni, domini, sottodomini, aree geografiche). These collections allow better navigation and semantic grouping all within the SKOS model.

### 🔹 agroterm_collections_regioni

Creates a single SKOS Collection called `"Regioni"` and adds all normalized region terms as `skos:member`.

```yaml
subject: .../gruppo_regioni
  - [skos:member, .../$(ref_regione)]
````

---

### 🔹 agroterm\_collections\_certificazioni

Creates a `"Certificazioni"` collection and includes all `certificazione_normalizzata` values as members.

```yaml
subject: .../gruppo_certificazioni
  - [skos:member, .../$(certificazione_normalizzata)]
```

---

### 🔹 agroterm\_collections\_domini

Defines a `"Domini"` collection grouping the different domain categories.

```yaml
subject: .../gruppo_domini
  - [skos:member, .../$(dominio_normalizzato)]
```

---

## 🔁 Nested Membership Mappings

These mappings define **the actual structure and hierarchy** within each collection. They ensure that the right concepts are included as members of the corresponding collection.

---

### 🔸 agroterm\_concept\_member\_regioni

For each individual region, creates a `skos:Collection` with:

* `skos:prefLabel` from the region label
* `skos:member` pointing to the related geographical area

---

### 🔸 agroterm\_concept\_member\_area\_geografica

Each geographical area becomes a collection with:

* Label from `ref_area_geografica_label`
* `skos:member` for every concept in that area

---

### 🔸 agroterm\_concept\_member\_certificazioni

For each certification (normalized), this creates a collection with:

* Label from the `certificazione` field
* Members being all concepts associated with it

---

### 🔸 agroterm\_concept\_member\_domini

Builds domain collections and links them to their respective `sottodomini`.

```yaml
subject: .../$(dominio_normalizzato)
  - [skos:member, .../$(sottodominio_normalizzato)]
```

---

### 🔸 agroterm\_concept\_member\_sottodomini

Each subdomain becomes a collection of all concepts associated with it.

```yaml
subject: .../$(sottodominio_normalizzato)
  - [skos:member, .../$(concept)]
```

---

> 🧭 **Purpose**: These mappings provide a semantically rich structure to the vocabulary by grouping concepts into hierarchical and thematic collections, improving **navigability**, **querying**, and **visual browsing** in tools like **Skosmos**.

> 💡 All collections use the same base URI and belong to the same SKOS scheme via `skos:inScheme`.

In [14]:
collections_raw_mapping = f"""
agroterm_collections_regioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].regioni[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/gruppo_regioni"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "Regioni", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_regione)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_collections_certificazioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/gruppo_certificazioni"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "Certificazioni", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(certificazione_normalizzata)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_collections_domini:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/gruppo_domini"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "Domini", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(dominio_normalizzato)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]


    
agroterm_concept_member_regioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].regioni[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_regione)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(label)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_area_geografica)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]


agroterm_concept_member_area_geografica:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].regioni[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_area_geografica)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(ref_area_geografica_label)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_concept)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_concept_member_certificazioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(certificazione_normalizzata)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(certificazione)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_concept_member_domini:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(dominio_normalizzato)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(dominio)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(sottodominio_normalizzato)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_concept_member_sottodomini:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(sottodominio_normalizzato)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(sottodominio)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]
"""

### 📦 Wrap-up: Generate the Final YARRRML Mapping File

In this step, we combine all previously defined mapping blocks and save them into a single **YARRRML YAML file**, ready for use by RDF generation tools like **Morph-KGC** or **YATTER**.

The process consists of:

---

#### 🔹 1. Indentation Function

YARRRML requires that all mapping blocks appear under the `mappings:` key with correct indentation. The `indent_mapping_block()` function:
- Takes a raw YAML block (e.g., `concepts_raw_mapping`)
- Adds 2 spaces of indentation to each non-empty line
- Returns the correctly indented block as a string

```python
def indent_mapping_block(block_text):
    ...
````

---

#### 🔹 2. Concatenation of All Parts

We build the full YAML by combining:

* The prefix block (`prefixes_raw_mapping`)
* The keyword `mappings:`
* All mapping blocks, properly indented

```python
full_mapping = (
    prefixes_raw_mapping + "\n\n" +
    "mappings:\n" +
    indent_mapping_block(concept_scheme_raw_mapping) + ...
)
```

---

#### 🔹 3. Saving to Disk

* A folder `./yaml/` is created if it doesn’t exist
* The complete YAML content is saved to `agroterm_mapping_collections.yml`

```python
output_yaml_path = './yaml/agroterm_mapping_collections.yml'
with open(output_yaml_path, 'w', encoding='utf-8') as f:
    f.write(full_mapping)
```

✅ **Result**: A complete and valid **YARRRML mapping file** that covers:

* Concept scheme metadata
* Concepts and labels
* Sources and multilingual definitions
* SKOS collections and groupings

> 🗂️ You can now process this file with tools like **YATTER** or **Morph-KGC** to generate the corresponding RDF triples.

In [15]:
import os

# Tutti gli altri blocchi vanno indentati di 2 spazi per essere sotto 'mappings:'
def indent_mapping_block(block_text):
    indented_lines = []
    for line in block_text.strip().splitlines():
        if line.strip():  # evita righe vuote
            indented_lines.append(f"  {line}")
        else:
            indented_lines.append("")
    return "\n".join(indented_lines)
    
# Costruzione finale del file
full_mapping = (
    prefixes_raw_mapping + "\n\n" +
    "mappings:\n" +
    indent_mapping_block(concept_scheme_raw_mapping) + "\n\n" +
    indent_mapping_block(concepts_raw_mapping) + "\n\n" +
    indent_mapping_block(concepts_altlabel_raw_mapping) + "\n\n" +
    indent_mapping_block(concepts_fonti_raw_mapping) + "\n\n" +
    indent_mapping_block(collections_raw_mapping)
)

# 2. Ensure the 'yaml' folder exists
os.makedirs('./yaml', exist_ok=True)

# 3. Save the final YAML file
output_yaml_path = './yaml/agroterm_mapping_collections.yml'

with open(output_yaml_path, 'w', encoding='utf-8') as f:
    f.write(full_mapping)

print(f"✅ YAML mapping file successfully created at: {output_yaml_path}")

✅ YAML mapping file successfully created at: ./yaml/agroterm_mapping_collections.yml


### 🔄 Convert YARRRML to RML (Turtle format)

In this final step, we transform the previously generated **YARRRML mapping file** into an **RML file** (RDF Mapping Language) in Turtle syntax. This RML file can be directly used with tools like **Morph-KGC** to generate RDF triples from the input data sources.

---

#### 🛠️ Step-by-step Breakdown

```python
import os
import traceback
from ruamel.yaml import YAML
import yatter
````

1. **Define input/output paths**
   Set the directory for the YAML file (`yaml/`) and the destination for the generated RML file (`rml/`):

```python
yaml_dir = "yaml"
rml_dir = "rml"
input_yaml_filename = "agroterm_mapping_collections.yml"
```

2. **Create the output folder if it doesn't exist**:

```python
if not os.path.exists(rml_dir):
    os.makedirs(rml_dir)
    print(f"✅ Folder created: {rml_dir}")
```

3. **Load the YAML file using `ruamel.yaml`**
   This ensures correct parsing of the YARRRML file:

```python
yaml_loader = YAML(typ='safe', pure=True)
...
yarrrml_content = yaml_loader.load(yarrrml_file)
```

4. **Convert YARRRML to RML using `yatter.translate()`**
   The resulting RDF mapping is serialized in Turtle format. We also replace the default namespace from `semweb.mmlab.be` to the standardized `w3id.org` URI:

```python
rml_output = yatter.translate(yarrrml_content)
rml_output = rml_output.replace("http://semweb.mmlab.be/ns/rml#", "http://w3id.org/rml/")
```

5. **Save the RML output to disk**:

```python
with open(rml_file_path, "w", encoding="utf-8") as rml_file:
    rml_file.write(rml_output)
```

6. **Error handling**
   Any exception in loading, translating or writing is caught and logged with `traceback.print_exc()` for debugging.

---

> ✅ **Output**:
>
> * A valid RML file in Turtle syntax, located in `rml/agroterm_mapping_collections.rml.ttl`
> * Ready for use with Morph-KGC to produce RDF from JSON or CSV data.

> 🧪 Tip: You can validate the generated `.rml.ttl` file using RDF tools like [RDF Playground](https://rdfplayground.dcc.uchile.cl/) or `riot` from Apache Jena.


In [16]:
import os
import traceback
from ruamel.yaml import YAML
import yatter

# 1. Define the paths
yaml_dir = "yaml"
rml_dir = "rml"
input_yaml_filename = "agroterm_mapping_collections.yml"

# 2. Ensure the 'rml' output directory exists
if not os.path.exists(rml_dir):
    os.makedirs(rml_dir)
    print(f"✅ Folder created: {rml_dir}")

# 3. Initialize YAML loader
yaml_loader = YAML(typ='safe', pure=True)

# 4. Define full paths
yaml_file_path = os.path.join(yaml_dir, input_yaml_filename)
rml_file_path = os.path.join(rml_dir, input_yaml_filename.replace(".yml", ".rml.ttl"))

# 5. Process the YARRRML file
try:
    # Load the YARRRML content
    with open(yaml_file_path, "r", encoding="utf-8") as yarrrml_file:
        yarrrml_content = yaml_loader.load(yarrrml_file)

    # Translate YARRRML to RML
    rml_output = yatter.translate(yarrrml_content)

    # Replace namespace if needed
    rml_output = rml_output.replace("http://semweb.mmlab.be/ns/rml#", "http://w3id.org/rml/")

    # Save the RML output
    with open(rml_file_path, "w", encoding="utf-8") as rml_file:
        rml_file.write(rml_output)

    print(f"✅ RML file successfully created at: {rml_file_path}")

except Exception as e:
    print(f"❌ Failed to convert {input_yaml_filename} to RML. Error: {e}")
    traceback.print_exc()

2025-05-12 20:05:15,135 | INFO: Translating YARRRML mapping to [R2]RML
2025-05-12 20:05:15,144 | INFO: RML content is created!
2025-05-12 20:05:15,192 | INFO: Mapping has been syntactically validated.
2025-05-12 20:05:15,194 | INFO: Translation has finished successfully.


✅ RML file successfully created at: rml/agroterm_mapping_collections.rml.ttl


### 🧪 Creation of the RDF File with Morph-KGC

This final step uses **Morph-KGC**, a tool for knowledge graph construction, to transform all RML mappings into an RDF file in **Turtle format** (`.ttl`).

The script:

---

#### 🗂️ 1. Defines Directories and Prefixes

```python
rml_files_directory = "rml"
rdf_output_directory = "rdf"
````

* **RML input**: folder containing `.rml.ttl` mapping files
* **RDF output**: where the resulting `agroterm_collections.ttl` will be saved

It also defines **default RDF prefixes**, and binds them to the RDFLib graph (e.g., `dc:`, `skos:`, `agroterm:`) for cleaner serialization:

```python
DEFAULT_PREFIXES = {...}
graph.bind(...)
```

---

#### ⚙️ 2. Dynamically Creates Morph-KGC Config

Instead of using a static config file, the script **builds a configuration string in memory**, listing all RML files found in the `rml/` directory under a single `[agroterm]` section:

```ini
[DEFAULT]
output_format = N-TRIPLES
output_file = /absolute/path/to/rdf/agroterm_collections.ttl

[agroterm]
mappings = /absolute/path/to/rml/file1.rml.ttl
mappings = /absolute/path/to/rml/file2.rml.ttl
...
```

---

#### 🧠 3. Executes Morph-KGC Materialization

The script then:

* **Calls `morph_kgc.materialize(config_string)`** to generate RDF triples in memory
* **Adds the defined prefixes** to the resulting RDFLib graph
* **Serializes the graph** to Turtle format and writes it to disk:

```python
graph.serialize(destination=output_path, format="turtle")
```

---

#### ✅ Output

If successful, this step will produce:

```
rdf/agroterm_collections.ttl
```

containing a complete, semantically enriched RDF version of the Agroterm taxonomy, ready to be published or loaded into a triple store (e.g., GraphDB, Fuseki).

---

> 🧩 Tip: You can preview or validate the output using online tools like [Turtle validator](https://ttl.summerofcode.be/) or load it into a SPARQL endpoint to test queries.

> ❗ Make sure the `rml/` folder contains at least one valid `.rml.ttl` file before executing the script.

In [17]:
import os
import morph_kgc
from rdflib import Namespace
from ruamel.yaml import YAML

# Define the default prefixes
DEFAULT_PREFIXES = {
    "dc": "http://purl.org/dc/elements/1.1/",
    "dct": "http://purl.org/dc/terms/",
    "iso639-3": "http://iso639-3.sil.org/code/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
}

def add_prefixes_to_graph(graph):
    """Add default and dynamic prefixes to the RDFLib graph."""
    for prefix, namespace in DEFAULT_PREFIXES.items():
        graph.bind(prefix, Namespace(namespace))
    # Add project-specific prefix for agroterm
    dynamic_prefix = "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
    graph.bind("agroterm", Namespace(dynamic_prefix))
    print(f"🔗 Added dynamic prefix: agroterm -> {dynamic_prefix}")

def create_and_process_config_string(output_dir, mapping_files_dir):
    """Create morph-kgc config dynamically and materialize RDF into agroterm.ttl."""
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # We'll generate a single output file named agroterm.ttl
    output_file = "agroterm_collections.ttl"
    output_path = os.path.join(os.path.abspath(output_dir), output_file)

    # Build the morph-kgc configuration string pointing to all RML files
    yaml = YAML(typ='safe', pure=True)
    mapping_paths = []
    for mapping_file in os.listdir(mapping_files_dir):
        if mapping_file.endswith(".rml") or mapping_file.endswith(".rml.ttl"):
            mapping_paths.append(os.path.abspath(os.path.join(mapping_files_dir, mapping_file)))

    if not mapping_paths:
        print(f"❌ No RML mapping files found in '{mapping_files_dir}'.")
        return

    # Prepare a combined config that lists all mappings under a single section
    mappings_entries = "\n".join(f"mappings = {path}" for path in mapping_paths)
    config_string = f"""
[DEFAULT]
output_format = N-TRIPLES
output_file = {output_path}

safe_percent_encoding = ì

[agroterm]
{mappings_entries}
"""

    print("🚀 Executing morph-kgc with combined mapping files...")
    try:
        # Materialize RDF triples using morph-kgc
        graph = morph_kgc.materialize(config_string)
        
        # Add prefixes
        add_prefixes_to_graph(graph)

        # Serialize the RDFLib graph to the single Turtle file
        graph.serialize(destination=output_path, format="turtle")
        print(f"✅ RDF file generated successfully: {output_path}\n")

    except Exception as e:
        print(f"❌ Error during RDF generation: {str(e)}")

# --- Define the folders ---
rml_files_directory = "rml"
rdf_output_directory = "rdf"

# --- Execute if RML files exist ---
if not os.path.exists(rml_files_directory):
    print(f"❌ RML folder '{rml_files_directory}' does not exist.")
else:
    create_and_process_config_string(rdf_output_directory, rml_files_directory)


🚀 Executing morph-kgc with combined mapping files...
❌ Error during RDF generation: While reading from '<string>' [line 10]: option 'mappings' in section 'agroterm' already exists


## 🔗 Aggregating Concepts to the SKOS ConceptScheme

This step enriches the generated RDF by explicitly linking all `skos:Concept` resources to the `skos:ConceptScheme`, using:

- `skos:hasTopConcept` (from ConceptScheme to Concept)
- `skos:topConceptOf` (from Concept to ConceptScheme)

These relations are essential for tools like **Skosmos**, which rely on them to build navigable hierarchies.

---

### 🧠 Script Summary

```python
def aggregate_concepts_to_scheme(rdf_dir):
````

* Scans all `.ttl` files in the given `rdf/` directory
* For each file:

  1. Loads the RDF graph using `rdflib`
  2. Identifies the `skos:ConceptScheme` (assumes only one)
  3. Finds all `skos:Concept` instances
  4. Adds the two SKOS relations for each concept
  5. Serializes the graph back into the same Turtle file

```python
graph.add((concept_scheme, SKOS.hasTopConcept, concept))
graph.add((concept, SKOS.topConceptOf, concept_scheme))
```

---

### ✅ Output

* The original `.ttl` RDF files are **updated in-place**
* Each concept is now explicitly connected to its scheme
* The vocabulary becomes **fully compliant with SKOS best practices** and interoperable with LOD platforms

---

> 📘 **Note**: This process assumes that each RDF file contains only **one `skos:ConceptScheme`**, which is true for Agroterm. If multiple schemes are present, this logic would need to be adapted.

> 🧩 You can now load the enriched RDF into **Skosmos** or query it with **SPARQL** using `skos:topConceptOf` and `skos:hasTopConcept`.

In [18]:
import os
from rdflib import Graph, Namespace, RDF
from rdflib.namespace import SKOS

# Define the SKOS namespace
SKOS_NS = Namespace("http://www.w3.org/2004/02/skos/core#")

def aggregate_concepts_to_scheme(rdf_dir):
    """
    Aggregate skos:Concept resources to their skos:ConceptScheme by adding
    skos:hasTopConcept (from ConceptScheme to Concept) and skos:topConceptOf (from Concept to ConceptScheme).
    The RDF Turtle files are updated in-place.
    """
    for rdf_file in os.listdir(rdf_dir):
        if rdf_file.endswith(".ttl"):
            rdf_file_path = os.path.join(rdf_dir, rdf_file)
            graph = Graph()
            graph.parse(rdf_file_path, format="turtle")

            # Find the ConceptScheme
            concept_schemes = list(graph.subjects(RDF.type, SKOS.ConceptScheme))
            if not concept_schemes:
                print(f"⚠️ No skos:ConceptScheme found in {rdf_file_path}. Skipping.")
                continue

            concept_scheme = concept_schemes[0]  # Assume only one concept scheme per file (as in our case)

            # Find all skos:Concept instances
            concepts = list(graph.subjects(RDF.type, SKOS.Concept))

            # Add skos:hasTopConcept and skos:topConceptOf relations
            for concept in concepts:
                graph.add((concept_scheme, SKOS.hasTopConcept, concept))
                graph.add((concept, SKOS.topConceptOf, concept_scheme))

            # Serialize and overwrite the updated RDF graph
            graph.serialize(destination=rdf_file_path, format="turtle")
            print(f"✅ Updated '{rdf_file}' with skos:hasTopConcept and skos:topConceptOf relationships.")

# Directory containing the RDF Turtle files
rdf_directory = "rdf"

# Run the aggregation process
aggregate_concepts_to_scheme(rdf_directory)


✅ Updated 'agroterm_hierarchical_decoded.ttl' with skos:hasTopConcept and skos:topConceptOf relationships.
✅ Updated 'agroterm_collections_decoded.ttl' with skos:hasTopConcept and skos:topConceptOf relationships.
✅ Updated 'agroterm_hierarchical.ttl' with skos:hasTopConcept and skos:topConceptOf relationships.
✅ Updated 'agroterm_collections.ttl' with skos:hasTopConcept and skos:topConceptOf relationships.


### 🧹 Clean-up: Remove Duplicate `skos:prefLabel` in Collections

Some SKOS collections may unintentionally have **multiple `skos:prefLabel` values**, which can cause issues in display tools like **Skosmos** or lead to ambiguity in RDF consumers.

This script:
- Loads the RDF vocabulary file
- Scans all `skos:Collection` resources
- For each collection with **multiple prefLabels**, retains only the **first one**
- Writes a cleaned version of the RDF graph to a new file

---

#### 🔍 How it Works

```python
input_path = "./rdf/agroterm_collections.ttl"
output_path = "./rdf/agroterm_collections_decoded.ttl"
````

* Loads the graph from the original RDF file
* Creates a new empty graph (`g_clean`) and copies all **namespaces**

```python
g_clean = Graph()
for prefix, namespace in g.namespaces():
    g_clean.bind(prefix, namespace)
```

* For each `skos:Collection`:

  * If more than one `skos:prefLabel` is found:

    * Removes all existing labels
    * Keeps only the **first one** (as found by `rdflib`)

```python
for s in g.subjects(RDF.type, SKOS.Collection):
    labels = list(g.objects(s, SKOS.prefLabel))
    if len(labels) > 1:
        ...
```

* Copies all remaining triples (including modified ones) to the clean graph
* Saves the result to a new Turtle file

```python
g_clean.serialize(destination=output_path, format="turtle")
```

---

#### ✅ Output

* `./rdf/agroterm_collections_decoded.ttl` — a cleaned RDF version of the vocabulary where each collection has **only one `prefLabel`**

> 🧩 **Tip**: You can run this step before uploading the RDF to tools like **Skosmos** or **LOD platforms** to avoid inconsistencies or UI issues.

> ⚠️ **Caution**: This script assumes the first `prefLabel` is the one to keep. If label prioritization by language is needed, the logic can be extended accordingly.

In [19]:
from rdflib import Graph, Namespace, RDF, URIRef, Literal
from rdflib.namespace import SKOS
import os

# Percorsi
input_path = "./rdf/agroterm_collections.ttl"
output_path = "./rdf/agroterm_collections_decoded.ttl"

# Carica il grafo RDF
g = Graph()
g.parse(input_path, format="turtle")

# Nuovo grafo per i dati puliti
g_clean = Graph()
g_clean.bind("skos", SKOS)

# Copia tutti i namespace originali
for prefix, namespace in g.namespaces():
    g_clean.bind(prefix, namespace)

# Trova tutte le collezioni
for s in g.subjects(RDF.type, SKOS.Collection):
    labels = list(g.objects(s, SKOS.prefLabel))
    if len(labels) > 1:
        # Mantieni solo la prima prefLabel
        first_label = labels[0]
        # Rimuovi tutte le prefLabel
        for label in labels:
            g.remove((s, SKOS.prefLabel, label))
        # Aggiungi solo la prima
        g.add((s, SKOS.prefLabel, first_label))

# Dopo modifica, copia tutto nel nuovo grafo
for triple in g:
    g_clean.add(triple)

# Salva il nuovo file
g_clean.serialize(destination=output_path, format="turtle")
print(f"✅ File pulito salvato in: {output_path}")


✅ File pulito salvato in: ./rdf/agroterm_collections_decoded.ttl


### 🧪 Integrity Check: Validate SKOS Collections

This script performs a **consistency check** on the SKOS collections defined in the RDF vocabulary. It verifies two common issues:

1. **Empty Collections** – collections that do not contain any `skos:member`
2. **Broken References** – members listed in `skos:member` statements that do not exist as subjects in the RDF graph

---

#### ⚙️ How it Works

```python
input_path = "./rdf/agroterm_collections_decoded.ttl"
````

* Loads the cleaned RDF graph from the Turtle file
* Iterates over all `skos:Collection` resources:

  * If no members → flagged as **empty collection**
  * If any member is missing (i.e., not defined as a subject) → flagged as **broken reference**

```python
if not members:
    empty_collections.append(collection)
elif not (member, None, None) in g:
    missing_members.append((collection, member))
```

---

#### 🧾 Output

The script prints a detailed diagnostic:

* ✅ If all collections are valid and complete
* ❗ Lists any **empty collections**
* ⚠️ Lists any **collections with missing members**

This helps ensure your RDF vocabulary is **structurally correct** before publication or ingestion in platforms like **Skosmos**, **SPARQL endpoints**, or **LOD repositories**.

---

> 🧩 **Tip**: If missing members are detected, they may have been excluded during filtering, cleaning, or might be the result of typo errors in URIs.

> 🛠️ You can enhance this script to:
>
> * Check for duplicate `skos:member`
> * Validate type of members (e.g., ensure they are `skos:Concept` or `skos:Collection`)

In [20]:
from rdflib import Graph, Namespace, RDF
from rdflib.namespace import SKOS
import os

# Percorso del file RDF
input_path = "./rdf/agroterm_collections_decoded.ttl"

# Carica il grafo RDF
g = Graph()
g.parse(input_path, format="turtle")

# Report
missing_members = []
empty_collections = []

# Trova tutte le collezioni
collections = list(g.subjects(RDF.type, SKOS.Collection))

for collection in collections:
    members = list(g.objects(collection, SKOS.member))
    
    if not members:
        # Collezione vuota
        empty_collections.append(collection)
    else:
        for member in members:
            if not (member, None, None) in g:
                missing_members.append((collection, member))

# Output risultati
print("🔍 Verifica delle collezioni SKOS:\n")

if empty_collections:
    print("❗ Collezioni vuote trovate:")
    for col in empty_collections:
        print(f"- {col}")
else:
    print("✅ Nessuna collezione vuota trovata.")

print()

if missing_members:
    print("⚠️ Inconsistenze trovate nei membri delle collezioni:")
    for col, member in missing_members:
        print(f"- Collezione: {col} → membro mancante: {member}")
else:
    print("✅ Tutti i membri delle collezioni esistono nel grafo.")


🔍 Verifica delle collezioni SKOS:

✅ Nessuna collezione vuota trovata.

✅ Tutti i membri delle collezioni esistono nel grafo.
