# **Agroterm 2025 - LLOD Notebook**

In this tutorial, we demonstrate how to transform structured data from Excel (XSL) files into a Linked Open Data (LOD) representation using the **Simple Knowledge Organization System (SKOS)** format. The focus is on **Agroterm**, a taxonomy developed within the **MADIN TERM** project, which aims to organize and publish agricultural terminology as an open, interoperable resource.

This notebook is designed for **students and researchers** interested in practical applications of the **Linguistic Linked Open Data (LLOD)** paradigm. It provides a modular and reproducible workflow for converting domain-specific vocabularies into SKOS-compliant RDF data.

## 🧭 Structure of the tutorial

1. **Introduction to SKOS**  
   Overview of the SKOS data model and its importance for representing terminologies.

2. **Data Preparation**  
   Loading and inspecting structured input data from Excel files.

3. **Transformation Process**  
   Using Python to convert tabular data into RDF, mapping it to SKOS concepts and properties.

4. **SKOS Construction**  
   Generation of concepts, preferred labels, and semantic relations such as `skos:broader`, `skos:narrower`, and `skos:related`.

5. **Export and Integration**  
   Saving the resulting taxonomy in RDF/Turtle format, ready for validation, publication, or integration into LOD infrastructures such as Skosmos.

## 🎯 Goal

To support the creation of **FAIR** and **interoperable** terminological resources by enabling the transformation of structured agricultural data into SKOS Linked Data, fostering wider dissemination and reuse through open data platforms.

## 📦 Installing Required Dependencies

The following Python packages are required to run this notebook. Each of them plays a specific role in the data transformation pipeline:

```python
# Used to make HTTP requests (e.g., to retrieve data or metadata from remote sources)
!pip install requests

# Powerful library for data manipulation and analysis; used to read and process Excel/CSV files
!pip install pandas

# Required by pandas to properly read `.xlsx` Excel files
!pip install openpyxl

# Provides YAML parsing and writing capabilities; used when working with configuration files (e.g., mappings)
!pip install ruamel.yaml

# Lightweight YAML processor tailored for RML/YARRRML transformations; useful in RDF generation pipelines
!pip install yatter

# Morph-KGC is a tool for knowledge graph construction based on R2RML/RML; used here for converting tabular data to RDF
!pip install morph-kgc

In [1]:
!pip install requests
!pip install pandas
!pip install openpyxl
!pip install ruamel.yaml
!pip install yatter
!pip install morph-kgc



## 📥 Importing Libraries

We now import the necessary Python libraries used throughout the notebook. Each module serves a specific function in the transformation and RDF generation process:

```python
# For sending HTTP requests
import requests

# For handling JSON data
import json

# For file and path manipulations
import os

# To generate hashes (e.g., for unique identifiers)
import hashlib

# For disabling SSL warnings in HTTP requests (optional)
import urllib3
import warnings

# Data analysis and tabular data processing
import pandas as pd

# For printing detailed error traces during debugging
import traceback

# For regular expressions and text processing
import re

# For reading YAML configuration files
import yaml
from ruamel.yaml import YAML  # More robust YAML parser used by Morph-KGC

# Lightweight library to process YARRRML files (RML simplified in YAML)
import yatter

# Morph-KGC: generates RDF from tabular data using RML or R2RML mappings
import morph_kgc

# RDFLib: for building and manipulating RDF graphs in Python
from rdflib import Graph, Namespace, URIRef
from rdflib.namespace import SKOS, RDF  # Common RDF vocabularies

# For URL encoding and parsing
import urllib.parse


In [2]:
import requests
import json      
import os        
import hashlib   
import urllib3   
import warnings
import pandas as pd
import traceback
import re
import yaml
import yatter
from ruamel.yaml import YAML
import morph_kgc
from rdflib import Graph, Namespace, URIRef
from rdflib.namespace import SKOS, RDF  # Import RDF
import urllib.parse

You have loaded all tools! Now, Now let's get into the operational part.

## 📄 Loading and Preparing Excel Data

We begin by loading the Agroterm taxonomy data from an Excel file. The file includes two sheets:

- **`metadata`**: general information about the taxonomy (e.g., title, authors, version).
- **`concepts`**: the list of terms/concepts to be converted into SKOS.

The code below also includes a normalization step to make column names easier to handle programmatically.


In [3]:
# Import necessary libraries
import pandas as pd
from IPython.display import display  # For better display of DataFrames in the notebook

# Define the path to the Excel file
excel_file_path = './src/agroterm_2025_rev.xlsx'  # Make sure that the file is named agroterm_2025_rev.xlsx

# Load the two sheets into separate DataFrames
try:
    metadata = pd.read_excel(excel_file_path, sheet_name='metadata')
    concepts = pd.read_excel(excel_file_path, sheet_name='concepts')
    print("✅ Excel file loaded successfully.")
except FileNotFoundError:
    print(f"❌ Error: The file '{excel_file_path}' was not found.")
except ValueError as e:
    print(f"❌ Error: {e}")

# Function to normalize column names
def normalize_column_names(df):
    df.columns = (
        df.columns
        .str.strip()          # Remove leading/trailing spaces
        .str.lower()          # Convert to lowercase
        .str.replace(' ', '_') # Replace spaces with underscores
        .str.replace(r'[^\w]', '', regex=True)  # Remove any non-alphanumeric characters
    )
    return df

# Normalize column names in both metadata and concepts
metadata = normalize_column_names(metadata)
concepts = normalize_column_names(concepts)

# Display the DataFrames nicely
print("\n🔎 Preview of normalized metadata:")
display(metadata)

print("\n🔎 Preview of normalized concepts:")
display(concepts)

✅ Excel file loaded successfully.

🔎 Preview of normalized metadata:


Unnamed: 0,autore,data,descrizione,link,titolo
0,Osservatorio di terminologie e politiche lingu...,2025-05-01,Il progetto MADIN-TERM nasce dalla collaborazi...,https://centridiricerca.unicatt.it/otpl-proget...,AGROTERM Taxonomy



🔎 Preview of normalized concepts:


Unnamed: 0,termine,regione,dominio,sottodominio,certificazione,area_geografica,definizione_it,note_it,fonti,definizione_en,definizione_es,definizione_fr,definizione_de
0,Arrosticini (s. m. pl.),ABRUZZO,prodotti agroalimentari,Carni fresche,PAT,Abruzzo (fascia montana e pedemontana- collin...,"Carne di ovino adulto, che si presenta tagliat...",Possono essere conditi con aromi naturali (pep...,https://www.regione.abruzzo.it/system/files/ag...,"Meat obtained from muttons, cubed to 1 cm squa...","Carne de ovino adulto, cortada en dados de apr...",,
1,Caciocavallo abruzzese (s. m.),ABRUZZO,prodotti agroalimentari,Latte e derivati,PAT,Province di Chieti e L'Aquila,"Prodotto caseario a pasta filata e semidura, o...",Il sapore del Caciocavallo abruzzese è dolce ...,https://www.regione.abruzzo.it/system/files/ag...,"Dairy product made up of a soft, compact paste...","Producto lácteo de textura compacta y blanda, ...",,
2,Caciofiore aquilano (s. m.),ABRUZZO,prodotti agroalimentari,Latte e derivati,PAT,Provincia di L'Aquila,"Prodotto caseario a pasta molle, ottenuto da l...",È un formaggio da pronto consumo. Essendo un c...,https://www.regione.abruzzo.it/system/files/ag...,"Soft cheese, made from whole sheep's milk, wit...","Queso de pasta blanda, elaborado con leche ent...",,
3,Canestrato di Castel Monte (s. m.),ABRUZZO,prodotti agroalimentari,Latte e derivati,PAT,Provincia di L'Aquila,Prodotto caseario a pasta dura di forma cilin...,Il formaggio presenta una crosta esterna che r...,https://www.regione.abruzzo.it/system/files/ag...,"Hard cheese, made from raw whole sheep's milk,...","Queso duro, elaborado con leche entera cruda d...",,
4,Carota dell'Altopiano del Fucino (s. f.),ABRUZZO,prodotti agroalimentari,Ortofrutticoli,IGP,Provincia di L'Aquila,"Ortaggio della specie Daucus carota L., coltiv...",La Carota dell'Altopiano del Fucino deriva dal...,https://www.politicheagricole.it/flex/cm/pages...,"Carrot of the species Dacus carota L., grown i...","Hortaliza de la especie Dacus carota L., culti...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
351,Recioto della Valpolicella (s. m.),VENETO,prodotti enologici e distillati,Vini,DOP,19 comuni in provincia di Verona,Vino rosso ottenuto da uve Corvina dal 45% al ...,Il grado alcolico minimo è del 12% Vol. È vie...,https://www.consorziovalpolicella.it/,Red wine made from Corvina (45% to 95%) and Ro...,Vino tinto elaborado con uvas Corvina (45% a 9...,,
352,Riso Nano Vialone Veronese (s. m.),VENETO,prodotti agroalimentari,Cereali,IGP,25 comuni della provincia di Verona,Riso ottenuto esclusivamente da semi della var...,"La lotta alle erbe infestanti, prima che con g...",1. https://www.regione.veneto.it 2. https://ww...,Rice obtained exclusively from seeds of the Vi...,Arroz obtenido exclusivamente a partir de semi...,,
353,Soave (s. m.),VENETO,prodotti enologici e distillati,Vini,DOP,13 comuni in provincia di Verona,Vino bianco ottenuto da uve Garganega al minim...,"Il grado alcolico minimo è del 10,5% Vol. Le o...",https://www.ilsoave.com/disciplinare/,"White wine made from Garganega (minimum 70%), ...",Vino blanco elaborado con uvas Garganega (70% ...,,
354,Soppressa Vicentina (s. f.),VENETO,prodotti agroalimentari,Salumi / insaccati,DOP,Provincia di Vicenza,Salume ottenuto da carni di suini di razza Lar...,La macellazione e la trasformazione della carn...,https://www.regione.veneto.it,Salume obtained from the meat of pigs belongin...,Salume obtenido de la carne de cerdos pertenec...,,


## 🧾 Exporting Metadata to JSON

In this section, we extract metadata information from the Excel sheet and convert it into a JSON file. This structured metadata will later be used to document and enrich the RDF output, and may include project-level details like the taxonomy title, version, creator, and supported languages.

The following steps are performed:

1. Convert the first row of the `metadata` DataFrame into a Python dictionary.
2. Normalize values to ensure compatibility with JSON (e.g., converting timestamps to ISO 8601 format).
3. Manually add a list of language codes (`ita`, `spa`, `eng`) following ISO 639-3.
4. Create a target directory (`./json`) if it doesn’t exist.
5. Save the metadata dictionary to `metadata.json` with proper UTF-8 encoding and indentation.
6. Preview the saved content.



In [4]:
import os
import json
import pandas as pd

# 1. Convert the first row of the metadata DataFrame into a dictionary
metadata_dict = metadata.iloc[0].to_dict()

# 2. Normalize the values: convert Timestamps and other non-serializable objects
for key, value in metadata_dict.items():
    if pd.isna(value):
        metadata_dict[key] = None  # If NaN, set to null
    elif isinstance(value, pd.Timestamp):
        metadata_dict[key] = value.isoformat()  # Convert to ISO 8601 string
    else:
        metadata_dict[key] = str(value) if not isinstance(value, (str, int, float, bool, type(None))) else value

# 3. Add the 'languages' key with ISO 639-3 codes
metadata_dict['languages'] = [
    {"code": "ita"},
    {"code": "spa"},
    {"code": "eng"}
]

# 4. Ensure the 'json' folder exists
os.makedirs('./json', exist_ok=True)

# 5. Define the output path
metadata_json_path = './json/metadata.json'

# 6. Save the enriched metadata dictionary to a JSON file
with open(metadata_json_path, 'w', encoding='utf-8') as f:
    json.dump(metadata_dict, f, ensure_ascii=False, indent=4)

# 7. Confirm the operation
print(f"✅ Metadata with languages saved successfully to {metadata_json_path}")

# 8. Preview the JSON content
print("\n🔎 Preview of metadata.json content:")
display(metadata_dict)


✅ Metadata with languages saved successfully to ./json/metadata.json

🔎 Preview of metadata.json content:


{'autore': 'Osservatorio di terminologie e politiche linguistiche, Università Cattolica del Sacro Cuore, Milano',
 'data': '2025-05-01T00:00:00',
 'descrizione': "Il progetto MADIN-TERM nasce dalla collaborazione tra l'Osservatorio di Terminologie e Politiche Linguistiche (OTPL) dell'Università Cattolica del Sacro Cuore e CLARIN-IT con l'obiettivo di diffondere la terminologia del Made in Italy, partendo dai risultati ottenuti dall'OTPL nell'ambito del progetto AGROTERM in cui è stata raccolta la terminologia dei prodotti italiani DOP, DOC e IGP.  Il progetto MADIN-TERM ha l’obiettivo di creare un banca dati terminologica nel rispetto dei principi FAIR. Per i prodotti selezionati, verrà proposta una definizione in italiano e per promuovere la comunicazione internazionale, in altre lingue, quali inglese, tedesco, francese e spagnolo. ",
 'link': 'https://centridiricerca.unicatt.it/otpl-progetti-sostenibilita-ambientale-e-alimentare',
 'titolo': 'AGROTERM Taxonomy',
 'languages': [{'code

## 🧠 Generating and Saving `concepts.json`

This section processes the list of concepts from the Excel sheet and transforms each row into a structured JSON object. The final output is saved as `concepts.json`, ready for further conversion to SKOS/RDF.

The processing pipeline includes:

1. **Text normalization functions**: utility functions to clean and standardize strings, remove accents, and prepare URI-safe fragments.
2. **Geographical area dictionary**: a normalized mapping from area labels to alias IDs (e.g., `area_1`, `area_2`).
3. **Concept object creation**: iterate through the `concepts` DataFrame and generate a structured dictionary for each concept, including:
   - Preferred label (`termine`) and alternative labels (`altLabels`);
   - Normalized values for region, domain, subdomain, certification;
   - List of regions (`regioni`) linked to each concept;
   - Sources (`fonti`) as a list of URLs or strings.
4. **Save to JSON**: store the resulting list of concepts and the area mapping dictionary into the `./json/` folder.



In [5]:
import os
import json
import pandas as pd
import re
import unicodedata
from datetime import datetime

def clean_string_for_uri(text):
    if not isinstance(text, str):
        text = str(text)
    text = text.strip()
    text = re.sub(r'[^\w\sàèéìòùÀÈÉÌÒÙ]', '', text, flags=re.UNICODE)
    text = re.sub(r'\s+', '_', text)
    return text.lower()

def remove_accents(text):
    return ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if unicodedata.category(c) != 'Mn'
    )

def clean_text_value(text):
    if not isinstance(text, str):
        text = str(text)
    return re.sub(r'\s+', ' ', text.strip())

def remove_marc_morf(text):
    return re.sub(r'\s*\(.*?\)', '', text).strip()

def clean_certificazione(text):
    if not isinstance(text, str):
        text = str(text)
    text = text.strip()
    if text.lower() in ["nessuna certificazione", "nessuna classificazione"]:
        return text.lower().replace(" ", "_")
    text = re.sub(r'\.', '', text)
    text = re.sub(r'\s+', '_', text)
    text = re.sub(r'[^A-Z0-9_]', '', text)
    return text

# --- 1. Costruisci dizionario area_geografica normalizzata → area_n
area_geo_set = set()

for val in concepts['area_geografica']:
    label = clean_text_value(val) if pd.notna(val) else "None"
    norm = clean_string_for_uri(label)
    norm_ascii = remove_accents(norm)
    area_geo_set.add(norm_ascii)

area_geo_dict = {norm: f"area_{i+1}" for i, norm in enumerate(sorted(area_geo_set))}


# --- 2. Generazione oggetti JSON concetto
concepts_list = []
all_columns = concepts.columns.tolist()
concept_counter = 1

for index, row in concepts.iterrows():
    concept_obj = {}

    termine_raw = row.get('termine', "")
    if pd.isna(termine_raw) or termine_raw == "":
        concept_uri_fragment = f"concept_{concept_counter}"
        pref_label = ""
        alt_labels = []
    else:
        term_variants = [t.strip() for t in re.split(r'[;,|\n]', termine_raw) if t.strip()]
        full_form = clean_text_value(term_variants[0]) if term_variants else ""
        pref_label = remove_marc_morf(full_form)
        concept_uri_raw = clean_string_for_uri(pref_label)
        concept_uri_fragment = f"{remove_accents(concept_uri_raw)}_{concept_counter}"
        concept_obj["marc_morf"] = full_form
        alt_labels = []
        for variant in term_variants[1:]:
            cleaned = clean_text_value(variant)
            alt_labels.append({
                "altLabel": remove_marc_morf(cleaned),
                "altLabel_marc_morf": cleaned,
                "alt_concept": concept_uri_fragment
            })

    concept_obj["termine"] = pref_label
    concept_obj["concept"] = concept_uri_fragment
    concept_obj["altLabels"] = alt_labels

    for col in all_columns:
        col_lower = col.lower()
        if col_lower in ["termine", "variante", "ultima_revisione", "fonti", "regione", "area_geografica"]:
            continue

        value = row.get(col, None)
        concept_obj[col] = "" if pd.isna(value) else clean_text_value(value) if isinstance(value, str) else value

    # Fonti
    fonti_val = row.get("fonti", "")
    if pd.notna(fonti_val) and isinstance(fonti_val, str) and fonti_val.strip():
        # Suddivide SOLO per newline (gestisce anche \r\n e \n)
        fonti_raw_list = [f.strip() for f in fonti_val.strip().splitlines() if f.strip()]
        concept_obj["fonti"] = [
            {"url": fonte, "concept": concept_uri_fragment}
            for fonte in fonti_raw_list
        ]
    else:
        concept_obj["fonti"] = []

    # Area geografica
    raw_area_geo_label = clean_text_value(row.get('area_geografica')) if pd.notna(row.get('area_geografica')) else "None"
    area_geo_norm = clean_string_for_uri(raw_area_geo_label)
    area_geo_ascii = remove_accents(area_geo_norm)
    area_geo_id = area_geo_dict.get(area_geo_ascii, "area_0")  # fallback

    # Regione
    raw_regione = row.get('regione', "")
    regione_norm = clean_string_for_uri(raw_regione)
    concept_obj['regione_normalizzata'] = regione_norm
    concept_obj['dominio_normalizzato'] = clean_string_for_uri(row.get('dominio', ""))
    concept_obj['sottodominio_normalizzato'] = clean_string_for_uri(row.get('sottodominio', ""))
    concept_obj['certificazione_normalizzata'] = clean_certificazione(row.get('certificazione', ""))

    # Lista regioni
    regioni_list = []
    if pd.notna(raw_regione) and isinstance(raw_regione, str) and raw_regione.strip():
        raw_regioni_split = [r.strip() for r in re.split(r'[;,|\n]', raw_regione) if r.strip()]
        for regione_item in raw_regioni_split:
            regione_item_norm = clean_string_for_uri(regione_item)
            regioni_list.append({
                "label": regione_item,
                "ref_regione": regione_item_norm,
                "ref_concept": concept_uri_fragment,
                "ref_area_geografica_label": raw_area_geo_label,
                "ref_area_geografica": area_geo_id
            })
    concept_obj["regioni"] = regioni_list

    concepts_list.append(concept_obj)
    concept_counter += 1

# --- Salvataggio JSON
concepts_json_path = './json/concepts.json'
area_dict_path = './json/area_geo_mapping.json'
os.makedirs('./json', exist_ok=True)

with open(concepts_json_path, 'w', encoding='utf-8') as f:
    json.dump(concepts_list, f, ensure_ascii=False, indent=4)

with open(area_dict_path, 'w', encoding='utf-8') as f:
    json.dump(area_geo_dict, f, ensure_ascii=False, indent=4)

print(f"✅ Concepts saved to: {concepts_json_path}")
print(f"📘 Area geographic aliases saved to: {area_dict_path}")

# Anteprima
print("\n🔎 Preview of first 3 concepts:")
for concept in concepts_list[:3]:
    display(concept)


✅ Concepts saved to: ./json/concepts.json
📘 Area geographic aliases saved to: ./json/area_geo_mapping.json

🔎 Preview of first 3 concepts:


{'marc_morf': 'Arrosticini (s. m. pl.)',
 'termine': 'Arrosticini',
 'concept': 'arrosticini_1',
 'altLabels': [],
 'dominio': 'prodotti agroalimentari',
 'sottodominio': 'Carni fresche',
 'certificazione': 'PAT',
 'definizione_it': 'Carne di ovino adulto, che si presenta tagliata a cubetti di circa 1 cm, di colore rosso più o meno intenso, infilati in spiedini di legno.',
 'note_it': 'Possono essere conditi con aromi naturali (peperoncino, salvia, cipolla) oppure misti, con l’aggiunta di carne di suino o bovino.',
 'definizione_en': 'Meat obtained from muttons, cubed to 1 cm square, which varies in intesity of a red colour and threaded in wooden skewers.',
 'definizione_es': 'Carne de ovino adulto, cortada en dados de aproximadamente 1 cm, de color rojo más o menos intenso, ensartados en pinchos de madera.',
 'definizione_fr': '',
 'definizione_de': '',
 'fonti': [{'url': 'https://www.regione.abruzzo.it/system/files/agricoltura/pord_agroalimentari/Atlante_prodotti_tipici.pdf',
   'con

{'marc_morf': 'Caciocavallo abruzzese (s. m.)',
 'termine': 'Caciocavallo abruzzese',
 'concept': 'caciocavallo_abruzzese_2',
 'altLabels': [],
 'dominio': 'prodotti agroalimentari',
 'sottodominio': 'Latte e derivati',
 'certificazione': 'PAT',
 'definizione_it': 'Prodotto caseario a pasta filata e semidura, ottenuto da latte intero crudo di vacca con aggiunta di caglio e sale, che presenta forma a pera dal peso non inferiore a 1 kg.',
 'note_it': 'Il sapore del Caciocavallo abruzzese è dolce e pastoso quando è ancora fresco, intenso e piccante con la stagionatura. Il latte munto non viene pastorizzato in quanto avviene a una temperatura inferiore ai 40°C.',
 'definizione_en': 'Dairy product made up of a soft, compact paste. It is made with full raw cow’s milk, rennet and salt, it has a smooth outer surface and an unusual pear shape, weighing over 1 kg.',
 'definizione_es': 'Producto lácteo de textura compacta y blanda, obtenido a partir de leche entera cruda de vaca con adición de cu

{'marc_morf': 'Caciofiore aquilano (s. m.)',
 'termine': 'Caciofiore aquilano',
 'concept': 'caciofiore_aquilano_3',
 'altLabels': [],
 'dominio': 'prodotti agroalimentari',
 'sottodominio': 'Latte e derivati',
 'certificazione': 'PAT',
 'definizione_it': "Prodotto caseario a pasta molle, ottenuto da latte intero ovino, con l'aggiunta di caglio di carciofo e zafferano, che si presenta in forma cilindrica, con crosta fine.",
 'note_it': 'È un formaggio da pronto consumo. Essendo un caciofiore è prodotto usando il "fiore del latte" cioè la parte grassa che affiora in superficie.',
 'definizione_en': "Soft cheese, made from whole sheep's milk, with the addition of artichoke rennet and saffron, which has cylindrical shape and thin crust.",
 'definizione_es': 'Queso de pasta blanda, elaborado con leche entera de oveja, con la adición de cuajo de alcachofa y azafrán, que tiene forma cilíndrica y corteza fina.',
 'definizione_fr': '',
 'definizione_de': '',
 'fonti': [{'url': 'https://www.reg

This is a schematic structure of the resultant json:

![Concept JSON Structure](./assets/concept_structure.png)

## 🌐 Defining the Base URI and Prefixes

To ensure that all generated RDF resources are uniquely and consistently identified, we define:

### 1. **Base URI**

This is the base namespace under which all concepts, labels, and metadata for the Agroterm taxonomy will be created:

In [6]:
# Define the base URI for the vocabulary
BASE_URI = "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"

### 2. **RDF Prefix Mapping (YAML)**

In this block, we define a set of RDF prefixes in YAML format. These prefixes are used to shorten URIs when generating RDF triples, especially during the mapping phase with tools like **Morph-KGC** or **YARRRML/YATTER**.

```yaml
prefixes:
  dc: "http://purl.org/dc/elements/1.1/"              # Dublin Core basic metadata elements
  dct: "http://purl.org/dc/terms/"                    # Dublin Core extended terms
  iso639-3: "http://iso639-3.sil.org/code/"           # ISO 639-3 language codes
  skos: "http://www.w3.org/2004/02/skos/core#"        # SKOS vocabulary for thesauri and concept schemes
  xsd: "http://www.w3.org/2001/XMLSchema#"            # XML Schema datatypes (e.g., xsd:string, xsd:date)
  agroterm: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"  # Custom namespace for this vocabulary


In [7]:
prefixes_raw_mapping = """
prefixes:
  dc: "http://purl.org/dc/elements/1.1/"
  dct: "http://purl.org/dc/terms/"
  iso639-3: "http://iso639-3.sil.org/code/"
  skos: "http://www.w3.org/2004/02/skos/core#"
  xsd: "http://www.w3.org/2001/XMLSchema#"
  agroterm: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
"""

### 🗺️ Constructing YARRRML Raw Mapping for `skos:ConceptScheme`

In this step, we define the **YARRRML raw mapping** that will be used to generate RDF triples describing the **Agroterm concept scheme**, based on metadata extracted from the file `metadata.json`.

### 🟦 Define JSON source path


This variable stores the path to the JSON file containing the metadata info (`metadata.json`). It will be referenced by only one mapping block using JSONPath expressions.

In [8]:
metadata_source = "./json/metadata.json"

#### 🔹 `agroterm_conceptscheme`

This rule maps the **main concept scheme resource** with basic metadata using JSONPath to extract values:

```yaml
agroterm_conceptscheme:
  sources:
    - ['./json/metadata.json~jsonpath', "$"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
  predicateobjects:
    - [a, skos:ConceptScheme]                              # Declares the resource as a ConceptScheme
    - [dc:description, $(descrizione), it~lang]            # Description in Italian
    - [dc:title, $(titolo), it~lang]                       # Title in Italian
    - [dc:created, $(data), xsd:date]                      # Creation date
    - [dc:creator, $(autore), it~lang]                     # Author
    - [dct:source, $(link)]                                # Source or external reference


In [9]:
concept_scheme_raw_mapping = f"""
agroterm_conceptscheme:
  sources:
    - ['{metadata_source}~jsonpath', "$"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
  predicateobjects:
    - [a, skos:ConceptScheme]
    - [dc:description, $(descrizione), it~lang]
    - [dc:title, $(titolo), it~lang]
    - [dc:created, $(data), xsd:date]
    - [dc:creator, $(autore), it~lang]
    - [dct:source, $(link)]

agroterm_conceptscheme_languages:
  sources:
    - ['{metadata_source}~jsonpath', "$.languages[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
  predicateobjects:
    - [dct:language, "http://iso639-3.sil.org/code/$(code)"]
"""

### 🧩 Constructing YARRRML Mappings for Concepts

In the following cells, we define the **YARRRML raw mappings** used to transform the concept-level data in `concepts.json` into SKOS triples. Each mapping targets a specific aspect of the concept representation: identity, labels, definitions, alternative labels, and sources.

In [10]:
concepts_source = "./json/concepts.json"

## 🌐 Modeling Concept Hierarchies with SKOS: Domains, Subdomains, Concepts

This block of YARRRML mappings defines a **hierarchical structure** using SKOS semantic relationships:
- `skos:broader` and `skos:narrower` for hierarchical links
- `skos:inScheme` to link everything to the Agroterm concept scheme

It organizes:
- Concepts under Subdomains
- Subdomains under Domains

---

### 🔹 `agroterm_concepts`

Maps individual **concepts** (terms) from `concepts.json` as `skos:Concept` resources:

```yaml
subject: .../$(concept)
  - [skos:prefLabel, $(termine)]
  - [skos:broader, .../$(sottodominio_normalizzato)]
````

Each concept:

* Gets a preferred label (`termine`)
* Is linked to its subdomain via `skos:broader`
* Has optional multilingual `skos:definition`
* Carries morphological notes as `skos:note`
* Is placed inside the Agroterm scheme

---

### 🔸 `agroterm_concept_member_domini`

Defines each **domain** as a `skos:Concept` and relates it to its subdomains using `skos:narrower`:

```yaml
subject: .../$(dominio_normalizzato)
  - [skos:narrower, .../$(sottodominio_normalizzato)]
```

This creates the first level of the hierarchy:

```
Domain → narrower → Subdomain
```

---

### 🔸 `agroterm_concept_member_sottodomini`

Defines each **subdomain** as a `skos:Concept`, with:

* A broader relation to its domain
* A narrower relation to each concept it contains

```yaml
subject: .../$(sottodominio_normalizzato)
  - [skos:broader, .../$(dominio_normalizzato)]
  - [skos:narrower, .../$(concept)]
```

This completes the full chain:

```
Domain → Subdomain → Concept
```

> 📘 **Note**: These are SKOS semantic links and not SKOS collections — they're used to express taxonomic relations and enable hierarchical navigation in tools like Skosmos or SPARQL queries.

---

### ✅ Result

You now have a **multilevel concept hierarchy** encoded with:

* `skos:broader` for upward navigation
* `skos:narrower` for downward traversal
* Proper typing and labeling for all levels (domain, subdomain, concept)

> 🧩 You can now traverse the hierarchy or visualize it via SPARQL, or generate tree structures in LOD browsers.

In [11]:
concepts_raw_mapping = f"""
agroterm_concepts:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"
  predicateobjects:
    - [a, skos:Concept]
    - [skos:prefLabel, $(termine), it~lang]
    - [skos:broader, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(sottodominio_normalizzato)"]
    - [skos:definition, $(definizione_it), it~lang]
    - [skos:definition, $(definizione_en), en~lang]
    - [skos:definition, $(definizione_es), es~lang]
    - [skos:note, $(marc_morf), it~lang]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]
    
agroterm_concept_member_domini:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(dominio_normalizzato)"
  po:
    - [a, skos:Concept]
    - [skos:prefLabel, "$(dominio)", it~lang]
    - [skos:narrower, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(sottodominio_normalizzato)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_concept_member_sottodomini:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(sottodominio_normalizzato)"
  po:
    - [a, skos:Concept]
    - [skos:prefLabel, "$(sottodominio)", it~lang]
    - [skos:broader, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(dominio_normalizzato)"]
    - [skos:narrower, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]
"""

## 🏷️ Mapping Alternative Labels (`skos:altLabel`)

This YARRRML block defines the mapping of **alternative lexical variants** for each concept in the Agroterm vocabulary. These variants are important to support search, disambiguation, and linguistic richness within the thesaurus.

---

### 🔹 `agroterm_altlabels`

This mapping iterates over the `altLabels` array for each concept (in `concepts.json`) using JSONPath:

```yaml
sources:
  - ['./json/concepts.json~jsonpath', "$[*].altLabels[*]"]
````

Each object in the `altLabels` list contains:

* `altLabel`: the cleaned variant label
* `altLabel_marc_morf`: the original variant form (often including morphological markers)
* `alt_concept`: the concept this variant belongs to

These are mapped as follows:

```yaml
subject: .../$(alt_concept)
  - [skos:altLabel, $(altLabel), it~lang]
  - [skos:note, $(altLabel_marc_morf), it~lang]
```

---

### 🧠 Semantic Meaning

* `skos:altLabel` is used to provide **alternative terms or synonyms** that may be used interchangeably with the `prefLabel`.
* `skos:note` is used to preserve **morphological or editorial annotations** that were stripped from the main label but might be useful for reference or linguistic studies.

---

### ✅ Result

Each concept will have one or more `skos:altLabel` triples linked to it, enhancing its discoverability and making the vocabulary more flexible and user-friendly.

> 💡 You can extend this mapping to include `@lang` variants or even `skos:hiddenLabel` if needed for more granular use cases like autocomplete or indexing.

In [12]:
concepts_altlabel_raw_mapping = f"""
agroterm_altlabels:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].altLabels[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(alt_concept)"
  predicateobjects:
    - [skos:altLabel, $(altLabel), it~lang]
    - [skos:note, $(altLabel_marc_morf), it~lang]
"""

## 📚 Mapping Sources (`dct:source`)

This YARRRML block defines how to attach **source references** to each concept using the `dct:source` property. These sources may include bibliographic references, institutional documents, glossaries, or websites that justify or describe the term.

---

### 🔹 `agroterm_fonti`

This mapping iterates over the `fonti` list inside each concept object in `concepts.json`:

```yaml
sources:
  - ['./json/concepts.json~jsonpath', "$[*].fonti[*]"]
````

Each `fonti[*]` object contains:

* `url`: the source string or URL
* `concept`: the identifier of the related concept

The mapping links each source to the corresponding concept:

```yaml
subject: .../$(concept)
  - [dct:source, $(url)]
```

---

### 🧠 Semantic Meaning

* `dct:source` (from the Dublin Core Terms vocabulary) is used to indicate **a related resource from which the concept is derived or inspired**.
* This property improves **traceability**, **transparency**, and **reusability** of the vocabulary content.

---

### ✅ Result

Each `skos:Concept` in the Agroterm RDF will have zero or more `dct:source` triples attached, making it easier for users to understand the provenance of terms.

> 🧩 **Tip**: If URLs are used, they can be automatically clickable in interfaces like Skosmos. If plain text is used, consider structuring it to include citation data (authors, date, title).

In [13]:
concepts_fonti_raw_mapping = f"""
agroterm_fonti:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].fonti[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"
  predicateobjects:
    - [dct:source, $(url)]
"""

## 🗂️ Mapping SKOS Collections: Regions and Certifications

This mapping block organizes the vocabulary into **semantic collections** using `skos:Collection` and `skos:member`. Collections help users **navigate and explore the vocabulary** via thematic, geographic, or classificatory groupings — especially in interfaces like **Skosmos**.

---

### 🔹 `agroterm_collections_regioni`

Creates a **global collection** named `"Regioni"` that includes as members all normalized region identifiers:

```yaml
subject: .../gruppo_regioni
  - [skos:member, .../$(ref_regione)]
````

Each region will be defined separately as its own collection (see below).

---

### 🔹 `agroterm_collections_certificazioni`

Creates a **global collection** named `"Certificazioni"`, including all certification categories found in the dataset:

```yaml
subject: .../gruppo_certificazioni
  - [skos:member, .../$(certificazione_normalizzata)]
```

---

## 🔁 Nested Collection Definitions

The mappings below define **individual collections** that are part of the above global groupings.

---

### 🔸 `agroterm_concept_member_regioni`

Each **region** is modeled as a `skos:Collection`, labeled with its name, and including a member that links to its **area geografica**:

```yaml
subject: .../$(ref_regione)
  - [skos:member, .../$(ref_area_geografica)]
```

---

### 🔸 `agroterm_concept_member_area_geografica`

Each **geographic area** becomes a collection with:

* A label from `ref_area_geografica_label`
* Members pointing to the related concepts (`ref_concept`) that belong to that area

```yaml
subject: .../$(ref_area_geografica)
  - [skos:member, .../$(ref_concept)]
```

---

### 🔸 `agroterm_concept_member_certificazioni`

Each **certification** becomes a collection containing all the concepts assigned to it:

```yaml
subject: .../$(certificazione_normalizzata)
  - [skos:member, .../$(concept)]
```

---

### 🧠 Semantic Purpose

* `skos:Collection` is used here to **group concepts non-hierarchically**, unlike `skos:broader/narrower`.
* `skos:member` defines inclusion of concepts or subcollections.
* All collections are linked to the same `skos:ConceptScheme` via `skos:inScheme`.

---

### ✅ Result

The result is a **multi-level organizational structure** of the Agroterm vocabulary that supports:

* Thematic browsing (e.g., by certification)
* Geographic navigation (e.g., by region or area)
* Enhanced usability in vocabulary browsers

> 🧩 **Tip**: You can enrich collections with `skos:note` or `skos:definition` if needed for display or documentation purposes.


In [14]:
collections_raw_mapping = f"""
agroterm_collections_regioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].regioni[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/gruppo_regioni"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "Regioni", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_regione)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_collections_certificazioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/gruppo_certificazioni"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "Certificazioni", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(certificazione_normalizzata)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

    
agroterm_concept_member_regioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].regioni[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_regione)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(label)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_area_geografica)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]


agroterm_concept_member_area_geografica:
  sources:
    - ['{concepts_source}~jsonpath', "$[*].regioni[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_area_geografica)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(ref_area_geografica_label)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(ref_concept)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

agroterm_concept_member_certificazioni:
  sources:
    - ['{concepts_source}~jsonpath', "$[*]"]
  subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(certificazione_normalizzata)"
  po:
    - [a, skos:Collection]
    - [skos:prefLabel, "$(certificazione)", it~lang]
    - [skos:member, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/$(concept)"]
    - [skos:inScheme, "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"]

"""

## 🧩 Wrap-up: Generate the Final Hierarchical Mapping File

In this final step, we **aggregate all YARRRML mapping blocks** into a single YAML file. This file combines:

- The SKOS ConceptScheme
- Concepts and definitions
- Alternative labels
- Sources
- Thematic and geographic collections

This mapping is suitable for **generating a fully structured, hierarchical RDF vocabulary**.

---

### ⚙️ Indentation Helper

YARRRML requires all mapping blocks to be indented **under the `mappings:` key**. The function below ensures each line of a block is indented by 2 spaces:

```python
def indent_mapping_block(block_text):
    ...
````

---

### 🏗️ YAML File Construction

All blocks are merged into a single YAML string in the following order:

1. `prefixes_raw_mapping` – RDF namespace declarations
2. `"mappings:"` – top-level key for YARRRML
3. Indented blocks:

   * `concept_scheme_raw_mapping`
   * `concepts_raw_mapping`
   * `concepts_altlabel_raw_mapping`
   * `concepts_fonti_raw_mapping`
   * `collections_raw_mapping` (which includes hierarchy and grouping)

```python
full_mapping = (
    prefixes_raw_mapping + "\n\n" +
    "mappings:\n" +
    indent_mapping_block(...) + ...
)
```

---

### 📝 Save the YAML

The file is saved to:

```
./yaml/agroterm_mapping_hierarchical.yml
```

```python
with open(output_yaml_path, 'w', encoding='utf-8') as f:
    f.write(full_mapping)
```

---

### ✅ Output

This YAML file is now ready for:

* Conversion to RML (via **YATTER**)
* RDF materialization (via **Morph-KGC**)
* Reuse and editing in YAML-based workflows

> 💡 You can inspect or edit this file manually, or process it in bulk to generate RDF datasets.

> 🧪 Consider validating the YAML with a syntax checker or testing a small subset of the mappings first.

In [15]:
import os

# Tutti gli altri blocchi vanno indentati di 2 spazi per essere sotto 'mappings:'
def indent_mapping_block(block_text):
    indented_lines = []
    for line in block_text.strip().splitlines():
        if line.strip():  # evita righe vuote
            indented_lines.append(f"  {line}")
        else:
            indented_lines.append("")
    return "\n".join(indented_lines)
    
# Costruzione finale del file
full_mapping = (
    prefixes_raw_mapping + "\n\n" +
    "mappings:\n" +
    indent_mapping_block(concept_scheme_raw_mapping) + "\n\n" +
    indent_mapping_block(concepts_raw_mapping) + "\n\n" +
    indent_mapping_block(concepts_altlabel_raw_mapping) + "\n\n" +
    indent_mapping_block(concepts_fonti_raw_mapping) + "\n\n" +
    indent_mapping_block(collections_raw_mapping)
)

# 2. Ensure the 'yaml' folder exists
os.makedirs('./yaml', exist_ok=True)

# 3. Save the final YAML file
output_yaml_path = './yaml/agroterm_mapping_hierarchical.yml'

with open(output_yaml_path, 'w', encoding='utf-8') as f:
    f.write(full_mapping)

print(f"✅ YAML mapping file successfully created at: {output_yaml_path}")

✅ YAML mapping file successfully created at: ./yaml/agroterm_mapping_hierarchical.yml


## 🔄 Convert Hierarchical YARRRML to RML (Turtle)

This script transforms the final **hierarchical YARRRML mapping file** into an **RML Turtle file**, ready for RDF generation using tools like **Morph-KGC**.

---

### 🧭 Script Overview

#### 📁 1. Define Input/Output Paths

```python
yaml_dir = "yaml"
rml_dir = "rml"
input_yaml_filename = "agroterm_mapping_hierarchical.yml"
````

* Reads the `YARRRML` file from the `yaml/` directory
* Saves the converted `RML` file in the `rml/` directory
* Ensures the output folder exists (`os.makedirs`)

---

#### 🧪 2. Parse YARRRML with `ruamel.yaml`

```python
yaml_loader = YAML(typ='safe', pure=True)
yarrrml_content = yaml_loader.load(yarrrml_file)
```

* Uses `ruamel.yaml` to safely parse YAML with correct typing and Unicode support

---

#### 🔁 3. Convert YARRRML → RML using `yatter`

```python
rml_output = yatter.translate(yarrrml_content)
```

* Translates the in-memory YAML dictionary to RML (as a Turtle string)
* Updates the RML namespace to the canonical URI (`w3id.org`) to ensure compatibility:

```python
rml_output = rml_output.replace("http://semweb.mmlab.be/ns/rml#", "http://w3id.org/rml/")
```

---

#### 💾 4. Save the RML File

```python
with open(rml_file_path, "w", encoding="utf-8") as rml_file:
    rml_file.write(rml_output)
```

* Final output: `rml/agroterm_mapping_hierarchical.rml.ttl`

---

### ✅ Result

You now have a **complete and valid RML mapping file** that:

* Encodes hierarchical SKOS collections and concept relations
* Can be executed by **Morph-KGC** to generate the RDF dataset

---

> ⚠️ **Error Handling**: If parsing or translation fails, a full traceback is printed.

> 🧪 **Next step**: Run `morph_kgc.materialize()` on this `.rml.ttl` file to produce the final RDF vocabulary in Turtle format.

In [16]:
import os
import traceback
from ruamel.yaml import YAML
import yatter

# 1. Define the paths
yaml_dir = "yaml"
rml_dir = "rml"
input_yaml_filename = "agroterm_mapping_hierarchical.yml"

# 2. Ensure the 'rml' output directory exists
if not os.path.exists(rml_dir):
    os.makedirs(rml_dir)
    print(f"✅ Folder created: {rml_dir}")

# 3. Initialize YAML loader
yaml_loader = YAML(typ='safe', pure=True)

# 4. Define full paths
yaml_file_path = os.path.join(yaml_dir, input_yaml_filename)
rml_file_path = os.path.join(rml_dir, input_yaml_filename.replace(".yml", ".rml.ttl"))

# 5. Process the YARRRML file
try:
    # Load the YARRRML content
    with open(yaml_file_path, "r", encoding="utf-8") as yarrrml_file:
        yarrrml_content = yaml_loader.load(yarrrml_file)

    # Translate YARRRML to RML
    rml_output = yatter.translate(yarrrml_content)

    # Replace namespace if needed
    rml_output = rml_output.replace("http://semweb.mmlab.be/ns/rml#", "http://w3id.org/rml/")

    # Save the RML output
    with open(rml_file_path, "w", encoding="utf-8") as rml_file:
        rml_file.write(rml_output)

    print(f"✅ RML file successfully created at: {rml_file_path}")

except Exception as e:
    print(f"❌ Failed to convert {input_yaml_filename} to RML. Error: {e}")
    traceback.print_exc()

2025-05-13 14:19:54,405 | INFO: Translating YARRRML mapping to [R2]RML
2025-05-13 14:19:54,409 | INFO: RML content is created!
2025-05-13 14:19:54,514 | INFO: Mapping has been syntactically validated.
2025-05-13 14:19:54,516 | INFO: Translation has finished successfully.


✅ RML file successfully created at: rml/agroterm_mapping_hierarchical.rml.ttl


## 🧪 Generate RDF from RML using Morph-KGC

This Python cell takes a previously generated **RML mapping file** and uses the [`morph-kgc`](https://github.com/oeg-upm/morph-kgc) library to produce the final **RDF Turtle** representation of the vocabulary.

---

### 🔧 Step-by-Step Explanation

#### ✅ 1. Define RDF Namespaces

```python
DEFAULT_PREFIXES = {
    "dc": "...",
    "skos": "...",
    ...
}
````

These prefixes are injected into the output RDF graph to improve readability and standardization.

---

#### 🔗 2. Create Morph-KGC Config Programmatically

```python
config_string = f'''
[DEFAULT]
output_format = N-TRIPLES
output_file = {output_path}
...

[agroterm]
mappings = {mapping_file_path}
'''
```

The configuration is defined **in-memory**, specifying:

* Output format (`N-TRIPLES`) before converting to Turtle
* Output file path
* Mapping file path (RML `.ttl` file)

---

#### 🧠 3. Run Morph-KGC Materialization

```python
graph = morph_kgc.materialize(config_string)
```

* Uses the config string to process the mappings
* Generates an in-memory `rdflib.Graph` containing the RDF triples

---

#### 🧼 4. Add Prefix Bindings and Serialize Output

```python
graph.bind("skos", ...)
graph.serialize(destination=output_path, format="turtle")
```

* Adds namespace bindings (`bind`) for more compact Turtle
* Saves the RDF file to `./rdf/agroterm_hierarchical.ttl`

---

### 📁 Output

* **File**: `rdf/agroterm_hierarchical.ttl`
* **Format**: Turtle (with prefixes)
* **Content**: All concepts, collections, relations and metadata as RDF triples

---

> 💡 **Tip**: This step assumes the RML file was generated using `yatter.translate()` from a YARRRML file in a previous cell.

> ✅ **Next step**: You can now inspect, validate, or publish the RDF file to a triple store or LOD platform like Skosmos.

In [17]:
import os
import morph_kgc
from rdflib import Namespace
from ruamel.yaml import YAML

# Define the default prefixes
DEFAULT_PREFIXES = {
    "dc": "http://purl.org/dc/elements/1.1/",
    "dct": "http://purl.org/dc/terms/",
    "iso639-3": "http://iso639-3.sil.org/code/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
}

def add_prefixes_to_graph(graph):
    """Add default and dynamic prefixes to the RDFLib graph."""
    for prefix, namespace in DEFAULT_PREFIXES.items():
        graph.bind(prefix, Namespace(namespace))
    # Add project-specific prefix for agroterm
    dynamic_prefix = "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/"
    graph.bind("agroterm", Namespace(dynamic_prefix))
    print(f"🔗 Added dynamic prefix: agroterm -> {dynamic_prefix}")

def create_and_process_config_string(output_dir, mapping_file_path):
    """Create morph-kgc config dynamically and materialize RDF from a specific mapping file."""
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Output file
    output_file = "agroterm_hierarchical.ttl"
    output_path = os.path.join(os.path.abspath(output_dir), output_file)

    # Validate mapping path
    if not os.path.isfile(mapping_file_path):
        print(f"❌ Mapping file not found: {mapping_file_path}")
        return

    # Build configuration string
    config_string = f"""
[DEFAULT]
output_format = N-TRIPLES
output_file = {output_path}

safe_percent_encoding = ì

[agroterm]
mappings = {os.path.abspath(mapping_file_path)}
"""

    print("🚀 Executing morph-kgc with specified mapping file...")
    try:
        # Materialize RDF triples using morph-kgc
        graph = morph_kgc.materialize(config_string)
        
        # Add prefixes
        add_prefixes_to_graph(graph)

        # Serialize RDF to Turtle
        graph.serialize(destination=output_path, format="turtle")
        print(f"✅ RDF file generated successfully: {output_path}\n")

    except Exception as e:
        print(f"❌ Error during RDF generation: {str(e)}")

# --- Define paths ---
mapping_file = os.path.join("rml", "agroterm_mapping_hierarchical.rml.ttl")
rdf_output_directory = "rdf"

# --- Run materialization ---
create_and_process_config_string(rdf_output_directory, mapping_file)


2025-05-13 14:19:54,532 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/SKOSMOS TNA CALL/rdf/agroterm_hierarchical.ttl', 'safe_percent_encoding': 'ì', 'na_values': ',nan', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2025-05-13 14:19:54,534 | DEBUG: DATA SOURCE `agroterm`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/SKOSMOS TNA CALL/rdf/agroterm_hierarchical.ttl', 'safe_percent_encoding': 'ì', 'mappings': '/home/jovyan/work/SKOSMOS TNA CALL/rml/agroterm_mapping_hierarchical.rml.ttl'}


🚀 Executing morph-kgc with specified mapping file...


2025-05-13 14:19:56,885 | INFO: 47 mapping rules retrieved.
2025-05-13 14:19:56,903 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2025-05-13 14:19:56,914 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2025-05-13 14:19:56,921 | INFO: Mapping partition with 18 groups generated.
2025-05-13 14:19:56,924 | INFO: Maximum number of rules within mapping group: 8.
2025-05-13 14:19:56,927 | INFO: Mappings processed in 2.386 seconds.
2025-05-13 14:19:56,931 | DEBUG: Parallelizing with 16 cores.
2025-05-13 14:19:57,585 | INFO: Number of triples generated in total: 5762.


🔗 Added dynamic prefix: agroterm -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/madin-term/agroterm/
✅ RDF file generated successfully: /home/jovyan/work/SKOSMOS TNA CALL/rdf/agroterm_hierarchical.ttl



## 🌲 Aggregate Top Concepts into ConceptScheme

This cell updates an RDF Turtle file by identifying **top-level `skos:Concept`s** (i.e., those that have `skos:narrower` relations but **no** `skos:broader`) and explicitly links them to the `skos:ConceptScheme` using:

- `skos:hasTopConcept` (from the scheme to the concept)
- `skos:topConceptOf` (from the concept to the scheme)

---

### 🔍 Purpose

This step ensures that top-level concepts are **formally declared** as top concepts within the hierarchy of the vocabulary, improving compatibility with tools like **Skosmos** and semantic graph validators.

---

### ⚙️ Logic

```python
for concept in graph.subjects(RDF.type, SKOS.Concept):
    has_narrower = (concept, SKOS.narrower, None) in graph
    has_broader = (concept, SKOS.broader, None) in graph
    if has_narrower and not has_broader:
        ...
````

* The script loops over all `skos:Concept`s.
* If a concept has children (`narrower`) but no parent (`broader`), it is declared a top concept.

---

### 💾 Output

* The RDF file is **modified in place**: `rdf/agroterm_hierarchical.ttl`
* The new relationships are serialized directly into the same file.

---

### ✅ Benefits

* Ensures completeness of the SKOS vocabulary structure.
* Makes top concepts visible in vocabulary browsers.
* Enhances hierarchy navigation.

---

<!-- > 💡 Run this cell **after RDF materialization** but **before final publication or validation**. -->

In [18]:
import os
from rdflib import Graph, Namespace, RDF
from rdflib.namespace import SKOS

# Define the SKOS namespace
SKOS_NS = Namespace("http://www.w3.org/2004/02/skos/core#")

def aggregate_concepts_to_scheme(input_file_path):
    """
    Aggregate top-level skos:Concepts (those with skos:narrower but no skos:broader)
    to the skos:ConceptScheme using skos:hasTopConcept and skos:topConceptOf.
    """
    if not os.path.isfile(input_file_path):
        print(f"❌ File not found: {input_file_path}")
        return

    graph = Graph()
    graph.parse(input_file_path, format="turtle")

    # Trova il ConceptScheme
    concept_schemes = list(graph.subjects(RDF.type, SKOS.ConceptScheme))
    if not concept_schemes:
        print(f"⚠️ Nessun skos:ConceptScheme trovato in {input_file_path}.")
        return

    concept_scheme = concept_schemes[0]  # Supponiamo che ce ne sia solo uno

    # Trova tutti i skos:Concept candidati a topConcept
    for concept in graph.subjects(RDF.type, SKOS.Concept):
        has_narrower = (concept, SKOS.narrower, None) in graph
        has_broader = (concept, SKOS.broader, None) in graph
        if has_narrower and not has_broader:
            graph.add((concept_scheme, SKOS.hasTopConcept, concept))
            graph.add((concept, SKOS.topConceptOf, concept_scheme))

    # Sovrascrive il file con le modifiche
    graph.serialize(destination=input_file_path, format="turtle")
    print(f"✅ File aggiornato: {input_file_path}")

# Percorso esatto del file da processare
rdf_file_path = os.path.join("rdf", "agroterm_hierarchical.ttl")

# Esegui lo script
aggregate_concepts_to_scheme(rdf_file_path)


✅ File aggiornato: rdf/agroterm_hierarchical.ttl


## 🧼 Clean-up of SKOS Collections: prefLabel Normalization

This cell ensures that each `skos:Collection` in the RDF graph has **only one `skos:prefLabel`**, as required by certain SKOS-consuming applications and recommended best practices.

---

### 🛠 What It Does

1. **Parses** the RDF Turtle file: `rdf/agroterm_hierarchical.ttl`
2. **Identifies all `skos:Collection` resources**
3. **If a collection has multiple `skos:prefLabel` values**, it:
   - Keeps only the **first** one found
   - **Removes** all others
4. **Copies** the rest of the graph unchanged
5. **Serializes** a new cleaned RDF file:  
   ➤ `rdf/agroterm_hierarchical_decoded.ttl`

---

### ⚠️ Why This Is Important

- SKOS technically allows multiple `prefLabel`s **only if they are in different languages**.
- If multiple `prefLabel`s in the same language exist (which may happen due to data errors or duplicates), RDF tools like **Skosmos** or **validators** might issue warnings or fail to render properly.
- This script assumes that keeping the **first label** is sufficient, but you can adapt it to filter by language or other criteria if needed.

---

### 📄 Output

- A new RDF Turtle file with cleaned collections is saved as:
  
  ```text
  rdf/agroterm_hierarchical_decoded.ttl
````

---

> 💡 Run this step **before validation** or publishing the RDF vocabulary for optimal compatibility.

In [19]:
from rdflib import Graph, Namespace, RDF, URIRef, Literal
from rdflib.namespace import SKOS
import os

# Percorsi
input_path = "./rdf/agroterm_hierarchical.ttl"
output_path = "./rdf/agroterm_hierarchical_decoded.ttl"

# Carica il grafo RDF
g = Graph()
g.parse(input_path, format="turtle")

# Nuovo grafo per i dati puliti
g_clean = Graph()
g_clean.bind("skos", SKOS)

# Copia tutti i namespace originali
for prefix, namespace in g.namespaces():
    g_clean.bind(prefix, namespace)

# Trova tutte le collezioni
for s in g.subjects(RDF.type, SKOS.Collection):
    labels = list(g.objects(s, SKOS.prefLabel))
    if len(labels) > 1:
        # Mantieni solo la prima prefLabel
        first_label = labels[0]
        # Rimuovi tutte le prefLabel
        for label in labels:
            g.remove((s, SKOS.prefLabel, label))
        # Aggiungi solo la prima
        g.add((s, SKOS.prefLabel, first_label))

# Dopo modifica, copia tutto nel nuovo grafo
for triple in g:
    g_clean.add(triple)

# Salva il nuovo file
g_clean.serialize(destination=output_path, format="turtle")
print(f"✅ File pulito salvato in: {output_path}")


✅ File pulito salvato in: ./rdf/agroterm_hierarchical_decoded.ttl
