# **LLOD Tutorial - REALITER Lexicons**

## Creating a Lexicon in SKOS from Structured Data

### **Introduction:**

In this tutorial, we will explore the creation of a lexicon using the Simple Knowledge Organization System (SKOS) format. Specifically, we will focus on how to generate a lexicon based on structured data, utilizing the lexicons developed within the REALITER project. This guide will cover the transformation process of linguistic data into the SKOS format, making it interoperable with Linked Data principles and the Linguistic Linked Open Data (LLOD) paradigm.

Once the lexicon is created, it will be hosted on **Skosmos** and accessible at the following address: [https://vocabs.ilc4clarin.ilc.cnr.it/skosmos/](https://vocabs.ilc4clarin.ilc.cnr.it/skosmos/). Skosmos is a web-based tool for browsing and publishing SKOS vocabularies, which will allow users to easily explore the created lexicons.

The tutorial will be structured in the following phases:
1. Understanding the SKOS format and its relevance for representing linguistic data.
2. Preparing structured data for the transformation process.
3. Implementing the transformation from structured data to SKOS.
4. Validating, visualizing, and publishing the resulting lexicon on Skosmos.

This step-by-step guide aims to provide practical insights into the use of LLOD technologies for language resource development, contributing to the wider use and accessibility of linguistic data in an open and interconnected manner.

### **Prerequisites:**

To follow this tutorial successfully, you should have the following knowledge and tools:

Knowledge:
- **Linked Data**: Familiarity with the principles of Linked Data and how it enables the interconnection of data across the web.
- **SKOS (Simple Knowledge Organization System)**: Understanding of how SKOS is used to represent structured vocabularies and taxonomies.
- **Python**: Basic programming skills in Python, particularly for data processing and RDF generation.
- **Skosmos**: Basic understanding of how Skosmos works as a tool for browsing and publishing SKOS-based vocabularies.

Required Tools:
- **Text Editor or IDE**: For example, Visual Studio Code, PyCharm, or any code editor you prefer for editing Python code.
- **Jupyter Notebook**: A web-based interactive environment for Python programming, where you'll be able to run and document your code.
- **Python** and the following libraries:
  - `yatter`: A Python library for RDF manipulation.
  - `morph-kgc`: A library used for transforming structured data (e.g., from relational databases or spreadsheets) into RDF.
- **Structured Data**: You'll need your data in a structured format, either **XLSX (Excel spreadsheet)** or **JSON**, which will be transformed into SKOS.

### **Preparation of Vocabulary Metadata**

In this section, we will focus on retrieving the metadata of the vocabulary that has been deposited in the DSpace repository of ILC4CLARIN. This metadata is essential for documenting and structuring the vocabulary in accordance with Linked Data principles, and will serve as the foundation for defining the SKOS conceptScheme.

The metadata will be extracted from the DSpace repository, which typically includes information such as authorship, date of issuance, licensing, and subject classification. We will then formalize this metadata into a schema that will describe the overarching conceptScheme for the vocabulary. This conceptScheme acts as the container for the SKOS concepts that form the core of the vocabulary, providing a structured and navigable resource.

#### **Steps for Retrieving Metadata from DSpace Using REST API and Saving as JSON**

**Step 1: Install and import Required Libraries**
Before we begin, you need to install the `requests` library, which will allow you to make HTTP requests in Python.

If you don't have it installed, run this command:

In [1]:
!pip install requests
!pip install pandas
!pip install openpyxl
!pip install ruamel.yaml
!pip install yatter
!pip install morph-kgc



In [2]:
import requests
import json      
import os        
import hashlib   
import urllib3   
import warnings
import pandas as pd
import traceback
import re
import yaml
import yatter
from ruamel.yaml import YAML
import morph_kgc
from rdflib import Graph, Namespace, URIRef
from rdflib.namespace import SKOS, RDF  # Import RDF
import urllib.parse

**Step 2: Search REALITER Collection Data from DSpace using REST API**

In this step, we will send a GET request to retrieve data about the REALITER collection from the DSpace REST API. Specifically, we will query the collections endpoint to search for the REALITER collection within the ILC4CLARIN DSpace repository.

We will use the following endpoint to retrieve the collections:
`https://dspace-clarin-it.ilc.cnr.it/repository/rest/collections`

Below is the code to perform this GET request using Python:

In [3]:
# Suppress only the InsecureRequestWarning from urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Define the API endpoint for collections
url = "https://dspace-clarin-it.ilc.cnr.it/repository/rest/collections"

# Make the GET request to the DSpace API, disabling SSL verification
response = requests.get(url, verify=False)

# Check if the request was successful
if response.status_code == 200:
    print("Request successful!")
else:
    print(f"Request failed with status code: {response.status_code}")

# Get the data from the response
collections_data = response.json()

# Filter the first collection that contains "realiter" in the "name" field
realiter_collection = next((collection for collection in collections_data if "realiter" in collection['name'].lower()), None)

# Print the filtered REALITER collection (optional)
if realiter_collection:
    print(json.dumps(realiter_collection, indent=4))
else:
    print("No REALITER collection found.")


2024-10-09 15:35:32,810 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443
2024-10-09 15:35:33,131 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/collections HTTP/11" 200 None


Request successful!
{
    "type": "collection",
    "expand": [
        "parentCommunityList",
        "parentCommunity",
        "items",
        "license",
        "logo",
        "all"
    ],
    "handle": "000-c0-111/565",
    "id": 25,
    "name": "REALITER - OTPL",
    "copyrightText": "",
    "introductoryText": "<p>REALITER \u00e8 la Rete panlatina di terminologia che riunisce individui, istituzioni e organismi dei Paesi di lingua neolatina che lavorano nel settore della terminologia, al fine di favorire lo sviluppo armonizzato delle lingue neolatine.</p>\r\n<p>L'OTPL - Osservatorio di terminologie e politiche linguistiche sviluppa studi sulle terminologie specialistiche nelle lingue euroamericane, attraverso attivit\u00e0 di ricerca scientifica, teorica e applicata, in prospettiva diacronica e sincronica.</p>\r\n<p></p>\r\n <p>REALITER is the Pan-Latin Terminology Network that brings together individuals, institutions and organizations from the Latin-speaking countries working

**Step 3: Retrieve Items from the REALITER Collection**

In this step, we will retrieve the items from the REALITER collection using the DSpace REST API. After identifying the REALITER collection in the previous step, we will now query the items that belong to this collection. 

The API endpoint to retrieve the items from a specific collection is:  
`https://dspace-clarin-it.ilc.cnr.it/repository/rest/collections/{collectionId}/items`

We will use the `collectionId` from the `realiter_collection` obtained in Step 2.

In [4]:
# Assuming realiter_collection has been retrieved in Step 2
collection_id = realiter_collection['id']  # Extract the ID of the REALITER collection

# Construct the API endpoint for retrieving items in the REALITER collection
items_url = f"https://dspace-clarin-it.ilc.cnr.it/repository/rest/collections/{collection_id}/items"

# Make the GET request to the DSpace API to retrieve the items
response = requests.get(items_url, verify=False)

# Check if the request was successful
if response.status_code == 200:
    print("Items retrieved successfully!")
else:
    print(f"Request failed with status code: {response.status_code}")

# Get the data from the response (list of items)
items_data = response.json()

# Print the retrieved items (optional)
#print(json.dumps(items_data, indent=4))

2024-10-09 15:35:33,161 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443
2024-10-09 15:35:33,358 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/collections/25/items HTTP/11" 200 6117


Items retrieved successfully!


**Step 4: Get the Vocabularies Metadata**

In this section, we will retrieve the metadata for each item in the REALITER collection. The metadata provides detailed information about each vocabulary item, such as authorship, title, language, and more. We will use the DSpace REST API endpoint:

```
https://dspace-clarin-it.ilc.cnr.it/repository/rest/items/{item_id}/metadata
```

For each item in our list, we will extract its `id`, then make a request to this endpoint to retrieve the associated metadata.

In [5]:
# Assuming items_data has been populated in Step 3
metadata_data = []

# Loop through each item to get the metadata
for item in items_data:
    item_id = item['id']  # Extract the item ID
    metadata_url = f"https://dspace-clarin-it.ilc.cnr.it/repository/rest/items/{item_id}/metadata"

    # Make the GET request to retrieve the metadata for the current item
    response = requests.get(metadata_url, verify=False)
    
    # Check if the request was successful
    if response.status_code == 200:
        metadata = response.json()  # Get the metadata data
        print(f"Metadata retrieved for item {item_id}")
        metadata_data.append({
            "item_id": item_id,
            "metadata": metadata
        })
    else:
        print(f"Failed to retrieve metadata for item {item_id} with status code: {response.status_code}")

# Print the final metadata for all items (optional)
# print(json.dumps(metadata_data, indent=4))

2024-10-09 15:35:33,386 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443
2024-10-09 15:35:33,553 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/618/metadata HTTP/11" 200 2835
2024-10-09 15:35:33,563 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 618


2024-10-09 15:35:33,727 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1077/metadata HTTP/11" 200 3597
2024-10-09 15:35:33,737 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1077


2024-10-09 15:35:33,898 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1087/metadata HTTP/11" 200 2765
2024-10-09 15:35:33,906 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1087


2024-10-09 15:35:34,069 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1091/metadata HTTP/11" 200 2745
2024-10-09 15:35:34,079 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1091


2024-10-09 15:35:34,241 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1089/metadata HTTP/11" 200 2734
2024-10-09 15:35:34,251 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1089


2024-10-09 15:35:34,412 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1107/metadata HTTP/11" 200 3134
2024-10-09 15:35:34,423 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1107


2024-10-09 15:35:34,583 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1118/metadata HTTP/11" 200 3023
2024-10-09 15:35:34,592 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1118


2024-10-09 15:35:34,756 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1130/metadata HTTP/11" 200 2912
2024-10-09 15:35:34,765 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1130


2024-10-09 15:35:34,930 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1132/metadata HTTP/11" 200 3025
2024-10-09 15:35:34,940 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1132


2024-10-09 15:35:35,105 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1133/metadata HTTP/11" 200 2998
2024-10-09 15:35:35,116 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1133


2024-10-09 15:35:35,279 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1134/metadata HTTP/11" 200 2764
2024-10-09 15:35:35,291 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1134


2024-10-09 15:35:35,455 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1135/metadata HTTP/11" 200 3153
2024-10-09 15:35:35,465 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1135


2024-10-09 15:35:35,625 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1136/metadata HTTP/11" 200 2812
2024-10-09 15:35:35,633 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1136


2024-10-09 15:35:35,795 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1138/metadata HTTP/11" 200 2677
2024-10-09 15:35:35,804 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1138


2024-10-09 15:35:35,971 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1137/metadata HTTP/11" 200 2494
2024-10-09 15:35:35,980 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1137


2024-10-09 15:35:36,143 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1139/metadata HTTP/11" 200 2814
2024-10-09 15:35:36,151 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1139


2024-10-09 15:35:36,313 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1140/metadata HTTP/11" 200 3296
2024-10-09 15:35:36,322 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1140


2024-10-09 15:35:36,483 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1141/metadata HTTP/11" 200 2898
2024-10-09 15:35:36,493 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1141


2024-10-09 15:35:36,657 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1142/metadata HTTP/11" 200 3680
2024-10-09 15:35:36,665 | DEBUG: Starting new HTTPS connection (1): dspace-clarin-it.ilc.cnr.it:443


Metadata retrieved for item 1142


2024-10-09 15:35:36,828 | DEBUG: https://dspace-clarin-it.ilc.cnr.it:443 "GET /repository/rest/items/1143/metadata HTTP/11" 200 2959


Metadata retrieved for item 1143


**Step 5: Metadata Mapping and Saving Concept Schemes Metadata**

In this step, we will map the metadata retrieved for each item in the REALITER collection into a structured format suitable for creating a SKOS Concept Scheme. Each item's metadata will be transformed into a Python dictionary, where the `key` fields from the metadata will be used as the dictionary keys, and the corresponding `value` fields will be stored as values.

This will help us organize the metadata in a structured format that can be used to generate SKOS concept schemes or to integrate with other linked data frameworks.

##### Metadata Structure:

Given the metadata structure of each item, we will create a dictionary where:
- The keys will be unique `key` fields like `"dc.title"`, `"dc.contributor.author"`, `"dc.language.iso"`, etc.
- The values will be the corresponding `value` fields.
- For keys that can have multiple values (e.g., `dc.contributor.author` or `dc.language.iso`), the values will be stored as a list.

In [6]:
mapped_metadata_list = []

# Loop through the metadata of each item in metadata_data
for item in metadata_data:
    item_id = item['item_id']  # Extract item_id
    metadata = item['metadata']  # Get the metadata list
    
    # Create a dictionary to store the mapped metadata
    mapped_metadata = {"item_id": item_id}

    # Loop through each metadata element
    for element in metadata:
        key = element['key'].replace('.', '_')  # Replace '.' with '_'
        value = element['value']

        # Special case for 'dc_language_iso' to convert it to a list of objects
        if key == 'dc_language_iso':
            if key in mapped_metadata:
                if isinstance(mapped_metadata[key], list):
                    # Add the new value as an object with a "code" key
                    mapped_metadata[key].append({"code": value})
                else:
                    # Convert the single value to a list of objects with "code" keys
                    mapped_metadata[key] = [{"code": mapped_metadata[key]}, {"code": value}]
            else:
                # If the key doesn't exist, create a list of objects with "code" keys
                mapped_metadata[key] = [{"code": value}]
        else:
            # If the key already exists, append the value to the list (to handle multiple values for the same key)
            if key in mapped_metadata:
                if isinstance(mapped_metadata[key], list):
                    mapped_metadata[key].append(value)
                else:
                    # Convert the single value to a list and append the new value
                    mapped_metadata[key] = [mapped_metadata[key], value]
            else:
                # If the key doesn't exist, add it to the dictionary
                mapped_metadata[key] = value

    # Append the mapped metadata dictionary to the final list
    mapped_metadata_list.append(mapped_metadata)

# Create the directory "concept_schemes" if it doesn't exist
output_directory = "concept_schemes"
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Loop through each mapped metadata and save as JSON file
for mapped_metadata in mapped_metadata_list:
    # Extract the value of "dc_identifier_uri"
    if "dc_identifier_uri" in mapped_metadata:
        identifier_uri = mapped_metadata["dc_identifier_uri"]
        
        # Extract the part after "http://hdl.handle.net/"
        if identifier_uri.startswith("http://hdl.handle.net/"):
            identifier_suffix = identifier_uri.replace("http://hdl.handle.net/", "").replace("/", "_")
            identifier_suffix = identifier_suffix.replace(".", "_").replace("-", "_")
            # Define the file name and path
            file_name = f"{identifier_suffix}.json"
            file_path = os.path.join(output_directory, file_name)
            
            # Save the mapped metadata as a JSON file
            with open(file_path, "w", encoding="utf-8") as f:
                json.dump(mapped_metadata, f, ensure_ascii=False, indent=4)
            
            print(f"File saved: {file_path}")
    else:
        print(f"Missing 'dc.identifier.uri' for item {mapped_metadata['item_id']}")

# Optionally print out the first item in metadata_data for verification
print(json.dumps(metadata_data[0], indent=4))


File saved: concept_schemes/20_500_11752_OPEN_975.json
File saved: concept_schemes/20_500_11752_OPEN_987.json
File saved: concept_schemes/20_500_11752_OPEN_993.json
File saved: concept_schemes/20_500_11752_OPEN_994.json
File saved: concept_schemes/20_500_11752_OPEN_995.json
File saved: concept_schemes/20_500_11752_OPEN_1006.json
File saved: concept_schemes/20_500_11752_OPEN_1008.json
File saved: concept_schemes/20_500_11752_OPEN_1014.json
File saved: concept_schemes/20_500_11752_OPEN_1015.json
File saved: concept_schemes/20_500_11752_OPEN_1016.json
File saved: concept_schemes/20_500_11752_OPEN_1017.json
File saved: concept_schemes/20_500_11752_OPEN_1018.json
File saved: concept_schemes/20_500_11752_OPEN_1019.json
File saved: concept_schemes/20_500_11752_OPEN_1021.json
File saved: concept_schemes/20_500_11752_OPEN_1020.json
File saved: concept_schemes/20_500_11752_OPEN_1022.json
File saved: concept_schemes/20_500_11752_OPEN_1024.json
File saved: concept_schemes/20_500_11752_OPEN_1025.js

### **Preparation of Source Files**

In this section, we will explain how to prepare the structured files so they can be properly associated with their corresponding metadata. 


**Step 1: Renaming input files**

To ensure a correct association between structured data files and their respective metadata files (Concept Schemes), the structured files must be renamed using the same nomenclature as the corresponding Concept Scheme files. This is a manual process and requires the following steps:

1. **Identify the Corresponding Metadata**: 
   For each structured file (e.g., a lexicon or terminology file), you need to identify its matching Concept Scheme file. This can be done by checking the metadata that describes the structured file. Usually, this metadata is stored in a JSON file and contains information such as the `dc.identifier.uri` which uniquely identifies the Concept Scheme.

2. **Renaming the Structured Files**: 
   Once you have identified the corresponding Concept Scheme, rename the structured file to match the file name of the Concept Scheme. For instance, if the Concept Scheme file is named `20_500_11752_OPEN_1014.json`, you should rename the structured file to `20_500_11752_OPEN_1014.<appropriate_extension>` (e.g., `.xlsx`, `.json`, etc.). This ensures that both the structured file and its metadata share the same base name.

3. **Consistency Check**: 
   After renaming the files, verify that the filenames are consistent across all related files (metadata, structured data). This step is crucial to ensure proper mapping during the processing phase. Misnamed files could cause errors or incorrect associations between data and metadata.

This manual process is essential for preparing the data to ensure it can be processed correctly in subsequent stages of the pipeline. Properly named files will ensure that structured data can be automatically linked to the correct metadata during mapping and transformation tasks.

**Step 2: adapt and normalize source files**

The main objective of this step is to create a new data structure that makes the information more accessible and aligned with the requirements for Linked Data. By following a lexicon-building syntax in Excel, we will extract key elements such as preferred labels (prefLabel), alternative labels (altLabel), and any notes or additional information associated with each term.

 This process will allow us to create intermediate JSON files that contain normalized data. These JSON files will be saved in the same folder as the structured source files. The goal is to transform and standardize the input data into a consistent format, making it easier to convert into Linked Data and prepare it for subsequent processing steps.

By adapting and normalizing the source files, we ensure that all values, labels, and language tags are uniform across the dataset. This includes handling language variations, cleaning and transforming labels, and organizing the data in a structured format that will be used for further transformations.

In [7]:
def clean_column_name(col_name):
    try:
        # Converti in minuscolo e rimuovi spazi esterni
        col_name = col_name.lower().strip()
        
        # Sostituzione parentesi quadre con trattini e rimozione degli spazi attorno al trattino
        if '[' in col_name and ']' in col_name:
            col_name = col_name.replace('[', '-').replace(']', '')
            col_name = col_name.replace(" -", "-").replace("- ", "-")
        
        # Mappatura delle combinazioni di lingue
        lang_map = {
            "es-es": "es",
            "es-mex": "es-MX",
            "es-arg": "es-AR",
            "pt-br": "pt-BR",
            "fr-ca": "fr-CA",
        }

        # Gestione dei casi con lingue multiple, come "es-arg/mex"
        if "/" in col_name:
            # Dividi le lingue e trova la prima valida nella mappa
            parts = col_name.split("/")
            for part in parts:
                part = part.strip()
                if part in lang_map:
                    return lang_map[part]
            # Se nessuna parte è nella mappa, ritorna la prima parte
            return parts[0]

        # Verifica se col_name corrisponde a una chiave nella mappa e sostituiscilo
        if col_name in lang_map:
            col_name = lang_map[col_name]

        return col_name
    except Exception as e:
        print(f"Error cleaning column name: {col_name}. Error: {e}")
        traceback.print_exc()

def process_value(value):
    try:
        # Verifica se il valore è una Serie (potrebbe essere accaduto accidentalmente)
        if isinstance(value, pd.Series):
            # Se è una Serie, prendi il primo elemento (o qualsiasi altra logica che abbia senso)
            value = value.iloc[0]
        
        # Controlla se il valore è NaN o vuoto
        if pd.isna(value) or value == "":
            return [], []
        
        # Gestione dei numeri
        if isinstance(value, (int, float)):  # Handle numerical values
            value = str(value).strip()
            return value, []
        
        # Gestione delle stringhe
        if isinstance(value, str):
            # Dividi la stringa in base alle nuove righe (se presenti)
            values = [v.strip() for v in value.split("\n") if v.strip()]
            
            # Prima etichetta è prefLabel, le altre sono altLabels
            pref_label = values[0] if values else None
            alt_labels = values[1:] if len(values) > 1 else []
            return pref_label, alt_labels
        
        else:
            print(f"Unsupported value type: {type(value)} with value: {value}")
            return [], []
    
    except Exception as e:
        print(f"Error processing value: {value}. Error: {e}")
        traceback.print_exc()
        return [], []


def clean_concept_name(concept):
    try:
        # Split the concept based on \n and take the first value
        first_part = concept.split("\n")[0].strip()

        # Use a regex to remove content inside the last parentheses, if present
        cleaned_concept = re.sub(r'\s*\([^)]+\)\s*$', '', first_part).strip()

        # Replace problematic characters: replace spaces and dashes with underscores
        cleaned_concept = cleaned_concept.replace(" ", "_").replace("-", "_").replace("|", "_").replace("'", "_").replace(",", "_").replace("’", "_").replace("/", "_")

        # Check if the string contains a "(" character
        if "(" in cleaned_concept:
            print(f"Warning: Concept contains '(': {cleaned_concept}")

        
        # Return the cleaned concept without any encoding
        return cleaned_concept

    except Exception as e:
        print(f"Error processing concept: {concept}. Error: {e}")
        return None

def get_definition(row, first_col):
    try:
        for col in row.index:
            if col.lower() == 'def':
                definition = row[col]
                definition_lang = clean_column_name(first_col)  # Utilizza il nome pulito della prima colonna come lingua
                return definition, definition_lang
        return None, None
    except Exception as e:
        print(f"Error getting definition from row: {row}. Error: {e}")
        traceback.print_exc()


def extract_value(label):
    # Pattern per rimuovere il contenuto tra le ultime parentesi tonde
    cleaned_value = re.sub(r'\s*\([^)]+\)\s*$', '', label).strip()
    return cleaned_value


def process_excel_file(file_path):
    try:
        df = pd.read_excel(file_path)
        
        # Pulizia dei nomi delle colonne e controllo di compatibilità
        original_columns = df.columns.tolist()  # Conserva le colonne originali
        cleaned_columns = [clean_column_name(col) for col in original_columns]

        if len(cleaned_columns) != len(original_columns):
            print(f"Warning: The number of cleaned columns ({len(cleaned_columns)}) does not match the number of original columns ({len(original_columns)}).")

        df.columns = cleaned_columns  # Applica i nomi delle colonne puliti
        
        processed_data = []
        for index, row in df.iterrows():
            # Sostituisci i NaN con stringhe vuote nella riga
            row = row.fillna("")

            concept_raw = row.iloc[0]
            concept = clean_concept_name(concept_raw)

            # Recupera la definizione e la lingua
            first_col = df.columns[0]  # Ottieni il nome della prima colonna
            definition, definition_lang = get_definition(row, first_col)

            pref_labels = []
            alt_labels = []
            note = None 

            # Controlla se esiste una colonna "nota" indipendentemente dal caso
            note_column = next((col for col in df.columns if col.lower() in ['nota']), None)
            notes = []  # Lista per contenere tutte le note
            
            if note_column:
                raw_note = row.get(note_column, "")
                
                # Verifica se ci sono numerazioni (es. "1.", "2.", ecc.)
                numbered_notes = re.split(r'\s*\d+\.\s*', raw_note.strip())
            
                # Se ci sono più elementi dopo lo split, allora ci sono numerazioni
                if len(numbered_notes) > 1:
                    for note_text in numbered_notes:
                        if note_text.strip():
                            notes.append({
                                "concept": concept,
                                "value": note_text.strip(),
                                "lang": None
                            })
                else:
                    notes.append({
                        "concept": concept,
                        "value": raw_note.strip(),
                        "lang": None  
                    })

            for col in df.columns:
                if col.lower() == 'def' or col == 'nota':
                    continue
                
                cell_value = row[col]
                pref_label, alt_label_list = process_value(cell_value)

                # Per la prefLabel
                if pref_label:
                    pref_labels.append({
                        "concept": concept,
                        "value": extract_value(pref_label),
                        "lang": col,
                    })
                    notes.append({
                        "concept": concept,
                        "value": pref_label.strip(),
                        "lang": col
                    })
                
                # Per le altLabel
                for alt_label in alt_label_list:
                    alt_labels.append({
                        "concept": concept,
                        "value": extract_value(alt_label),
                        "lang": col,
                    })

            note = notes if notes else None

            json_structure = {
                "concept": concept,
                "definition": definition,
                "definitionLang": definition_lang,  # Aggiungi il campo della lingua della definizione
                "prefLabels": pref_labels,
                "altLabels": alt_labels,
                "note": note
            }
            
            processed_data.append(json_structure)
        
        return processed_data
    except Exception as e:
        print(f"Error processing Excel file: {file_path}. Error: {e}")
        traceback.print_exc()
        return []


def process_all_excel_files(input_data_dir):
    all_processed_data = {}
    
    for filename in os.listdir(input_data_dir):
        if filename.endswith(".xlsx"):
            file_path = os.path.join(input_data_dir, filename)
            print(f"Processing file: {filename}")
            
            try:
                processed_data = process_excel_file(file_path)
                if processed_data:
                    all_processed_data[filename] = processed_data
                
                    output_filename = filename.replace(".xlsx", ".json")
                    output_path = os.path.join(input_data_dir, output_filename)
                    with open(output_path, "w", encoding="utf-8") as f:
                        json.dump(processed_data, f, ensure_ascii=False, indent=4)
                    
                    print(f"Processed data saved to {output_path}")
            except Exception as e:
                print(f"Failed to process {filename}: {e}")
                traceback.print_exc()
    
    return all_processed_data

# Directory containing input Excel files
input_data_dir = "input_data"

# Process all Excel files in the directory
processed_data = process_all_excel_files(input_data_dir)


Processing file: 20_500_11752_OPEN_1015.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1015.json
Processing file: 20_500_11752_OPEN_1017.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1017.json
Processing file: 20_500_11752_OPEN_1021.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1021.json
Processing file: 20_500_11752_OPEN_1022.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1022.json
Processing file: 20_500_11752_OPEN_1020.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1020.json
Processing file: 20_500_11752_OPEN_1024.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1024.json
Processing file: 20_500_11752_OPEN_1016.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1016.json
Processing file: 20_500_11752_OPEN_1019.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1019.json
Processing file: 20_500_11752_OPEN_1026.xlsx
Processed data saved to input_data/20_500_11752_OPEN_1026.json
Processing file: 20_500_1175

### **Creation of YARRRML Mapping File**

After obtaining the vocabulary metadata and structured data, the next step is to create a YARRRML mapping file for each vocabulary. This mapping file will allow us to define how the data from our structured sources is transformed into RDF triples, which are essential for generating SKOS Concept Schemes.

YARRRML (Yet Another RML Mapping Language) is a user-friendly, YAML-based syntax for creating RML (RDF Mapping Language) mappings. It simplifies the process of describing how data should be mapped to RDF by allowing the use of a more readable format.

#### **Purpose of the Mapping File**

The YARRRML mapping file serves as the blueprint that specifies how the structured data (e.g., terms, definitions, languages, and relationships) will be transformed into RDF triples in SKOS format. Each mapping file will correspond to a specific vocabulary and will contain rules for converting the data into SKOS concepts, labels, definitions, and relations.

**Step 1: Collecting references**

In this step, we will iterate over the `bitstreams_data_no_duplicates` variable to collect references that link each vocabulary's concept schema and its corresponding structured data file. The goal is to match the `item_id` from the `bitstreams_data_no_duplicates` to the appropriate JSON file inside the `concept_schemes` directory, as well as retrieve the corresponding structured data file from the `input_data` directory based on the `name` attribute found within the bitstreams.

In [9]:
import os
import json

# Directory containing the concept schema files
concept_schemes_dir = "concept_schemes"

# Lista per raccogliere tutte le coppie di concept schema e structured data
data_pairs = []

# Iterate through the files in the "concept_schemes" directory
for filename in os.listdir(concept_schemes_dir):
    # Ensure the file is a JSON file
    if filename.endswith(".json"):
        file_path = os.path.join(concept_schemes_dir, filename)
        
        # Open and load the JSON file
        with open(file_path, "r", encoding="utf-8") as f:
            try:
                file_data = json.load(f)
                # Append the concept schema reference to the data_pairs list
                data_pairs.append({
                    "references": filename  # Store the filename as the reference
                })
            except json.JSONDecodeError:
                print(f"Failed to decode JSON from {file_path}")

# Print or process the final data_pairs list
print(f"Total data pairs collected: {len(data_pairs)}")
print(data_pairs)


Total data pairs collected: 20
[{'references': '20_500_11752_OPEN_1021.json'}, {'references': '20_500_11752_OPEN_1017.json'}, {'references': '20_500_11752_OPEN_1014.json'}, {'references': '20_500_11752_OPEN_1022.json'}, {'references': '20_500_11752_OPEN_1008.json'}, {'references': '20_500_11752_OPEN_1015.json'}, {'references': '20_500_11752_OPEN_1019.json'}, {'references': '20_500_11752_OPEN_1006.json'}, {'references': '20_500_11752_OPEN_1025.json'}, {'references': '20_500_11752_OPEN_1016.json'}, {'references': '20_500_11752_OPEN_1024.json'}, {'references': '20_500_11752_OPEN_994.json'}, {'references': '20_500_11752_OPEN_995.json'}, {'references': '20_500_11752_OPEN_1026.json'}, {'references': '20_500_11752_OPEN_1027.json'}, {'references': '20_500_11752_OPEN_975.json'}, {'references': '20_500_11752_OPEN_1020.json'}, {'references': '20_500_11752_OPEN_1018.json'}, {'references': '20_500_11752_OPEN_987.json'}, {'references': '20_500_11752_OPEN_993.json'}]


**Step 2: Creation of YARRRML Mapping Files**

In this step, we will generate the YARRRML mapping files, which will be used to map the data from the structured files into RDF format. After collecting the references to the pairs of files related to the conceptSchema and the structured data in the previous step, we now proceed with the creation of a directory called mapping_files. This directory will contain a separate YARRRML mapping file for each object present in the data_pairs list.

The process involves iterating over the data_pairs list. For each object in the list, a YAML file will be generated in the mapping_files directory. Each YAML file will define the rules for mapping the structured data from the source file to the corresponding RDF triples according to the concept schema. These mappings will ensure that each term, label, and relationship in the source data is correctly transformed into the appropriate linked data representation.

The generation of these mapping files is crucial for the next steps, where we will use these YARRRML files to create RML files that can be processed to produce the final Turtle RDF files. Each mapping file will be customized according to the structure of the concept schema and the data within each structured file.

In [14]:
def ensure_directory_exists(directory):
    """Ensure the directory exists, otherwise create it."""
    if not os.path.exists(directory):
        os.makedirs(directory)


def load_json(file_path):
    """Load a JSON file from a given path."""
    with open(file_path, "r", encoding="utf-8") as f:
        return json.load(f)


def extract_uri_info(dc_identifier_uri):
    """Extract the central value and suffix from the dc.identifier.uri."""
    uri_match = re.search(r'(\d+\.\d+\.\d+)/(.+)', dc_identifier_uri)
    if uri_match:
        uri_number = uri_match.group(1)
        uri_suffix = uri_match.group(2)
        return uri_number, uri_suffix
    return None, None


def build_custom_name_and_prefix(uri_number, uri_suffix, structured_file_ref):
    """Build the custom name and prefix based on URI information."""
    if uri_number and uri_suffix:
        custom_name = f"pl_{uri_number.replace('.', '-')}-{uri_suffix}"
        custom_prefix = f"https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{custom_name}/"
    else:
        custom_name = f"pl_{structured_file_ref.replace('.json', '').replace(' ', '_')}"
        custom_prefix = f"https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{custom_name}/"
    return custom_name, custom_prefix


def create_mappings_section(custom_name, concept_schema_file, concept_schemes_dir, custom_prefix):
    """Create the mappings section for the concept schema as raw text."""
    
    concept_schema_path = os.path.abspath(os.path.join(concept_schemes_dir, concept_schema_file))
    
    raw_mapping = f"""
    {custom_name}_concept_scheme:
        sources:
            - [{concept_schema_path}~jsonpath, "$"]
        subject: "{custom_prefix}"
        predicateobjects:
            - [a, skos:ConceptScheme]
            - [dc:description, $(dc_description), en~lang]
            - [dc:identifier,  $(dc_identifier_uri)]
            - [dc:title, $(dc_title), en~lang]
            - [dc:created, $(dc_date_issued), xsd:date]
            - [dc:creator, $(dc_contributor_author)]


    {custom_name}_concept_scheme_languages:
        sources:
            - [{concept_schema_path}~jsonpath, "$.dc_language_iso[*]"]
        subject: "{custom_prefix}"
        predicateobjects:
            - [dct:language, "http://iso639-3.sil.org/code/$(code)"]
"""
    
    return raw_mapping.strip()



def create_structured_mappings_section(structured_file_ref, structured_data_file, input_data_dir):
    """Create the mappings section for the structured data as raw text."""
    
    structured_data_path = os.path.abspath(os.path.join(input_data_dir))
    
    raw_mapping = f"""
    {structured_file_ref}_concepts:
        sources:
            - [{structured_data_path}~jsonpath, "$[*]"]
        subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{structured_file_ref}/$(concept)"
        predicateobjects:
            - [a, skos:Concept]
            - [skos:definition, $(definition), $(definitionLang)~lang]
    
    {structured_file_ref}_prefLabels:
        sources:
            - [{structured_data_path}~jsonpath, "$[*].prefLabels[*]"]
        subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{structured_file_ref}/$(concept)"
        predicateobjects:
            - [skos:prefLabel, $(value), $(lang)~lang]
    
    {structured_file_ref}_altLabels:
        sources:
            - [{structured_data_path}~jsonpath, "$[*].altLabels[*]"]
        subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{structured_file_ref}/$(concept)"
        predicateobjects:
            - [skos:altLabel, $(value), $(lang)~lang]
    
    {structured_file_ref}_notes:
        sources:
            - [{structured_data_path}~jsonpath, "$[*].note[*]"]
        subject: "https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{structured_file_ref}/$(concept)"
        predicateobjects:
            - [skos:note, $(value), $(lang)~lang]
"""
    
    return raw_mapping.strip()


def process_data_pairs(data_pairs, concept_schemes_dir, input_data_dir, mapping_files_dir):
    """Main function to iterate over data_pairs and create YARRRML files."""
    yaml = YAML()
    yaml.default_flow_style = False
    yaml.allow_unicode = True

    # Prefissi di base in stile YARRRML
    base_prefixes = {
        "dc": "http://purl.org/dc/elements/1.1/",
        "dct": "http://purl.org/dc/terms/",
        "iso369-3": "http://iso639-3.sil.org/code/",
        "skos": "http://www.w3.org/2004/02/skos/core#",
        "xsd": "http://www.w3.org/2001/XMLSchema#"
    }

    for idx, data_pair in enumerate(data_pairs):
        print(f"Processing data pair {idx + 1}/{len(data_pairs)}")

        concept_schema_file = data_pair["references"]
        structured_file_ref = data_pair["references"]

        # Load concept schema JSON
        concept_schema_path = os.path.join(concept_schemes_dir, concept_schema_file)
        structured_path = os.path.join(input_data_dir, structured_file_ref)
        if not os.path.exists(concept_schema_path):
            print(f"Concept schema file not found: {concept_schema_path}")
            continue
        concept_schema = load_json(concept_schema_path)

        # Extract URI and build names
        dc_identifier_uri = concept_schema.get("dc.identifier.uri", "")
        #print(f"Processing URI: {dc_identifier_uri}")
        uri_number, uri_suffix = extract_uri_info(dc_identifier_uri)
        custom_name, custom_prefix = build_custom_name_and_prefix(uri_number, uri_suffix, structured_file_ref)

        # Create prefixes and mappings sections
        prefixes = base_prefixes.copy()
        prefixes[custom_name] = custom_prefix

        
        concept_mappings_section = create_mappings_section(custom_name, concept_schema_file, concept_schemes_dir, custom_prefix)
        structured_mappings_section = create_structured_mappings_section(custom_name, structured_file_ref, structured_path)

        # Combine raw mappings sections and convert to YAML
        raw_yarrrml_content = f"""prefixes:
  dc: http://purl.org/dc/elements/1.1/
  dct: http://purl.org/dc/terms/
  iso369-3: http://iso639-3.sil.org/code/
  skos: http://www.w3.org/2004/02/skos/core#
  xsd: http://www.w3.org/2001/XMLSchema#
  {custom_name}: {custom_prefix}

mappings:
    {concept_mappings_section}
    
    {structured_mappings_section}
"""
        # Write raw YAML content to file
        yaml_filename = f"{custom_name}.yaml"
        yaml_file_path = os.path.join(mapping_files_dir, yaml_filename)

        with open(yaml_file_path, "w", encoding="utf-8") as yaml_file:
            yaml_file.write(raw_yarrrml_content)

        print(f"YARRRML mapping file created at: {yaml_file_path}")
        
# Esecuzione principale
ensure_directory_exists("mapping_files")
process_data_pairs(data_pairs, "concept_schemes", "input_data", "mapping_files")

Processing data pair 1/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1021.yaml
Processing data pair 2/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1017.yaml
Processing data pair 3/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1014.yaml
Processing data pair 4/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1022.yaml
Processing data pair 5/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1008.yaml
Processing data pair 6/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1015.yaml
Processing data pair 7/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1019.yaml
Processing data pair 8/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1006.yaml
Processing data pair 9/20
YARRRML mapping file created at: mapping_files/pl_20_500_11752_OPEN_1025.yaml
Processing data pair 10/20
YARRRML mapping file created at: mapp

### **Creation of RML Files**

In this section, we will generate intermediate RML (RDF Mapping Language) files using the YARRRML-to-RML conversion library called **Yatter**. These files will serve as a bridge between our YARRRML mapping files and the final RDF output, converting the YARRRML mappings into a format that can be processed to produce RDF data.

Yatter converts YARRRML YAML files into RML format, which is then used to transform structured data (e.g., CSV, JSON, Excel) into RDF triples. These intermediate RML files will be stored in the **rml_files** directory, where each file will be named according to its corresponding YARRRML file but with an `.rml` extension and serialized in Turtle format (`rdf/turtle`).

In [15]:
# Create the folder "rml_files" if it doesn't exist
rml_files_dir = "rml_files"
if not os.path.exists(rml_files_dir):
    os.makedirs(rml_files_dir)

# Directory containing the YARRRML mapping files
mapping_files_dir = "mapping_files"

yaml = YAML(typ='safe', pure=True)

# Loop through each YARRRML file in the mapping_files directory
for filename in os.listdir(mapping_files_dir):
    if filename.endswith(".yaml"):  # Only process YARRRML files
        yarrrml_file_path = os.path.join(mapping_files_dir, filename)

        try:
            # Load the YARRRML file and convert to RML
            with open(yarrrml_file_path, "r", encoding="utf-8") as yarrrml_file:
                yarrrml_content = yaml.load(yarrrml_file)
            
            # Translate YARRRML to RML
            rml_output = yatter.translate(yarrrml_content)
            
            # Replace the RML prefix with the new namespace
            rml_output = rml_output.replace("http://semweb.mmlab.be/ns/rml#", "http://w3id.org/rml/")
            
            # Define the output RML filename
            rml_filename = filename.replace(".yaml", ".rml")
            rml_file_path = os.path.join(rml_files_dir, rml_filename)

            # Write the updated RML content to the output file
            with open(rml_file_path, "w", encoding="utf-8") as rml_file:
                rml_file.write(rml_output)

            print(f"RML file created: {rml_file_path}")

        except Exception as e:
            print(f"Failed to convert {filename} to RML. Error: {e}")
            traceback.print_exc()


2024-10-09 15:43:38,601 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,604 | INFO: RML content is created!
2024-10-09 15:43:38,616 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,618 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,644 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,647 | INFO: RML content is created!
2024-10-09 15:43:38,658 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,660 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,673 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,675 | INFO: RML content is created!
2024-10-09 15:43:38,684 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,686 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,699 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,701 | INFO: RML content is created!
2024-10-09 15:43:38,711 | INFO: Mapping has been syntacti

RML file created: rml_files/pl_20_500_11752_OPEN_993.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1019.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1020.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1018.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1006.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1025.rml


2024-10-09 15:43:38,835 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,837 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,855 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,857 | INFO: RML content is created!
2024-10-09 15:43:38,866 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,867 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,881 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,883 | INFO: RML content is created!
2024-10-09 15:43:38,893 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,894 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,907 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:38,909 | INFO: RML content is created!
2024-10-09 15:43:38,918 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:38,919 | INFO: Translation has finished successfully.
2024-10-09 15:43:38,932 | INFO: Transla

RML file created: rml_files/pl_20_500_11752_OPEN_1024.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1014.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1021.rml
RML file created: rml_files/pl_20_500_11752_OPEN_995.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1027.rml
RML file created: rml_files/pl_20_500_11752_OPEN_975.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1008.rml


2024-10-09 15:43:39,051 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:39,052 | INFO: Translation has finished successfully.
2024-10-09 15:43:39,063 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:39,065 | INFO: RML content is created!
2024-10-09 15:43:39,075 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:39,077 | INFO: Translation has finished successfully.
2024-10-09 15:43:39,089 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:39,091 | INFO: RML content is created!
2024-10-09 15:43:39,100 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:39,102 | INFO: Translation has finished successfully.
2024-10-09 15:43:39,115 | INFO: Translating YARRRML mapping to [R2]RML
2024-10-09 15:43:39,117 | INFO: RML content is created!
2024-10-09 15:43:39,127 | INFO: Mapping has been syntactically validated.
2024-10-09 15:43:39,129 | INFO: Translation has finished successfully.
2024-10-09 15:43:39,142 | INFO: Transla

RML file created: rml_files/pl_20_500_11752_OPEN_1016.rml
RML file created: rml_files/pl_20_500_11752_OPEN_987.rml
RML file created: rml_files/pl_20_500_11752_OPEN_994.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1017.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1022.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1015.rml
RML file created: rml_files/pl_20_500_11752_OPEN_1026.rml


### **Creation of RDF Files**

In this section, we will explain how to generate RDF files using the mapping files (in RML format) and the `morph-kgc` library. `morph-kgc` is a Python package that enables us to transform relational data into RDF using the R2RML, RML, or RML-star mappings. It is a powerful tool that automates the process of converting structured data into RDF triples, which are then serialized into a format such as Turtle, N-Triples, or RDF/XML.


**Step 1: Create the Configuration File for `morph-kgc`**

Before executing the mapping, we need to create a configuration file that `morph-kgc` will use to understand where the data and mappings are located and how to produce the RDF output.

Here’s how the configuration file, typically named `config.ini`, should be structured:

```ini
[DEFAULT]
# Path to the input data source(s)
# For JSON or CSV files, specify the path to the file(s)
# If using a relational database, provide connection details
output_format = turtle

# Specify the output directory where the RDF triples will be saved
output_dir = 

# (Optional) If working with a relational database, provide connection details
# for relational data sources like MySQL, PostgreSQL, etc.
[DataSource1]
# For JSON/CSV files, skip this section
# For relational databases, use the following format:
mappings = path/to/mapping_file1.rml.ttl

[DataSourceN]
# For JSON/CSV files, skip this section
# For relational databases, use the following format:
mappings = path/to/mapping_fileN.rml.ttl

```

In this file:
- **`mappings`**: This is the path to the RML mapping file that specifies how to transform the data into RDF.
- **`output_format`**: This specifies the format of the output RDF file (e.g., `turtle`, `ntriples`, etc.).
- **`output_dir`**: This is where the RDF triples will be saved.

In [16]:
# Definisci i prefissi predefiniti
DEFAULT_PREFIXES = {
    "dc": "http://purl.org/dc/elements/1.1/",
    "dct": "http://purl.org/dc/terms/",
    "iso369-3": "http://iso639-3.sil.org/code/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
}

def add_prefixes_to_graph(graph, mapping_name):
    """Add default and dynamic prefixes to the RDFLib graph."""
    # Aggiungi i prefissi predefiniti
    for prefix, namespace in DEFAULT_PREFIXES.items():
        graph.bind(prefix, Namespace(namespace))
    
    # Aggiungi il prefisso dinamico basato sul nome del file
    dynamic_prefix = f"https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/{mapping_name}/"
    graph.bind(mapping_name, Namespace(dynamic_prefix))
    print(f"Added dynamic prefix: {mapping_name} -> {dynamic_prefix}")

def create_and_process_config_string(output_dir, mapping_files_dir):
    # Ensure output directory exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Iterate over each mapping file in the mapping_files_dir
    for mapping_file in os.listdir(mapping_files_dir):
        if mapping_file.endswith(".rml") or mapping_file.endswith(".rml.ttl"):
            # Define output file name
            mapping_name = os.path.splitext(mapping_file)[0]
            mapping_path = os.path.abspath(os.path.join(mapping_files_dir, mapping_file))
            output_file = f"{mapping_name}.ttl"
            
            # Build the configuration string
            config_string = f"""
            [DEFAULT]
            output_format = N-TRIPLES
            output_file = {os.path.join(os.path.abspath(output_dir), output_file)}

            [{mapping_name}]
            mappings = {mapping_path}
            """

            print(f"Processing mapping file: {mapping_file}")
            try:
                # Use morph-kgc as a library to generate RDF triples using config string
                graph = morph_kgc.materialize(config_string)
                
                # Add default and dynamic prefixes to the graph
                add_prefixes_to_graph(graph, mapping_name)

                # Serialize the RDFLib graph to a Turtle file
                output_ttl_path = os.path.join(output_dir, output_file)
                graph.serialize(destination=output_ttl_path, format='turtle')
                print(f"RDF file generated successfully: {output_ttl_path}")

            except Exception as e:
                print(f"Error processing the RML file {mapping_file}: {str(e)}")

# Define the directories
output_directory = "rdf_files"
rml_files_directory = "rml_files"

# Check if mapping files directory exists and create config files for each RML file
if not os.path.exists(rml_files_directory):
    print(f"Mapping files directory '{rml_files_directory}' does not exist.")
else:
    # Create the config files and process each RML file
    create_and_process_config_string(output_directory, rml_files_directory)


2024-10-09 15:43:55,205 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1022.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:43:55,209 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1022`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1022.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1022.rml'}


Processing mapping file: pl_20_500_11752_OPEN_1022.rml


2024-10-09 15:43:56,002 | INFO: 12 mapping rules retrieved.
2024-10-09 15:43:56,011 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:43:56,017 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:43:56,022 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:43:56,024 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:43:56,027 | INFO: Mappings processed in 0.807 seconds.
2024-10-09 15:43:56,032 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:43:56,645 | INFO: Number of triples generated in total: 9560.


Added dynamic prefix: pl_20_500_11752_OPEN_1022 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1022/


2024-10-09 15:43:57,881 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1018.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:43:57,883 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1018`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1018.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1018.rml'}


RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1022.ttl
Processing mapping file: pl_20_500_11752_OPEN_1018.rml


2024-10-09 15:43:58,621 | INFO: 12 mapping rules retrieved.
2024-10-09 15:43:58,629 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:43:58,634 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:43:58,639 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:43:58,642 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:43:58,643 | INFO: Mappings processed in 0.756 seconds.
2024-10-09 15:43:58,647 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:43:59,056 | INFO: Number of triples generated in total: 1768.
2024-10-09 15:43:59,192 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1017.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_li

Added dynamic prefix: pl_20_500_11752_OPEN_1018 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1018/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1018.ttl
Processing mapping file: pl_20_500_11752_OPEN_1017.rml


2024-10-09 15:44:00,130 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:00,138 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:00,143 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:00,147 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:00,149 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:00,151 | INFO: Mappings processed in 0.952 seconds.
2024-10-09 15:44:00,157 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:00,648 | INFO: Number of triples generated in total: 6452.


Added dynamic prefix: pl_20_500_11752_OPEN_1017 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1017/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1017.ttl
Processing mapping file: pl_20_500_11752_OPEN_1014.rml


2024-10-09 15:44:01,129 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1014.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:44:01,131 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1014`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1014.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1014.rml'}
2024-10-09 15:44:02,022 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:02,029 | DEBUG: All predicate maps are consta

Added dynamic prefix: pl_20_500_11752_OPEN_1014 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1014/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1014.ttl
Processing mapping file: pl_20_500_11752_OPEN_1021.rml


2024-10-09 15:44:03,501 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:03,511 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:03,517 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:03,520 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:03,524 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:03,525 | INFO: Mappings processed in 0.836 seconds.
2024-10-09 15:44:03,528 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:04,226 | INFO: Number of triples generated in total: 4337.
2024-10-09 15:44:04,542 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1025.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_li

Added dynamic prefix: pl_20_500_11752_OPEN_1021 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1021/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1021.ttl
Processing mapping file: pl_20_500_11752_OPEN_1025.rml


2024-10-09 15:44:05,480 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:05,488 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:05,493 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:05,497 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:05,500 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:05,501 | INFO: Mappings processed in 0.953 seconds.
2024-10-09 15:44:05,506 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:06,139 | INFO: Number of triples generated in total: 5768.


Added dynamic prefix: pl_20_500_11752_OPEN_1025 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1025/


2024-10-09 15:44:06,589 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_975.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:44:06,591 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_975`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_975.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_975.rml'}


RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1025.ttl
Processing mapping file: pl_20_500_11752_OPEN_975.rml


2024-10-09 15:44:07,380 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:07,389 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:07,397 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:07,401 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:07,403 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:07,405 | INFO: Mappings processed in 0.808 seconds.
2024-10-09 15:44:07,409 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:07,755 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1016.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server

Error processing the RML file pl_20_500_11752_OPEN_975.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_975.json'
Processing mapping file: pl_20_500_11752_OPEN_1016.rml


2024-10-09 15:44:08,873 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:08,881 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:08,888 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:08,893 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:08,895 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:08,896 | INFO: Mappings processed in 1.127 seconds.
2024-10-09 15:44:08,900 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:09,411 | INFO: Number of triples generated in total: 6844.


Added dynamic prefix: pl_20_500_11752_OPEN_1016 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1016/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1016.ttl
Processing mapping file: pl_20_500_11752_OPEN_1026.rml


2024-10-09 15:44:09,919 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1026.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:44:09,921 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1026`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1026.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1026.rml'}
2024-10-09 15:44:10,909 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:10,931 | DEBUG: All predicate maps are consta

Added dynamic prefix: pl_20_500_11752_OPEN_1026 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1026/


2024-10-09 15:44:12,042 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1006.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:44:12,045 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1006`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1006.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1006.rml'}


RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1026.ttl
Processing mapping file: pl_20_500_11752_OPEN_1006.rml


2024-10-09 15:44:12,969 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:12,977 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:12,982 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:12,987 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:12,990 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:12,993 | INFO: Mappings processed in 0.942 seconds.
2024-10-09 15:44:12,997 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:13,319 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_995.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server'

Error processing the RML file pl_20_500_11752_OPEN_1006.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_1006.json'
Processing mapping file: pl_20_500_11752_OPEN_995.rml


2024-10-09 15:44:14,123 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:14,132 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:14,138 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:14,143 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:14,145 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:14,147 | INFO: Mappings processed in 0.812 seconds.
2024-10-09 15:44:14,151 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:14,495 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1008.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server

Error processing the RML file pl_20_500_11752_OPEN_995.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_995.json'
Processing mapping file: pl_20_500_11752_OPEN_1008.rml


2024-10-09 15:44:15,290 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:15,298 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:15,305 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:15,310 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:15,312 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:15,314 | INFO: Mappings processed in 0.810 seconds.
2024-10-09 15:44:15,318 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:15,640 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_994.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server'

Error processing the RML file pl_20_500_11752_OPEN_1008.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_1008.json'
Processing mapping file: pl_20_500_11752_OPEN_994.rml


2024-10-09 15:44:16,433 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:16,441 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:16,448 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:16,453 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:16,456 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:16,458 | INFO: Mappings processed in 0.808 seconds.
2024-10-09 15:44:16,461 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:16,799 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_987.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server'

Error processing the RML file pl_20_500_11752_OPEN_994.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_994.json'
Processing mapping file: pl_20_500_11752_OPEN_987.rml


2024-10-09 15:44:17,636 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:17,645 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:17,651 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:17,656 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:17,658 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:17,660 | INFO: Mappings processed in 0.843 seconds.
2024-10-09 15:44:17,664 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:17,997 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1027.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server

Error processing the RML file pl_20_500_11752_OPEN_987.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_987.json'
Processing mapping file: pl_20_500_11752_OPEN_1027.rml


2024-10-09 15:44:18,757 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:18,765 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:18,771 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:18,776 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:18,779 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:18,781 | INFO: Mappings processed in 0.775 seconds.
2024-10-09 15:44:18,784 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:19,260 | INFO: Number of triples generated in total: 5535.
2024-10-09 15:44:20,122 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1020.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_li

Added dynamic prefix: pl_20_500_11752_OPEN_1027 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1027/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1027.ttl
Processing mapping file: pl_20_500_11752_OPEN_1020.rml


2024-10-09 15:44:20,124 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1020`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1020.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1020.rml'}
2024-10-09 15:44:20,864 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:20,871 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:20,877 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:20,881 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:20,884 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:20,885 | INFO: Mappings processed in 0.757 seconds.
2024-10-09 15:44:20,888 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:21,317 | INFO: Number of triples generated in total: 2594.
2024-10-09 15:44:21,544 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': 

Added dynamic prefix: pl_20_500_11752_OPEN_1020 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1020/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1020.ttl
Processing mapping file: pl_20_500_11752_OPEN_1019.rml


2024-10-09 15:44:22,400 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:22,412 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:22,419 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:22,423 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:22,425 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:22,427 | INFO: Mappings processed in 0.876 seconds.
2024-10-09 15:44:22,430 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:22,942 | INFO: Number of triples generated in total: 10952.


Added dynamic prefix: pl_20_500_11752_OPEN_1019 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1019/


2024-10-09 15:44:23,991 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1015.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:44:23,993 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_1015`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1015.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_1015.rml'}


RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1019.ttl
Processing mapping file: pl_20_500_11752_OPEN_1015.rml


2024-10-09 15:44:24,891 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:24,899 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:24,905 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:24,909 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:24,912 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:24,915 | INFO: Mappings processed in 0.917 seconds.
2024-10-09 15:44:24,918 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:25,420 | INFO: Number of triples generated in total: 4897.
2024-10-09 15:44:25,814 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_1024.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_li

Added dynamic prefix: pl_20_500_11752_OPEN_1015 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1015/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1015.ttl
Processing mapping file: pl_20_500_11752_OPEN_1024.rml


2024-10-09 15:44:26,594 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:26,603 | DEBUG: All predicate maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:26,609 | DEBUG: All graph maps are constant-valued, invariant subset is not enforced.
2024-10-09 15:44:26,613 | INFO: Mapping partition with 11 groups generated.
2024-10-09 15:44:26,615 | INFO: Maximum number of rules within mapping group: 2.
2024-10-09 15:44:26,616 | INFO: Mappings processed in 0.796 seconds.
2024-10-09 15:44:26,619 | DEBUG: Parallelizing with 16 cores.
2024-10-09 15:44:27,077 | INFO: Number of triples generated in total: 5371.


Added dynamic prefix: pl_20_500_11752_OPEN_1024 -> https://vocabs.ilc4clarin.ilc.cnr.it/vocabularies/pl_20_500_11752_OPEN_1024/
RDF file generated successfully: rdf_files/pl_20_500_11752_OPEN_1024.ttl
Processing mapping file: pl_20_500_11752_OPEN_993.rml


2024-10-09 15:44:27,723 | DEBUG: CONFIGURATION: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_993.ttl', 'na_values': ',nan', 'safe_percent_encoding': '', 'read_parsed_mappings_path': '', 'write_parsed_mappings_path': '', 'mapping_partitioning': 'PARTIAL-AGGREGATIONS', 'logging_file': '', 'oracle_client_lib_dir': '', 'oracle_client_config_dir': '', 'udfs': '', 'output_kafka_server': '', 'output_kafka_topic': '', 'output_dir': '', 'only_printable_chars': 'no', 'infer_sql_datatypes': 'no', 'logging_level': 'INFO', 'number_of_processes': '16'}
2024-10-09 15:44:27,726 | DEBUG: DATA SOURCE `pl_20_500_11752_OPEN_993`: {'output_format': 'N-TRIPLES', 'output_file': '/home/jovyan/work/REALITER/rdf_files/pl_20_500_11752_OPEN_993.ttl', 'mappings': '/home/jovyan/work/REALITER/rml_files/pl_20_500_11752_OPEN_993.rml'}
2024-10-09 15:44:28,473 | INFO: 12 mapping rules retrieved.
2024-10-09 15:44:28,481 | DEBUG: All predicate maps are constant-v

Error processing the RML file pl_20_500_11752_OPEN_993.rml: [Errno 2] No such file or directory: '/home/jovyan/work/REALITER/input_data/20_500_11752_OPEN_993.json'


**Step 2: Aggregating Concepts to ConceptScheme**

After generating the `.ttl` RDF files containing both `skos:ConceptScheme` and `skos:Concept` resources, we need to establish the relationships between the `skos:ConceptScheme` and its top-level concepts. This step ensures the correct linkage between the concepts and their overarching schema.

There are two key properties we will use:
1. **`skos:hasTopConcept`**: This property will connect the `skos:ConceptScheme` to its top-level `skos:Concept`.
2. **`skos:topConceptOf`**: This is the inverse property, which links the `skos:Concept` back to its `skos:ConceptScheme`.

The goal is to:
- Add `skos:hasTopConcept` for each concept within the concept scheme.
- Add `skos:topConceptOf` for each concept to reference the parent concept scheme.

In [17]:
# Define the SKOS namespace
SKOS_NS = Namespace("http://www.w3.org/2004/02/skos/core#")

def aggregate_concepts_to_scheme(rdf_dir):
    """
    Aggregate concepts to concept schemes by adding skos:hasTopConcept and skos:topConceptOf properties.
    This updates the existing TTL files.
    """
    for rdf_file in os.listdir(rdf_dir):
        if rdf_file.endswith(".ttl"):
            rdf_file_path = os.path.join(rdf_dir, rdf_file)
            graph = Graph()
            graph.parse(rdf_file_path, format="ttl")

            # Find the concept scheme
            concept_scheme = None
            for scheme in graph.subjects(RDF.type, SKOS.ConceptScheme):
                concept_scheme = scheme
                break

            if concept_scheme is None:
                print(f"No ConceptScheme found in {rdf_file_path}")
                continue

            # Find all top concepts (skos:Concept)
            concepts = list(graph.subjects(RDF.type, SKOS.Concept))

            # Add skos:hasTopConcept and skos:topConceptOf properties
            for concept in concepts:
                # Link from the scheme to the concept
                graph.add((concept_scheme, SKOS.hasTopConcept, concept))
                # Link from the concept to the scheme
                graph.add((concept, SKOS.topConceptOf, concept_scheme))

            # Serialize and update the RDF file
            graph.serialize(destination=rdf_file_path, format="turtle")
            print(f"Updated {rdf_file_path} with skos:hasTopConcept and skos:topConceptOf relationships")

# Directory containing RDF Turtle files
rdf_directory = "rdf_files"

# Run the aggregation
aggregate_concepts_to_scheme(rdf_directory)

Updated rdf_files/pl_20_500_11752_OPEN_1018.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1026.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1025.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1016.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1024.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1014.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1019.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1017.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_11752_OPEN_1021.ttl with skos:hasTopConcept and skos:topConceptOf relationships
Updated rdf_files/pl_20_500_

**Step 3: Normalizing Character Encoding**

In this step, we will normalize the character encoding of the `skos:Concept` URIs in the generated `.ttl` files. This process involves checking if any of the `skos:Concept` URIs contain special percent-encoded characters, such as `%C3%A9` for `é`. If such encodings are found, they will be decoded back to their proper UTF-8 characters. This ensures that the URIs remain human-readable and correctly represent accented or special characters.

In [18]:
def is_percent_encoded(uri):
    """
    Check if the given URI contains percent-encoded characters.
    """
    return '%' in uri

def normalize_concept_uris_in_ttl(ttl_file):
    """
    Function to normalize percent-encoded characters in URIs in a Turtle file
    while preserving the existing prefixes.
    """
    # Load the Turtle file into an RDFLib graph
    g = Graph()
    g.parse(ttl_file, format="turtle")

    # Create a new graph for the normalized triples
    updated_graph = Graph()

    # Copy namespaces (prefixes) from the original graph to the new graph
    for prefix, namespace in g.namespace_manager.namespaces():
        updated_graph.bind(prefix, namespace)

    # Iterate over triples in the original graph
    for subj, pred, obj in g:
        subj_str = str(subj)
        obj_str = str(obj)

        # Check and normalize percent-encoded characters in the subject URI
        if is_percent_encoded(subj_str):
            cleaned_subj = urllib.parse.unquote(subj_str)
            subj = URIRef(cleaned_subj)  # Ensure it's an RDFLib URIRef

        # Check and normalize percent-encoded characters in the object URI if it is a URIRef
        if isinstance(obj, URIRef) and is_percent_encoded(obj_str):
            cleaned_obj = urllib.parse.unquote(obj_str)
            obj = URIRef(cleaned_obj)  # Ensure it's an RDFLib URIRef

        # Add the updated triples to the new graph
        updated_graph.add((subj, pred, obj))
    
    # Write the normalized graph back to the Turtle file with UTF-8 encoding, preserving prefixes
    with open(ttl_file, "wb") as f:
        updated_graph.serialize(destination=f, format='turtle', encoding='utf-8')

    print(f"Normalized URIs and saved the updated file: {ttl_file}")

def normalize_all_ttl_files(directory):
    """
    Function to normalize URIs in all Turtle files in a given directory
    while preserving the prefixes in each file.
    """
    for filename in os.listdir(directory):
        if filename.endswith(".ttl"):
            ttl_file_path = os.path.join(directory, filename)
            normalize_concept_uris_in_ttl(ttl_file_path)

# Specify the directory containing the .ttl files
ttl_files_directory = "rdf_files"

# Run normalization on all .ttl files in the directory
normalize_all_ttl_files(ttl_files_directory)


Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1018.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1026.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1025.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1016.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1024.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1014.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1019.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1017.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1021.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1020.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1027.ttl
Normalized URIs and saved the updated file: rdf_files/pl_20_500_11752_OPEN_1