# XSD to CSV Conversion Testing both versions

### Import needed libraries

- `zipfile` – for working with ZIP archives
- `json` – for working with jsons
- `pandas` – for working with tables and dataframes (`pip install pandas`)
- `lxml.etree` – to parse and read XML/XSD structure (`pip install lxml`)
- `pathlib.Path` – for path handling

Make sure to install missing ones using `pip` if needed.

In [119]:
import zipfile
import json
import pandas as pd
from lxml import etree
from pathlib import Path

### version 1.4.3 - Set paths for input ZIP and output folder

This cell sets:
- `input_zip_path`: path to the input ZIP file with XSDs
- `output_path`: where to extract files

If the folder doesn't exist, it will be created.

In [103]:
# Define the path to the input zip file and output directory
input_zip_path = Path("../tests/data/JVF_DTM_143_XSD.zip")
output_path = Path("../tests/output/JVF_DTM_143_XSD")

# Create the output directory if it doesn't exist
output_path.mkdir(parents=True, exist_ok=True)

### version 1.5.0 - Set paths for input ZIP and output folder

This cell sets:
- `input_zip_path`: path to the input ZIP file with XSDs
- `output_path`: where to extract files

If the folder doesn't exist, it will be created.

In [87]:
# Define the path to the input zip file and output directory
input_zip_path = Path("../tests/data/JVF_DTM_150_beta3_XSD.zip")
output_path = Path("../tests/output/JVF_DTM_150_beta3_XSD")

# Create the output directory if it doesn't exist
output_path.mkdir(parents=True, exist_ok=True)

## Unzip the input file

Extracts the ZIP archive to the output folder and prints the extracted file names.

In [23]:
# Extract the zip file
with zipfile.ZipFile(input_zip_path, 'r') as zip_ref:
    zip_ref.extractall(output_path)

# Show extracted files
print(f"Extracted to: {output_path}")
print(sorted(f.name for f in output_path.iterdir()))

Extracted to: ..\tests\output\JVF_DTM_150_beta3_XSD
['xsd']


## Summary CSV
* can process both versions
* This cell parses all `.xsd` files in the `xsd/objekty` directory and extracts object definitions from the `ZaznamyObjektu` element:

- **Version 1.5.0**: Uses `ref="ZaznamObjektu"` and collects types via `substitutionGroup`.
- **Version 1.4.3**: Extracts elements directly from the named `ZaznamObjektu`.

The script collects relevant attributes (`filename`, `type`, `ref`, `minOccurs`, etc.), builds a `DataFrame`, and saves it as summary csv.

In [88]:
# Path to the folder with extracted XSD files
xsd_objects_path = output_path / "xsd" / "objekty"

records = []
seen_global = set()  # Track all (filename, nazev) pairs to avoid duplicates

# Loop through all .xsd files in the directory
for file_path in xsd_objects_path.glob("*.xsd"):
    # try:
        # Parse the XSD file into an XML tree
        tree = etree.parse(str(file_path))
        root = tree.getroot()

        # Find all element named ZaznamyObjektu
        zaznamy_elements = root.xpath(".//*[local-name()='element'][@name='ZaznamyObjektu']")
        
        # iterate over elements named ZaznamyObjektu, but there should be just one
        for zaznamy_elem in zaznamy_elements:
            nested_elements = zaznamy_elem.xpath(".//*[local-name()='element']")
            
            # Iterate over elements inside ZaznamyObjektu
            for element in nested_elements:
                
                # If ZaznamObjektu is ref, that is the case for version 1.5.0
                if element.get("ref") == "ZaznamObjektu":
                    
                    # Get the type for all types of ZaznamObjektu ()
                    types = root.xpath("./*[local-name()='element'][@substitutionGroup='ZaznamObjektu']/@type")
                    # Find complexType with name coresponding with one of the types
                    matching_complex_types = []

                    for ctype in types:
                        complex_types = root.xpath(f".//*[local-name()='complexType'][@name='{ctype}']")
                        matching_complex_types.extend(complex_types)
                    
                    for mctype in matching_complex_types:
                        type = mctype.get("name")
                        elems = mctype.xpath(".//*[local-name()='element' and not(@name='AtributyObjektu')]")
                        for el in elems:
                            parent_name = el.get("name")
                            parent_type = el.get("type")
                            nazev = el.get("ref")
                            minOccurs = el.get("minOccurs")
                            is_choice = '1' if el.xpath("ancestor::*[local-name()='choice']") else None
                            records.append({
                                            "filename": file_path.name,
                                            "type": type,
                                            "parent_name": parent_name,
                                            "parent_type": parent_type,
                                            "nazev": nazev,
                                            "minOccurs": minOccurs,
                                            "choice": is_choice
                                            })
                            
                    break
                    
                # If ZaznamObjektu is name, that is the case for version 1.4.3
                elif element.get("name") == "ZaznamObjektu":
                    refs = []
                    gml_group = []
                    mOall = []
                    check = True
                    # Find all elements that have defined reference that is not cmn:ZapisObjektu
                    for ref_el in element.xpath(".//*[local-name()='element'][@ref and not(@ref='cmn:ZapisObjektu')]"):
                        ref_val = ref_el.get("ref")
                        # Handle geometry and add to list
                        if ref_val.startswith("gml:"):
                            gml_group.append(ref_val)
                            # finding gml parent just once for each file
                            if check:
                                geom_parent = ref_el.xpath("ancestor::*[local-name()='element'][@name='GeometrieObjektu']")
                                # minOccurs of GeometrieObjektu
                                minOccurs = geom_parent[0].get("minOccurs")
                                mOall.append(minOccurs)
                                check = False
                        else:
                            # minOccurs of other output elements
                            minOccurs = ref_el.get("minOccurs")
                            mOall.append(minOccurs)
                            if gml_group:
                                refs.append(gml_group)
                                gml_group = []

                            refs.append(ref_val)

                    if gml_group:
                        refs.append(gml_group)
                        
                    # Add everything to output
                    for item, m in zip(refs, mOall):
                        records.append({
                            "filename": file_path.name,
                            "nazev": item,
                            "minOccurs": m
                        })
                    break
                
# Create DataFrame from extracted records
df_str2 = pd.DataFrame(records)

# Save the DataFrame to CSV
df_str2.to_csv(output_path.parent / "summary_1.5.0.csv", index=False)

# Display the resulting DataFrame
df_str2

Unnamed: 0,filename,type,parent_name,parent_type,nazev,minOccurs,choice
0,autobusova_zastavka-bod.xsd,RefV,SpolecneAtributyVsechObjektu,atr:SpolecneAtributyVsechObjektuRefType,,,
1,autobusova_zastavka-bod.xsd,RefV,SpolecneAtributyObjektuDI,atr:SpolecneAtributyObjektuDIRefType,,,
2,autobusova_zastavka-bod.xsd,RefV,GeometrieObjektu,cmn:GeometrieObjektuBodType,,,
3,autobusova_zastavka-bod.xsd,RefV,SpolecneAtributyVsechObjektu,atr:SpolecneAtributyVsechObjektuRefType,,,
4,autobusova_zastavka-bod.xsd,RefV,SpolecneAtributyObjektuDI,atr:SpolecneAtributyObjektuDIRefType,,,
...,...,...,...,...,...,...,...
7962,zemedelska_plocha-plocha.xsd,Upd,GeometrieObjektu,cmn:GeometrieObjektuPlochaZPSType,,0,
7963,zemedelska_plocha-plocha.xsd,Del,SpolecneAtributyVsechObjektu,atr:SpolecneAtributyVsechObjektuDelType,,,
7964,zemedelska_plocha-plocha.xsd,Del,SpolecneAtributyObjektuZPS,atr:SpolecneAtributyObjektuZPSDelType,,0,
7965,zemedelska_plocha-plocha.xsd,Del,,,atr:TypZemedelskePlochy,0,


## Detailed CSV

* This cell loads a JSON config (config_1.5.0.json or config_1.4.3.json) and parses `.xsd` files in `xsd/objekty` to extract detailed metadata:
- Matches elements based on config file.
- Extracts flags, attributes, and existence markers (e.g., `minOccurs`, `type`).

In [120]:
def process_flag_value(data, element, matched_column, flag, is_unique=False):
    """
    Extracts the value of a specific attribute (flag) from an XML element and stores it in the data dictionary.
    
    Parameters:
    - data: dict where extracted values are stored.
    - element: XML element being processed.
    - matched_column: base name used to construct the output column name.
    - flag: the attribute name to extract from the element.
    - is_unique: if True, prevents duplicate values in the list for this column.
    """
    val = element.get(flag)
    if not val:
        return

    output_column_name = f"{matched_column}_{flag}"

    if output_column_name not in data:
        data[output_column_name] = [val]
    else:
        if not isinstance(data[output_column_name], list):
            data[output_column_name] = [data[output_column_name]]

        if is_unique and val in data[output_column_name]:
            return

        data[output_column_name].append(val)


def order_dataframe_columns_by_config(df, config, element_types):
    """
    Orders DataFrame columns based on the config and element_types definition.
    Ensures that base element columns (like 'GeometrieObjektu') appear before derived ones
    (like 'GeometrieObjektu_minOccurs', 'GeometrieObjektu_type', etc.).
    """
    ordered_cols = []

    # Add basic config-defined columns
    for key in ["filename", "namespace", "name", "type"]:
        if config.get(key):
            ordered_cols.append(key)

    # Add columns defined in element_types
    for col, desc in element_types.items():
        base = col.split(":")[-1]

        # Add existence flag or base column first
        if desc.get("exist"):
            ordered_cols.append(base)

        # Then add element flags like required, prohibited, minOccurs, etc.
        for flag, val in desc.items():
            if flag in {"exist", "match", "attributes"}:
                continue
            if val is True or val == "unique":
                ordered_cols.append(f"{base}_{flag}")

        # Then add attribute-derived columns
        if "attributes" in desc:
            for attr, props in desc["attributes"].items():
                for prop in props:
                    ordered_cols.append(f"{attr}_{prop}")

    # Add any remaining columns from DataFrame
    remaining = [c for c in df.columns if c not in ordered_cols]
    return df.reindex(columns=ordered_cols + remaining)


# Load JSON config
with open("../tests/data/config_1.4.3.json", "r", encoding="utf-8") as f:
    # config = json.load(f)
    config = json.load(f)

# Path to the folder with extracted .xsd files
xsd_objects_path = output_path / "xsd" / "objekty"
records = []
element_types = config["element_types"]

for file_path in xsd_objects_path.glob("*.xsd"):
    try:
        tree = etree.parse(str(file_path))
        root = tree.getroot()
        
        namespace = root.attrib.get("targetNamespace", "")
        data = {}
        if config.get("filename") is True:
            data["filename"] = file_path.name
        if config.get("namespace") is True:
            data["namespace"] = namespace

        # Get top-level element
        if config.get("name") is True or config.get("type") is True:
            top_level_elems = root.xpath('./*[local-name()="element"]')
            main_elem = top_level_elems[0]
            if config.get("name") is True:
                data["name"] = main_elem.get("name", "")
            if config.get("type") is True:
                data["type"] = main_elem.get("type", "")
        
        # Initialize the value of exist to 0 when asking for existence
        for column_name, description in element_types.items():
            if description.get("exist") is True:
                cleaned_column_name = column_name.split(":")[-1]
                data[cleaned_column_name] = 0

        all_elements = root.xpath(".//*[local-name()='element']")
        for element in all_elements:
                etype = None
                match_key = None
                matched_column = None
                # Find "match" values from config file
                for column_name, description in element_types.items():
                    config_match = description.get("match")
                    # if value of "match" is not given, use "name"
                    if not config_match:
                        config_match = "name"
                        
                    # Try to find element match
                    match_element = element.get(config_match)
                    # if it is the right element save description, column_name and match value
                    if match_element == column_name:
                        etype = description
                        matched_column = column_name.split(":")[-1]
                        match_key = config_match
                        break
                        
                if not etype:
                    # skip if matching element type wasn't found
                    continue
                # Handle attributes
                if "attributes" in etype:
                    # Iterate over attributes and their properties
                    for attr_name, props in etype["attributes"].items():
                        for prop in props:
                            # find values using xpath
                            val = element.xpath(f".//*[local-name()='attribute' and @{match_key}='{attr_name}']/@{prop}")
                            # save to output data
                            if val:
                                data[f"{attr_name}_{prop}"] = val[0]
                
                # Handle existence of element
                if etype.get("exist") is True:
                    cleaned_matched_column = matched_column.split(":")[-1]
                    data[cleaned_matched_column] = 1
                
                # Special case for geometry in version 1.4.3 without explicit 'type' attribute
                if matched_column =="GeometrieObjektu" and not element.get("type"):
                    # Find all child <element> nodes with a 'ref' attribute
                    ref_values = element.xpath(".//*[local-name()='element' and @ref]/@ref")
                    if ref_values:
                        data["GeometrieObjektu"] = ref_values
                
                # Handle asking directly for element properties
                true_flags = [key for key, value in etype.items() if value is True]
                unique_flags = [key for key, value in etype.items() if value =="unique"]
                unique_cols = list(f"{matched_column}_{flag}" for flag in unique_flags)
                for flag in true_flags:
                    process_flag_value(data, element, matched_column, flag, is_unique=False)

                for flag in unique_flags:
                    process_flag_value(data, element, matched_column, flag, is_unique=True)

                for col in list(data):
                    if col in unique_cols:
                        if not isinstance(data[col], list):
                            data[col] = [data[col]]
                    else:
                        if isinstance(data[col], list) and len(data[col]) == 1 and col not in ("GeometrieObjektu_type","GeometrieObjektu"):
                            data[col] = data[col][0]

        records.append(data)

    except Exception as e:
        print(f"[Error] {file_path.name}: {e}")
        raise


df_str1 = pd.DataFrame(records, dtype=object)
df_str1 = order_dataframe_columns_by_config(df_str1, config, element_types)
output_csv_path = output_path.parent / "detailed_1.4.3.csv"
output_csv_path.parent.mkdir(parents=True, exist_ok=True)
df_str1.to_csv(output_csv_path, index=False)
df_str1

Unnamed: 0,filename,namespace,name,type,code_base_fixed,code_base_use,code_suffix_fixed,code_suffix_use,KategorieObjektu_fixed,SkupinaObjektu_fixed,ObsahovaCast_fixed,OblastObjektuKI,OblastObjektuKI_minOccurs,GeometrieObjektu_minOccurs,GeometrieObjektu
0,BP_plynovodni_site-plocha.xsd,bpplsi,BPPlynovodniSite,bpplsi:BPPlynovodniSiteType,0100000290,required,03,required,Ochranná a bezpečnostní pásma,Ochranné a bezpečnostní pásmo,TI,1,0,0,[gml:surfaceProperty]
1,BP_podzemniho_zasobniku_plynu-plocha.xsd,bpppol,BPPodzemnihoZasobnikuPlynu,bpppol:BPPodzemnihoZasobnikuPlynuType,0100000369,required,03,required,Ochranná a bezpečnostní pásma,Ochranné a bezpečnostní pásmo,TI,1,0,0,[gml:surfaceProperty]
2,BP_zarizeni_PKO-plocha.xsd,bpzpko,BPZarizeniPKO,bpzpko:BPZarizeniPKOType,0100000291,required,03,required,Ochranná a bezpečnostní pásma,Ochranné a bezpečnostní pásmo,TI,1,0,0,[gml:surfaceProperty]
3,budova-defbod.xsd,buddef,BudovaDefinicniBod,buddef:BudovaDefinicniBodType,0100000001,required,04,required,Budovy,Objekt budovy,ZPS,0,,,[gml:pointProperty]
4,budova-plocha.xsd,budpol,BudovaPlocha,budpol:BudovaPlochaType,0100000001,required,03,required,Budovy,Objekt budovy,ZPS,0,,,"[gml:surfaceProperty, gml:multiCurveProperty]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353,zed-linie.xsd,zedlin,ZedLinie,zedlin:ZedLinieType,0100000168,required,02,required,Součásti a příslušenství staveb,Stavba společná pro více skupin,ZPS,0,,,[gml:curveProperty]
354,zed-plocha.xsd,zedpol,ZedPlocha,zedpol:ZedPlochaType,0100000168,required,03,required,Součásti a příslušenství staveb,Stavba společná pro více skupin,ZPS,0,,,"[gml:surfaceProperty, gml:multiCurveProperty]"
355,zeleznicni_prejezd-plocha.xsd,zprpol,ZeleznicniPrejezd,zprpol:ZeleznicniPrejezdType,0100000022,required,03,required,Dopravní stavby,Drážní doprava,DI,0,,,[gml:surfaceProperty]
356,zemedelska_plocha-defbod.xsd,zepdef,ZemedelskaPlochaDefinicniBod,zepdef:ZemedelskaPlochaDefinicniBodType,0100000207,required,04,required,"Vodstvo, vegetace a terén",Hospodářská plocha,ZPS,0,,,[gml:pointProperty]
