# Test imported XSD files – Issue #3

This notebook performs the following tasks:

- Extracts the ZIP archive containing XSD schemas
- Locates the main `.xsd` file with `<xs:import>` declarations (`input_data.xsd` or `index/index_data.xsd`)
- Verifies:
  1. That all imported files actually exist in the extracted folder
  2. That all existing `.xsd` files are actually imported by someone

### Import needed libraries

- `zipfile` – for working with ZIP archives
- `json` – for working with jsons
- `xmlschema` – to read and validate XSD files (`pip install xmlschema`)
- `pandas` – for working with tables and dataframes (`pip install pandas`)
- `lxml.etree` – to parse and read XML/XSD structure (`pip install lxml`)
- `pathlib.Path` – for path handling

Make sure to install missing ones using `pip` if needed.

In [1]:
import zipfile
import tempfile
import json
import xmlschema
import pandas as pd
from lxml import etree
from pathlib import Path

### Set paths for input ZIP and output folder

This cell sets:
- `input_zip_path`: path to the input ZIP file with XSDs
- `output_path`: where to extract files

If the folder doesn't exist, it will be created.

In [2]:
# Define the path to the input zip file and output directory
input_zip_path = Path("../tests/data/JVF_DTM_143_XSD.zip")
output_path = Path("../tests/output/JVF_DTM_143_XSD")

# Create the output directory if it doesn't exist
output_path.mkdir(parents=True, exist_ok=True)

### Unzip the input file

Extracts the ZIP archive to the output folder and prints the extracted file names.

In [3]:
# Extract the zip file
with zipfile.ZipFile(input_zip_path, 'r') as zip_ref:
    zip_ref.extractall(output_path)

# Show extracted files
print(f"Extracted to: {output_path}")
print(sorted(f.name for f in output_path.iterdir()))

Extracted to: ..\tests\output\JVF_DTM_143_XSD
['xsd']


### Load and parse XSD files

Defines a function that finds all `.xsd` files in the folder, parses them into XML trees, and returns a list of (file name, XML root element) pairs.

In [4]:
# Set the path to the folder with extracted XSD files
xsd_dir = output_path / "xsd"

def load_xsd_files(directory):
    """Load all XSD files from the given directory and parse them into XML trees.
    
    Args:
        directory (Path): Base directory to search for .xsd files.

    Returns:
        list of (filename, XML root element) tuples.
    """
    
    # Initialize an empty list to store parsed XSD file information
    xsd_files = []
    
    # Recursively find all .xsd files
    for path in directory.rglob("*.xsd"):
        try:
            # Parse the file into an XML tree
            tree = etree.parse(str(path))
    
            # Add the file name and root element to the list
            xsd_files.append((path.name, tree.getroot()))
    
        except etree.XMLSyntaxError as e:
            # Handle invalid XML syntax
            print(f"[XMLSyntaxError] Skipping {path.name}: {e}")
    
        except Exception as e:
            #  Handle all unexpected errors
            print(f"[UnexpectedError] Could not process {path.name}: {e}")
            raise  # Re-raise the exception

    return xsd_files

# Call the function to load and parse the XSD files
xsd_files = load_xsd_files(xsd_dir)

### Create summary table of XSD files

Builds a simple table showing each XSD file's name, root tag, and number of elements.

In [5]:
# xsd_files to store the full Paths
xsd_files = []
for path in output_path.rglob("*.xsd"):
    # parse and get root
    tree = etree.parse(str(path))        
    root = tree.getroot()
    # store the absolute Path and its root element
    xsd_files.append((path, root))

# Build the summary using true relative paths
summary = []
for path, root in xsd_files:
    # compute the path relative to the root XSD folder
    rel = path.relative_to(output_path)  

    summary.append({
        "File Name":          path.name,      
        "Relative Path":      rel.as_posix(), 
        "Root Tag":           root.tag,       
        "Number of Elements": len(root)       
    })

# Create and display the DataFrame
df_summary = pd.DataFrame(summary)
df_summary

Unnamed: 0,File Name,Relative Path,Root Tag,Number of Elements
0,atributy.xsd,xsd/common/atributy.xsd,{http://www.w3.org/2001/XMLSchema}schema,185
1,common.xsd,xsd/common/common.xsd,{http://www.w3.org/2001/XMLSchema}schema,4
2,doprovodne_informace.xsd,xsd/common/doprovodne_informace.xsd,{http://www.w3.org/2001/XMLSchema}schema,29
3,extenze.xsd,xsd/common/extenze.xsd,{http://www.w3.org/2001/XMLSchema}schema,1
4,servis.xsd,xsd/common/servis.xsd,{http://www.w3.org/2001/XMLSchema}schema,1
...,...,...,...,...
442,spatialReferencing.xsd,xsd/ext/gsr/spatialReferencing.xsd,{http://www.w3.org/2001/XMLSchema}schema,14
443,geometry.xsd,xsd/ext/gss/geometry.xsd,{http://www.w3.org/2001/XMLSchema}schema,19
444,gss.xsd,xsd/ext/gss/gss.xsd,{http://www.w3.org/2001/XMLSchema}schema,7
445,gts.xsd,xsd/ext/gts/gts.xsd,{http://www.w3.org/2001/XMLSchema}schema,7


### Locate the main `.xsd` file

This step searches for the main file that contains `<xs:import>` declarations.
- It first looks for `input_data.xsd` in any folder
- If not found, it tries `index/index_data.xsd`
- If neither is found, an error is raised


Collect and Display XSD File Paths from file

In [6]:
# Attempt to locate the main XSD file
xsd_path = None

# Search for 'input_data.xsd' or 'index_data.xsd' under output_path
candidates = list(output_path.rglob("input_data.xsd")) + list(output_path.rglob("index_data.xsd"))

if candidates:
    xsd_path = candidates[0]
    print("Found main XSD file:", xsd_path.relative_to(output_path))
else:
    raise FileNotFoundError("Neither 'input_data.xsd' nor 'index_data.xsd' was found.")

Found main XSD file: xsd\index\index_data.xsd


In [7]:
# Verify that xsd_path is set
if xsd_path is None:
    raise ValueError("xsd_path is not set. Cannot continue.")

# Parse the main XSD file and retrieve its root element
tree = etree.parse(str(xsd_path))
root = tree.getroot()

# Define the namespace mapping for XSD elements
xs_ns = {"xs": "http://www.w3.org/2001/XMLSchema"}

# Collect all <xs:import> and <xs:include> elements and their types
records = []

# Find and record each import's schemaLocation
for el in root.findall(".//xs:import", namespaces=xs_ns):
    loc = el.get("schemaLocation")
    if loc:
        records.append({"schemaLocation": loc, "type": "import"})

# Find and record each include's schemaLocation
for el in root.findall(".//xs:include", namespaces=xs_ns):
    loc = el.get("schemaLocation")
    if loc:
        records.append({"schemaLocation": loc, "type": "include"})

# Build a DataFrame from the collected records
df_locations = pd.DataFrame(records)

# Display the DataFrame
df_locations

Unnamed: 0,schemaLocation,type
0,../common/atributy.xsd,import
1,../common/common.xsd,import
2,../common/doprovodne_informace.xsd,import
3,../common/extenze.xsd,import
4,../common/servis.xsd,import
...,...,...
359,../objekty/zed-linie.xsd,import
360,../objekty/zed-plocha.xsd,import
361,../objekty/zeleznicni_prejezd-plocha.xsd,import
362,../objekty/zemedelska_plocha-defbod.xsd,import


# Define a helper function to normalize paths  
Strip leading `../` segments and an optional `xsd/` prefix so that all comparisons use the same base.

In [8]:
from pathlib import Path

def normalize_path(p: str) -> str:
    """
    Remove any leading '../' segments and a leading 'xsd/' prefix
    from the given POSIX-style path string.
    """
    # Strip all leading "../"
    while p.startswith("../"):
        p = p[3:]
    # Remove leading "xsd/" if present
    if p.startswith("xsd/"):
        p = p[4:]
    return p

## Collect all imports/includes from the main XSD  
Parse `xsd_path`, gather every `schemaLocation` from `<xs:import>` and `<xs:include>`, and normalize.

In [9]:
# Load and parse the primary XSD file
tree = etree.parse(str(xsd_path))

# Define the XML namespace for XSD elements
ns = {"xs": "http://www.w3.org/2001/XMLSchema"}

# Find all <xs:import> and <xs:include> elements and get their schemaLocation values
raw_imports = {
    el.get("schemaLocation")
    for tag in ("import", "include")
    for el in tree.getroot().findall(f".//xs:{tag}", namespaces=ns)
    if el.get("schemaLocation")
}
raw_imports

{'../common/atributy.xsd',
 '../common/common.xsd',
 '../common/doprovodne_informace.xsd',
 '../common/extenze.xsd',
 '../common/servis.xsd',
 '../ext/gml/gml.xsd',
 '../objekty/BP_plynovodni_site-plocha.xsd',
 '../objekty/BP_podzemniho_zasobniku_plynu-plocha.xsd',
 '../objekty/BP_zarizeni_PKO-plocha.xsd',
 '../objekty/OP_drazni_stavby-plocha.xsd',
 '../objekty/OP_elektricke_site-plocha.xsd',
 '../objekty/OP_jaderneho_zarizeni-plocha.xsd',
 '../objekty/OP_kolektoru_kabelovodu-plocha.xsd',
 '../objekty/OP_leteckych_zabezpecovacich_zarizeni-plocha.xsd',
 '../objekty/OP_letiste-plocha.xsd',
 '../objekty/OP_objektu_kanalizace-plocha.xsd',
 '../objekty/OP_objektu_vodovodu-plocha.xsd',
 '../objekty/OP_plynovodni_site-plocha.xsd',
 '../objekty/OP_podzemniho_zasobniku_plynu-plocha.xsd',
 '../objekty/OP_pozemni_komunikace-plocha.xsd',
 '../objekty/OP_site_EK-plocha.xsd',
 '../objekty/OP_site_produktovodu-plocha.xsd',
 '../objekty/OP_stanice_elektricke_site-plocha.xsd',
 '../objekty/OP_stavby_vo

## Cell 3: Normalize imported paths and scan output directory
This cell normalizes each extracted schemaLocation and then recursively scans the `output_path` for all `.xsd` files, normalizing their relative POSIX paths.

In [10]:
# Normalize each imported schemaLocation path to remove unwanted prefixes
imported = {
    normalize_path(Path(loc).as_posix())
    for loc in raw_imports
}

# Recursively scan the output directory for all .xsd files and normalize their relative paths
all_files = {
    normalize_path(p.relative_to(output_path).as_posix())
    for p in output_path.rglob("*.xsd")
}
all_files

{'common/atributy.xsd',
 'common/common.xsd',
 'common/doprovodne_informace.xsd',
 'common/extenze.xsd',
 'common/servis.xsd',
 'ext/gco/basicTypes.xsd',
 'ext/gco/gco.xsd',
 'ext/gco/gcoBase.xsd',
 'ext/gmd/applicationSchema.xsd',
 'ext/gmd/citation.xsd',
 'ext/gmd/constraints.xsd',
 'ext/gmd/content.xsd',
 'ext/gmd/dataQuality.xsd',
 'ext/gmd/distribution.xsd',
 'ext/gmd/extent.xsd',
 'ext/gmd/freeText.xsd',
 'ext/gmd/gmd.xsd',
 'ext/gmd/identification.xsd',
 'ext/gmd/maintenance.xsd',
 'ext/gmd/metadataApplication.xsd',
 'ext/gmd/metadataEntity.xsd',
 'ext/gmd/metadataExtension.xsd',
 'ext/gmd/portrayalCatalogue.xsd',
 'ext/gmd/referenceSystem.xsd',
 'ext/gmd/spatialRepresentation.xsd',
 'ext/gml/applicationSchema.xsd',
 'ext/gml/basicTypes.xsd',
 'ext/gml/citation.xsd',
 'ext/gml/constraints.xsd',
 'ext/gml/content.xsd',
 'ext/gml/coordinateOperations.xsd',
 'ext/gml/coordinateReferenceSystems.xsd',
 'ext/gml/coordinateSystems.xsd',
 'ext/gml/coverage.xsd',
 'ext/gml/dataQuality.xs

## Cell 4: Compare import list with existing files and build DataFrame
This cell iterates over the union of imported paths and discovered files, determines each file’s status, and assembles the results into a Pandas DataFrame.

In [11]:
# Initialize a list to collect comparison records
records = []

# Iterate through all unique paths from imports and disk
for path in sorted(imported.union(all_files)):
    # Check if the path was imported and if it exists on disk
    in_imports     = path in imported
    exists_on_disk = path in all_files

    # Determine status: OK if both, Missing if imported but not on disk, Unreferenced otherwise
    status = (
        "OK"            if in_imports and exists_on_disk else
        "Missing"       if in_imports and not exists_on_disk else
        "Unreferenced"
    )

    # Append a record for this path
    records.append({
        "Path":      path,
        "Imported":  in_imports,
        "Exists":    exists_on_disk,
        "Status":    status
    })

# Create a DataFrame from the list of records
df_check = pd.DataFrame(records)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
df_check

Unnamed: 0,Path,Imported,Exists,Status
0,common/atributy.xsd,True,True,OK
1,common/common.xsd,True,True,OK
2,common/doprovodne_informace.xsd,True,True,OK
3,common/extenze.xsd,True,True,OK
4,common/servis.xsd,True,True,OK
5,ext/gco/basicTypes.xsd,False,True,Unreferenced
6,ext/gco/gco.xsd,False,True,Unreferenced
7,ext/gco/gcoBase.xsd,False,True,Unreferenced
8,ext/gmd/applicationSchema.xsd,False,True,Unreferenced
9,ext/gmd/citation.xsd,False,True,Unreferenced
