# Convert TAG DB XML to JSON

This notebook converts TAG DB XML files to JSON format for processing by the Purview_TAG_DB_Scan notebook.

## Prerequisites
- XML files uploaded to `tag-db-xml/` folder in your storage account
- Storage account configured with proper access

## Workflow
1. Reads XML files from `tag-db-xml/` folder
2. Converts each XML file to JSON using xmltodict
3. Saves JSON files to `tag-db-json/` folder
4. Optionally moves processed XML files to archive folder

## Configuration
Update the configuration cell below with your infrastructure settings before running.

In [None]:
%pip install xmltodict

In [None]:
# Storage Configuration
blob_container_name = "pccsa"
blob_account_name = "pccsast6nvsfni5vtcj6"
blob_xml_path = "tag-db-xml"  # Input: XML files
blob_json_path = "tag-db-json"  # Output: JSON files
blob_xml_archive = "tag-db-xml-processed"  # Optional: Archive processed XML files

# Processing Options
archive_processed_xml = False  # Set to True to move processed XML files to archive folder

In [None]:
import json
import xmltodict
from notebookutils import mssparkutils
from datetime import datetime

In [None]:
# Build ADLS paths
adls_xml_path = f'abfss://{blob_container_name}@{blob_account_name}.dfs.core.windows.net/{blob_xml_path}'
adls_json_path = f'abfss://{blob_container_name}@{blob_account_name}.dfs.core.windows.net/{blob_json_path}'
adls_xml_archive = f'abfss://{blob_container_name}@{blob_account_name}.dfs.core.windows.net/{blob_xml_archive}'

print(f"XML Source: {adls_xml_path}")
print(f"JSON Destination: {adls_json_path}")
print(f"XML Archive: {adls_xml_archive}")

## Convert XML Files to JSON

This cell processes all XML files in the source folder.

In [None]:
import traceback

try:
    # Get list of XML files
    print(f"\nüîç Scanning for XML files in: {adls_xml_path}")
    files = mssparkutils.fs.ls(adls_xml_path)
    
    xml_files = [f for f in files if f.name.lower().endswith('.xml') and f.size > 0]
    
    if len(xml_files) == 0:
        print("‚ö†Ô∏è  No XML files found to process")
    else:
        print(f"\nüìÅ Found {len(xml_files)} XML file(s) to process\n")
        
        processed_count = 0
        failed_count = 0
        
        for file in xml_files:
            try:
                print(f"\nüìÑ Processing: {file.name}")
                print(f"   Size: {file.size:,} bytes")
                
                # Read XML file (read entire file, adjust if files are very large)
                xml_content = mssparkutils.fs.head(file.path, file.size)
                
                # Convert XML to dictionary
                print("   üîÑ Converting XML to JSON...")
                data_dict = xmltodict.parse(xml_content)
                
                # Convert to JSON
                json_output = json.dumps(data_dict, indent=2)
                
                # Generate output filename (replace .xml with .json)
                json_filename = file.name.replace('.xml', '.json').replace('.XML', '.json')
                json_file_path = f"{adls_json_path}/{json_filename}"
                
                # Save JSON file
                print(f"   üíæ Saving JSON: {json_filename}")
                mssparkutils.fs.put(json_file_path, json_output, overwrite=True)
                
                # Archive processed XML file if enabled
                if archive_processed_xml:
                    archive_path = f"{adls_xml_archive}/{file.name}"
                    print(f"   üì¶ Archiving XML to: {blob_xml_archive}/{file.name}")
                    try:
                        # Delete existing file in archive if it exists
                        mssparkutils.fs.rm(archive_path, recurse=False)
                    except:
                        pass
                    mssparkutils.fs.mv(file.path, archive_path)
                
                print(f"   ‚úÖ Success!")
                processed_count += 1
                
            except Exception as e:
                print(f"   ‚ùå Error processing {file.name}:")
                print(f"      {str(e)}")
                failed_count += 1
                traceback.print_exc()
        
        # Summary
        print(f"\n{'='*60}")
        print(f"üìä Conversion Summary")
        print(f"{'='*60}")
        print(f"   ‚úÖ Successfully converted: {processed_count}")
        print(f"   ‚ùå Failed: {failed_count}")
        print(f"   üìÅ Total files: {len(xml_files)}")
        print(f"{'='*60}")
        
        if processed_count > 0:
            print(f"\n‚ú® JSON files are ready in: {blob_json_path}/")
            print(f"\n‚ñ∂Ô∏è  Next step: Run the 'Purview_TAG_DB_Scan' notebook")
            
except Exception as e:
    print(f"\n‚ùå Fatal error during conversion:")
    print(f"   {str(e)}")
    traceback_lines = traceback.format_exc()
    print(traceback_lines)

## Verification

Check the generated JSON files.

In [None]:
# List JSON files
try:
    json_files = mssparkutils.fs.ls(adls_json_path)
    print(f"\nüìã JSON files in {blob_json_path}:")
    print("=" * 60)
    for f in json_files:
        if f.name.lower().endswith('.json'):
            print(f"   üìÑ {f.name} ({f.size:,} bytes)")
    print("=" * 60)
except Exception as e:
    print(f"‚ö†Ô∏è  No files found or error: {str(e)}")

## Next Steps

After successful conversion:

1. **Verify JSON files** are in `tag-db-json/` folder
2. **Run the Purview_TAG_DB_Scan notebook** with this configuration:
   ```python
   blob_relative_path = "tag-db-json"  # Read from JSON folder
   ```
3. The scanner will process the JSON files and generate Purview entity definitions

**Folder Structure:**
- `tag-db-xml/` ‚Üí Input XML files (source)
- `tag-db-json/` ‚Üí Converted JSON files (intermediate)
- `tag-db-purview-json/` ‚Üí Purview entity JSON (output from scanner)
- `tag-db-processed/` ‚Üí Archive for processed JSON files (from scanner)
- `tag-db-xml-processed/` ‚Üí Archive for processed XML files (optional)