# Purview TAG DB Scanner

This notebook scans TAG DB XML metadata files and loads them into Microsoft Purview.

## Prerequisites
- TAG DB custom entity types created in Purview (afdatabase, afelement, afattribute, afanalysis)
- Service principal with Purview Data Curator and Storage Blob Data Contributor roles
- Key Vault with service principal credentials (client-id, client-secret, tenant-id)
- Storage account with TAG DB folders created

## Workflow
1. Reads XML files from `tag-db-xml/` folder
2. Parses hierarchical TAG DB structure
3. Converts to Purview entity JSON format
4. Moves processed files to `tag-db-processed/`
5. Outputs Purview-ready JSON to `tag-db-purview-json/`

## Configuration
Update cell 2 with your infrastructure settings before running.

In [None]:
%pip install pyapacheatlas

In [None]:
# Storage Configuration
blob_container_name = "pccsa"
blob_account_name = "pccsast6nvsfni5vtcj6"
blob_relative_path = "tag-db-json"  # Changed from tag-db-xml to tag-db-json
blob_processed = "tag-db-processed"
out_file = "tag-db-purview-json"

# Purview Configuration
app_name = "purviewspn"
key_vault_uri = "https://pccsakv6nvsfni5vtcj6.vault.azure.net/"
purview_account_name = "edinmedi-purview-labs"

## Install PyApacheAtlas Package
Required for Purview Atlas API integration

## Import All dependencies Libraries
Make sure you import either to the cluster as workspace level or as sessionm the PyApacheAtlas packages. All the others are native.

In [None]:
import json
import os
# PyApacheAtlas packages
# for using guid generator to garantee unid guids
from pyapacheatlas.core.util import GuidTracker
from notebookutils import mssparkutils
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from datetime import datetime

## Setting up program variables
    - Logger: Use to logg debuging information
    - mylogger: Object use to log
    - adls_home: Used for the relative path for the files used by the Notebook
    - adls_processed: Folder where processed files are putted after Notebook finishe processing the file
    - adls_out_home: Folder where the output json used to load on purview is generated
    - gt: Object responsible to track the unique identities for the Json objects to load onto Purview
    

In [None]:
#Setting up variable for loging.
my_jars = os.environ.get("SPARK_HOME")
myconf = SparkConf()
myconf.set("spark.jars","%s/jars/log4j-1.2.17.jar" % my_jars)
spark = SparkSession\
 .builder\
 .appName("DB2_Test")\
 .config(conf = myconf) \
 .getOrCreate()

Logger= spark._jvm.org.apache.log4j.Logger
mylogger = Logger.getLogger(app_name)
#file path inicializer
adls_home = 'abfss://%s@%s.dfs.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
adls_processed = 'abfss://%s@%s.dfs.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_processed)
adls_out_home = 'abfss://%s@%s.dfs.core.windows.net/%s' % (blob_container_name, blob_account_name, out_file)
#inicialize guid tracker to garantee unique guids for the purview objects
gt = GuidTracker()

## Log Function
Function write booth on the spark Server logs and output on the nootebook for debug
    - msg_type: Type of mssage to write (ERROR or INFO)
    - msg: Message to be logged


In [None]:
#function to simplify loging on files and on screen.
def log_msgs(msg_type,msg):
        
        if msg_type.upper() == "ERROR":
            print('ERROR: %s' % msg)
            mylogger.error(msg)
        else:
            print('INFO: %s' % msg)
            mylogger.info("Fim")
           

## All TAG DB Classes Definitions
Used to read from XML and transform into the Json to be loaded into Purview
    - AFDatabase
    - AFElement
    - AFAttribute
    - AFAnalysis

In [None]:
class AFBaseObject:
    """
    Base Class fot the OSI PI metadata Objects,
    provide a methond that is common for all other classes
    """
    def __init__(self):
        self.attributes= {'attributes':{}}

    """
    Generically creates a relationship with another data type

    :param str nameElement:
    Name that the element will have to the current Data Asset
    e.g. "Database", "Parent", "Child", "Group"...

    :param str typeElement:
    Name of the type fo the element that it is creating the relationship

    :param str idElement:
    Guid fo the object that is creating the relationship to.

    :param str relationShipType::
    Name of the type of the relationship being created
    """
    def addRelationship(self,nameElement, typeElement, idElement,relationShipType):
        self.attributes['relationshipAttributes'][nameElement]={}
        self.attributes['relationshipAttributes'][nameElement]['guid']=idElement
        self.attributes['relationshipAttributes'][nameElement]['typeName']= typeElement
        self.attributes['relationshipAttributes'][nameElement]['entityStatus']= "ACTIVE"
        self.attributes['relationshipAttributes'][nameElement]['displayText']= nameElement
        self.attributes['relationshipAttributes'][nameElement]['relationshipType']= relationShipType
        self.attributes['relationshipAttributes'][nameElement]['relationshipStatus']= "ACTIVE"
        self.attributes['relationshipAttributes'][nameElement]['relationshipAttributes']={}
        self.attributes['relationshipAttributes'][nameElement]['relationshipAttributes']['typeName']=relationShipType

    """
    Return the class attributes as a dictionary (json like) to be 
    consumed by any API
    """
    def toJson(self):
        return self.attributes
    
    def fixDate(self, date):
        if date is None:
            return ''
        else:
            return date.replace('Z','.0000000Z')
    
    def removeNulls(self, field, isnumber=False):
        if field is None:
            if isnumber:
                return 0
            else:
                return ''
        else:
            if field == '':
                if isnumber:
                    return 0
                else:
                    return ''
        return field
    
class AFDatabase(AFBaseObject):
    """
    AFDatabase Data Asset Type Definition, will hold all the metadata information for
    a OSI PI AFDatabase data asset
    """


    """
    Inicialize the class with all the attributes needed

    :param str name:
    Name that will be using for the data asset

    :param str description:
    description of the data asset

    :param str defaultpiserver:
    Name of the Default OSI PI Server

    :param str defaultpiserverid:
    ID of the sefault OSI PI server

    :param str guid:
    Unique identifier of the Data Asset.
    """
    def __init__(self,name=None, description=None,defaultpiserver=None, defaultpiserverid=None,guid=None):
        self.name = self.removeNulls(name)
        self.description= self.removeNulls(description)
        self.defaultpiserver = self.removeNulls(defaultpiserver)
        self.defaultpiserverid = self.removeNulls(defaultpiserverid)
        #unique Identifier build based on the hierarchical patr AFDatabase-Name
        self.qualifiedName = 'osipi://%s/%s' % (self.defaultpiserver,self.name)
        self.guid = guid
        self.typeName= 'afdatabase'
        self.attributes = {'attributes':{}}
        self.attributes['attributes']['name']=self.name
        self.attributes['guid']=guid
        self.attributes['attributes']['description']=self.description
        self.attributes['attributes']['DefaultPIServer']=self.defaultpiserver
        self.attributes['attributes']['DefaultPIServerID']=self.defaultpiserverid
        self.attributes['attributes']['qualifiedName'] = self.qualifiedName
        self.attributes['typeName']= 'afdatabase'
        self.attributes['relationshipAttributes'] = {}
    

class AFElemento(AFBaseObject):
    """
    AFElement Data Asset Type Definition, will hold all the metadata information for
    a OSI PI AFElement data asset, has a relationship with AFDatabase and can have childs as: 
      * AFElements ('Parent')
      * AFAttribiute ('Attribute')
      * AFAnalysis ('Analysis')
    """


    """
    Inicialize the class with all the attributes needed

    :param str guid:
    Unique identifier of the Data Asset.

    :param str name:
    Name that will be using for the data asset

    :param str description:
    description of the data asset

    :param int isAnnotated:
    Represent 0 if false and 1 if true

    :param str template:
    Name of the template that the AFElement use

    :param AFDatabase database:
    Database that the element belongs to

    :param str comment:
    Comment about AFElement

    :param datetime  effectiveDate:
    Effective data for the AFElement 
    (Dates are not coming with miliseconds needed to be fixed, make sure you system is not using miliseconds also
    mkae the change if need it)

    :param datatime obsoleteDate:
    Date when the AFElement gets obsolete
    (Dates are not coming with miliseconds needed to be fixed, make sure you system is not using miliseconds also
    mkae the change if need it)

    :param str  modifier:
    AFElement modifier
    """
    def __init__(self,guid=None, name=None, description=None,isAnnotated=None, template=None,database=None, comment=None
        , effectiveDate=None, obsoleteDate=None, modifier=None):
        self.name = self.removeNulls(name)
        self.description= self.removeNulls(description)
        #unique Identifier build based on the hierarchical patr AFDatabase-AFElement Name
        self.qualifiedName = 'osipi://%s/%s/%s' % (database.defaultpiserver,database.name,self.name)
        self.obsoleteDate = self.fixDate(date=obsoleteDate)
        self.template=self.removeNulls(template)
        self.comment=self.removeNulls(comment)
        self.database= database
        self.isAnnotated=self.removeNulls(isAnnotated)
        self.effectiveDate=self.fixDate(date=effectiveDate)
        self.attributes = {'attributes':{}}
        self.modifier = self.removeNulls(modifier)
        self.guid=guid
        self.attributes['relationshipAttributes'] = {}
        
        self.attributes['attributes']['name']=self.name
        self.attributes['guid']=self.guid
        self.attributes['attributes']['description']=self.description
        self.attributes['attributes']['ObsoleteDate']=self.obsoleteDate
        self.attributes['attributes']['Template']=self.template
        self.attributes['attributes']['Comment'] = self.comment
        self.attributes['attributes']['IsAnnotated'] = self.isAnnotated
        self.attributes['attributes']['EffectiveDate'] = self.effectiveDate
        self.attributes['attributes']['qualifiedName'] = self.qualifiedName
        self.attributes['attributes']['Modifier']=self.modifier
        self.attributes['typeName']= 'afelement'
        self.addRelationship('Database', self.database.typeName, self.database.guid,'afdatabase_afelement')
    
class AFAttribute(AFBaseObject):
    """
    AFAttribute Data Asset Type Definition, will hold all the metadata information for
    a OSI PI AFAttribute data asset has relationship with some other OSI PI data Assets:
      * AFElements ('Parent Element')
    """

    """
     Inicialize the class with all the attributes needed

    :param str guid:
    Unique identifier of the Data Asset.

    :param str name:
    Name that will be using for the data asset

    :param str description:
    description of the data asset
    
    :param int isHidden:
    If is a hidden attribute, 0=false 1=true

    :param int isManualDataEntry:
    If is a manual data entry attribute, 0=false 1=true

    :param in isExcluded:
    If is excluded attribute, 0=false 1=true
    
    :param in isConfigurationItem:
    If is a configuration attribute, 0=false 1=true
    
    :param str trait:
    Trait

    :param str defaultUOM:
    Default UOM
    
    :param str displayDigits:
    # display digits

    :param str _type:
    Type 

    :param str typeQualifier:
    Type quilifier
    
    :param str dataReference:
    Reference date in string format
    
    :param str configString:
    Configuration string 
    
    :param AFDatabase database:
    Parent AFDatabase

    :param AFElement afelement:
    Parent AFElement
    """
    def __init__(self,guid=None, name=None, description=None,isHidden=None, isManualDataEntry=None,isExcluded=None, isConfigurationItem=None, trait=None, defaultUOM=None, displayDigits=None,
    _type=None, typeQualifier=None, dataReference=None, configString=None,database = None,afelement=None):
        self.name = self.removeNulls(name)
        self.description= self.removeNulls(description)
        #unique Identifier build based on the hierarchical patr AFDatabase-AFElement-Name
        self.qualifiedName = 'osipi://%s/%s/%s/%s' % (database.defaultpiserver,database.name,afelement.name,self.name)
        self.isHidden = 0 if self.removeNulls(isHidden).upper()=='FALSE' else 1
        self.isManualDataEntry=0 if self.removeNulls(isManualDataEntry).upper()=='FALSE' else 1
        self.isExcluded=0 if self.removeNulls(isExcluded).upper()=='FALSE' else 1
        self.database= database
        self.isConfigurationItem=0 if self.removeNulls(isConfigurationItem).upper()=='FALSE' else 1
        self.trait=self.removeNulls(trait)
        self.defaultUOM=self.removeNulls(defaultUOM)
        self.displayDigits=self.removeNulls(field=displayDigits,isnumber=True)
        self._type=self.removeNulls(_type)
        self.typeQualifier=self.removeNulls(typeQualifier)
        self.dataReference=self.removeNulls(dataReference)
        self.configString=self.removeNulls(configString)
        self.attributes = {'attributes':{}}
        self.guid=guid
        self.attributes['relationshipAttributes'] = {}

        self.attributes['attributes']['name']=self.name
        self.attributes['guid']=self.guid
        self.attributes['attributes']['description']=self.description
        self.attributes['attributes']['IsHidden'] = self.isHidden
        self.attributes['attributes']['IsManualDataEntry'] = self.isManualDataEntry
        self.attributes['attributes']['IsExcluded'] = self.isExcluded
        self.attributes['attributes']['IsConfigurationItem'] = self.isConfigurationItem
        self.attributes['attributes']['Trait'] = self.trait
        self.attributes['attributes']['DefaultUOM'] = self.defaultUOM
        self.attributes['attributes']['DisplayDigits'] = self.displayDigits
        self.attributes['attributes']['Type'] = self._type
        self.attributes['attributes']['TypeQualifier'] = self.typeQualifier
        self.attributes['attributes']['DataReference'] = self.dataReference
        self.attributes['attributes']['ConfigString'] = self.configString
        self.attributes['attributes']['qualifiedName'] = self.qualifiedName
        self.attributes['typeName']= 'afattribute'

        #adding relatioship to the Parent AFElement
        self.addRelationship('Parent Element', 'afelement', afelement.guid,'afelement_afattribute')


class AFAnalysis(AFBaseObject):
    """
    AFAnalysis Data Asset Type Definition, will hold all the metadata information for
    a OSI PI AFAnalysis data asset has relationship with some other OSI PI data Assets:
      * AFElements ('Reference Element')
    """

    """
     Inicialize the class with all the attributes needed

    :param str guid:
    Unique identifier of the Data Asset.

    :param str name:
    Name that will be using for the data asset

    :param str description:
    description of the data asset
    
    :param in template:
    Template used for the AFAnalysis
    
    :param str caseTemplate:
    Case Template

    :param str outputTime:
    Output time
    
    :param str status:
    Analysis Status

    :param str publishResults:
    Publish Results

    :param str priority:
    Priority
    
    :param str maxQueueSize:
    Max Queue Size
    
    :param str groupID:
    Group ID
    
    :param str target:
    Target

    :param AFDatabase database:
    Parent AFDatabase

    :param AFElement afelement:
    Parent AFElement
    """
    def __init__(self,guid=None, name=None, description=None,template=None, caseTemplate=None,outputTime=None, status=None, publishResults=None, priority=None, maxQueueSize=None,
    groupID=None, target=None,database = None,afelement=None):
        self.name = self.removeNulls(name)
        self.description= self.removeNulls(description)
        #unique Identifier build based on the hierarchical patr AFDatabase-AFElement-Name
        self.qualifiedName = 'osipi://%s/%s/%s/%s' % (database.defaultpiserver,database.name,afelement.name,self.name)
        self.template = self.removeNulls(template)
        self.caseTemplate=self.removeNulls(caseTemplate)
        self.publishResults=0 if self.removeNulls(publishResults).upper()=='FALSE' else 1
        self.database= database
        self.outputTime=self.removeNulls(outputTime)
        self.status=self.removeNulls(status)
        self.priority=self.removeNulls(priority)
        self.maxQueueSize=self.removeNulls(field=maxQueueSize,isnumber=True)
        self.groupID=self.removeNulls(field=groupID, isnumber=True)
        self.attributes = {'attributes':{}}
        self.guid=guid
        self.attributes['relationshipAttributes'] = {}

        self.attributes['attributes']['name']=self.name
        self.attributes['guid']=self.guid
        self.attributes['attributes']['description']=self.description
        self.attributes['attributes']['Template'] = self.template
        self.attributes['attributes']['CaseTemplate'] = self.caseTemplate
        self.attributes['attributes']['PublishResults'] = self.publishResults
        self.attributes['attributes']['OutputTime'] = self.outputTime
        self.attributes['attributes']['Status'] = self.status
        self.attributes['attributes']['Priority'] = self.priority
        self.attributes['attributes']['MaxQueueSize'] = self.maxQueueSize
        self.attributes['attributes']['GroupID'] = self.groupID
        self.attributes['attributes']['qualifiedName'] = self.qualifiedName
        self.attributes['typeName']= 'afanalysis'

        self.addRelationship('Reference Element', 'afelement', afelement.guid,'afelement_afanalysis')


## Function to hlep check if Element exist in the Dictionary
Check if element exists and send empty if it dos not

In [None]:
def get_element(name=None,dictionary=None):
    if name in dictionary:
        return dictionary[name]
    return ''

## Recursive Function the iterate over the AFElement hierarchy
Loop through all AFElements adding the hierarcgy into the json to be loaded and recreated ad relationship into Purview 

In [None]:
#Function used to transverse the hierarchical tree and create the AFElements and it relationships
#can be use recursively
def get_AFElement(db=None, parent=None, element=None):
    afelement = None
    if 'Name' in element:
        #condition to AFElements with parents
        comment = None
        if 'Comment' in element:
            comment=get_element(name='Comment',dictionary= element)

        afelement = AFElemento(
            guid=gt.get_guid(), 
            name= get_element(name='Name',dictionary= element), 
            description= get_element(name='Description',dictionary= element),
            isAnnotated= 1 if get_element(name='IsAnnotated',dictionary= element)=='True' else 0, 
            template= get_element(name='Template',dictionary= element),
            database=db, 
            comment=comment, 
            effectiveDate= get_element(name='EffectiveDate',dictionary= element), 
            obsoleteDate=get_element(name='ObsoleteDate',dictionary= element),
            modifier = get_element(name='Modifier',dictionary= element)
        )
        if parent != None:
            #adding Parent relationship
            afelement.addRelationship(
                nameElement='Parent',
                typeElement='afelement', 
                idElement = parent.guid,
                relationShipType='afelement_afelement')
        #Add AFElement for the list of entities to be loaded into Purview
        purview_load_entities.append(afelement.toJson())
    #validating is AFElement has child AFElement
    if 'AFElement' in element:
        #if it is a list of AFElements
        if type(element['AFElement']) is list:
            for item in element['AFElement']:
                if not type(item) is list:
                    try:
                        get_AFElement(db=db,parent=afelement ,element=item)
                    except:
                        #print(item.keys())
                        errors=1
                        print(item)
        else:
        #Only One AFElement
            get_AFElement(db=db,parent= afelement ,element=element['AFElement'])
    
    get_AFAnalysis(db=db, afelement=afelement,element=element)

    get_AFAttributes(db=db, afelement=afelement,element=element)

def get_AFAnalysis(db=None, afelement=None, element=None):
    #Check is ther is AF Analysis
    if 'AFAnalysis' in element:
        #If it is more the one
        if type(element['AFAnalysis']) is list:
            for attrib in element['AFAnalysis']:
                analysis = set_AFAnalysis(analisys=element,database = db,afelement=afelement)
                #Append the AFAnalysis to the list of Data Assets to be loaded into Purview
                purview_load_entities.append(analysis.toJson())
        else:
        #Only one AFAnalysis
            attrib = element['AFAnalysis']
            analysis = set_AFAnalysis(analisys=element,database = db,afelement=afelement)
            #Append the AFAnalysis to the list of Data Assets to be loaded into Purview
            purview_load_entities.append(analysis.toJson())

def get_AFAttributes(db=None, afelement=None, element=None):
    #checking if AFElement has AFAttributes
    if 'AFAttribute' in element:
        #If it is a list of AFAttributes
        if type(element['AFAttribute']) is list:
            for attrib in element['AFAttribute']:
                attribute = set_AFAttribute(attrib=element, database = db,afelement=afelement)
                #Append the AFAttribute to the list of Data Assets to be loaded into Purview
                purview_load_entities.append(attribute.toJson())
        else:
        #Only one AFAttribute
            attrib = element['AFAttribute']
            attribute = set_AFAttribute(attrib=element, database = db,afelement=afelement)
            #Append the AFAttribute to the list of Data Assets to be loaded into Purview
            purview_load_entities.append(attribute.toJson())


def set_AFAttribute(attrib=None, database=None,afelement=None):
    return AFAttribute(
                    guid=gt.get_guid(), name=get_element('Name',attrib), 
                    description=get_element('Description',attrib),isHidden=get_element('IsHidden',attrib), 
                    isManualDataEntry=get_element('IsManualDataEntry',attrib),isExcluded=get_element('IsExcluded',attrib), 
                    isConfigurationItem=get_element('IsConfigurationItem',attrib), trait=get_element('Trait',attrib), 
                    defaultUOM=get_element('DefaultUOM',attrib), displayDigits=get_element('DisplayDigits',attrib),
                    _type=get_element('Type',attrib), typeQualifier=get_element('TypeQualifier',attrib), 
                    dataReference=get_element('DataReference',attrib), configString=get_element('ConfigString',attrib),
                    database = database,afelement=afelement)

def set_AFAnalysis(analisys, database, afelement):
    return AFAnalysis(
                guid=gt.get_guid(), name=analisys['Name'], 
                    description= get_element('Description',analisys),
                    template= get_element('Template',analisys), caseTemplate=get_element('CaseTemplate',analisys),
                    outputTime=get_element('OutputTime',analisys), status=get_element('Status',analisys), 
                    publishResults=get_element('PublishResults',analisys), priority=get_element('Priority',analisys), maxQueueSize=get_element('MaxQueueSize',analisys),
                    groupID=get_element('GroupID',analisys), target=get_element('Target',analisys),
                    database = database,afelement=afelement)


## Function (load_tag_db_DataAssets) that Generate the Full Data Assets to be loaded into Purview
Validate Top AF nodes to generate the the json objects

In [None]:
def load_tag_db_DataAssets(j):
    
    for i in j:
        json_obj = json.loads(i)
        #print(json_obj)
        if 'AF' in json_obj:
            #print('Found AF')
            if 'AFDatabase' in json_obj['AF']:
                if 'Name' in json_obj['AF']["AFDatabase"]:
                    #print('Found AFDatabase')
                    if 'Description' in json_obj['AF']["AFDatabase"]:
                        DefaultPIServer=''
                        DefaultPIServerID=''
                        if 'AFExtendedProperty' in json_obj['AF']["AFDatabase"]:
                            for l in json_obj['AF']["AFDatabase"]['AFExtendedProperty']:
                                if 'Name' in l:
                                    if l['Name']=='DefaultPIServer':
                                        DefaultPIServer= l["Value"]
                                    if l['Name']=='DefaultPIServerID':
                                        DefaultPIServerID = l["Value"]
                        db = AFDatabase(guid=gt.get_guid(),
                            name=json_obj['AF']["AFDatabase"]['Name'],
                            description=json_obj['AF']["AFDatabase"]['Description'],
                            defaultpiserver=DefaultPIServer,
                            defaultpiserverid=DefaultPIServerID
                        )
                        #print(db.toJson())   
                        purview_load_entities.append(db.toJson())
                    if 'AFElement' in json_obj['AF']["AFDatabase"]:
                        #print('Found AFElement')
                        get_AFElement(db,None,json_obj['AF']["AFDatabase"]['AFElement'])
                    #print(purview_load_entities)
                    now = datetime.now() # current date and time
                    timestamp = now.strftime("%y%m%d%H%M%S%f")
                    json_value = json.dumps(purview_load_entities)
                    mssparkutils.fs.put('%s/osipi-%s.json' % (adls_out_home,timestamp),json_value, True)
                    return True
    return False

## Loop over all the files on the ADLS_HOME folder to generate all json objects to load into Purview
Loop over all files, load one by one and move to processs folder after correctly proccessed

In [None]:
import traceback
try:
    havefiles = True
    inicialnumfiles = 0
    while havefiles:
        havefiles = False
        files = mssparkutils.fs.ls(adls_home)
        numoffiles = len(files)
        processedfiles = 0
        failfiles=0
        for file in files:
            purview_load_entities=[]
            if file.size > 0:
                havefiles = True
                i=0
                filepath = ""
                fileparts = file.path.split('/')
                for filepart in fileparts:
                    if i < len(fileparts)-1:
                        filepath+='%s/' % filepart
                    i+=1
                
                filepath='%s/%s' % (adls_processed,file.name)
                load_json = False
                readComplexJSONDF=None
                try:
                    print(file.path)
                    readComplexJSONDF = spark.read.option("multiLine","true").json(file.path)
                    load_json=True
                    print('Finished Loading json')
                except Exception as e:
                    log_msgs('ERROR','Invalid Json: %s /r %s' % file.path,e.args[0])

                if load_json:
                    print('Start Loading Json')
                    j = readComplexJSONDF.toJSON().collect()
                    log_msgs('INFO','Loading File: %s' % file.path)
                    if load_tag_db_DataAssets(j):
                        print('Finished Loading File')
                        try:
                            deletfile = mssparkutils.fs.rm(filepath)
                            
                        except:
                            log_msgs('INFO','No file to delete')
                        movefile = mssparkutils.fs.mv(src=file.path,dest=filepath)
                        processedfiles+=1
                    else:
                        failfiles+=1
        if failfiles > 0  and processedfiles == 0:
            print('Exit all files loaded')
            break
except  Exception as e:
    traceback_lines = traceback.format_exc().splitlines()
    log_msgs('ERROR',traceback_lines)

## Next Steps

After this notebook completes:

1. **Output Location**: Generated Purview entity JSON files are in `tag-db-purview-json/` folder
2. **Load to Purview**: Use the main `Purview_Load_Entity` notebook to load these files into Purview
   - Update its configuration to read from `tag-db-purview-json/` folder
   - Files will contain the hierarchical TAG DB structure with all relationships

**Folder Structure:**
- `tag-db-xml/` → Input (XML files from TAG DB export)
- `tag-db-purview-json/` → Output (Purview-ready JSON)
- `tag-db-processed/` → Archive (processed XML files)