# Working with BatchXSLT

This is a tutorial for the Python-Wrapper batchxslt. This package is specified to transform language resources of the dgd metafile xml format to the clarin medata format by using xsl stylesheets and saxon. I wrote this wrapper to make corpus transformation less painful and more extensible. Though processing with lxml is easy, I decided keep transformation to the xlst processor and management of it to batchxslt so users not familiar with lxml can stick to xslt and rely on this tutorial for their needs.


## 1. Transform metadata from dgd to cmdi format

* define resource locators (corpus, event, speakers directories and saxon and xsl directories)
* define an output directory 

In [1]:
# using absolute paths 
corpus_dir = "/home/kuhn/Data/IDS/svn/dgd2_data/metadata/corpora/extern"
event_dir = "/home/kuhn/Data/IDS/svn/dgd2_data/metadata/events/extern"
speakers_dir = "/home/kuhn/Data/IDS/svn/dgd2_data/metadata/speakers/extern"
xsl_dir = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/"
corpus_xsl = xsl_dir + "dgdCorpus2cmdi.xsl"
event_xsl = xsl_dir + "dgdEvent2cmdi.xsl"
speaker_xsl = xsl_dir + "dgdSpeaker2cmdi.xsl"
saxon_jar = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/batchxslt/saxon/saxon9he.jar"
out_corp = "/tmp/cmdi/corpus"
out_event = "/tmp/cmdi/event"
out_speaker = "/tmp/cmdi/speakers"

In [2]:
# import the xsl wrapper
from batchxslt import processor

In [3]:
xsl_processor = processor.XSLBatchProcessor(saxon_jar)

In [4]:
# show me the api doc of the start method
xsl_processor.start?

In [4]:
%time xsl_processor.start(corpus_xsl, corpus_dir, "cmdi_", out_corp)

stylesheet: /home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/dgdCorpus2cmdi.xsl
outputdir: /tmp/cmdi/corpus
xmldata: /home/kuhn/Data/IDS/svn/dgd2_data/metadata/corpora/extern
CPU times: user 4.27 ms, sys: 13.6 ms, total: 17.9 ms
Wall time: 18.1 s


In [None]:
%time xsl_processor.start(event_xsl, event_dir, "cmdi_", out_event)

stylesheet: /home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/dgdEvent2cmdi.xsl
outputdir not readable: /tmp/cmdi/event
xmldata: /home/kuhn/Data/IDS/svn/dgd2_data/metadata/events/extern

In [4]:
%time xsl_processor.start(speaker_xsl, speakers_dir, "cmdi_", out_speaker)

stylesheet: /home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/dgdSpeaker2cmdi.xsl
outputdir not readable: /tmp/cmdi/speakers
xmldata: /home/kuhn/Data/IDS/svn/dgd2_data/metadata/speakers/extern
cannot create directory /tmp/cmdi/speakers/ISW
Maybe it already exists...
CPU times: user 933 ms, sys: 8.78 s, total: 9.71 s
Wall time: 18h 23min 46s


## 2. Defining Resource Proxies

Once the original metadata files have been transformed to cmdi, we can go on to build up a resource tree structure of it. For this purpose, we can use the module **cmdiresource**.  

In [2]:
from batchxslt import cmdiresource

We define the paths to our recently transformed data

In [3]:
corpus = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/cmdiOutBeta2/corpus/"
event = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/cmdiOutBeta2/event/"
speakers = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/cmdiOutBeta2/speakers/"

Now define a ResourceTreeCollection instance.

In [4]:
resourcetree = cmdiresource.ResourceTreeCollection(corpus, event, speakers)

['cmdi_DR--_extern.xml', 'cmdi_BR--_extern.xml', 'cmdi_SA--_extern.xml', 'cmdi_MV--_extern.xml', 'cmdi_JK--_extern.xml', 'cmdi_SR--_extern.xml', 'cmdi_HL--_extern.xml', 'cmdi_EK--_extern.xml', 'cmdi_BW--_extern.xml', 'cmdi_PF--_extern.xml', 'cmdi_IS--_extern.xml', 'cmdi_ISZ-_extern.xml', 'cmdi_ZW--_extern.xml', 'cmdi_FR--_extern.xml', 'cmdi_OS--_extern.xml', 'cmdi_DS--_extern.xml', 'cmdi_ISW-_extern.xml', 'cmdi_KN--_extern.xml', 'cmdi_AD--_extern.xml', 'cmdi_FOLK_extern.xml', 'cmdi_BB--_extern.xml', 'cmdi_SW--_extern.xml', 'cmdi_SV--_extern.xml']
['ISW', 'BW', 'BR', 'FOLK', 'MV', 'HL', 'KN', 'DS', 'JK', 'OS', 'SW', 'AD', 'FR', 'PF', 'ZW', 'EK', 'IS', 'SR', 'DR', 'SA', 'BB', 'SV', 'ISZ']
['ISW', 'BW', 'BR', 'FOLK', 'MV', 'HL', 'KN', 'DS', 'JK', 'OS', 'SW', 'AD', 'FR', 'PF', 'ZW', 'IS', 'SR', 'DR', 'SA', 'SV', 'ISZ']


ResourceTreeCollection inherits from networx.DiGraph and builds up a resource tree for all resources of the dgd2.
Lets look at a random resource node.

In [29]:
resourcetree.node.get('cmdi_FOLK_S_00022_extern') # look up the attributes of a resource node

{'corpusroot': False,
 'etreeobject': <lxml.etree._ElementTree at 0x7f8afb596c20>,
 'filename': 'cmdi_FOLK_S_00022_extern.xml',
 'repopath': None,
 'type': 'metadata'}

In [24]:
resourcetree.find_eventsessions('cmdi_FOLK_S_00022_extern') # find all sessions a speaker takes part in 

['FOLK_E_00001_SE_01',
 'FOLK_E_00004_SE_01',
 'FOLK_E_00005_SE_01',
 'FOLK_E_00006_SE_01',
 'FOLK_E_00007_SE_01',
 'FOLK_E_00008_SE_01',
 'FOLK_E_00009_SE_01']

In [25]:
resourcetree.find_events('cmdi_FOLK_S_00022_extern') # find all events a speaker takes part in

['FOLK_E_00009',
 'FOLK_E_00008',
 'FOLK_E_00001',
 'FOLK_E_00007',
 'FOLK_E_00006',
 'FOLK_E_00005',
 'FOLK_E_00004']

['SW--_E_00006']