# Working with BatchXSLT

This is a tutorial for the Python-Wrapper batchxslt. This package is specified to transform language resources of the dgd metafile xml format to the clarin medata format by using xsl stylesheets and saxon. I wrote this wrapper to make corpus transformation less painful and more extensible. Though processing with lxml is easy, I decided keep transformation to the xlst processor and management of it to batchxslt so users not familiar with lxml can stick to xslt and rely on this tutorial for their needs.


## 1. Transform metadata from dgd to cmdi format

* define resource locators (corpus, event, speakers directories and saxon and xsl directories)
* define an output directory 

In [3]:
# using absolute paths 
corpus_dir = "/home/kuhn/Data/IDS/svn_rev1233/dgd2_data/metadata/corpora/extern"
event_dir = "/home/kuhn/Data/IDS/svn_rev1233/dgd2_data/metadata/events/extern"
speakers_dir = "/home/kuhn/Data/IDS/svn_rev1233/dgd2_data/metadata/speakers/extern"
xsl_dir = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/"
corpus_xsl = xsl_dir + "dgdCorpus2cmdi.xsl"
event_xsl = xsl_dir + "dgdEvent2cmdi.xsl"
speaker_xsl = xsl_dir + "dgdSpeaker2cmdi.xsl"
saxon_jar = "/home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/batchxslt/saxon/saxon9he.jar"
out_corp = "/home/kuhn/Data/cmdi_final/corpus"
out_event = "/home/kuhn/Data/cmdi_final/events"
out_speaker = "/home/kuhn/Data/cmdi_final/speakers"

In [1]:
# import the xsl wrapper
from batchxslt import processor

In [4]:
xsl_processor = processor.XSLBatchProcessor(saxon_jar)

In [5]:
# show me the api doc of the start method
xsl_processor.start?

In [5]:
%time xsl_processor.start(corpus_xsl, corpus_dir, "cmdi_", out_corp)

stylesheet: /home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/dgdCorpus2cmdi.xsl
outputdir: /home/kuhn/Data/cmdi_final/corpus
xmldata: /home/kuhn/Data/IDS/svn_rev1233/dgd2_data/metadata/corpora/extern
CPU times: user 5.26 ms, sys: 10.5 ms, total: 15.8 ms
Wall time: 15.3 s


In [6]:
%time xsl_processor.start(event_xsl, event_dir, "cmdi_", out_event)

stylesheet: /home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/dgdEvent2cmdi.xsl
outputdir: /home/kuhn/Data/cmdi_final/events
xmldata: /home/kuhn/Data/IDS/svn_rev1233/dgd2_data/metadata/events/extern
CPU times: user 613 ms, sys: 5.7 s, total: 6.31 s
Wall time: 2h 21min 2s


In [7]:
%time xsl_processor.start(speaker_xsl, speakers_dir, "cmdi_", out_speaker)

stylesheet: /home/kuhn/Data/IDS/svn/dgd2_data/dgd2cmdi/xslt/dgdSpeaker2cmdi.xsl
outputdir: /home/kuhn/Data/cmdi_final/speakers
xmldata: /home/kuhn/Data/IDS/svn_rev1233/dgd2_data/metadata/speakers/extern
CPU times: user 654 ms, sys: 7.1 s, total: 7.75 s
Wall time: 2h 20min 37s


## 2. Defining Resource Proxies

Once the original metadata files have been transformed to cmdi, we can go on to build up a resource tree structure of it. For this purpose, we can use the module **cmdiresource**.  

In [24]:
from batchxslt import cmdiresource
from batchxslt import cmdiheader
import os
import logging

We define the paths to our recently transformed data

In [22]:
corpus = "/home/kuhn/Data/cmdi_final/corpus/"
event = "/home/kuhn/Data/cmdi_final/events/"
speakers = "/home/kuhn/Data/cmdi_final/speakers/"
transcripts = "/home/kuhn/Data/IDS/svn_rev1233/dgd2_data/transcripts/"

cmdi_final = '/tmp/cmdi/all'

Now define a ResourceTreeCollection instance.

In [8]:
%time resourcetree = cmdiresource.ResourceTreeCollection(corpus, event, speakers, transcripts)

CPU times: user 2.28 s, sys: 535 ms, total: 2.82 s
Wall time: 2.84 s


In [9]:
resourcetree.size()

45328

In [10]:
# define ids for all node objects
counter = 0
for node in resourcetree.nodes_iter():
    corpuslabel = node.split('_')[0].rstrip('-')
    resourcetree.node.get(node).update({'id': corpuslabel + '_' + str(counter)})
    counter += 1

ResourceTreeCollection inherits from networx.DiGraph and builds up a resource tree for all resources of the dgd2.
Lets look at a random resource node.

In [11]:
resourcetree.build_resourceproxy()

In [12]:
for nodename in resourcetree.nodes_iter():
    if resourcetree.node.get(nodename).get('type') == 'event':
        resourcetree.define_parts(nodename)
    elif resourcetree.node.get(nodename).get('type') == 'corpus':
        resourcetree.define_parts(nodename)

In [13]:
for nodename in resourcetree.nodes_iter():
    if resourcetree.node.get(nodename).get('type') == 'speaker':
        resourcetree.speaker2event(nodename)

## 3. Write all corpora to separate cmdi folders


In [28]:
import os
# get a list of all labels of corpora (e.g. list them in the event directory)
for corpusname in os.listdir(event):
    print 'processing: ' + corpusname
    try:
        os.mkdir(os.path.join(cmdi_final, corpusname))
    except IOError:
        logging.error('cannot create directory' + corpusname)
        
    cmdiheader.define_header(corpusname, resourcetree)
    resourcetree.write_cmdi(corpusname, os.path.join(os.path.join(cmdi_final, corpusname), corpusname + '.cmdi'))
    # iterate over all successors of a corpus
    for nodename in resourcetree.successors_iter(corpusname):
        if resourcetree.node.get(nodename).get('type') == 'event':
            cmdiheader.define_header(nodename, resourcetree)
            resourcetree.write_cmdi(nodename, os.path.join(cmdi_final, corpusname + '/' + nodename + '.cmdi'))

processing: BR
processing: EK
processing: SA
processing: KN
processing: BW
processing: ISZ
processing: FR
processing: IS
processing: PF
processing: SR
processing: AD
processing: FOLK
processing: SV
processing: SW
processing: MV
processing: HL
processing: OS
processing: ISW
processing: ZW
processing: BB


OSError: [Errno 17] File exists: '/tmp/cmdi/all/BB'