-
Notifications
You must be signed in to change notification settings - Fork 21
/
xml.py
1244 lines (988 loc) · 40.2 KB
/
xml.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
# XML import
You can convert any XML source into TF by specifying a few details about the source.
TF then invokes the `tf.convert.walker` machinery to produce a TF
dataset out of the source.
The converter goes one extra mile: it generates a TF app, in such a way that
the TF browser is instantly usable.
!!! caution "As an example"
This is more intended as an example of how to tackle the conversion
of XML to TF than as a production engine.
Most XML corpora use elements for special things, and a good conversion
to TF should deal with the intention behind the elements.
See `tf.convert.tei` for a production converter of TEI XML to TF.
## White-space
This converter does not read schemas and has no extra knowledge about the elements.
Because of the lack of schema information we do not know exactly which white-space
is significant. The only thing we do to white-space is to condense each stretch
of white-space to a single space.
Whether some of these spaces around tags must be ignored is a matter of further
customization.
# Configuration
We assume that you have a `programs` directory at the top-level of your repo.
In this directory we'll look for two optional files:
* a file `xml.yaml` in which you specify a bunch of values
Last, but not least, you can assemble all the input parameters needed to
get the conversion off the ground.
* a file `xml.py` in which you define a function `transform(text)` which
takes a text string argument and delivers a text string as result.
The converter will call this on every XML input file it reads *before*
feeding it to the XML parser.
## Keys and values of the `xml.yaml` file
### `generic`
dict, optional `{}`
Metadata for all generated TF features.
The actual source version of the XML files does not have to be stated here,
it will be inserted based on the version that the converter will actually use.
That version depends on the `xml` argument passed to the program.
The key under which the source version will be inserted is `xmlVersion`.
### `intFeatures`
list, optional `[]`
The features (nodes and edges) that are integer-valued.
### `featureDescriptions`
dict, optional `{}`
Short descriptions for the features.
Will be included in the metadata of the feature files, after `@description`.
### `procins`
boolean, optional `False`
If True, processing instructions will be treated.
Processing instruction `<?foo bar="xxx"?>` will be converted as if it were an empty
element named `foo` with attribute `bar` with value `xxx`.
# Usage
## Command-line
``` sh
tf-fromxml tasks flags
```
## From Python
``` python
from tf.convert.xml import XML
X = XML()
X.task(**tasks, **flags)
```
For a short overview the tasks and flags, see `HELP`.
## Tasks
We have the following conversion tasks:
1. `check`: makes and inventory of all XML elements and attributes used.
1. `convert`: produces actual TF files by converting XML files.
1. `load`: loads the generated TF for the first time, by which the pre-computation
step is triggered. During pre-computation some checks are performed. Once this
has succeeded, we have a workable TF dataset.
1. `app`: creates or updates a corpus specific TF app with minimal sensible settings.
1. `browse`: starts the TF browser on the newly created dataset.
Tasks can be run by passing any choice of task keywords to the
`XML.task()` method.
## Note on versions
The XML source files come in versions, indicated with a data.
The converter picks the most recent one, unless you specify an other one:
``` python
tf-fromxml xml=-2 # previous version
tf-fromxml xml=0 # first version
tf-fromxml xml=3 # third version
tf-fromxml xml=2019-12-23 # explicit version
```
The resulting TF data is independently versioned, like `1.2.3`.
When the converter runs, by default it overwrites the most recent version,
unless you specify another one.
It looks at the latest version and then bumps a part of the version number.
``` python
tf-fromxml tf=3 # minor version, 1.2.3 becomes 1.2.4
tf-fromxml tf=2 # intermediate version, 1.2.3 becomes 1.3.0
tf-fromxml tf=1 # major version, 1.2.3 becomes 2.0.0
tf-fromxml tf=1.8.3 # explicit version
```
## Examples
Exactly how you can call the methods of this module and add your own customised
conversion code is demonstrated in the Greek New Testament:
* [`ETCBC/nestle1904`](https://nbviewer.org/github/ETCBC/nestle1904/blob/master/programs/tfFromLowfat.ipynb).
"""
import sys
import collections
import re
from subprocess import run
from importlib import util
from .helpers import setUp, FILE
from .xmlCustom import convertTaskDefault
from ..capable import CheckImport
from ..parameters import BRANCH_DEFAULT_NEW
from ..fabric import Fabric
from ..core.helpers import console, versionSort
from ..convert.walker import CV
from ..core.timestamp import AUTO, DEEP, TERSE
from ..core.text import DEFAULT_FORMAT
from ..core.command import readArgs
from ..core.helpers import mergeDict
from ..core.files import (
fileOpen,
abspath,
expanduser as ex,
unexpanduser as ux,
getLocation,
initTree,
dirNm,
dirExists,
dirContents,
fileExists,
fileCopy,
fileRemove,
scanDir,
readYaml,
writeYaml,
)
(HELP, TASKS, TASKS_EXCLUDED, PARAMS, FLAGS) = setUp("XML")
TRIM_ATTS = set(
"""
id
key
target
value
""".strip().split()
)
class XML(CheckImport):
def __init__(
self,
convertTaskCustom=None,
trimAtts=set(),
keywordAtts=set(),
renameAtts={},
xml=PARAMS["xml"][1],
tf=PARAMS["tf"][1],
verbose=FLAGS["verbose"][1],
):
"""Converts XML to TF.
For documentation of the resulting encoding, read the
[transcription template](https://github.com/annotation/text-fabric/blob/master/tf/convert/app/transcription.md).
Below we describe how to control the conversion machinery.
We adopt a fair bit of "convention over configuration" here, in order to lessen
the burden for the user of specifying so many details.
Based on current directory from where the script is called,
it defines all the ingredients to carry out
a `tf.convert.walker` conversion of the XML input.
This function is assumed to work in the context of a repository,
i.e. a directory on your computer relative to which the input directory exists,
and various output directories: `tf`, `app`.
Your current directory must be at
```
~/backend/org/repo/relative
```
where
* `~` is your home directory;
* `backend` is an online back-end name,
like `github`, `gitlab`, `git.huc.knaw.nl`;
* `org` is an organization, person, or group in the back-end;
* `repo` is a repository in the `org`.
* `relative` is a directory path within the repo (0 or more components)
This is only about the directory structure on your local computer;
it is not required that you have online incarnations of your repository
in that back-end.
Even your local repository does not have to be a git repository.
The only thing that matters is that the full path to your repo can be parsed
as a sequence of `home/backend/org/repo/relative` .
Relative to this directory the program expects and creates
input / output directories.
## Input directories
### `xml`
*Location of the XML sources.*
**If it does not exist, the program aborts with an error.**
Several levels of subdirectories are assumed:
1. the version of the source (this could be a date string).
1. volumes / collections of documents. The subdirectory `__ignore__` is ignored.
1. the XML documents themselves.
## Output directories
### `report`
Directory to write the results of the `check` task to: an inventory
of elements / attributes encountered.
If the directory does not exist, it will be created.
The default value is `.` (i.e. the current directory in which
the script is invoked).
### `tf`
The directory under which the TF output file (with extension `.tf`)
are placed.
If it does not exist, it will be created.
The TF files will be generated in a folder named by a version number,
passed as `tfVersion`.
### `app`
Location of additional TF app configuration files.
If they do not exist, they will be created with some sensible default
settings and generated documentation.
These settings can be overridden in the `app/config_custom.yaml` file.
Also a default `display.css` file and a logo are added.
Custom content for these files can be provided in files
with `_custom` appended to their base name.
Parameters
----------
convertTaskCustom: function, optional None
You can pass a replacement for the `convertTask` method.
If you do that, it will be used instead.
By means of this approach you can use the generic machinery of the
XML converter as much as possible, and you only have to adapt the bits
that process the XML sources.
trimAtts: iterable of string, optional set()
You can pass the names of attributes whose values you do not have to
see spelled out in the report files generated by the check task.
keywordAtts: iterable of string, optional set()
You can pass the names of attributes whose values are a limited set of
keywords that you want to see spelled out in the report files
generated by the check task.
renameAtts: dict, optional {}
You can change attribute names systematically on the fly.
xml: string, optional ""
If empty, use the latest version under the `xml` directory with sources.
Otherwise it should be a valid integer, and it is the index in the
sorted list of versions there.
* `0` or `latest`: latest version;
* `-1`, `-2`, ... : previous version, version before previous, ...;
* `1`, `2`, ...: first version, second version, ....
* everything else that is not a number is an explicit version
If the value cannot be parsed as an integer, it is used as the exact
version name.
tf: string, optional ""
If empty, the TF version used will be the latest one under the `tf`
directory.
If it can be parsed as the integers 1, 2, or 3 it will bump the latest
relevant TF version:
* `0` or `latest`: overwrite the latest version
* `1` will bump the major version
* `2` will bump the intermediate version
* `3` will bump the minor version
* everything else is an explicit version
Otherwise, the value is taken as the exact version name.
verbose: integer, optional -1
Produce no (-1), some (0) or many (1) progress and reporting messages
"""
super().__init__("lxml")
if self.importOK(hint=True):
self.etree = self.importGet()
else:
return
self.good = True
self.convertTaskCustom = convertTaskCustom
self.trimAtts = set(trimAtts)
self.keywordAtts = set(keywordAtts)
self.renameAtts = renameAtts
(backend, org, repo, relative) = getLocation()
if any(s is None for s in (backend, org, repo, relative)):
console(
"Not working in a repo: "
f"backend={backend} org={org} repo={repo} relative={relative}"
)
self.good = False
return
if verbose == 1:
console(
f"Working in repository {org}/{repo}{relative} in back-end {backend}"
)
base = ex(f"~/{backend}")
repoDir = f"{base}/{org}/{repo}"
refDir = f"{repoDir}{relative}"
programDir = f"{refDir}/programs"
convertSpec = f"{programDir}/xml.yaml"
convertCustom = f"{programDir}/xml.py"
settings = readYaml(asFile=convertSpec, plain=True)
self.transform = None
if fileExists(convertCustom):
try:
spec = util.spec_from_file_location("xmlcustom", convertCustom)
code = util.module_from_spec(spec)
sys.path.insert(0, dirNm(convertCustom))
spec.loader.exec_module(code)
sys.path.pop(0)
self.transform = code.transform
except Exception as e:
console(str(e))
self.transform = None
generic = settings.get("generic", {})
self.generic = generic
intFeatures = settings.get("intFeatures", {})
self.intFeatures = intFeatures
featureDescriptions = settings.get("featureDescriptions", {})
self.featureMeta = {
k: dict(description=v) for (k, v) in featureDescriptions.items()
}
procins = settings.get("procins", False)
self.procins = procins
reportDir = f"{refDir}/report"
appDir = f"{refDir}/app"
xmlDir = f"{refDir}/xml"
tfDir = f"{refDir}/tf"
xmlVersions = sorted(dirContents(xmlDir)[1], key=versionSort)
nXmlVersions = len(xmlVersions)
if xml in {"latest", "", "0", 0} or str(xml).lstrip("-").isdecimal():
xmlIndex = (0 if xml == "latest" else int(xml)) - 1
try:
xmlVersion = xmlVersions[xmlIndex]
except Exception:
absIndex = xmlIndex + (nXmlVersions if xmlIndex < 0 else 0) + 1
console(
(
f"no item in {absIndex} in {nXmlVersions} source versions "
f"in {ux(xmlDir)}"
)
if len(xmlVersions)
else f"no source versions in {ux(xmlDir)}",
error=True,
)
self.good = False
return
else:
xmlVersion = xml
xmlPath = f"{xmlDir}/{xmlVersion}"
reportPath = f"{reportDir}/{xmlVersion}"
if not dirExists(xmlPath):
console(
f"source version {xmlVersion} does not exists in {ux(xmlDir)}",
error=True,
)
self.good = False
return
xmlStatuses = {tv: i for (i, tv) in enumerate(reversed(xmlVersions))}
xmlStatus = xmlStatuses[xmlVersion]
xmlStatusRep = (
"most recent"
if xmlStatus == 0
else "previous"
if xmlStatus == 1
else f"{xmlStatus - 1} before previous"
)
if xmlStatus == len(xmlVersions) - 1 and len(xmlVersions) > 1:
xmlStatusRep = "oldest"
if verbose >= 0:
console(f"XML data version is {xmlVersion} ({xmlStatusRep})")
tfVersions = sorted(dirContents(tfDir)[1], key=versionSort)
latestTfVersion = tfVersions[-1] if len(tfVersions) else "0.0.0"
if tf in {"latest", "", "0", 0}:
tfVersion = latestTfVersion
vRep = "latest"
elif tf in {"1", "2", "3", 1, 2, 3}:
bump = int(tf)
parts = latestTfVersion.split(".")
def getVer(b):
return int(parts[b])
def setVer(b, val):
parts[b] = val
if bump > len(parts):
console(
f"Cannot bump part {bump} of latest TF version {latestTfVersion}",
error=True,
)
self.good = False
return
else:
b1 = bump - 1
old = getVer(b1)
setVer(b1, old + 1)
for b in range(b1 + 1, len(parts)):
setVer(b, 0)
tfVersion = ".".join(str(p) for p in parts)
vRep = (
"major" if bump == 1 else "intermediate" if bump == 2 else "minor"
)
vRep = f"next {vRep}"
else:
tfVersion = tf
status = "existing" if dirExists(f"{tfDir}/{tfVersion}") else "new"
vRep = f"explicit {status}"
tfPath = f"{tfDir}/{tfVersion}"
if verbose >= 0:
console(f"TF data version is {tfVersion} ({vRep})")
console(
f"Processing instructions will be {'treated' if procins else 'ignored'}"
)
self.refDir = refDir
self.xmlVersion = xmlVersion
self.xmlPath = xmlPath
self.tfVersion = tfVersion
self.tfPath = tfPath
self.reportPath = reportPath
self.tfDir = tfDir
self.appDir = appDir
self.backend = backend
self.org = org
self.repo = repo
self.relative = relative
self.verbose = verbose
myDir = dirNm(abspath(__file__))
self.myDir = myDir
def getParser(self):
"""Configure the LXML parser.
See [parser options](https://lxml.de/parsing.html#parser-options).
Returns
-------
object
A configured LXML parse object.
"""
if not self.importOK():
return None
etree = self.etree
procins = self.procins
return etree.XMLParser(
remove_blank_text=False,
collect_ids=False,
remove_comments=True,
remove_pis=not procins,
huge_tree=True,
)
def getXML(self):
"""Make an inventory of the XML source files.
Returns
-------
tuple of tuple | string
The outer tuple has sorted entries corresponding to folders under the
XML input directory.
Each such entry consists of the folder name and an inner tuple
that contains the file names in that folder, sorted.
"""
xmlPath = self.xmlPath
IGNORE = "__ignore__"
xmlFilesRaw = collections.defaultdict(list)
with scanDir(xmlPath) as dh:
for folder in dh:
folderName = folder.name
if folderName == IGNORE:
continue
if not folder.is_dir():
continue
with scanDir(f"{xmlPath}/{folderName}") as fh:
for file in fh:
fileName = file.name
if not (fileName.lower().endswith(".xml") and file.is_file()):
continue
xmlFilesRaw[folderName].append(fileName)
xmlFiles = tuple(
(folderName, tuple(sorted(fileNames)))
for (folderName, fileNames) in sorted(xmlFilesRaw.items())
)
return xmlFiles
def checkTask(self):
"""Implementation of the "check" task.
Then it makes an inventory of all elements and attributes in the XML files.
If tags are used in multiple namespaces, it will be reported.
!!! caution "Conflation of namespaces"
The XML to TF conversion does construct node types and attributes
without taking namespaces into account.
However, the parsing process is namespace aware.
The inventory lists all elements and attributes, and many attribute values.
But is represents any digit with `n`, and some attributes that contain
ids or keywords, are reduced to the value `x`.
This information reduction helps to get a clear overview.
It writes reports to the `reportPath`:
* `errors.txt`: validation errors
* `elements.txt`: element / attribute inventory.
"""
if not self.importOK():
return
if not self.good:
return
etree = self.etree
verbose = self.verbose
procins = self.procins
trimAtts = self.trimAtts
keywordAtts = self.keywordAtts
renameAtts = self.renameAtts
xmlPath = self.xmlPath
reportPath = self.reportPath
if verbose == 1:
console(f"XML to TF checking: {ux(xmlPath)} => {ux(reportPath)}")
if verbose >= 0:
console(
f"Processing instructions are {'treated' if procins else 'ignored'}"
)
kindLabels = dict(
keyword="Keyword Attributes",
rest="Remaining Attributes and Elements",
)
getStore = lambda: collections.defaultdict( # noqa: E731
lambda: collections.defaultdict(collections.Counter)
)
analysis = {x: getStore() for x in kindLabels}
errors = []
tagByNs = collections.defaultdict(collections.Counter)
parser = self.getParser()
initTree(reportPath)
def analyse(root, analysis):
NUM_RE = re.compile(r"""[0-9]""", re.S)
def nodeInfo(xnode):
if procins and isinstance(xnode, etree._ProcessingInstruction):
target = xnode.target
tag = f"?{target}"
ns = ""
else:
qName = etree.QName(xnode.tag)
tag = qName.localname
ns = qName.namespace
atts = xnode.attrib
tagByNs[tag][ns] += 1
if len(atts) == 0:
kind = "rest"
analysis[kind][tag][""][""] += 1
else:
for kOrig, v in atts.items():
k = etree.QName(kOrig).localname
kind = "keyword" if k in keywordAtts else "rest"
dest = analysis[kind]
if kind == "rest":
vTrim = "X" if k in trimAtts else NUM_RE.sub("N", v)
dest[tag][k][vTrim] += 1
else:
words = v.strip().split()
for w in words:
dest[tag][k][w.strip()] += 1
for child in xnode.iterchildren(
tag=(etree.Element, etree.ProcessingInstruction)
if procins
else etree.Element
):
nodeInfo(child)
nodeInfo(root)
def writeErrors():
errorFile = f"{reportPath}/errors.txt"
nErrors = 0
with fileOpen(errorFile, mode="w") as fh:
for xmlFile, lines in errors:
fh.write(f"{xmlFile}\n")
for line in lines:
fh.write(line)
nErrors += 1
fh.write("\n")
console(
f"{nErrors} error(s) in {len(errors)} file(s) written to {errorFile}"
if verbose >= 0 or nErrors
else "Validation OK"
)
def writeNamespaces():
errorFile = f"{reportPath}/namespaces.txt"
nErrors = 0
nTags = len(tagByNs)
with fileOpen(errorFile, mode="w") as fh:
for tag, nsInfo in sorted(
tagByNs.items(), key=lambda x: (-len(x[1]), x[0])
):
label = "OK"
nNs = len(nsInfo)
if nNs > 1:
nErrors += 1
label = "XX"
for ns, amount in sorted(
nsInfo.items(), key=lambda x: (-x[1], x[0])
):
fh.write(
f"{label} {nNs:>2} namespace for "
f"{tag:<16} : {amount:>5}x {ns}\n"
)
console(
f"{nTags} tags of which {nErrors} with multiple namespaces "
f"written to {errorFile}"
if verbose >= 0 or nErrors
else "Namespaces OK"
)
def writeReport():
reportFile = f"{reportPath}/elements.txt"
with fileOpen(reportFile, mode="w") as fh:
fh.write(
"Inventory of tags and attributes in the source XML file(s).\n"
"Contains the following sections:\n"
)
for label in kindLabels.values():
fh.write(f"\t{label}\n")
fh.write("\n\n")
infoLines = 0
def writeAttInfo(tag, att, attInfo):
nonlocal infoLines
nl = "" if tag == "" else "\n"
tagRep = "" if tag == "" else f"<{tag}>"
attNew = renameAtts.get(att, att)
attX = att if att == attNew else f"{attNew} (was {att})"
attRep = "" if att == "" else f"{attX}="
atts = sorted(attInfo.items())
(val, amount) = atts[0]
fh.write(
f"{nl}\t{tagRep:<18} " f"{attRep:<18} {amount:>5}x {val}\n"
)
infoLines += 1
for val, amount in atts[1:]:
fh.write(f"""\t{'':<18} {'"':<18} {amount:>5}x {val}\n""")
infoLines += 1
def writeTagInfo(tag, tagInfo):
nonlocal infoLines
tags = sorted(tagInfo.items())
(att, attInfo) = tags[0]
writeAttInfo(tag, att, attInfo)
infoLines += 1
for att, attInfo in tags[1:]:
writeAttInfo("", att, attInfo)
for kind, label in kindLabels.items():
fh.write(f"\n{label}\n")
for tag, tagInfo in sorted(analysis[kind].items()):
writeTagInfo(tag, tagInfo)
if verbose >= 0:
console(f"{infoLines} info line(s) written to {reportFile}")
def filterError(msg):
return msg == (
"Element 'graphic', attribute 'url': [facet 'pattern'] "
"The value '' is not accepted by the pattern '\\S+'."
)
def doXMLFile(xmlPath):
tree = etree.parse(xmlPath, parser)
root = tree.getroot()
analyse(root, analysis)
i = 0
for xmlFolder, xmlFiles in self.getXML():
console(f"Start folder {xmlFolder}:")
for xmlFile in xmlFiles:
i += 1
console(f"\r{i:>4} {xmlFile:<50}", newline=False)
thisXmlPath = f"{xmlPath}/{xmlFolder}/{xmlFile}"
doXMLFile(thisXmlPath)
console("")
console(f"End folder {xmlFolder}")
console("")
writeReport()
writeErrors()
writeNamespaces()
# SET UP CONVERSION
def convertTask(self):
"""Implementation of the "convert" task.
It sets up the `tf.convert.walker` machinery and runs it.
Returns
-------
boolean
Whether the conversion was successful.
"""
if not self.importOK():
return
etree = self.etree
convertTaskCustom = self.convertTaskCustom
return (
convertTaskDefault(etree)
if convertTaskCustom is None
else convertTaskCustom
)(self)
def getConverter(self):
"""Initializes a converter.
Returns
-------
object
The `tf.convert.walker.CV` converter object, initialized.
"""
verbose = self.verbose
tfPath = self.tfPath
silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP
TF = Fabric(locations=tfPath, silent=silent)
return CV(TF, silent=silent)
def loadTask(self):
"""Implementation of the "load" task.
It loads the TF data that resides in the directory where the "convert" task
deliver its results.
During loading there are additional checks. If they succeed, we have evidence
that we have a valid TF dataset.
Also, during the first load intensive pre-computation of TF data takes place,
the results of which will be cached in the invisible `.tf` directory there.
That makes the TF data ready to be loaded fast, next time it is needed.
Returns
-------
boolean
Whether the loading was successful.
"""
if not self.importOK():
return
if not self.good:
return
tfPath = self.tfPath
verbose = self.verbose
silent = AUTO if verbose == 1 else TERSE if verbose == 0 else DEEP
if not dirExists(tfPath):
console(f"Directory {ux(tfPath)} does not exist.")
console("No TF found, nothing to load")
self.good = False
return
TF = Fabric(locations=[tfPath], silent=silent)
allFeatures = TF.explore(silent=True, show=True)
loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
api = TF.load(loadableFeatures, silent=silent)
if api:
if verbose >= 0:
console(f"max node = {api.F.otype.maxNode}")
self.good = True
return
self.good = False
# APP CREATION/UPDATING
def appTask(self, tokenBased=False):
"""Implementation of the "app" task.
It creates / updates a corpus-specific app.
There should be a valid TF dataset in place, because some
settings in the app derive from it.
It will also read custom additions that are present in the target app directory.
These files are:
* `config_custom.yaml`:
A YAML file with configuration data that will be *merged* into the generated
config.yaml.
* `app_custom.py`:
A python file with named snippets of code to be inserted
at corresponding places in the generated `app.py`
* `display_custom.css`:
Additional CSS definitions that will be put in place.
If the TF app for this resource needs custom code, this is the way to retain
that code between automatic generation of files.
Returns
-------
boolean
Whether the operation was successful.
"""
if not self.importOK():
return
if not self.good:
return
verbose = self.verbose
refDir = self.refDir
myDir = self.myDir
# key | parent directory | file | template based
# if parent directory is a tuple, the first part is the parent of the source
# end the second part is the parent of the destination
itemSpecs = (
("logo", "app/static", "logo.png", True),
("display", "app/static", "display.css", False),
("config", "app", "config.yaml", False),
("app", "app", "app.py", False),
)
genTasks = {
s[0]: dict(parent=s[1], file=s[2], justCopy=s[3]) for s in itemSpecs
}
tfVersion = self.tfVersion
version = tfVersion
def createConfig(sourceText, customText):
text = sourceText.replace("«version»", f'"{version}"')
settings = readYaml(text=text, plain=True)
settings.setdefault("provenanceSpec", {})["branch"] = BRANCH_DEFAULT_NEW
dataDisplay = settings.setdefault("dataDisplay", {})
interfaceDefaults = settings.setdefault("interfaceDefaults", {})
if "textFormats" in dataDisplay:
del dataDisplay["textFormats"]
dataDisplay["textFormat"] = DEFAULT_FORMAT
interfaceDefaults["fmt"] = DEFAULT_FORMAT
customSettings = (
{} if customText is None else readYaml(text=customText, plain=True)
)
mergeDict(settings, customSettings)
text = writeYaml(settings)
return text
def createDisplay(sourceText, customText):
"""Copies the custom display.css file of an TF app."""
return customText or ""
def createApp(sourceText, customText):
"""Copies and tweaks the app.py file of an TF app.
The template app.py provides text formatting functions.
It retrieves text from features, but that is dependent on
the settings of the conversion.
"""
if customText is None:
return None
materialCode = '''F.ch.v(n) or ""'''
code = sourceText.replace("F.matérial", materialCode)
code = code.replace('"rèndValues"', "{}")
hookStartRe = re.compile(r"^# DEF (import|init|extra)\s*$", re.S)
hookEndRe = re.compile(r"^# END DEF\s*$", re.S)
hookInsertRe = re.compile(r"^\t# INSERT (import|init|extra)\s*$", re.S)
custom = {}
section = None
for line in (customText or "").split("\n"):