-
Notifications
You must be signed in to change notification settings - Fork 21
/
recorder.py
742 lines (601 loc) · 23.9 KB
/
recorder.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
"""
# Recorder
Support for round trips of TF data to annotation tools and back.
The scenario is:
* Prepare a piece of corpus material for plain text use in an annotation tool,
e.g. [BRAT](https://brat.nlplab.org).
* Alongside the plain text, generate a mapping file that maps nodes to
character positions in the plain text
* Use an annotation tool to annotate the plain text
* Read the output of the annotation tools and convert it into TF features,
using the mapping file.
## Explanation
The recorder object is an engine to which you can send text material, interspersed
with commands that say:
* start node `n`;
* end node `n`.
The recorder stores the accumulating text as a plain text, without any
trace of the `start` and `end` commands.
However, it also maintains a mapping between character positions in the
accumulated text and the nodes.
At any moment, there is a set of *active* nodes: the ones that have been started,
but not yet ended.
Every character of text that has been sent to the recorder
will add an entry to the position mapping: it maps the position of that character
to the set of active nodes at that point.
## Usage
We suppose you have a corpus loaded, either by
```
from tf.app import use
A = use(corpus)
api = A.api
```
or by
```
from tf.fabric import Fabric
TF = Fabric(locations, modules)
api = TF.load(features)
```
```
from tf.convert.recorder import Recorder
rec = Recorder(api)
rec.add("a")
rec.start(n1)
rec.add("bc")
rec.start(n2)
rec.add("def")
rec.end(n1)
rec.add("ghij")
rec.end(n2)
rec.add("klmno")
```
This leads to the following mapping:
position | text | active nodes
--- | --- | ---
0 | `a` | `{}`
1 | `b` | `{n1}`
2 | `c` | `{n1}`
3 | `d` | `{n1, n2}`
4 | `e` | `{n1, n2}`
5 | `f` | `{n1, n2}`
6 | `g` | `{n2}`
7 | `h` | `{n2}`
8 | `i` | `{n2}`
9 | `j` | `{n2}`
10 | `k` | `{}`
11 | `l` | `{}`
12 | `m` | `{}`
13 | `n` | `{}`
14 | `o` | `{}`
There are methods to obtain the accumulated text and the mapped positions from the
recorder.
You can write the information of a recorder to disk and read it back later.
And you can generate features from a CSV file using the mapped positions.
To see it in action, see this
[tutorial](https://nbviewer.jupyter.org/github/etcbc/bhsa/blob/master/tutorial/annotate.ipynb)
"""
from itertools import chain
from ..core.helpers import (
specFromRangesLogical,
specFromRanges,
rangesFromSet,
)
from ..core.files import expanduser as ex, splitExt, initTree, dirNm
ZWJ = "\u200d" # zero width joiner
class Recorder:
def __init__(self, api=None):
"""Accumulator of generated text that remembers node positions.
Parameters
----------
api: obj, optional None
The handle of the API of a loaded TF corpus.
This is needed for operations where the recorder needs
TF intelligence associated with the nodes, e.g. their types.
If you do not pass an api, such methods are unavailable later on.
"""
self.api = api
self.material = []
"""Accumulated text.
It is a list of chunks of text.
The text is just the concatenation of all these chunks.
"""
self.nodesByPos = []
"""Mapping from textual positions to nodes.
It is a list. Entry `p` in this list stores the set of active nodes
for character position `p`.
"""
self.context = set()
"""The currently active nodes.
"""
def start(self, n):
"""Start a node.
That means: add it to the context, i.e. make the node active.
Parameters
----------
n: integer
A node. The node can be any node type.
"""
self.context.add(n)
def end(self, n):
"""End a node.
That means: delete it from the context, i.e. make the node inactive.
Parameters
----------
n: integer
A node. The node can be of any node type.
"""
self.context.discard(n)
def add(self, string, empty=ZWJ):
"""Add text to the accumulator.
Parameters
----------
string: string | None
Material to add.
If it is a string, the string will be added to the accumulator.
If it is `None`, a default value will be added.
The default value is passed through parameter `empty`.
empty: string, optional zero-width-joiner
If the string parameter is `None`, this is the default value
that will be added to the accumulator.
If this parameter is absent, the zero-width joiner is used.
"""
if string is None:
string = empty
self.material.append(string)
self.nodesByPos.extend([frozenset(self.context)] * len(string))
def text(self):
"""Get the accumulated text.
Returns
-------
string
The join of all accumulated text chunks.
"""
return "".join(self.material)
def positions(self, byType=False, simple=False):
"""Get the node positions as mapping from character positions.
Character positions start at `0`.
For each character position we get the set of nodes whose material
occupies that character position.
Parameters
----------
byType: boolean, optional False
If True, makes a separate node mapping per node type.
For this it is needed that the Recorder has been
passed a TF api when it was initialized.
simple: boolean, optional False
In some cases it is known on beforehand that at each textual position
there is at most 1 node.
Then it is more economical to fill the list with single nodes
rather than with sets of nodes.
If this parameter is True, we pick the first node from the set.
Returns
-------
list|dict|None
If `byType`, the result is a dictionary, keyed by node type,
valued by mappings of textual positions to nodes of that type.
This mapping takes the shape of a list where entry `i`
contains the frozen set of all nodes of that type
that were active at character position `i` in the text.
If not `byType` then a single mapping is returned (as list),
where entry `i` contains the frozen set of all
nodes, irrespective of their type
that were active at character position `i` in the text.
"""
if not byType:
if simple:
return tuple(list(x)[0] if x else None for x in self.nodesByPos)
return self.nodesByPos
api = self.api
if api is None:
print(
"""\
Cannot determine node types without a TF api.
You have to call Recorder(`api`) instead of Recorder()
where `api` is the result of
tf.app.use(corpus)
or
tf.Fabric(locations, modules).load(features)
"""
)
return None
F = api.F
Fotypev = F.otype.v
info = api.TF.info
indent = api.TF.indent
indent(level=True, reset=True)
info("gathering nodes ...")
allNodes = set(chain.from_iterable(self.nodesByPos))
allTypes = {Fotypev(n) for n in allNodes}
info(f"found {len(allNodes)} nodes in {len(allTypes)} types")
nodesByPosByType = {nodeType: [] for nodeType in allTypes}
info("partitioning nodes over types ...")
for nodeSet in self.nodesByPos:
typed = {}
for node in nodeSet:
nodeType = Fotypev(node)
typed.setdefault(nodeType, set()).add(node)
for nodeType in allTypes:
thisSet = (
frozenset(typed[nodeType]) if nodeType in typed else frozenset()
)
value = (list(thisSet)[0] if thisSet else None) if simple else thisSet
nodesByPosByType[nodeType].append(value)
info("done")
indent(level=False)
return nodesByPosByType
def iPositions(self, byType=False, logical=True, asEntries=False):
"""Get the character positions as mapping from nodes.
Parameters
----------
byType: boolean, optional False
If True, makes a separate node mapping per node type.
For this it is needed that the Recorder has been
passed a TF api when it was initialized.
logical: boolean, optional True
If True, specs are represented as tuples of ranges
and a range is represented as a tuple of a begin and end point,
or as a single point.
Points are integers.
If False, ranges are represented by strings: `,` separated ranges,
a range is *b*`-`*e* or *p*.
asEntries: boolean, optional False
If True, do not return the dict, but rather its entries.
Returns
-------
list|dict|None
If `byType`, the result is a dictionary, keyed by node type,
valued by mappings for nodes of that type.
Entry `n` in this mapping contains the intervals of all
character positions in the text where node `n` is active.
If not `byType` then a single mapping is returned, where each node
is mapped to the intervals where that node is active.
"""
method = specFromRangesLogical if logical else specFromRanges
posByNode = {}
for (i, nodeSet) in enumerate(self.nodesByPos):
for node in nodeSet:
posByNode.setdefault(node, set()).add(i)
for (n, posSet) in posByNode.items():
posByNode[n] = method(rangesFromSet(posSet))
if asEntries:
posByNode = tuple(posByNode.items())
if not byType:
return posByNode
api = self.api
if api is None:
print(
"""\
Cannot determine node types without a TF api.
You have to call Recorder(`api`) instead of Recorder()
where `api` is the result of
tf.app.use(corpus)
or
tf.Fabric(locations, modules).load(features)
"""
)
return None
F = api.F
Fotypev = F.otype.v
posByNodeType = {}
if asEntries:
for (n, spec) in posByNode:
nType = Fotypev(n)
posByNodeType.setdefault(nType, []).append((n, spec))
else:
for (n, spec) in posByNode.items():
nType = Fotypev(n)
posByNodeType.setdefault(nType, {})[n] = spec
return posByNodeType
def rPositions(self, acceptMaterialOutsideNodes=False):
"""Get the first textual position for each node
The position information is a big amount of data, in the general case.
Under certain assumptions we can economize on this data usage.
Strong assumptions:
1. every textual position is covered by **exactly one node**;
2. the nodes are consecutive:
every next node is equal to the previous node plus 1;
3. the positions of the nodes are monotonous in the nodes, i.e.
if node *n* < *m*, then the position of *n* is before the position of *m*.
Imagine the text partitioned in consecutive non-overlapping chunks, where
each node corresponds to exactly one chunk, and the order of the nodes
is the same as the order of the corresponding chunks.
We compute a list of positions that encode the mapping from nodes to textual
positions as follows:
Suppose we need map nodes *n*, *n+1*, ..., *n+m* to textual positions;
say
* node *n* starts at position *t0*,
* node *n+1* at position *t1*,
* node *n+m* at position *tm*.
Say position *te* is the position just after the whole text covered by these
nodes.
Then we deliver the mapping as a sequence of these numbers:
* *n - 1*
* *t0*
* *t1*
* ...
* *tm*
* *te*
So the first element of the list is used to specify the offset to be
applied for all subsequent nodes.
The *te* value is added as a sentinel, to facilitate the determination
of the last position of each node.
Users of this list can find the start and end positions of node *m*
as follows
```
start = posList[m - posList[0]]
end = posList[m - posList[0] + 1] - 1
```
Parameters
----------
acceptMaterialOutsideNodes: boolean, optional False
If this is True, we accept that the text contains extra material that is not
covered by any node.
That means that condition 1 above is relaxed to that we accept that
some textual positions do not correspond to any node.
Applications that make use of the positions must realize that in this case
the material associated with a node also includes the subsequent material
outside any node.
Returns
-------
list | str
The result is a list of numbers as described above.
We only return the *posList* if the assunptions hold.
If not, we return a string with diagnostic information.
"""
good = True
multipleNodes = 0
multipleFirst = 0
noNodes = 0
noFirst = 0
nonConsecutive = 0
nonConsecutiveFirst = 0
posByNode = {}
for (i, nodeSet) in enumerate(self.nodesByPos):
if (not acceptMaterialOutsideNodes and len(nodeSet) == 0) or len(
nodeSet
) > 1:
good = False
if len(nodeSet) == 0:
if noNodes == 0:
noFirst = i
noNodes += 1
else:
if multipleNodes == 0:
multipleFirst = i
multipleNodes += 1
continue
for node in nodeSet:
if node in posByNode:
continue
posByNode[node] = i
lastI = i
if not good:
msg = ""
if noNodes:
msg += (
f"{noNodes} positions without node, "
f"of which the first one is {noFirst}\n"
)
if multipleNodes:
msg += (
f"{multipleNodes} positions with multiple nodes, "
f"of which the first one is {multipleFirst}\n"
)
return msg
sortedPosByNode = sorted(posByNode.items())
offset = sortedPosByNode[0][0] - 1
posList = [offset]
prevNode = offset
for (node, i) in sortedPosByNode:
if prevNode + 1 != node:
good = False
if nonConsecutive == 0:
nonConsecutiveFirst = f"{prevNode} => {node}"
nonConsecutive += 1
else:
posList.append(i)
prevNode = node
posList.append(lastI)
if not good:
return (
f"{nonConsecutive} nonConsecutive nodes, "
f"of which the first one is {nonConsecutiveFirst}"
)
return posList
def write(
self, textPath, inverted=False, posPath=None, byType=False, optimize=True
):
"""Write the recorder information to disk.
The recorded text is written as a plain text file,
and the remembered node positions are written as a TSV file.
You can also have the node positions be written out by node type.
In that case you can also optimize the file size.
Optimization means that consecutive equal values are prepended
by the number of repetitions and a `*`.
Parameters
----------
textPath: string
The file path to which the accumulated text is written.
inverted: boolean, optional False
If False, the positions are taken as mappings from character
positions to nodes. If True, they are a mapping from nodes to
character positions.
posPath: string, optional None
The file path to which the mapped positions are written.
If absent, it equals `textPath` with a `.pos` extension, or
`.ipos` if `inverted` is True.
The file format is: one line for each character position,
on each line a tab-separated list of active nodes.
byType: boolean, optional False
If True, writes separate node mappings per node type.
For this it is needed that the Recorder has been
passed a TF api when it was initialized.
The file names are extended with the node type.
This extension occurs just before the last `.` of the inferred `posPath`.
optimize: boolean, optional True
Optimize file size. Only relevant if `byType` is True
and `inverted` is False.
The format of each line is:
*rep* `*` *nodes`
where *rep* is a number that indicates repetition and *nodes*
is a tab-separated list of node numbers.
The meaning is that the following *rep* character positions
are associated with these *nodes*.
"""
textPath = ex(textPath)
posExt = ".ipos" if inverted else ".pos"
posPath = ex(posPath or f"{textPath}{posExt}")
textDir = dirNm(textPath)
initTree(textDir)
with open(textPath, "w", encoding="utf8") as fh:
fh.write(self.text())
if not byType:
posDir = dirNm(posPath)
initTree(posDir)
with open(posPath, "w", encoding="utf8") as fh:
if inverted:
fh.write(
"\n".join(
f"{node}\t{intervals}"
for (node, intervals) in self.iPositions(
byType=False, logical=False, asEntries=True
)
)
)
else:
fh.write(
"\n".join(
"\t".join(str(i) for i in nodes)
for nodes in self.nodesByPos
)
)
fh.write("\n")
return
mapByType = (
self.iPositions(byType=True, logical=False, asEntries=True)
if inverted
else self.positions(byType=True)
)
if mapByType is None:
print("No position files written")
return
(base, ext) = splitExt(posPath)
# if we reach this, there is a TF api
api = self.api
info = api.TF.info
indent = api.TF.indent
indent(level=True, reset=True)
for (nodeType, mapping) in mapByType.items():
fileName = f"{base}-{nodeType}{ext}"
info(f"{nodeType:<20} => {fileName}")
with open(fileName, "w", encoding="utf8") as fh:
if inverted:
fh.write(
"\n".join(
f"{node}\t{intervals}" for (node, intervals) in mapping
)
)
else:
if not optimize:
fh.write(
"\n".join(
"\t".join(str(i) for i in nodes) for nodes in mapping
)
)
else:
repetition = 1
previous = None
for nodes in mapping:
if nodes == previous:
repetition += 1
continue
else:
if previous is not None:
prefix = f"{repetition}*" if repetition > 1 else ""
value = "\t".join(str(i) for i in previous)
fh.write(f"{prefix}{value}\n")
repetition = 1
previous = nodes
if previous is not None:
prefix = f"{repetition + 1}*" if repetition else ""
value = "\t".join(str(i) for i in previous)
fh.write(f"{prefix}{value}\n")
indent(level=False)
def read(self, textPath, posPath=None):
"""Read recorder information from disk.
Parameters
----------
textPath: string
The file path from which the accumulated text is read.
posPath: string, optional None
The file path from which the mapped positions are read.
If absent, it equals `textPath` with the extension `.pos` .
The file format is: one line for each character position,
on each line a tab-separated list of active nodes.
!!! caution
Pos files that have been written with `optimize=True` cannot
be read back.
"""
textPath = ex(textPath)
posPath = ex(posPath or f"{textPath}.pos")
self.context = {}
with open(textPath, encoding="utf8") as fh:
self.material = list(fh)
with open(posPath, encoding="utf8") as fh:
self.nodesByPos = [
{int(n) for n in line.rstrip("\n").split("\t")}
if line != "\n"
else set()
for line in fh
]
def makeFeatures(self, featurePath, headers=True):
"""Read a tab-separated file of annotation data and convert it to features.
An external annotation tool typically annotates text by assigning values
to character positions or ranges of character positions.
In Text-Fabric, annotations are values assigned to nodes.
If a *recorded* text has been annotated by an external tool,
we can use the position-to-node mapping to construct Text-Fabric features
out of it.
The annotation file is assumed to be a tab-separated file.
Every line corresponds to an annotation.
The first two columns have the start and end positions, as character positions
in the text.
The remaining columns contain annotation values for that strectch of text.
If there is a heading column, the values of the headers translate to names
of the new TF features.
Parameters
----------
featurePath: string
Path to the annotation file.
headers: boolean or iterable, optional True
Indicates whether the annotation file has headers.
If not True, it may be an iterable of names, which will
be used as headers.
"""
featurePath = ex(featurePath)
nodesByPos = self.nodesByPos
features = {}
with open(featurePath, encoding="utf8") as fh:
if headers is True:
names = next(fh).rstrip("\n").split("\t")[2:]
elif headers is not None:
names = headers
else:
names = None
for line in fh:
(start, end, *data) = line.rstrip("\n").split("\t")
if names is None:
names = tuple(f"f{i}" for i in range(1, len(data) + 1))
nodes = set(
chain.from_iterable(
nodesByPos[i] for i in range(int(start), int(end) + 1)
)
)
for n in nodes:
for i in range(len(names)):
val = data[i]
if not val:
continue
name = names[i]
features.setdefault(name, {})[n] = val
return features