Formulated STAM-baseoffset extension #7

proycon · 2023-01-19T17:45:43Z

(please read the README.md in this PR)

dirkroorda · 2023-01-23T08:41:56Z

Some questions arise. How do you deal with annotations on the corpus and on its parts?
It is not trivial to map annotations on parts to annotations on the corpus and vice versa.

A few examples. Suppose you can annotate words and sentences in the corpus. Here are a few types of annotations that will cause problems when you want to transfer them from part to corpus of vice versa.

type A (the least problematic of the non-trivial cases)

lex is an annotation set. Each annotation in it has a body which is a lexeme, and its targets are the occurrences of the lexeme.

lex annotations have targets across parts. You could trim lex annotations to the parts by just leaving out the targets that do not belong to the part in question.
Then, if you compose the corpus-wide lex annotation sets from the part-wise lex annotation sets, you have to merge the targets of the annotations with the same bodies.

type B (slightly more problematic)

Suppose you have an annotation set freqLex whose annotations have as body a number and as target a single word, where the number is de frequency of the lexeme of the word in the corpus.

Now, if you have computed the freqLex relative to the parts, then in order to compose the freqLex for the corpus as a whole, you have to sum the bodies of the part-wise annotations. For each target in a part, you have to find annotations in other parts that target a word with the same lexeme, and then take the sum.

type C (even worse)

Suppose you have annotation set rankLex whose annotations have as a body a number and as target a single word, where the number is the rank of the lexeme of the word in the corpus. Words with the highest frequency have rank 0, the second highest frequency get rank 1, and so on.

Here it is probably not worth to even try to compute the composed rank from the part-wise ranks and you want to recompute the corpus-wide rank from scratch.

type D (a different case)

Suppose you have an annotation set similar whose annotations have a real number as body and whose targets are a pair of sentences. The number is a measure of how similar the target sentences are.

Now, if we have these annotations on the corpus, how do we reduce them to annotations on parts?

We reduce the annotation sets to a part by removing all annotations whose targets are not in the part.
If we then look at the union of the annotation sets of the parts it is not equal to the annotation set of the whole, because the annotations that have their targets across parts are left out.

You could remedy this, by adding to each part the "half-annotations": annotations whose target contains a sentence in that part, but also a sentence outside that part.
You could add the dangling sentence to the body of the half-annotation. So these
half-annotations have a body containing a similarity and a link to a sentence outside the part. And they only have a single target: a sentence within the part.

Then you can recompose the similarity annotation set of the whole out of the similarity annotation sets plus the half-annotation sets of the parts.

proycon · 2023-01-23T13:00:01Z

You raise some good points, it is indeed sometimes not trivial to map annotations
on parts to annotations on the corpus (and vice versa).

Type A

lex is an annotation set. Each annotation in it has a body which is a lexeme, and its targets are the occurrences of the lexeme.

Then, if you compose the corpus-wide lex annotation sets from the part-wise
lex annotation sets, you have to merge the targets of the annotations with
the same bodies.

Let me see if I get this right, we might not be entirely on the same page
regarding the terminology yet. We want to annotate lexemes, so we create an
annotation dataset with some key lex and the value is a a string
corresponding to the lexeme (or we can make the key be the lexeme and have the
value be null, doesn't matter much). Each lexeme will only occur once in the
dataset. These form the 'bodies' of annotations. This annotation dataset is
entirely independent for corpus or parts (because it doesn't reference
anything). At this stage we don't have annotations yet.

Now we add annotations for each occurrence of the lexeme in the data, each has
a TextSelector pointing to the resource and the offset in it, and it points to the data (body) for the lexeme.

What this proposed extension allows is that even if these targetselectors point
at 'parts', we use the same offset as if we were pointing to the 'whole'. So
any of references to the parts (part1.txt) can simply be replaced with the
whole (corpus.txt), without further mappings (or vice versa). Without this extension, you'd have to map the offsets.

I think what you mean in your example, however, is what if you have ONE
annotation that references ALL the lexemes? That is via a MultiSelector.
STAM doesn't really prescribe how to model the data, so this is certainly an
option. That solution would indeed suffer from the problem of having to merge
annotations (merge MultiSelectors), if you move from parts to the whole corpus,
or splitting annotations if you move from the whole corpus to parts.

This problem, which you show even more clearly in B and C, is inherent in
having ONE annotation convey information about the dataset as a whole. If
people often switch between parts and the whole, then this may not be the best
way to model things, or it at least requires some implementation to do the
mappings. Sometimes there's no way around it if computing the whole at once is
simply too much/big.

Maybe the question at hand is: Do we want to capture
some of that logic in a STAM extension? I don't think such an extension would
change much or anything in our core data model though.

Type B/C

Suppose you have an annotation set freqLex whose annotations have as body a
number and as target a single word, where the number is de frequency of the
lexeme of the word in the corpus.

To store the frequency of a lexeme you wouldn't target a single instance of it
in the text. (well, you can, STAM doesn't prescribe a model, but it's a bit
odd). I'd say either target all instances with a MultiSelector, ideally
consisting of AnnotationSelectors that point to each occurrence of the lexeme
(or TextSelectors to point directly), or (and I think this is even better)
don't point to any at instances at all but instead point to the resource itself
with a ResourceSelector, the annotation is then on the resource and the body
would be something like "[{lex": "whatever"}, { "freq": 10 } ]. (Here you'd
indeed need to sum things if you go from parts to the whole though)

The {lex: "whatever"} data (AnnotationData) would be the same one as the
one referenced by the individual lexeme annotations in example A (so that
actual data would only be a single node in memory). Though such an annotation
model doesn't explicitly link the frequency to the instances, this information
is still in the graph and easily accessible.

In fact, frequency information in general is easily accessed if you have one
annotation for each instance, all referencing a single AnnotationData (body). Simply
query the STAM implementation for the AnnotationData and count how many annotations
reference it (this information is in the reverse index).

Type D

I like this example, it clearly shows limitations when moving from parts to the
whole or back. It will apply to any annotation that has a MultiSelector.
Perhaps we need to define some recommendations for implementations to handle
this, but I wouldn't say it's the most urgent matter nor a show-stopper for
this particular proposal. It was good to think deeper about these things
already though! It shows that this is something users modelling annotations
will need to keep in mind if they plan to make frequent switches between whole
and parts.

dirkroorda · 2023-01-23T14:03:29Z

A few remarks:

Under Type A:

I think what you mean in your example, however, is what if you have ONE
annotation that references ALL the lexemes?

I meant to have one annotation per lexeme that targets all its occurrences.

The same thing also happens if you have other things contained in each other which do not neatly fall into volumes. Think of pages and lines (although volumes usually start at a new page).

Remember that there are two ways to interpret a Multiselector:

(I): the body is an annotation to each of the targets in the multiselector individually
(II): the body is an annotation to the list of targets

Here we are still in annotations of type (I), and that makes it easier to switch between wholes and parts. So it might pay off to implement something here.
Type D is an example of interpretation (II).

Under Type B/C

To store the frequency of a lexeme you wouldn't target a single instance of it
in the text. (well, you can, STAM doesn't prescribe a model, but it's a bit
odd). I'd say either target all instances with a MultiSelector, ideally
consisting of AnnotationSelectors that point to each occurrence of the lexeme
(or TextSelectors to point directly), or (and I think this is even better)
don't point to any at instances at all but instead point to the resource itself
with a ResourceSelector, the annotation is then on the resource and the body
would be something like "[{lex": "whatever"}, { "freq": 10 } ].

Sure, it is odd, yet it can be practical. It depends on how your query language is able to retrieve data.
What I have in mind is a use case where we want to label each word with information about how frequent its lexeme is. Then people can search for sentences composed of frequent words. Or sentences with at most one rare word.

An example from Text-Fabric:

sentence
/without/
  word freqLex<100
/-/

looks for sentences without words that have a frequency lower than 100.

And

sentence
  word freqLex<100

looks for a sentence and a word inside that sentence with a frequency lower than 100.

If I had supplied only the lexemes with freqLex info then these queries would become
a bit more convoluted:

sentence
/without/
lex freq_lex<100
  w:word
.. [[ w
/-/

resp.

l:lex freqLex<100
sentence
  w:word

l [[ w

But is a (very) redundant way to store lexeme frequencies.

Probably the query language of STAM will be rich enough to formulate these queries, and then I could not agree more with your remark.

The distinguishing point in type B/C is that the meaning/intention of the annotation depends on the corpus as a whole. The frequency of a word inside a part is different from the frequency of the same word inside the whole.

This leads to different things that people want.

Suppose I'm interested in sentences without uncommon words.
Suppose I'm working in a part.
Probably I want to find the sentences in that part whose words are frequent as being measured in the corpus as a whole. So I need the original values, and I do not want any
adaptation of the values to the part I'm working in!

If that is the case generally, then we have to do nothing with these kinds of annotations, when going from parts to the whole and back.
So the most intricate cases are gone!

That leaves only types A and D where STAM might do something to support the transition
from whole to part and back.

Type D

it clearly shows limitations when moving from parts to the
whole or back. It will apply to any annotation that has a MultiSelector.
Perhaps we need to define some recommendations for implementations to handle
this ...

Yes, it is a case of MultiSelectors in interpretation (II) (see above).
The thing is, one can do something about this.

I think the least thing one can do is to make corpus encoders/users aware that annotations with multiselectors can be of type (I) or (II).
STAM will also know which is which, so it remains possible to add to the parts the one-legged left-overs of such annotations.

I did that in Text-Fabric generically when it extracts
volumes from works.

proycon · 2023-01-23T16:01:25Z

I'll react to your core point first, because you touch upon a very important subject indeed:

Remember that there are two ways to interpret a Multiselector:

(I): the body is an annotation to each of the targets in the multiselector individually
(II): the body is an annotation to the list of targets

This is an interesting point. But I think we are already pretty normative there being one way to interpret it, namely type II, from the current specification:

MultiSelector - A selector that consists of multiple other selectors,
used to select more complex targets that transcend the idea of a single
simple selection. This MUST be interpreted as the annotation applying
equally to the conjunction as a whole, its parts being inter-dependent and
for any of them it goes that they MUST NOT be omitted for the annotation
to makes sense. Note that the order of the selectors is not significant.
When there is no dependency relation between the selectors, you MUST
simply use multiple Annotations instead.

This is a bit add odds with what I said earlier where I said STAM doesn't force a model.

That of course begs the question; what about Type I? Are we perhaps too strict
in imposing a way to model here?

I think the least thing one can do is to make corpus encoders/users aware
that annotations with multiselectors can be of type (I) or (II). STAM will
also know which is which, so it remains possible to add to the parts the
one-legged left-overs of such annotations.

If we have two types and want to have both, then I agree it's good to make it very explicit what the semantics of each is. We don't want ambiguity here.
We could split up the MultiSelector into two separate types to do so: Like a CompositeSelector (Type II, just inventing names here) and perhaps keep MultiSelector for Type I?
This would make STAM a bit more flexible and have very clear semantics. What do you think?

This would be an actual change to the core model (which we are still at liberty to do in this stage) rather than an extension btw.

dirkroorda · 2023-01-23T17:59:54Z

MultiSelector for type (I) sounds good to me.

CompositeSelector for type (II) also sounds fine. I was thinking of TupleSelector because this corresponds to the idea that annotations of this type are an n-ary relation between textual selectors.

But the term is maybe a bit too nerdy. Anyway, I think it is beneficial to have the distinction in the core model of STAM.

formulated STAM-baseoffset extension

8b82697

proycon requested review from dirkroorda and hayco January 19, 2023 17:45

This was referenced Jan 24, 2023

Split MultiSelector into CompositeSelector and MultiSelector with distinct semantics #8

Merged

the importance of having a coordinate system independent of what the source files offer #9

Closed

Implement text-baseoffset extension annotation/stam-rust#11

Closed

proycon merged commit 85d5f12 into master Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formulated STAM-baseoffset extension #7

Formulated STAM-baseoffset extension #7

proycon commented Jan 19, 2023

dirkroorda commented Jan 23, 2023 •

edited

Loading

proycon commented Jan 23, 2023

dirkroorda commented Jan 23, 2023 •

edited

Loading

proycon commented Jan 23, 2023

dirkroorda commented Jan 23, 2023 •

edited

Loading

Formulated STAM-baseoffset extension #7

Formulated STAM-baseoffset extension #7

Conversation

proycon commented Jan 19, 2023

dirkroorda commented Jan 23, 2023 • edited Loading

type A (the least problematic of the non-trivial cases)

type B (slightly more problematic)

type C (even worse)

type D (a different case)

proycon commented Jan 23, 2023

Type A

Type B/C

Type D

dirkroorda commented Jan 23, 2023 • edited Loading

Under Type A:

Under Type B/C

Type D

proycon commented Jan 23, 2023

dirkroorda commented Jan 23, 2023 • edited Loading

dirkroorda commented Jan 23, 2023 •

edited

Loading

dirkroorda commented Jan 23, 2023 •

edited

Loading

dirkroorda commented Jan 23, 2023 •

edited

Loading