-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formulated STAM-baseoffset extension #7
Conversation
Some questions arise. How do you deal with annotations on the corpus and on its parts? A few examples. Suppose you can annotate words and sentences in the corpus. Here are a few types of annotations that will cause problems when you want to transfer them from part to corpus of vice versa. type A (the least problematic of the non-trivial cases)
type B (slightly more problematic)Suppose you have an annotation set Now, if you have computed the freqLex relative to the parts, then in order to compose the freqLex for the corpus as a whole, you have to sum the bodies of the part-wise annotations. For each target in a part, you have to find annotations in other parts that target a word with the same lexeme, and then take the sum. type C (even worse)Suppose you have annotation set Here it is probably not worth to even try to compute the composed type D (a different case)Suppose you have an annotation set Now, if we have these annotations on the corpus, how do we reduce them to annotations on parts? We reduce the annotation sets to a part by removing all annotations whose targets are not in the part. You could remedy this, by adding to each part the "half-annotations": annotations whose target contains a sentence in that part, but also a sentence outside that part. Then you can recompose the |
You raise some good points, it is indeed sometimes not trivial to map annotations Type A
Let me see if I get this right, we might not be entirely on the same page Now we add annotations for each occurrence of the lexeme in the data, each has What this proposed extension allows is that even if these targetselectors point I think what you mean in your example, however, is what if you have ONE This problem, which you show even more clearly in B and C, is inherent in Maybe the question at hand is: Do we want to capture Type B/C
To store the frequency of a lexeme you wouldn't target a single instance of it The In fact, frequency information in general is easily accessed if you have one Type DI like this example, it clearly shows limitations when moving from parts to the |
A few remarks: Under Type A:
I meant to have one annotation per lexeme that targets all its occurrences. The same thing also happens if you have other things contained in each other which do not neatly fall into volumes. Think of pages and lines (although volumes usually start at a new page). Remember that there are two ways to interpret a Multiselector: (I): the body is an annotation to each of the targets in the multiselector individually Here we are still in annotations of type (I), and that makes it easier to switch between wholes and parts. So it might pay off to implement something here. Under Type B/C
Sure, it is odd, yet it can be practical. It depends on how your query language is able to retrieve data. An example from Text-Fabric:
looks for sentences without words that have a frequency lower than 100. And
looks for a sentence and a word inside that sentence with a frequency lower than 100. If I had supplied only the lexemes with freqLex info then these queries would become
resp.
But is a (very) redundant way to store lexeme frequencies. Probably the query language of STAM will be rich enough to formulate these queries, and then I could not agree more with your remark. The distinguishing point in type B/C is that the meaning/intention of the annotation depends on the corpus as a whole. The frequency of a word inside a part is different from the frequency of the same word inside the whole. This leads to different things that people want. Suppose I'm interested in sentences without uncommon words. If that is the case generally, then we have to do nothing with these kinds of annotations, when going from parts to the whole and back. That leaves only types A and D where STAM might do something to support the transition Type D
Yes, it is a case of MultiSelectors in interpretation (II) (see above). I think the least thing one can do is to make corpus encoders/users aware that annotations with multiselectors can be of type (I) or (II). I did that in Text-Fabric generically when it extracts |
I'll react to your core point first, because you touch upon a very important subject indeed:
This is an interesting point. But I think we are already pretty normative there being one way to interpret it, namely type II, from the current specification:
This is a bit add odds with what I said earlier where I said STAM doesn't force a model. That of course begs the question; what about Type I? Are we perhaps too strict
If we have two types and want to have both, then I agree it's good to make it very explicit what the semantics of each is. We don't want ambiguity here. This would be an actual change to the core model (which we are still at liberty to do in this stage) rather than an extension btw. |
But the term is maybe a bit too nerdy. Anyway, I think it is beneficial to have the distinction in the core model of STAM. |
(please read the README.md in this PR)