Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deeper nested child annotations #11

Closed
nikopartanen opened this issue Nov 8, 2017 · 6 comments
Closed

Deeper nested child annotations #11

nikopartanen opened this issue Nov 8, 2017 · 6 comments

Comments

@nikopartanen
Copy link

nikopartanen commented Nov 8, 2017

Hi! I have been trying to automatize some of our project workflows with Pympi, but I have run into problems with the tier structure I desire. So the structure is:

- reference tier (refT, independent)
    \- transcription tier (orthT, symbolic association)
        \- token tier (wordT, symbolic subdivision)

The starting point looks like this:

image

And with the token tier populated it will be like this:

image

The problem I have is that it doesn't seem to be possible to create the annotations on word level here so that they would be correctly referenced to the transcription tier, but it seems necessary to set the references to the highermost tier.

If I do:

elan_file = pympi.Eaf(file_path='example.eaf')
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='ref@Niko', time=10, value='Words')
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='ref@Niko', time=10, value='here', prev='a' + str(elan_file.maxaid))
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='ref@Niko', time=10, value='.', prev='a' + str(elan_file.maxaid))

Things go fine, and the file works, but the internal arrangement of references is quite different from what is got when the annotations are added in ELAN, for example by tokenizing the transcription tier with "Tokenize tier…".

If I try to refer directly into transcription level tiers I get an error:

elan_file.add_ref_annotation(id_tier='word@Niko', tier2='orth@Niko', time=10, value='Words')
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='orth@Niko', time=10, value='here', prev='a' + str(elan_file.maxaid))
elan_file.add_ref_annotation(id_tier='word@Niko', tier2='orth@Niko', time=10, value='.', prev='a' + str(elan_file.maxaid))
...
/Users/niko/.local/lib/python3.6/site-packages/pympi/Elan.py in add_ref_annotation(self, id_tier, tier2, time, value, prev, svg)
    332                 break
    333         if not ann:
--> 334             raise ValueError('There is no annotation to reference to.')
    335         aid = self.generate_annotation_id()
    336         self.annotations[aid] = id_tier

ValueError: There is no annotation to reference to.

In ELAN XML the problem looks like this, the question mark points where the annotation should, as far as I understand, refer to:

image

Is there some way to add the lower-level tiers with correct references? Of course it seems that the current arrangement also works, but it is bit dangerous in the longer run as programmatically manipulated files may have different structure from the ones which have been edited manually, and differences in tier structures make it impossible to parse the content correctly just by looking into tier relations and id's. Or the logic would be different between files.

@dopefishh
Copy link
Owner

Thanks for the feedback. There is no such way I think. However, one could add it.
I haven't been working on this for quite some time so I will have to delve into the matter again and could use all the help to speed things up. I'll try to be back at you asap.

@nikopartanen
Copy link
Author

Thanks for reply! Take your time! I usually use R when I work with ELAN files, but if I need to create or populate new tiers I've often relied to pympi anyway. All in all it is very useful package, thanks for great work!

Kind of use cases I would wish to do with pympi are, for example, taking the transcription tier, tokenizing it and writing the tokens into new symbolic subdivision tier (the example above). Or then taking the tokens, sending those to a morphological analysator, and then writing the result into new tier below. Now pympi has not been used in this, but I would be more than happy to shift into it in this scenario too, as it could be more generic approach than what my project has now. I can send you the script my colleagues are writing, but it isn't yet online, so I can't link to it directly.

This is the kind of structure I use:

So everything has symbolic subdivision on word, lemma, pos and morph tiers, and their references point always to the one above. I guess the main change would be to have the ability to specify how the new annotations are added, following the demands of the stereotype used. Maybe this is already somehow taken into account, it is very possible I don't use pympi correctly!

@nikopartanen
Copy link
Author

I just added the script my colleagues and I have been using into this repository, there are also links to few papers where we have described the workflow:

https://github.com/langdoc/elan-fst

So in principle it has two distinct parts, one is reading and writing into ELAN, and another is to send tokens to morphological analysator and parsing the output. We have quite clear plans how to work further with the analysator part, but I'm just thinking that rewriting the ELAN manipulation part with pympi could make it more generalizable and compact. Just adding this information here as it connects to what I was describing earlier in this issue.

@dopefishh
Copy link
Owner

closed by #14

@nikopartanen
Copy link
Author

Hi! I just would like confirm if this feature works now? I have updated pympi, but I still get the error ValueError: There is no annotation to reference to. when I try to add child annotations at deeper levels. Is there an example somewhere how this can be done correctly? Thanks!

@dopefishh
Copy link
Owner

Hi, PR #14 didn't solve it apparently I'm afraid. I'll reopen this one. I'm happy to accepts PR's to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants