Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating words with USAS #204

Closed
TomazErjavec opened this issue Mar 30, 2022 · 88 comments
Closed

Annotating words with USAS #204

TomazErjavec opened this issue Mar 30, 2022 · 88 comments
Assignees
Labels
enhancement New feature or request

Comments

@TomazErjavec
Copy link
Collaborator

The USAS semantic tags will be encoded in a taxonomy (cf. #202), but there remains the question of how to encode these tags (or, rather, references to the IDs of the taxomomy categories) on word tokens. An important complication is that USAS can also tag multi-word expressions (MWEs).

One option would be to directly mark the USAS tag in w/@ana, and, for MWEs, introduce a new element (probably phr) and mark phr/@ana. However, there is a real danger that phr will at times conflict with name, leading to non-well formed XML or difficult fixes.

An alternative which does not have these problems is to use linkGrp, similarly to how we use it for syntax. Here the problem is that the link elements that we used so far inside linkGrp require at least two IDREFs as the value of their @target, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by using ptr instead of link (note that ptr/@targer can also have several IDREFs).

In line with this, the encoding (suitably simplified) could be like:

<s xml:id="s1">
 <w xml:id="t1">I</w>
 <w xml:id="t2">therefore</w>
 <w xml:id="t3">very</w>
 <w xml:id="t4">much</w>
 <w xml:id="t5">welcome</w>
 <w xml:id="t6">the</w>
 <w xml:id="t7">Government's</w>
 <w xml:id="t8">intention</w>
 <linkGrp type="USAS-SEM">
   <ptr ana="usas:Z8" target="#t1"/>
   <ptr ana="usas:Z5" target="#t2"/>
   <ptr ana="usas:A13.3" target="#t3 #t4"/>
   <ptr ana="usas:Q2.2" target="#t5"/>
   <ptr ana="usas:Z5" target="#t6"/>
   <ptr ana="usas:G1.1" target="#t7"/>
   <ptr ana="usas:X7p" target="#t8"/>
 </linkGrp>
</s>

@matyaskopp, do you see any problems with this suggestion?

@matyaskopp
Copy link
Collaborator

One option would be to directly mark the USAS tag in w/@ana, and, for MWEs, introduce a new element (probably phr) and mark phr/@ana. However, there is a real danger that phr will at times conflict with name, leading to non-well formed XML or difficult fixes.

Agree

An alternative which does not have these problems is to use linkGrp, similarly to how we use it for syntax. Here the problem is that the link elements that we used so far inside linkGrp require at least two IDREFs as the value of their @target, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by using ptr instead of link (note that ptr/@targer can also have several IDREFs).

using ptr is a nice hack but I have to admit that I don't like it either. This annoy me:

  • ptr/@ana looks like you are anotating the pointer but you want to annotate word or mwe
  • assume the previous item is OK, Does this <ptr ana="usas:A13.3" target="#t3 #t4"/> express that t3 and t4 form one mwe?

I think the best solution for this situation is to introduce a <standOff> annotations. I know that you did not want to introduce it before, but we will definitely introduce a kind of stand-off annotations with timelines and audio alignment, so it can probably solve this issue too.

<TEI>
<!-- ... -->
 <standOff type="USAS-SEM">
   <span ana="usas:Z8" target="#t1"/>
   <span ana="usas:Z5" target="#t2"/>
   <span ana="usas:A13.3" target="#t3 #t4"/>
   <span ana="usas:Q2.2" target="#t5"/>
   <span ana="usas:Z5" target="#t6"/>
   <span ana="usas:G1.1" target="#t7"/>
   <span ana="usas:X7p" target="#t8"/>
 </standOff>
</TEI>

@TomazErjavec
Copy link
Collaborator Author

using ptr is a nice hack but I have to admit that I don't like it either.

I would say it's a tweak, not a hack, and I didn't actually say I don't like it - in fact, I do!

ptr/@ana looks like you are anotating the pointer but you want to annotate word or mwe

But then you could say exactly the same for linkGrp/link/@ana, which we use already for syntax.

Does this express that t3 and t4 form one mwe?

Yes, exactly.

I think the best solution for this situation is to introduce standOff

Yikes! This would open a whole new can of worms:

  1. standOff needs its separate TEI document, with its teiHeader, so we would introdduce a completely new way of encoding linguistic annotations from all the rest
  2. If we were to use it for semantic annotations, why not for syntax, and, in fact, all other linguistic annotations. Not saying this is a completely crazy idea, but it would mean redesigning the complete ling. annotation, I don't think we have the energy and time for that.

we will definitely introduce a kind of stand-off annotations with timelines and audio alignment

Why? What is wrong with the Parla-CLARIN recommendation. Except that the description is rather brief... This is the way I encoded speech alignment in GosVL, cf. http://hdl.handle.net/11356/1444 and though to use the same system in ParlaMint. This is quite similar to what is proposed in the TEI-based ISO 24624:2016, although they do use annotationBlock to wrap elements.

@matthewcoole
Copy link
Collaborator

matthewcoole commented Mar 30, 2022

I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged A13.3.

Perhaps a separate <ptr> could be used for the MWE and we add something to the taxonomy for it e.g.

<ptr ana="usas:A13.3" target="#t3"/>
<ptr ana="usas:A13.3" target="#t4"/>
<ptr ana="usas:MWE" target="#t3 t4"/>

@matyaskopp
Copy link
Collaborator

But then you could say exactly the same for linkGrp/link/@ana, which we use already for syntax.

I don't think so. We annotate the link between two entities (not child or parent node) in the annotation of syntactic relation. But in USAS case we want to annotate the node (word or mwe) not the ptr.

Does this express that t3 and t4 form one mwe?

Yes, exactly.

How I understand this <ptr ana="usas:A13.3" target="#t3 #t4"/>:

  • it represents two pointers(relations with USAS taxonomy), both are labeled with A13.3
  • it says nothing about relation between t3 and t4

Yikes! This would open a whole new can of worms

  1. standOff needs its separate TEI document, with its teiHeader, so we would introdduce a completely new way of encoding linguistic annotations from all the rest

I don't think that standOff needs a separate TEI document. It can be placed at the end of TEI document if it fulfils bold condition:

source: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASOstdf

As a member of model.resource, standOff may occur as a child of TEI (or teiCorpus). If the metadata that describes the standOff is largely the same as the metadata that describes the associated resource (e.g., the transcribed text in text), then the standOff and the encoded associated resource may appear as children of the same TEI element. The example below has a transcription with <placename> elements in the text linked to a list of place elements in the standOff section.

  1. If we were to use it for semantic annotations, why not for syntax, and, in fact, all other linguistic annotations. Not saying this is a completely crazy idea, but it would mean redesigning the complete ling. annotation, I don't think we have the energy and time for that.
  • for syntax - we are annotating arrow (the relation), not starting and ending nodes
  • morphology can be implemented this way, but it is usually a single-word expression and if multiword one it does not break XML tree structure. And it is faster for processing.
  • named entities can break tree structure if there are other mwe (such as hypertext links - I get rid of them in ParlaMint-CZ data)

we will definitely introduce a kind of stand-off annotations with timelines and audio alignment

Why? What is wrong with the ...

sorry, this was a misleading comparison

I am not sure if we can agree now. I hope it will be brighter tomorrow.

@matyaskopp
Copy link
Collaborator

I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged A13.3.

@matthewcoole I used your online tagger http://ucrel-api.lancaster.ac.uk/usas/tagger.html, and it made some think that I finally understand the point of mwe. you are annotating both:

  • mwe with one semantic tag
  • words that are part of mwe with different tags

I have tried this sentence: The Manx name of the Isle of Man is Ellan Vannin.
And the vertical output is:

0000001 002  -----   -----                    
0000003 010  AT      The                      Z5 
0000003 020  JJ      Manx                     Z99 
0000003 030  NN1     name                     Q2.2 
0000003 040  IO      of                       Z5 
0000003 050  AT      the                      Z5 
0000003 060  NNL1    Isle                     Z2[i1.3.1 W3 
0000003 070  IO      of                       Z2[i1.3.2 Z5 
0000003 080  NP1     Man                      Z2[i1.3.3 
0000003 090  VBZ     is                       A3+ Z5 
0000003 100  NP1     Ellan                    Z1mf[i2.2.1 Z3c[i2.2.1 
0000003 110  NP1     Vannin                   Z1mf[i2.2.2 Z3c[i2.2.2 
0000003 111  .       .                        

We can decomposite this issue into two:

  • how can mwe be implemented in ParlaMint in general (not only USAS case)? I mean mwe that can break tree structure (not named entities)
  • how to encode USAS tags

And then, we can annotate words in @ana attribute: <w ana="usas:A3+ usas:Z5">is</w>
and mwes with <PSEUDO_MWE_ELEMENT ana="usas:..."/>

@matthewcoole
Copy link
Collaborator

I think that makes sense. I should've been clearer. MWEs should have the same tags I think, but the tokens themselves may have other tags, as in the Isle of Man example above. Correct me if I'm wrong @perayson .

@TomazErjavec
Copy link
Collaborator Author

OK, if I try to summarize, also to check if I understand:

  1. single words will get one or more USAS annotations
  2. an USAS annotation can be further decomposed into a hierarchically defined class and optional modifiers
  3. MWEs can also be annotated as a whole, but their individual words will still get USAS annotations.

As for the encoding:

  1. In-line annotation is problematic because MWEs need a wrapper element, and this might conflict with NE annotations
  2. linkGrp is problematic as we would need to use ptr, but its semantics is wrong (in the bright light of a new day I agree with @matyaskopp)
  3. standOff is problematic because, even though it can be in the same TEI as the text, it is not inside the sentence, as is the case for the otherwise similar syntactic annotation, but will appear at the end of the TEI; it also introduces a completel new encoding of linguistic annotations , which is is not covered by the Parla-CLARIN recommendations, and not currently used in ParlaMint.

All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations.

@matyaskopp
Copy link
Collaborator

matyaskopp commented Mar 31, 2022

All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations.

Agree.

For the record, full sample of The Manx name of the Isle of Man is Ellan Vannin. sentence:

<s>
  <w ana="usas:Z5">The</w>
  <w ana="usas:Z99">Manx</w>
  <w ana="usas:Q2.2">name</w>
  <w ana="usas:Z5">of</w>
  <w ana="usas:Z5">the</w>
  <phr ana="usas:Z2">
    <w ana="usas:W3">Isle</w>
    <w ana="usas:Z5">of</w>
    <w>Man</w>
  </phr>
  <w ana="usas:A3+ usas:Z5">is</w> <!-- pointing to invalid ID, but it is not the point of this example -->
  <phr ana="usas:Z1mf usas:Z3c">
    <w>Ellan</w>
    <w>Vannin</w>
  </phr>
  <pc>.</pc>   
</s>

@perayson
Copy link
Collaborator

I just answered something along similar lines in #202. The tagger itself will output all the possible semantic tags for each word and MWE, but following contextual disambiguation, the first tag in the list should be the most likely. So, we could simplify things for ParlaMint and just provide the first choice tag and remove the remainder. That will give us a certain level of accuracy, but reduce recall of course.

@TomazErjavec
Copy link
Collaborator Author

So, it is now settled that USAS annotation:

  • will not mark up discontinuous MWEs
  • can mark up continuous MWEs
  • is composed of a list of tags, ordered by likelyhood that the tag is correct, i.e. the first tag is the most likely
  • the tags are composed of alphanumeric characters and full stop. An inspection of tag sequences also shows usage of +, -, /, %, and space.

I would split this encoding question into two parts, how to encode USAS tags in CoNLL-U (relevant for @perayson ), and how in TEI, which will be generated from CoNLL-U (for @matyaskopp and @TomazErjavec).

In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but, as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap). For this the standard was is to use the IOB encoding, which we already use for NER markup in CoNLL-U. This means that

  • each token should have a semantic attribute assigned, let's call it SEM
  • if the token (like punctuation, function words) does not have assigned a semantic tag, then it is Outside, i.e. it is marked up as SEM=O
  • if the token is the first (often only) one marked up with this tag sequence, it is marked up as Beginning, i.e. SEM=B-<tags>
  • if the token is the n-th (second, third, ...) one marked up with this tag sequence, it is marked up as Inside, i.e. SEM=I-<tags>
  • USAS uses space as the sequence delimiter; space is a bit dangerous, so I propose we change it to comma

So, we would have something like (with irrelevant columns skipped and randomly picked USAS tags):

# sent_id = ParlaMint-GB_2015-01-05-commons.seg1.2
# text = What progress her Department has made on implementing exit checks at borders.
1       What            NER=O|SEM=O
2       progress        NER=O|SEM=B-A10-,A12-
3       her             NER=O|SEM=O
4       Department      NER=O|SEM=B-A11.1+
5       has             NER=O|SEM=O
6       made            NER=O|SEM=B-A1.1.1,G2.2-,X9.2+,E3-,N5+,G2.1%,Z5
7       on              NER=O|SEM=O
8       implementing    NER=O|SEM=B-A1.1.1,H1%
9       exit            NER=O|SEM=B-A11.1+,H2
10      checks          NER=O|SEM=I-A11.1+,H2
11      at              NER=O|SEM=O
12      borders         NER=O|SpaceAfter=No|SEM=B-ZZ2
13      .               NER=O|SEM=O

@matyaskopp, do you agree?

As for the TEI encoding, I would postpone this discussion until the CoNLL-U format is finalised.

@matyaskopp
Copy link
Collaborator

  • the tags are composed of alphanumeric characters and full stop. An inspection of tag sequences also shows usage of +, -, /, %, and space.

Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,

10     checks                   X2.4/A5.3 A15 Q1.2 T2- M5 

can be encoded this way:

10     checks                   SEM=B-X2.4/A5.3,A15,Q1.2,T2-,M5 

In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap).

True if "overlap" means that the tags on the second,... positions have the same "semantic segmentation". The second and latter tags describe the same span (mwe, or single token) as the first one

Otherwise, we have to add B- and I- prefixes to all tags:

9       exit            NER=O|SEM=B-A11.1+,B-H1
10      checks          NER=O|SEM=I-A11.1+,B-H2

@matyaskopp
Copy link
Collaborator

Now I see my previous comment and example, it seems that there are nested semantic spans:

0000001 002  -----   -----                    
0000003 010  AT      The                      Z5 
0000003 020  JJ      Manx                     Z99 
0000003 030  NN1     name                     Q2.2 
0000003 040  IO      of                       Z5 
0000003 050  AT      the                      Z5 
0000003 060  NNL1    Isle                     Z2[i1.3.1 W3 
0000003 070  IO      of                       Z2[i1.3.2 Z5 
0000003 080  NP1     Man                      Z2[i1.3.3 
0000003 090  VBZ     is                       A3+ Z5 
0000003 100  NP1     Ellan                    Z1mf[i2.2.1 Z3c[i2.2.1 
0000003 110  NP1     Vannin                   Z1mf[i2.2.2 Z3c[i2.2.2 
0000003 111  .       .                        

@TomazErjavec
Copy link
Collaborator Author

@matyaskopp it looks like you either haven't seen or read what I wrote, because:

Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,

Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.

Otherwise, we have to add B- and I- prefixes to all tags:

I alredy proposed adding B and I prefixes to all the tags, we need them to be able to properly encode MWEs (even without nesting). We cannot count on two neighbouring words to always have different tags, also, we already encode NEs with IOB, so it is only sensible to encode semantic annotations in the same way.

it seems that there are nested semantic spans

If this is true (and maybe @perayson can confirm), then we do have a problem. One option would of course be to ignore the inner tags, the same way we do for NER (well, except CZ). If this is for some reason not possible, we have to think again...

@matyaskopp
Copy link
Collaborator

@matyaskopp it looks like you either haven't seen or read what I wrote, because:

Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,

Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.

Sorry, I haven't understood it properly I imagined a vertical sequence of tags that correspond to multiple tokens.
Now I see what you meant "tag sequence" is a list of "tag alternatives"

@perayson
Copy link
Collaborator

perayson commented Jul 4, 2023

The format with the 10 digits at the start of each line comes from the C version of USAS, and the addition of sequences like [i1.3.2 was intended to show the MWE sequences: [i is the separator, 1 = ID unique within the current file, 3 = this MWE is three words long, and 2 = this is the second element of the MWE sequence. PyMUSAS doesn't produce this format, see https://ucrel.github.io/pymusas/usage/how_to/tag_text instead for output examples. I think the BIO format is a good idea as long as we say that it represents information about the most likely (i.e. first) tag in the list. i.e. if the word is the part of the MWE, then we use B and I for internal MWE parts, otherwise O. The remainder of the tag list might include other (non-preferred) MWE tags or single word tags (but we ignore the MWE information about the non-preferred tags).

Note that punctuation also gets a Z9 tag in wmatrix6, PyMUSAS itself (at the moment) outputs PUNCT (which I should change). Function words will also get other Z tags. SEM=O-X3.4 will indicate that X3.4 is a single word semantic tag.

@TomazErjavec
Copy link
Collaborator Author

PyMUSAS doesn't produce this format

OK, this is a relief!

see https://ucrel.github.io/pymusas/usage/how_to/tag_text instead for output examples

Is there maybe a document explaining the format? Namely, the only output example I understand there is for English, and that one is rather short. In particular I'm still not completely clear on whether all words inside a MWE have the same tags, or can they differ, so that the complete MWE has one tag, but there can be further word-specific tags in the tag list, so that in effects tags can be nested. By what you write below, my guess is that the second in the case, so I contnue with that understanding.

I think we then have two options:

1 use IOB for all sem tags, and always keep only the first tag. The fact that something is a MWE or not is distinguished by the I tag on the second, third etc. MWE token. As you write that all tokens get a sem tag, we would not in fact have any O tags
2 use two attributes, say 1) SEMMWE for MWEs, which uses IOB format, it then contains the first (MWE) tag only, if it is inside a MWE (B for first, I for internal), and O otherwise, and (2) a SEM tag for single word tag lists, which does not need to be IOB, but just gives a list of tags for each token (without the MWE tag)

So, let's say we have a sentence "a b c", with b and c being a MWE. a has sem tags 1,2, b has 3, c has 4, while the MWE tag is M, so that b has the list "M,3" and c has "M,4".

So, the first option would give:

a   SEM=B-1
b   SEM=B-M
c   SEM=I-M

and the second would be:

a   SEMMWE=O SEM=1,2
b   SEMMWE=B-M SEM=3
c   SEMMWE=I-M SEM=4

I hope it is more or less clear what I mean....

I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?

Note that punctuation also gets a Z9 tag in wmatrix6, PyMUSAS itself (at the moment) outputs PUNCT (which I should change).

OK, so I guess this means that all tokens get a sem tag, nice.

@matyaskopp
Copy link
Collaborator

I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?

I see another option:
3 use IOB for all sem tags (O will never be used if all tokens are tagged). Add I- or B- prefix to all tags and tags separate with a comma:

a   SEM=B-1,B-2
b   SEM=B-M,B-3
c   SEM=I-M,B-4

it is simple encoding and the MWE will never overlap with the different MWE. You can always get the first option from it: s/,.*$//

OK, so I guess this means that all tokens get a sem tag, nice.

yes because we don't have syntactic analysis for the translated version (so no syntactic tokens).

@TomazErjavec
Copy link
Collaborator Author

I see another option: use IOB for all sem tags (O will never be used if all tokens are tagged). Add I- or B- prefix to all tags and tags separate with a comma

Wow, good one! It is a slight perversion over the usual IOB rules but I think it solves all the problems, so I also vote for it!

@perayson
Copy link
Collaborator

I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled O (not B). To answer an earlier question, for semantic MWEs, each word has the same semantic field tag. So, for those, I would use B for the first part, and I for the rest. My preferred option is to keep the tags and MWE markers separate (as per the current pymusas output), so here is option 4:

a   SEMMWE=O SEM=1,2
b   SEMMWE=B SEM=M,3
c   SEMMWE=I SEM=M,4

@TomazErjavec
Copy link
Collaborator Author

I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled O (not B).

The idea is that there is no formal distinction between single and multi word expressions, the only difference is that tags for single words will always be B, while for MWEs the second, third etc. word will be I. And, as every token gets a semantic tag, there are no tokens oustide some semantic tag, so nothing is marked with O.

As for your suggestion, the use of IOB here is different from what it is otherwise, e.g. in our NER annotation. If you want to split MWE annotations from single word annotations, a more common way would be:

a   SEMMWE=O SEM=1,2
b   SEMMWE=B-M SEM=3
c   SEMMWE=I-M SEM=4

Still, not to overcomplicate: why don't you do it in the way that you feel most comfortable with, and then me and @matyaskopp can, if necessary, modify it to be the most in line with our NER annotation?

@perayson
Copy link
Collaborator

Thanks, so I've tweaked the tagging script to produce this format:

# sent_id =     ParlaMint-ES-CT_2022-01-25-2201.1.0.8.1
# source =      Sí; senyor Garriga ?
# text =        Yes, Mr. Garriga?
1       Yes     yes     INTJ    UH      _       0       _       _       ForwardAlignment=1|BackwardAlignment=1|NER=O|SpaceAfter=No|SEMMWE=O|SEM=Z4
2       ,       ,       PUNCT   ,       _       1       _       _       ForwardAlignment=2|BackwardAlignment=2|NER=O|SEMMWE=O|SEM=Z9
3       Mr.     Mr.     PROPN   NNP     Number=Sing     2       _       _       ForwardAlignment=3|BackwardAlignment=3|NER=O|SEMMWE=B|SEM=Z1mf,Z3c
4       Garriga Garriga PROPN   NNP     Number=Sing     3       _       _       ForwardAlignment=4|BackwardAlignment=4|NER=B-PER|SpaceAfter=No|SEMMWE=I|SEM=Z1mf,Z3c
5       ?       ?       PUNCT   .       _       4       _       _       ForwardAlignment=5|BackwardAlignment=5|NER=O|SEMMWE=O|SEM=Z9

If you approve, then I can start to set up the tagging jobs. By the way, please can you confirm where I should download the final MT CONLLU format data from?

@TomazErjavec TomazErjavec added this to the ParlaMint 3.1 release milestone Sep 24, 2023
TomazErjavec added a commit that referenced this issue Sep 24, 2023
@JohnVidler
Copy link
Collaborator

Hey folks, sorry for the delay, been dealing with a major issue on another project.

Btw, what do the three asterix dign mean in ES-en?

Just highlighting what had changed, with the table being so large - sorry, should have said :)

Unfortunatelly, all the files there, unlike yours, unpack to mnt/zfs/ucrel-data/, so I need to change it a bit

.. that's odd - I've got tar set to build relative paths, will look to fix this on the next build, then to repackage the existing ones with the fixed paths.

I'm still a little maxed out on the other project, but I'll see if I can get AT and CZ running overnight, along with the path fix.

@TomazErjavec
Copy link
Collaborator Author

Hey folks, sorry for the delay, been dealing with a major issue on another project.

Not a problem, we are busy with finishing the original langauge corpora anyway.

I've got tar set to build relative paths, will look to fix this on the next build

No need, I've got my unpacking set up the way it is now, so it would only mean I have to change things at my end again.

One thing: I tried running my scripts over BA, and after everything crashed, found out that one of your CoNLL-U files abruptly terminates in the middle of the original file, this one: ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu
No idea what goes wrong exactly with this one, at first I thought it was because of "==" in the text, which is rather an unusual combination of chars, but others have this as well. Under the assumption that other truncated files would also end with the # text = line I made a script that checks for this, but the file above was the only one it finds. So, fingers crossed that this is indeed the only bad file, but I can only really tell when I try to merge CoNLL-U into the XML files.

Anyway, could you re-annotate ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu please?

@TomazErjavec
Copy link
Collaborator Author

In addition to AT and CZ, some other MTed files are now also ready:

FI is still to come, will be finished shortly.

And, yes, I need a newly annotated ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu

@TomazErjavec
Copy link
Collaborator Author

FI is now also available: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-FI-en.conllu.zip
Note that we discovered in the MT process that a lot (about 7%) sentences are in fact in Swedish, although not marked as such. As the MT model expects Finnish, these sentences are untranslated. I guess USAS will put the tag for unknown here.

@JohnVidler
Copy link
Collaborator

I just tried kicking off the UA and HU jobs, but the zip files seem to be malformed?

Archive:  ParlaMint-HU-en.conllu.zip
   creating: ParlaMint-HU-en.conllu/
   creating: ParlaMint-HU-en.conllu/2014/
  inflating: ParlaMint-HU-en.conllu/2014/ParlaMint-HU-en_2014-05-10.conllu
error: invalid zip file with overlapped components (possible zip bomb)

With equivalent output for the UA file.

Can you take a look @TomazErjavec ?

@TomazErjavec
Copy link
Collaborator Author

the zip files seem to be malformed?

I just tried getting the file and unzipping it on my machine, and it works fine, also for UA and FI. Weird.

Anyway, I now made .tgz files for FI, HU, UA, hope that will be better. Same location as before, i.e https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/?C=M;O=D

@JohnVidler
Copy link
Collaborator

Super weird, I re-downloaded the zips to check again and they seem to be happy now?

In any case, just a quick note here to say I've been running everything at our side here, and it all looks good bar one issue where the script isn't super happy affecting a single file. I'm currently investigating that and hope to have everything published tonight/tomorrow in the web folder.

As we have a few versions of these archives kicking around now, I'm also generating an md5 for each of the files, so you can confirm you have the latest/correct version.

@JohnVidler
Copy link
Collaborator

I've processed and uploaded a new set of tar's over at http://ucrel-api-01.lancaster.ac.uk/vidler/ - the only failing file is ParlaMint-BA-en_2006-09-07-0.conllu which has a processing error (The 'Translated' field is being interpreted as 'None' which is breaking the output causing the script to crash on that one) and I'll sink some time into that tomorrow and get it reuploaded.

For now, I suggest not using the BA-en files.

There's also now an archives.md5 which includes md5 hashes for each file, to allow integrity checks, in case they're required.

I've fixed the tar enclosed paths @TomazErjavec so I'm afraid your programs will need reverting to their previous paths - it was a bug that needed fixing as any changes to our build system here would mean a different path in the resultant .tar - sorry!

Also, according to my load tests - I can now apparently rebuild this lot over about a 24-48hr period no problem at all 🙂

@TomazErjavec
Copy link
Collaborator Author

Thanks @JohnVidler, got the missing files. I don't see any particular need to use the checksums, unless something freaky happens again. And good to hear that speed is not an issue.
So:

@JohnVidler
Copy link
Collaborator

JohnVidler commented Oct 3, 2023

No problem, I'll set ES-CT going in a moment, and I'll be looking at GA and GB today, so they should be up shortly, barring any major issue

Edit: Ah, GB got missed because it didn't follow the 'XX-en' pattern I was using to automatically download everything - whoops. Getting that started too.

@TomazErjavec
Copy link
Collaborator Author

No problem, I'll set ES-CT going in a moment, and I'll be looking at GA and GB today, so they should be up shortly, barring any major issue
Edit: Ah, GB got missed because it didn't follow the 'XX-en' pattern I was using to automatically download everything - whoops. Getting that started too.

@JohnVidler, and news on GB and ES-CT? As well as on the missing BA file?

And we now also got the last corpus translated, if you could process this one as well please: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-ES-PV-en.conllu.zip

@JohnVidler
Copy link
Collaborator

Hey @TomazErjavec - disruption to my working pattern slowed me down - ES-CT is now in the usual spot: ( http://ucrel-api-01.lancaster.ac.uk/vidler/ ) GB is mostly playing well with the tooling here now, but I've got a couple of errors I'm fixing today so that should be up shortly.

I'll kick off ES-PV now and it'll be up for tomorrow morning, assuming we hit no problems.

@TomazErjavec
Copy link
Collaborator Author

Thanks @JohnVidler for ES-CT. Some problems, because a) it seems ES-CT actually deleted some files from the first round and b) you seem to have expanded the new corpus into the old directory, so those files persisted there. The result was havoc with my integration program, but managed to identify the spurious files and delete them, so now all ok.
And looking forward to the final corpora!

@TomazErjavec
Copy link
Collaborator Author

Heads up @JohnVidler, we are now running very late. We still need:

  • 1 missing BA file
  • ES-PV
  • GB

We would really need to release ParlaMint-en soon, and the processing at this end takes some time too (assuming no problems, otherwise even more...)

@JohnVidler
Copy link
Collaborator

@TomazErjavec

GB to follow shortly, apologies for the delay!

@JohnVidler
Copy link
Collaborator

Note that GB has rather large log files, as the tooling repeatedly complains about the missing sources - I've left these in for now as the output still needs to be sanity checked by @perayson, but I've uploaded the .tar.gz anyway so you can get started @TomazErjavec on the assumption that all is well.

If the log size is a problem, let me know and I'll strip the warnings out and re-upload.

@TomazErjavec
Copy link
Collaborator Author

@JohnVidler, thanks for corpora. Got them all 3, at first glance looks ok (but do have to comment on the inventiveness of the paths, ES-PV and BA in mnt/zfs/ucrel-data, and GB in home/ubuntu/:).
Logs are not a problem. But if @perayson finds problems with GB pls. let me know before I start processing it.
And thanks for your work!

@perayson
Copy link
Collaborator

I've had a look at GB this morning, the semantic tagging looks fine, however we never really agreed an input/output format for GB as it's different from the translated corpora. Can you have a look @TomazErjavec and let us know what else needs to be retained, if anything, from the input?

@JohnVidler
Copy link
Collaborator

JohnVidler commented Oct 24, 2023

Argh, apologies for the path mixup - I had to run GB on its own, hence the different path, but the darned version of tar on the box seems to ignore the -C directive to handle non-absolute paths.
I've rebuilt the files and can update the published ones with the corrected paths if this is simpler.

@TomazErjavec
Copy link
Collaborator Author

I've rebuilt the files and can update the published ones with the corrected paths if this is simpler.

No @JohnVidler, it's ok, I have all the files now the way I want them here. But I though I should mention it!

I've had a look at GB this morning, the semantic tagging looks fine, however we never really agreed an input/output format for GB as it's different from the translated corpora. Can you have a look @TomazErjavec and let us know what else needs to be retained, if anything, from the input?

@perayson, I think it's ok the way it is. I did some pre-processing and nothing broke. So, I think we can consider the delivery of all the files done! (well, except if some later stage of processing, in particular the conversion into TEI, shows some unexpected problems, but I am optimistic that it won't).

@perayson
Copy link
Collaborator

ok, great, thanks for confirming!

@TomazErjavec
Copy link
Collaborator Author

The points here have been mostly solved, what remains should be taken up in #827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants