Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English PTB to UD 2.0 #717

Open
arademaker opened this issue Jul 24, 2020 · 83 comments
Open

English PTB to UD 2.0 #717

arademaker opened this issue Jul 24, 2020 · 83 comments
Labels
dependencies English features UPOS Universal part-of-speech tags: definitions and examples
Milestone

Comments

@arademaker
Copy link
Contributor

arademaker commented Jul 24, 2020

Does anyone know that is the best approach to convert a treebank in PTB format to UD 2.0? I found the page https://nlp.stanford.edu/software/stanford-dependencies.html, but it is not clear if the code supports UD 2.0. Suggestions are welcome.

@dan-zeman dan-zeman added dependencies English features UPOS Universal part-of-speech tags: definitions and examples labels Jul 24, 2020
@dan-zeman dan-zeman added this to the later milestone Jul 24, 2020
@amir-zeldes
Copy link
Contributor

You can use CoreNLP to convert PTB brackets for English to UD v1 (more or less, I think it represents a particular moment in time before 2.0 was released, but fairly close to v1 still), like this:

java -cp "*;" -Dfile.encoding=UTF-8 edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile FILENAME

If you have a good conversion to Stanford Dependencies, you can also use DepEdit to convert the data to the current UD standard, more or less accurately depending on whether you have some additional entities (e.g. entities to resolve flat/compound better, etc.). This process is described and evaluated in this paper:

https://www.aclweb.org/anthology/W18-4918/

Finally you can also use a quick and dirty UD1>UD2 DepEdit script to transform the CoreNLP output from the command above to the current guidelines, but there are certain to be errors if you don't have the additional annotations from the paper. This basically just renames the labels that were changed in V2, rewires cc+conj, etc.:

pos=/VERB/;func=/nmod/	#1>#2	#2:func=obl
func=/.*/;func=/conj/;func=/cc/	#1>#2;#1>#3;#1.*#2	#2>#3
func=/dobj/	none	#1:func=obj
func=/mwe/	none	#1:func=fixed
func=/name|foreign/	none	#1:func=flat
func=/neg/	none	#1:func=advmod
func=/nsubjpass/	none	#1:func=nsubj:pass
func=/auxpass/	none	#1:func=aux:pass

If you want the code from the paper, let me know, but it is probably not 100% runnable out of the box (hardwired paths etc.)

@sebschu
Copy link
Member

sebschu commented Jul 27, 2020

Since CoreNLP v4.0.0, the converter actually outputs UDv2!

You can run it, as suggested by Amir, using the command:

java -cp "*;" -Dfile.encoding=UTF-8 edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile FILENAME

@arademaker
Copy link
Contributor Author

arademaker commented Sep 15, 2020

Just to let people know... I got some errors when I run the UD validation script on the output data produced by CoreNLP 4.0 over the https://catalog.ldc.upenn.edu/LDC2013T19 dataset. the top 15 most frequent errors are:

41505  [L3 Syntax rel-upos-cop] 'cop' should be 'AUX' or 'PRON'/'DET' but it is 'VERB'
1780  [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'ADP'
 780  [L3 Syntax right-to-left-conj] Relation 'conj' must go left-to-right.
 568  [L3 Syntax rel-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
 489  [L3 Syntax rel-upos-punct] 'punct' must be 'PUNCT' but it is 'SYM'
 320  [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'DET'
 304  [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADJ'
 234  [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'X'
 208  [L3 Syntax rel-upos-cc] 'cc' should not be 'DET'
 175  [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
 136  [L3 Syntax upos-rel-punct] 'PUNCT' must be 'punct' but it is 'conj'
  63  [L3 Syntax rel-upos-case] 'case' should not be 'ADJ'
  61  [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADV'
  48  [L3 Syntax rel-upos-mark] 'mark' should not be 'DET'
  46  [L3 Syntax right-to-left-appos] Relation 'appos' must go left-to-right.

@nschneid
Copy link
Contributor

Any update on CoreNLP's PTB->UD conversion producing invalid UD? @sebschu @manning @AngledLuffa

@AngledLuffa
Copy link

that looks like a project! i will find time this year to start chipping away at that, but there's some work i simply can't put off any longer as i promised it for an upcoming industry event

@AngledLuffa
Copy link

actually, one way to speed this up would be to suggest a few command lines for doing the validation

@nschneid
Copy link
Contributor

I think this should run validation for EWT:

$ cd UD_English-EWT
$ git clone https://github.com/UniversalDependencies/tools/
$ tools/validate.py --lang en en_ewt-ud-{dev,test,train}.conllu

@AngledLuffa
Copy link

Drilling down a bit into the most common error, that of a cop being AUX instead of VERB, here is a concrete example. In the EWT tree

( (S
    (NP-SBJ (DT The) (JJ actual) (NN vote))
    (VP (VBZ is)
      (ADJP-PRD
        (NP (DT a) (JJ little))
        (JJ confusing)))
    (. .)))

Our POS tag converter code has a comment:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/UniversalPOSMapper.java
https://github.com/stanfordnlp/CoreNLP/blob/3499d27e615c35702f23948e886a7389b5695c33/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon#L45

% Don't do this, we are now treating these as copular constructions

and that part of the conversion being commented out results in the tag VERB instead of AUX

1       The     the     DET     DT      _       3       det     _       _
2       actual  actual  ADJ     JJ      _       3       amod    _       _
3       vote    vote    NOUN    NN      _       7       nsubj   _       _
4       is      be      VERB    VBZ     _       7       cop     _       _
5       a       a       DET     DT      _       6       det     _       _
6       little  little  ADJ     JJ      _       7       obl:npmod       _       _
7       confusing       confusing       ADJ     JJ      _       0       root    _       _
8       .       .       PUNCT   .       _       7       punct   _       _

whereas the UD version of that sentence is

# sent_id = weblog-blogspot.com_aggressivevoicedaily_20060629164800_ENG_20060629_164800-0002
# text = The actual vote is a little confusing.
1       The     the     DET     DT      Definite=Def|PronType=Art       3       det     3:det   _
2       actual  actual  ADJ     JJ      Degree=Pos      3       amod    3:amod  _
3       vote    vote    NOUN    NN      Number=Sing     7       nsubj   7:nsubj _
4       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   7       cop     7:cop   _
5       a       a       DET     DT      Definite=Ind|PronType=Art       6       det     6:det   _
6       little  little  ADJ     JJ      Degree=Pos      7       obl:npmod       7:obl:npmod     _
7       confusing       confusing       ADJ     JJ      Degree=Pos      0       root    0:root  SpaceAfter=No
8       .       .       PUNCT   .       _       7       punct   7:punct _

First there's a somewhat unfortunate DRY violation here, in that the same rules are repeated in the tsurgeon file and in the constituency -> dependency converter rules:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/UniversalEnglishGrammaticalRelations.java

So I'll need to figure out how extensive that problem is and how best to resolve it. There have been a few dependency converter fixes over the years which I assume are not reflected in any way in the POS converter. I also need to figure out how or why this particular rule about cop is being ignored and what to do to fix it.

The other errors probably have similar origins when it comes to UPOS tags being flagged by the validator. They'll each require some individual attention regarding what kind of tree is causing the error and how to fix.

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue Feb 23, 2024
@AngledLuffa
Copy link

for my own reference, i've been doing this to check a single tree:

java edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile foo.mrg

or this for an entire slice of PTB:

java edu.stanford.nlp.trees.ud.UniversalDependenciesConverter -encoding UTF-8 -treeFile path/to/en_ptb3_test.mrg > en_ptb_test.conll
tools/validate.py --lang en en_ptb_test.conll --no-tree-text --max-err lots

So here's the next phrase in the dev set which isn't a cop AUX error

                (SBAR
                  (WHNP-1 (WDT which) )
                  (S
                    (NP-SBJ (-NONE- *T*-1) )
                    (VP (VBZ seems)
                      (PP (TO to)
                        (NP (PRP me) ))
                      (ADJP-PRD
                        (ADVP (NN sort) (IN of) )
                        (JJ draconian) ))))))))))

Our converter turns this into

16      which   which   PRON    WDT     _       17      nsubj   _       _
17      seems   seem    VERB    VBZ     _       10      acl:relcl       _       _
18      to      to      ADP     TO      _       19      case    _       _
19      me      I       PRON    PRP     _       17      obl     _       _
20      sort    sort    NOUN    NN      _       22      advmod  _       _
21      of      of      ADP     IN      _       20      case    _       _
22      draconian       draconian       ADJ     JJ      _       17      xcomp   _       _

The error given is

[Line 542 Sent 17 Node 20]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'

however, I can find this sentence in EWT which has a similar structure

# sent_id = answers-20111107080027AA9zCIG_ans-0005
# text = its kind of expensive though
1-2     its     _       _       _       _       _       _       _       _
1       it      it      PRON    PRP     Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  5       nsubj   5:nsubj _
2       s       be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|Typo=Yes|VerbForm=Fin  5       cop     5:cop   CorrectForm='s
3       kind    kind    NOUN    NN      ExtPos=ADV|Number=Sing  5       advmod  5:advmod        _
4       of      of      ADP     IN      _       3       fixed   3:fixed _
5       expensive       expensive       ADJ     JJ      Degree=Pos      0       root    0:root  _
6       though  though  ADV     RB      _       5       advmod  5:advmod        _

so that's, quoting the French treebanks this time, kind of BS

although I do notice one difference, that "kind of ___" is fixed, as opposed to our converter, which turned "sort of ___" into case(sort, of)

Editing the dependencies to make that a fixed do in fact change that. So apparently that's the fix needed here... the converter needs to turn sort of, kind of, and whatever else matches into fixed instead of case

Continuing to dig into this, the converter has another component which breaks out fixed expressions prior to the tregex expressions run in UniversalEnglishGrammaticalRelations: https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/CoordinationTransformer.java

hey, as it turns out, there's already a thing which does kind of:

https://github.com/stanfordnlp/CoreNLP/blob/3499d27e615c35702f23948e886a7389b5695c33/src/edu/stanford/nlp/trees/CoordinationTransformer.java#L677

    TregexPattern.compile("@ADVP < ((RB|NN=node1 < /^(?i)kind$/) $+ (IN|RB=node2 < /^(?i)of$/))"), //kind of

So this fix is actually rather simple, aside from all the spelunking needed. Just need to turn that kind into kind|sort and make sure that doesn't make a hash of everything else. Looking over the changes it makes to the PTB train set, it's all perfectly reasonable, such as "this project is sort of annoying" and other examples. And hey, not only has this fixed the error in the dev set I was looking at, it also fixes 5 of the 13,633 errors in the train set.

@AngledLuffa
Copy link

This time around_ADP, they're moving even faster was converted to advmod(time, around). the last time around received a similar treatment.

( (S
    (NP-TMP
      (NP (DT This) (NN time) )
      (ADVP (RP around) ))
    (, ,)
    (NP-SBJ (PRP they) )
    (VP (VBP 're)
      (VP (VBG moving)
        (ADVP (RB even) (RBR faster) )))
    (. .) ))

Here are some similar examples in EWT:

17      sometime        sometime        ADV     RB      _       15      advmod  15:advmod       _
18      around  around  ADP     IN      _       19      case    19:case _
19      mid-August      mid-August      PROPN   NNP     Number=Sing     17      obl     17:obl:around   SpaceAfter=No
# sent_id = email-enronsent40_01-0086
# text = - Arrv. Nice around noon?
1       -       -       PUNCT   NFP     _       2       punct   2:punct _
2       Arrv.   arrive  VERB    VB      Abbr=Yes|VerbForm=Inf   0       root    0:root  _
3       Nice    Nice    PROPN   NNP     Number=Sing     2       obl:npmod       2:obl:npmod     _
4       around  around  ADP     IN      _       5       case    5:case  _
5       noon    noon    NOUN    NN      Number=Sing     2       obl     2:obl:around    SpaceAfter=No
6       ?       ?       PUNCT   .       _       2       punct   2:punct _
# sent_id = email-enronsent40_01-0099
11      around  around  ADP     IN      _       12      case    12:case _
12      noon    noon    NOUN    NN      Number=Sing     10      obl     10:obl:around   SpaceAfter=No
23      actions action  NOUN    NNS     Number=Plur     20      conj    20:conj:and|28:nsubj    _
24      around  around  ADP     IN      _       27      case    27:case _
25      the     the     DET     DT      Definite=Def|PronType=Art       27      det     27:det  _
26      same    same    ADJ     JJ      Degree=Pos      27      amod    27:amod _
27      time    time    NOUN    NN      Number=Sing     23      nmod    23:nmod:around  _

Also looking through GUM a bit, it looks like this should be case? But I'm not 100% convinced that's correct. Any suggestions on what to do would be welcome.

@AngledLuffa
Copy link

double written out is being transformed by our converter into a nummod

[Line 940 Sent 35 Node 23]: [L3 Syntax rel-upos-nummod] 'nummod' should be 'NUM' but it is 'ADV'
21      received        receive VERB    VBN     _       4       ccomp   _       _
22      about   about   ADV     RB      _       23      advmod  _       _
23      double  double  ADV     RB      _       26      nummod  _       _
24      the     the     DET     DT      _       26      det     _       _
25      usual   usual   ADJ     JJ      _       26      amod    _       _
26      volume  volume  NOUN    NN      _       21      obj     _       _

This is because the converter gets a QP and thinks, ah, QP, that's obviously a nummod:

          (VP (VBN received)
            (NP
              (NP
                (QP (RB about) (RB double) )
                (DT the) (JJ usual) (NN volume) )
              (PP (IN of)
                (NP (NNS calls) )))
            (PP-TMP (IN over)
              (NP (DT the) (NN weekend) ))))))

If I look around for possibly similar usages of double in GUM and EWT, it would appear they are typically labeled as amod

# sent_id = GUM_conversation_blacksmithing-85
# text = We — that was kind of a double thing that, we had in — in another class, so it was kinda review for us.
7       a       a       DET     DT      Definite=Ind|PronType=Art       9       det     9:det   _
8       double  double  ADJ     JJ      Degree=Pos      9       amod    9:amod  _
9       thing   thing   NOUN    NN      Number=Sing     0       root    0:root|13:obj   _
# sent_id = answers-20111108083754AAEw5Xc_ans-0016
# text = Travelling on your own you would have to pay double as cabins are sold on the basis of double occupancy.
18      of      of      ADP     IN      _       20      case    20:case _
19      double  double  ADJ     JJ      Degree=Pos      20      amod    20:amod _
20      occupancy       occupancy       NOUN    NN      Number=Sing     17      nmod    17:nmod:of      SpaceAfter=No

However, I'm not sure this is 100% indicative, as those usages of double are a bit different. Closer is twice such as

# sent_id = newsgroup-groups.google.com_alt.animals_0e65f540816d780c_ENG_20041116_124800-0040
25      twice   twice   ADV     RB      NumForm=Word|NumType=Mult       27      advmod  27:advmod       _
26      that    that    ADV     RB      _       27      advmod  27:advmod       _
27      much    much    ADV     RB      _       22      advmod  22:advmod       _
# sent_id = answers-20111108105629AAiZUDY_ans-0049
3       twice   twice   ADV     RB      NumForm=Word|NumType=Mult       5       advmod  5:advmod        _
4       my      my      PRON    PRP$    Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs     5       nmod:poss       5:nmod:poss     _
5       size    size    NOUN    NN      Number=Sing     0       root    0:root  _

I like those examples more, and they seem to suggest advmod. It is worth pointing out those are not in QPs in the original EWT trees.

Digging deeper and looking at half in the original EWT trees, half opened is not in a QP, whereas half of the furniture is. half of what A&E charges and half the price are not. less than half of the price IS. about half the time quoted is. half in this case is tagged DT/PDT as opposed to ADV/RB from double the usual volume. So that makes me wonder if that double was supposed to be a DT, or at least would be in the EWT paradigm? But then there's this usage of half, which also looks like a weird tagging to me:

# sent_id = weblog-blogspot.com_alaindewitt_20060924104100_ENG_20060924_104100-0028
# text = These 22 countries, with all their oil and natural resources, have a combined GDP smaller than that of Netherlands plus Belgium and equal to half of the GDP of California alone.
26      to      to      ADP     IN      _       27      case    27:case _
27      half    half    NOUN    NN      Number=Sing|NumForm=Word|NumType=Frac   25      obl     25:obl:to       _
28      of      of      ADP     IN      _       30      case    30:case _
29      the     the     DET     DT      Definite=Def|PronType=Art       30      det     30:det  _
30      GDP     GDP     PROPN   NNP     Number=Sing     27      nmod    27:nmod:of      _
31      of      of      ADP     IN      _       32      case    32:case _
32      California      California      PROPN   NNP     Number=Sing     30      nmod    30:nmod:of      _
33      alone   alone   ADV     RB      _       32      advmod  32:advmod       SpaceAfter=No

Effectively, once again, I have no idea what the ultimate resolution of this structure should be.

Hopefully this is somewhat illustrative as to why there is very little movement over time for this issue: there are probably zero people in the world in the center of the Venn diagram of "understands the converter", "feels comfortable making authoritative decisions about dependencies". and "has the time to make these changes"

@nschneid
Copy link
Contributor

I am happy to weigh in to clarify the UD annotation policies. :) It is not surprising that this will be a nontrivial change as in the last couple of years there have been some notable general guidelines changes, some major revisions of English-specific policies (like relative clauses, pronouns, and passives), and hundreds of smaller corrections and policy changes. Some will be reflected in the main UD validator, and others are checked in English-specific validation scripts.

You are quite right that fixed expressions trigger exceptions to the validator rules. Almost all of these fixed expressions are documented here.

I've responded to your question about "this time around" in UniversalDependencies/UD_English-GUM#81.

My gut feeling for "double the price" is advmod. nummod should be limited to actual numbers.* Is it possible to change the QP rule to check for a number (tagged NUM)? (* An exception: Currently ordinal dates e.g. "February 28th" have NOUN/nummod to attach the date to the month but this needs to be changed.)

@amir-zeldes
Copy link
Contributor

See my response on "around" in UniversalDependencies/UD_English-GUM#81

I think in "received double the price", "double" is obj, and "the price" is a modifier of some kind, perhaps nmod:npmod is the best option. My reasoning is that you can drop "the price" and reconstruct it contextually with no change in meaning, but if you drop "double" you get a totally different reading:

  • We received double the price
  • We received double (=of the price)
  • We received the price (totally different reading)

Interrogative test:

  • What did you receive?
  • Double the price
  • Double (same meaning)
  • The price (not the same meaning)
  • That (antecedent: "double the price" not "the price")

@amir-zeldes
Copy link
Contributor

zero people in the world in the center of the Venn diagram

That's probably true, but there are perhaps more grad students with ML skills who might be persuaded to work on postediting the converter output based on trying to match the final UD product in a corpus like EWT... I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish.

@AngledLuffa
Copy link

I would actually think that an ML step might be needed anyway for really good results, since UD trees express some things that PTB trees just don't distinguish.

I think part of the appeal of this converter is that it is fast, whereas as using an ML step to convert the trees would be orders slower. Certainly I would expect it to be more accurate, though.

@AngledLuffa
Copy link

IN vs RB vs RP in PTB is also giving me headaches for various short phrases. For example, close down_RB, drive down_IN, walk up_IN, laid out_RP, peer out_IN ....

This leads to an error

[Line 1340 Sent 52 Node 28]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'ADP'

in the phrase

                              (VP
                                (ADVP (RB just) )
                                (VBZ drives)
                                (NP (DT the) (NNS prices) )
                                (ADVP-DIR (IN down) )
                                (ADVP (RBR further) )))))))))))))))
23      which   which   PRON    WDT     _       25      nsubj   _       _
24      just    just    ADV     RB      _       25      advmod  _       _
25      drives  drive   VERB    VBZ     _       17      ccomp   _       _
26      the     the     DET     DT      _       27      det     _       _
27      prices  price   NOUN    NNS     _       25      obj     _       _
28      down    down    ADP     IN      _       25      advmod  _       _
29      further far     ADV     RBR     _       25      advmod  _       _

@nschneid
Copy link
Contributor

Is (ADVP-DIR (IN down) ) an error in the Penn tree? I would have expected RB since it's an adverb phrase.

@AngledLuffa
Copy link

Is (ADVP-DIR (IN down) ) an error in the Penn tree?

I think so, but I don't think the converter is the right place to editorialize PTB tags. Perhaps there's some room to apply some heuristics such as a singleton ADVP is treated as a particle in the "go down", "take down", "drive down" senses... I do wonder how easy it will be to distinguish servers and coal miners going down, though, or the sentence "If you're not busy, why not drive down this weekend?"

@nschneid
Copy link
Contributor

Yeah this is why I don't like the idiomaticity criterion. Probably best to trust the Penn tree and live with the occasional stray validator error caused by a Penn error.

@AngledLuffa
Copy link

In terms of fixed expressions, how about en masse? That occurs a couple times in PTB

23      with    with    ADP     IN      _       26      case    _       _
24      high    high    ADJ     JJ      _       26      amod    _       _
25      debt    debt    NOUN    NN      _       26      compound        _       _
26      ratios  ratio   NOUN    NNS     _       22      nmod    _       _
27      will    will    AUX     MD      _       29      aux     _       _
28      be      be      AUX     VB      _       29      aux:pass        _       _
29      dumped  dump    VERB    VBN     _       6       ccomp   _       _
30      en      en      ADP     IN      _       31      case    _       _
31      masse   masse   NOUN    NN      _       29      advmod  _       _
32      to      to      PART    TO      _       33      mark    _       _
33      discuss discuss VERB    VB      _       20      advcl   _       _
34      ,       ,       PUNCT   ,       _       33      punct   _       _
35      en      en      X       FW      _       36      compound        _       _
36      masse   masse   X       FW      _       33      obj     _       _
37      ,       ,       PUNCT   ,       _       33      punct   _       _
38      certain certain ADJ     JJ      _       40      amod    _       _
39      controversial   controversial   ADJ     JJ      _       40      amod    _       _
40      proposals       proposal        NOUN    NNS     _       33      obj     _       _
17      individuals     individual      NOUN    NNS     _       18      nsubj   _       _
18      ran     run     VERB    VBD     _       0       root    _       _
19      from    from    ADP     IN      _       21      case    _       _
20      the     the     DET     DT      _       21      det     _       _
21      market  market  NOUN    NN      _       18      obl     _       _
22      en      en      X       FW      _       23      compound        _       _
23      masse   masse   X       FW      _       18      advmod  _       _

Note the inconsistent tagging. I'd like to throw the PTB into space... but I do like fixing trivial errors in large projects

@nschneid
Copy link
Contributor

"en masse" is a good one. Not fixed (that's limited to grammatical expressions) but it falls under our newly articulated policy on foreign expressions. My inclination would be to say the whole thing is a borrowed adverb-expression, so flat(en/ADV masse/ADV).

@AngledLuffa
Copy link

grammatical expressions

Whatever heuristics I have developed to understand these things, they are failing me in this interpretation of "en masse" as not being a grammatical expression. Would you clarify that a little bit?

Also, to be clear, en is the head here, right? advmod attachment in each of the three cases I posted above?

@AngledLuffa
Copy link

As for the first, perhaps it's a bit misleading if the guidelines say that splitting is right - shouldn't the validator serve to warn users that an output is not what UD expects?

This seems like a reasonable argument. It's just a situation with another unfixable validator error after using the converter on PTB. I don't mind, since there are a lot of unfixable errors at this point anyway

It may not be finalized for a few weeks, but Chris has access to the draft proposal—you can ask him what would be the cleanest way to tweak the rules to pass validation while moving in the direction things are headed (or if he thinks that direction is wrong).

I would expect the tag changes to lists won't require more linguistic knowledge than I have to implement - what's the draft proposal look like?

@amir-zeldes
Copy link
Contributor

We have been following Penn tokenization for "gonna" etc. right

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

@AngledLuffa
Copy link

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

Agreed on that, but if a user gives the converter a tree with one of these contractions as a single word, I think it would be incorrect for the converter to split it for them. Similar to my current belief that it should return the same XPOS the user gives (via the tree) and the UPOS should correspond to the XPOS, even if that means the dependencies created violate the validator's rules about UPOS.

So basically there's a whole bunch of errors the validator will flag on the output of the converter when given PTB unless we start editing the input trees in ways which users would find surprising

@amir-zeldes
Copy link
Contributor

Yeah, that all sounds right. In terms of silencing the validator about the results of that, I wouldn't lose too much sleep over it if you want to do it, but I sort of find it right if the validator throws a warning, since the output indeed does not correspond to the recommended UD English standard, so users should be warned.

@dan-zeman
Copy link
Member

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

Agreed on that, but if a user gives the converter a tree with one of these contractions as a single word, I think it would be incorrect for the converter to split it for them. Similar to my current belief that it should return the same XPOS the user gives (via the tree) and the UPOS should correspond to the XPOS, even if that means the dependencies created violate the validator's rules about UPOS.

So basically there's a whole bunch of errors the validator will flag on the output of the converter when given PTB unless we start editing the input trees in ways which users would find surprising

Seems like we have different expectations of what "a converter to UD" is. FWIW, in my converter of Czech PDT to UD, I want the output to be as good/valid UD as possible given the input. I am not trying to output something that will be as close as possible to the original PDT, just with some UD labeling in places where it does not hurt feelings of users who live outside UD.

@AngledLuffa
Copy link

I want the output to be as good/valid UD as possible given the input

Well, we do have a PTB Correcting script which fixes up a bunch of known errors (mostly tags, but could include retokenizing mighta or whatever) in PTB. I could envision connecting that to the converter as an optional feature.

@AngledLuffa
Copy link

ps. not that I was offended by your wording, but "hurt feelings" or unmatched expectations tend to express themselves in the form of github issues

@nschneid
Copy link
Contributor

I want the output to be as good/valid UD as possible given the input

Well, we do have a PTB Correcting script which fixes up a bunch of known errors (mostly tags, but could include retokenizing mighta or whatever) in PTB. I could envision connecting that to the converter as an optional feature.

How well-established is this convention in Penn data? Is it just a one-off where they neglected to tokenize "mighta" or is it a repeated thing? I couldn't find other tokens in OntoNotes but not sure if I was searching correctly. If it only comes up once or twice in the Penn trees then I don't think UD should necessarily enact a policy just to accommodate that. But if it's a clear policy of PTB then we should be prepared to either convert or accommodate that tokenization in UD.

@sylvainkahane
Copy link
Contributor

Indeed, and they segment "gonna" into "gon" + "na", as we do as well (15/15 times), so I think colloquial contractions like this should generally be broken up.

Segmenting "gonna" into "gon" + "na" has to be justified. We have already discussed this in #1006. If we look at all the realisations of the lexeme TO (lemma=to) in UD_English-GUM (and if we exclude orthographic variations), we have 4 realisations (gon-na, ought-a, got-ta). It would be probably more justified to consider that TO has only two allomorphes, to and a.

But the real problem is that we don't have any criterion to decide how to segment this kind of words in UD and how to choose the form of their parts.

@amir-zeldes
Copy link
Contributor

It would be probably more justified to consider that TO has only two allomorphes, to and a

Maybe - it wouldn't have been hard to do "gonn a" if we were doing this from scratch, but since Penn corpora already went with "gon na", I don't mind having a third form too much. At least we're consistent with other English corpora this way.

@nschneid
Copy link
Contributor

Plus, as a practical matter, less risk of POS taggers mislabeling the "a" as a determiner!

@AngledLuffa
Copy link

could always split it as migh ta woul da etc or maybe that just looks really bad

splitting it as might have seems like a violation of the general principle that we leave pieces in such a form that they get combined to rebuild the original text. although we only generally do that in English, whereas other languages with MWT have their own split schemes that often leave the pieces different from the original

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue Mar 20, 2024
… in the PTB conversion to dependencies by about 250. Weirdly this is by removing 280 syntax errors and adding 40 morpho errors for aux verbs. Presumably those should be fixable. Of course, there is always more that can be done - there are now 2622 errors left when using the converter. UniversalDependencies/docs#717
@AngledLuffa
Copy link

i spent longer than necessary fixing up ccomp relations (and therefore the associated UD converter errors) in sentences such as

"Working on this issue is a pain," complained AngledLuffa

ccomp(complained, pain)

One sentence that still goes wonky from PTB is the following:

( (S
    (S-TPC-2
      (NP-SBJ
        (NP (DT Those) )
        (VP (VBN employed)
          (NP (-NONE- *) )
          (PP-LOC-CLR (IN in)
            (NP (JJ state-funded) (JJ special) (NNS programs) ))))
      (VP (VBN increased)
        (PP-EXT (IN by)
          (NP (CD 7,400) ))
        (PP-DIR (TO to)
          (NP (CD 65,200) ))
        (PP-TMP (IN in)
          (NP (DT the) (JJ same) (NN period) )))
      (, ,) )
    (NP-SBJ (DT the) (NNP Directorate) )
    (VP (VBD said)
      (SBAR (-NONE- 0)
        (S (-NONE- *T*-2) )))
    (. .) ))

In this case, I believe increased is mistagged and should be VBD, since it is something that actively happened rather than the participle use. Does that sound right?

I can make a new release of CoreNLP which greatly reduces the number of errors in converted PTB once I wrap up this tiny change, but completely eliminating them with a deterministic converter is optimistic

@AngledLuffa
Copy link

AngledLuffa commented Apr 18, 2024

... ultimately I don't see a difference in the verb usage in the following sentences, but I'm happy to be told how to count the angels dancing on this pin:

It adopted_VBD a takeover plan ...
He collaborated_VBD with ...
He launched_VBD into ...

Most yields ... moved_VBN in the opposite direction
The board increased_VBN by one
Those employed in state-funded special programs increased_VBN by ...
The dollar gained_VBN against most foreign currencies

Compare to the following, which is a condition or something the NP had done to it rather than something the NP did:

With Japan's cash-flush banks aligned_VBN ...
... had their budgets cut_VBN in half ...
... more than 6.6 million ADRs traded_VBN

@nschneid
Copy link
Contributor

Most yields ... moved_VBN in the opposite direction
The board increased_VBN by one
Those employed in state-funded special programs increased_VBN by ...
The dollar gained_VBN against most foreign currencies

These should all be VBD. PTB has a lot of tagger errors that annotators missed.

@AngledLuffa
Copy link

Good, glad to know it wasn't me misunderstanding. Thanks for checking

@AngledLuffa
Copy link

Are we happy with the conversion of about an hour in this following EWT sentence? advmod(hour, about) det(hour,an)

There are quite a few trees in PTB which have the about an inside a QP, presumably treating it similar to about one, and that makes our converter want to give it an nummod relation. However, the validator doesn't like that at all

# sent_id = reviews-288930-0003
# text = Try the 360 restraunt u spin in the cn tower with a beautiful view the sky pod elevator is about an hour line up in the summer
1       Try     try     VERB    VB      Mood=Imp|VerbForm=Fin   0       root    0:root  _
2       the     the     DET     DT      Definite=Def|PronType=Art       4       det     4:det   _
3       360     360     NUM     CD      NumForm=Digit|NumType=Card      4       nummod  4:nummod        _
4       restraunt       restaurant      NOUN    NN      Number=Sing|Typo=Yes    1       obj     1:obj   CorrectForm=restaurant
5       u       you     PRON    PRP     Abbr=Yes|Case=Nom|Person=2|PronType=Prs 6       nsubj   6:nsubj CorrectForm=you
6       spin    spin    VERB    VBP     Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin   1       parataxis       1:parataxis     _
7       in      in      ADP     IN      _       10      case    10:case _
8       the     the     DET     DT      Definite=Def|PronType=Art       10      det     10:det  _
9       cn      cn      PROPN   NNP     Number=Sing     10      compound        10:compound     _
10      tower   tower   PROPN   NNP     Number=Sing     6       obl     6:obl:in        _
11      with    with    ADP     IN      _       14      case    14:case _
12      a       a       DET     DT      Definite=Ind|PronType=Art       14      det     14:det  _
13      beautiful       beautiful       ADJ     JJ      Degree=Pos      14      amod    14:amod _
14      view    view    NOUN    NN      Number=Sing     6       obl     6:obl:with      _
15      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
16      sky     sky     NOUN    NN      Number=Sing     17      compound        17:compound     _
17      pod     pod     NOUN    NN      Number=Sing     18      compound        18:compound     _
18      elevator        elevator        NOUN    NN      Number=Sing     24      nsubj   24:nsubj        _
19      is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   24      cop     24:cop  _
20      about   about   ADV     RB      _       22      advmod  22:advmod       _
21      an      a       DET     DT      Definite=Ind|PronType=Art       22      det     22:det  _
22      hour    hour    NOUN    NN      Number=Sing     24      compound        24:compound     _
23      line    line    NOUN    NN      Number=Sing     24      compound        24:compound     _
24      up      up      NOUN    NN      Number=Sing     1       parataxis       1:parataxis     _
25      in      in      ADP     IN      _       27      case    27:case _
26      the     the     DET     DT      Definite=Def|PronType=Art       27      det     27:det  _
27      summer  summer  NOUN    NN      Number=Sing     24      obl     24:obl:in       _

@nschneid
Copy link
Contributor

Are we happy with the conversion of about an hour in this following EWT sentence? advmod(hour, about) det(hour,an)

Yes that's correct, it was specifically implemented it in an earlier EWT release: UniversalDependencies/UD_English-EWT#168

@AngledLuffa
Copy link

In terms of about an hour and its relatives, they are often (but not always) parsed into a structure such as

(NP (QP (RB about) (DT an)) (NN hour))

and for these I should probably get stuff like

(NP (QP (RB about) (DT a)) (NN month))
advmod(month, about)
det(month, a)

(NP (QP (RB virtually) (DT no)) (NN one))
advmod(one, virtually)
det(one, no)

(NP (QP (RB about) (DT a)) (NN week))
advmod(week, about)
det(week, a)

(NP (QP (RB nearly) (DT every)) (NN day))
advmod(day, nearly)
det(day, every)

Basically it seems to be QP with no CD is the first thing to look for... but then there are PDT that get captured if you just search for QP not over CD, such as in

(NP (QP (RB Not) (PDT all)) (DT those))

That looks pretty similar ... but the not gets treated differently from other RB in EWT, such as

# sent_id = reviews-079375-0007
... not all people ...
15      not     not     PART    RB      _       16      advmod  16:advmod       _
16      all     all     DET     DT      PronType=Tot    17      det     17:det  _
17      people  people  NOUN    NNS     Number=Plur     22      nsubj   22:nsubj        _
```

Also in EWT there is

```
# sent_id = reviews-190256-0005
... in about half the time quoted ...
7       about   about   ADV     RB      _       8       advmod  8:advmod        _
8       half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  10      det:predet      10:det:predet   _
9       the     the     DET     DT      Definite=Def|PronType=Art       10      det     10:det  _
10      time    time    NOUN    NN      Number=Sing     2       obl     2:obl:in        _
```

```
# sent_id = weblog-blogspot.com_dakbangla_20041028153019_ENG_20041028_153019-0001
28      almost  almost  ADV     RB      _       29      advmod  29:advmod       _
29      all     all     DET     PDT     PronType=Tot    32      det:predet      32:det:predet   _
30      the     the     DET     DT      Definite=Def|PronType=Art       32      det     32:det  _
31      Hindu   hindu   ADJ     JJ      Degree=Pos      32      amod    32:amod _
32      families        family  NOUN    NNS     Number=Plur     23      nmod    23:nmod:of|36:nsubj     _
```

which makes me think that if the structure is `(QP RB DT)` then the `RB` should depend on the thing the `DT` depends on, whereas if the structure is `(QP RB PDT)` then the `RB` depends directly on the `PDT`.  Does that sound about right?  It would also seem that the `PDT` has the `det:predet` relation, as opposed to a number type relation which the `QP` has been inducing in our converter.

Although there is also this:

```
# sent_id = reviews-036133-0002
# newpar id = reviews-036133-p0002
# text = I bought about half of the furniture I own from this place.
1       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      2       nsubj   2:nsubj _
2       bought  buy     VERB    VBD     Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin   0       root    0:root  _
3       about   about   ADV     RB      _       4       advmod  4:advmod        _
4       half    half    NOUN    NN      Number=Sing|NumForm=Word|NumType=Frac   2       obj     2:obj   _
5       of      of      ADP     IN      _       7       case    7:case  _
6       the     the     DET     DT      Definite=Def|PronType=Art       7       det     7:det   _
7       furniture       furniture       NOUN    NN      Number=Sing     4       nmod    4:nmod:of       _
```

It doesn't seem very different from `I brought about half the furniture`, and yet in one case `half` is the head of the `NP` whereas in the other `furniture` would have been treated as the head of the `NP`.  I don't like it.  Maybe the `PP` is enough to cause that difference, though.

There'd also be the question of if a word such as `all` shows up without another `DT`, such as

`(ADJP virtually_RB all_DT) corn_NN seeds_NNS`

Here, does that get treated as `advmod(all, virtually)` or `advmod(seeds, virtually)`?  I don't like the second option.  Maybe it can be distinguished here because it wasn't put in a `QP`.

What would be the relations for `(NP (QP twice_PDT as many) stuff)` or `(NP (QP more than half_PDT) stuff)`?  Does the position of the `PDT` affect the relations?  Found an example in EWT:

```
# sent_id = answers-20111108100523AA1i7no_ans-0011
14      less    less    ADJ     JJR     Degree=Cmp|ExtPos=ADV   16      advmod  16:advmod       _
15      than    than    ADP     IN      _       14      fixed   14:fixed        _
16      half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  18      det:predet      18:det:predet   _
17      a       a       DET     DT      Definite=Ind|PronType=Art       18      det     18:det  _
18      cm      cm      NOUN    NN      Number=Sing     13      obj     13:obj  _
```

Yet another different scheme is the phrase `yet another`, such as:

EWT example for `(NP yet another NN)`
```
(NP (ADJP (RB yet) (JJ another)) (NN police) (NN dispatch))
72      yet     yet     ADV     RB      _       73      advmod  73:advmod       _
73      another another DET     DT      PronType=Ind    75      det     75:det  _
74      police  police  NOUN    NN      Number=Sing     75      compound        75:compound     _
75      dispatch        dispatch        NOUN    NN      Number=Sing     69      obl     69:obl:for      _
```

`Yet another` is not even always parsed in a `QP` in WSJ:

```
(NP (RB yet) (DT another) ... (NN phenomenon))
(NP (QP (RB yet) (DT another)) (NN step))
(NP (ADJP (RB yet) (DT another)) (NN example))
(NP (RB yet) (DT another) (NN landscape) (NN architect))
(NP (RB yet) (DT another) (NNP Marlowe) (NN book))
(NP (ADJP (RB yet) (DT another) (NN week)))
(NP (ADJP (RB yet) (DT another)) (JJ unsettling) (NN parallel))
(NP (RB yet) (DT another) (NN setback))
```

so this particular example is highly annoying

 ones that might legitimately be `nummod`: `half a dozen ___`, `baker's dozen  ___`

Basically, need to work on some generalizable rules for rearranging / searching those subtrees

@nschneid
Copy link
Contributor

These are good questions. I am not an expert on how PTB uses QPs and have been frustrated at the lack of documentation on the UD treatment of these kinds of constructions.

Basically, it seems to me some simple principles are:

  • predeterminers (PDT) should attach as det:predet and may have advmod dependents
  • plain determiners (DT) should attach as det and should NOT have advmod dependents (At least this should usually be true. There are at least some instances of "not all" and "yet another" that violate this—maybe we should change them?)

You are right that semantically, "half the students" and "half of the students" are very similar, but the second involves a PP so syntactically speaking, they have different heads.

@AngledLuffa
Copy link

Related question: what to do about about half the time? There's an example I found in PTB which is parsed like this:

          (VP (VBG rising)
            (NP
              (NP
                (QP (IN about) (PDT half) (DT a) )
                (NN point) )

My exploration of EWT has found something kind of similar...

# sent_id = reviews-190256-0005
# text = They had the work done in about half the time quoted which made me and my wife extremely happy.
7       about   about   ADV     RB      _       8       advmod  8:advmod        _
8       half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  10      det:predet      10:det:predet   _
9       the     the     DET     DT      Definite=Def|PronType=Art       10      det     10:det  _
10      time    time    NOUN    NN      Number=Sing     2       obl     2:obl:in        _

Another similar example in PTB:

                (NP
                  (QP (PDT half) )
                  (DT a) (NN percentage) (NN point) )

whereas the PTB revision changes it to

(QP (PDT half) (DT a))

But of course in typical PTB style this gets annotated differently elsewhere, such as

                      (NP (PDT half) (DT a) (NN percentage) (NN point) ))

      (VP (VBG declining)
        (NP-EXT (PDT half) (DT a) (NN point) )
        (ADVP-TMP (RB semiannually) )
        (PP-DIR (TO to)
          (NP (JJ par) ))))

        (NP (IN about) (PDT half) (DT a) (NN point) )    # IN??  why am I doing this to myself

So, just in general, we want this pattern?

det(time, the)
det:predet(time, half)
advmod(half, about)

@AngledLuffa
Copy link

yet another does seem unique among the RB DT thing pattern. At first, I was thinking of just searching for a|an|the, but nearly every other DT has the advmod pointing to the thing rather than the DT. For example:

12      nearly  nearly  ADV     RB      _       14      advmod  14:advmod       _
13      every   every   DET     DT      PronType=Tot    14      det     14:det  _
14      day     day     NOUN    NN      Number=Sing     11      obl:tmod        11:obl:tmod     SpaceAfter=No

@AngledLuffa
Copy link

(sorry for the repeated small messages)

also, this one is a bit different, with an IN in the middle:

(NP (QP just_RB over_IN a_DT) decade_NN)

I believe over and a would both attach to decade, with advmod(over, just)

@nschneid
Copy link
Contributor

Yes agree with all these suggestions

@AngledLuffa
Copy link

Digging into one of the many tiny cases left, there's a tree which sounds a bit like Yoda:

( (SINV
    (ADVP (RB Also) )
    (VP-TPC-2 (VBN excluded)
      (NP (-NONE- *-1) ))
    (VP (MD will)
      (VP (VB be)
        (VP (-NONE- *T*-2) )))
    (NP-SBJ-1
      (NP (NNS investments) )
      (PP (IN in)
...

In this case, I expect the correct dependencies would be excluded as the root, aux(excluded, will), aux:pass(excluded, be)?

I have a change which fixes that one tree (and no others in PTB)

@nschneid
Copy link
Contributor

nschneid commented Jun 5, 2024

Yup! This is a subject-dependent inversion.

@AngledLuffa
Copy link

There was only one of them in PTB, interestingly. Maybe it was the only one with that particular parse.

I came across an oddity in our converter when fixing that one... apparently the results can be different depending on the object identity of the dependency objects, which changed when I created new objects to resolve that dependency. Long story short, in the following sentence, where should what-8 attach?

( (S
    (NP-SBJ-1 (DT The) (JJ Soviet) (NNS purchases) )
    (VP (VBP are)
      (ADJP-PRD (JJ close)
        (PP (TO to)
          (S-NOM
            (NP-SBJ (-NONE- *-1) )
            (VP (VBG exceeding)
              (SBAR
                (WHNP-2 (WP what) )
                (S
                  (NP-SBJ (DT some) (NNS analysts) )
                  (VP (VBD had)
                    (VP (VBN expected)
                      (S
                        (NP-SBJ (DT the) (NNP Soviet) (NNP Union) )
                        (VP (TO to)
                          (VP (VB buy)
                            (NP (-NONE- *T*-2) )
                            (NP-TMP
                              (NP (DT this) (NN fall) )
                              (, ,)
                              (NP
                                (NP (DT the) (NN season) )
                                (SBAR
                                  (WHPP-3 (IN in)
                                    (WHNP (WDT which) ))
                                  (S
                                    (NP-SBJ (PRP it) )
                                    (ADVP-TMP (RB usually) )
                                    (VP (VBZ buys)
                                      (NP
                                        (NP (JJ much) )
                                        (PP (IN of)
                                          (NP
                                            (NP (DT the) (NN corn) )
                                            (SBAR
                                              (WHNP-4 (-NONE- 0) )
                                              (S
                                                (NP-SBJ (PRP it) )
                                                (VP (VBZ imports)
                                                  (NP (-NONE- *T*-4) )
                                                  (PP-DIR (IN from)
                                                    (NP (DT the) (NNP U.S.) ))))))))
                                      (PP-TMP (-NONE- *T*-3) ))))))))))))))))))
    (. .) ))

the two candidates which our converter produces are either obj(expected-12, what-8) or obj(buy-17, what-8)

it should attach to buy, right?

@AngledLuffa
Copy link

AngledLuffa commented Jun 5, 2024

Also, as much as, this should be generally tagged & parsed the same as in this sentence? Looks pretty consistent in EWT

# sent_id = weblog-blogspot.com_alaindewitt_20040929103700_ENG_20040929_103700-0076
# text = We should know as much as we can.
1       We      we      PRON    PRP     Case=Nom|Number=Plur|Person=1|PronType=Prs      3       nsubj   3:nsubj _
2       should  should  AUX     MD      VerbForm=Fin    3       aux     3:aux   _
3       know    know    VERB    VB      VerbForm=Inf    0       root    0:root  _
4       as      as      ADV     RB      _       5       advmod  5:advmod        _
5       much    much    ADJ     JJ      Degree=Pos      3       obj     3:obj   _
6       as      as      SCONJ   IN      _       8       mark    8:mark  _
7       we      we      PRON    PRP     Case=Nom|Number=Plur|Person=1|PronType=Prs      8       nsubj   8:nsubj _
8       can     can     AUX     MD      VerbForm=Fin    5       advcl   5:advcl:as      SpaceAfter=No
9       .       .       PUNCT   .       _       3       punct   3:punct _

(edit: sometimes much is JJ, sometimes RB in EWT)

@nschneid
Copy link
Contributor

nschneid commented Jun 5, 2024

  • "The Soviet purchases are close to exceeding what2 some analysts had expected the Soviet Union to buy __2 this fall": I think this is a free relative, so in the basic dependencies "what" should be the obj of "exceeding", with PronType=Rel, and "expected" should be its acl:relcl. In the enhanced dependencies "what" should also attach as the obj of "buy".
  • "as much as": yeah this looks like a standard comparative. "much" can be ADJ if it implies "much stuff", e.g. "We should know as much (information) as we can". Or it can be ADV, e.g. "People don't read magazines as much as they used to."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies English features UPOS Universal part-of-speech tags: definitions and examples
Projects
None yet
Development

No branches or pull requests

7 participants