Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding English data from the merge of UD and Propbank #5

Closed
arademaker opened this issue Jan 31, 2020 · 10 comments
Closed

adding English data from the merge of UD and Propbank #5

arademaker opened this issue Jan 31, 2020 · 10 comments

Comments

@arademaker
Copy link
Member

The idea is to merge the data from propbank in https://github.com/propbank/propbank-release (subset with the EWT treebank) with the http://github.com/universaldependencies/UD_English-EWT (same sentences from the EWT with UD annotations and revisions)

@arademaker
Copy link
Member Author

arademaker commented Jan 31, 2020

@arademaker
Copy link
Member Author

arademaker commented Jan 31, 2020

some mistakes between the PoS tag in the propbank data and the xpostag in the UD data deserve attention. One example is:

# sent_id = reviews-398243-0007
# text = The price was actually lower than what I had anticipated and used to compared to other places, plus he showed me the work he did when I came into pick up the car.
1	The	the	DET	DT	Definite=Def|PronType=Art	2	det	2:det	Tree=(TOP(S(S(NP*|Framefile=-|Roleset=-|Args=*/(ARG1*/(ARG1*/*/*/*/*/*/*/*
2	price	price	NOUN	NN	Number=Sing	5	nsubj	5:nsubj	Tree=*)|Framefile=price|Roleset=price.01|Args=(V*)/*)/*)/*/*/*/*/*/*/*
3	was	be	AUX	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	5	cop	5:cop	Tree=(VP*|Framefile=be|Roleset=be.01|Args=*/(V*)/*/*/*/*/*/*/*/*
4	actually	actually	ADV	RB	_	5	advmod	5:advmod	Tree=(ADVP*)|Framefile=-|Roleset=-|Args=*/(ARGM-ADV*)/(ARGM-ADV*)/*/*/*/*/*/*/*
5	lower	lower	ADJ	JJR	Degree=Cmp	0	root	0:root	Tree=(ADJP(ADJP*)|Framefile=low|Roleset=low.04|Args=*/(ARG2*/(V*)/*/*/*/*/*/*/*
6	than	than	SCONJ	IN	_	7	case	7:case	Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/(ARGM-CXN*/*/*/*/*/*/*/*
7	what	what	PRON	WP	PronType=Int	5	obl	5:obl:than	Tree=(SBAR(WHNP*)|Framefile=-|Roleset=-|Args=*/*/*/*/(ARG1*)/*/*/*/*/*
8	I	I	PRON	PRP	Case=Nom|Number=Sing|Person=1|PronType=Prs	10	nsubj	10:nsubj|12:nsubj|14:nsubj:xsubj	Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/(ARG0*)/*/*/*/*/*
9	had	have	AUX	VBD	Mood=Ind|Tense=Past|VerbForm=Fin	10	aux	10:aux	Tree=(UCP(VP*|Framefile=have|Roleset=have.01|Args=*/*/*/(V*)/*/*/*/*/*/*
10	anticipated	anticipate	VERB	VBN	Tense=Past|VerbForm=Part	7	acl:relcl	7:acl:relcl	Tree=(VP*))|Framefile=anticipate|Roleset=anticipate.01|Args=*/*/*/*/(V*)/*/*/*/*/*
11	and	and	CCONJ	CC	_	12	cc	12:cc	Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
12	used	use	VERB	VBN	Tense=Past|VerbForm=Part	10	conj	7:acl:relcl|10:conj:and	Tree=(FRAG(ADJP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*|PBPOS=JJ
13	to	to	ADP	IN	_	14	aux	14:aux	Tree=(PP*)))|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
14	compared	compare	VERB	VBN	Tense=Past|VerbForm=Part	12	xcomp	12:xcomp	Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
15	to	to	ADP	IN	_	17	case	17:case	Tree=(PP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
16	other	other	ADJ	JJ	Degree=Pos	17	amod	17:amod	Tree=(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
17	places	place	NOUN	NNS	Number=Plur	14	obl	14:obl:to	SpaceAfter=No|Tree=*))))))))))|Framefile=-|Roleset=-|Args=*/*)/*)/*/*/*/*/*/*/*
18	,	,	PUNCT	,	_	21	punct	21:punct	Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
19	plus	plus	CCONJ	CC	_	21	cc	21:cc	Tree=*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*
20	he	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	21	nsubj	21:nsubj	Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG0*)/*/*/*/*
21	showed	show	VERB	VBD	Mood=Ind|Tense=Past|VerbForm=Fin	5	conj	5:conj:plus	Tree=(VP*|Framefile=show|Roleset=show.01|Args=*/*/*/*/*/(V*)/*/*/*/*
22	me	I	PRON	PRP	Case=Acc|Number=Sing|Person=1|PronType=Prs	21	iobj	21:iobj	Tree=(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG2*)/*/*/*/*
23	the	the	DET	DT	Definite=Def|PronType=Art	24	det	24:det	Tree=(NP(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARG1*/*/*/*/*
24	work	work	NOUN	NN	Number=Sing	21	obj	21:obj	Tree=*)|Framefile=work|Roleset=work.01|Args=*/*/*/*/*/*/(V*)/(ARGM-PRR*)/*/*
25	he	he	PRON	PRP	Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs	26	nsubj	26:nsubj	Tree=(SBAR(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/(ARG0*)/*/*/*
26	did	do	VERB	VBD	Mood=Ind|Tense=Past|VerbForm=Fin	24	acl:relcl	24:acl:relcl	Tree=(VP*))))|Framefile=do|Roleset=do.LV|Args=*/*/*/*/*/*)/(ARGM-LVB*)/(V*)/*/*
27	when	when	ADV	WRB	PronType=Int	29	mark	29:mark	Tree=(SBAR(WHADVP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/(ARGM-TMP*/*/*/(ARGM-TMP*)/*
28	I	I	PRON	PRP	Case=Nom|Number=Sing|Person=1|PronType=Prs	29	nsubj	29:nsubj	Tree=(S(NP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARG1*)/(ARG0*)
29	came	come	VERB	VBD	Mood=Ind|Tense=Past|VerbForm=Fin	21	advcl	21:advcl:when	Tree=(VP*|Framefile=come|Roleset=come.01|Args=*/*/*/*/*/*/*/*/(V*)/*
30	in	in	ADV	RB	_	29	advmod	29:advmod	SpaceAfter=No|Tree=(ADVP*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARGM-DIR*)/*
31	to	to	PART	TO	_	32	mark	32:mark	Tree=(S(VP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/(ARGM-PRP*/*
32	pick	pick	VERB	VB	VerbForm=Inf	29	advcl	29:advcl:to	Tree=(VP*|Framefile=pick|Roleset=pick_up.04|Args=*/*/*/*/*/*/*/*/*/(V*
33	up	up	ADP	RP	_	32	compound:prt	32:compound:prt	Tree=(PRT*)|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/*)
34	the	the	DET	DT	Definite=Def|PronType=Art	35	det	35:det	Tree=(NP*|Framefile=-|Roleset=-|Args=*/*/*/*/*/*/*/*/*/(ARG1*
35	car	car	NOUN	NN	Number=Sing	32	obj	32:obj	SpaceAfter=No|Tree=*)))))))))|Framefile=-|Roleset=-|Args=*/*/*/*/*/*)/*/*/*)/*)

used was analyzed as JJ in EWT, and, therefore, it was not annotated as a predicate. In UD this token was analyzed as VERB. The tree structure of UD should be very different from the ptb tree.

@arademaker
Copy link
Member Author

arademaker commented Jan 31, 2020

This is the summary of cases where the original PoS tag is complete different from the current xpostag in the UD data:

% gawk '$0 ~ /sent_id/ {sent=$0} $10 ~ /PBPOS/ {print $4,$5,gensub(/.*(PBPOS=[^/]+).*/,"\\1","g",$10)}' ud+prop.conllu  | sort | uniq -c | sort -nr
  23 NOUN NNS PBPOS=NN
  14 ADV RP PBPOS=RB
   8 ADJ JJ PBPOS=NN
   6 PRON PRP PBPOS=PRP$
   6 ADV JJ PBPOS=RB
   5 SYM SYM PBPOS=NN
   5 NOUN NN PBPOS=JJ
   5 ADV RB PBPOS=IN
   4 DET DT PBPOS=NN
   4 ADP RP PBPOS=IN
   4 ADJ RB PBPOS=JJ
   3 X GW PBPOS=NN
   3 PROPN JJ PBPOS=NNP
   3 NOUN NN PBPOS=IN
   3 NOUN NN PBPOS=GW
   3 NOUN NN PBPOS=CD
   3 ADP RB PBPOS=RP
   3 ADJ NN PBPOS=JJ
   3 ADJ DT PBPOS=JJ
   2 X NNP PBPOS=GW
   2 X GW PBPOS=JJ
   2 SYM SYM PBPOS=DT
   2 SCONJ IN PBPOS=CC
   2 PROPN NNP PBPOS=NN
   2 PRON PRP$ PBPOS=PRP
   2 PART RB PBPOS=VB
   2 NOUN NN PBPOS=VB
   2 NOUN NN PBPOS=NNP
   2 CCONJ CC PBPOS=DT
   2 ADV RB PBPOS=CC
   2 ADP RB PBPOS=IN
   2 ADJ NNP PBPOS=JJ
   2 ADJ JJ PBPOS=GW
   1 X GW PBPOS=VBN
   1 X ADD PBPOS=RP
   1 VERB VBP PBPOS=IN
   1 VERB VBN PBPOS=JJ
   1 VERB VBN PBPOS=GW
   1 VERB VBG PBPOS=NN
   1 VERB VB PBPOS=TO
   1 VERB VB PBPOS=RB
   1 VERB VB PBPOS=NNS
   1 VERB NNS PBPOS=VBZ
   1 SYM UH PBPOS=.
   1 SYM UH PBPOS=-RRB-
   1 SYM SYM PBPOS=NNP
   1 SYM NFP PBPOS=-RRB-
   1 SYM NFP PBPOS=-LRB-
   1 SYM IN PBPOS=SYM
   1 SCONJ IN PBPOS=RB
   1 PUNCT PDT PBPOS=''
   1 PUNCT . PBPOS=NFP
   1 PUNCT -RRB- PBPOS=NFP
   1 PUNCT , PBPOS=HYPH
   1 PUNCT , PBPOS=.
   1 PROPN NNPS PBPOS=NNS
   1 PROPN NN PBPOS=NNP
   1 PRON EX PBPOS=RB
   1 PRON EX PBPOS=PRP
   1 PRON DT PBPOS=IN
   1 PART TO PBPOS=PRP
   1 NUM NN PBPOS=CD
   1 NOUN VBG PBPOS=NN
   1 NOUN UH PBPOS=NNS
   1 NOUN NNS PBPOS=VBZ
   1 NOUN NNP PBPOS=NN
   1 NOUN NN PBPOS=VBN
   1 NOUN NN PBPOS=VBG
   1 NOUN NN PBPOS=RB
   1 NOUN JJ PBPOS=NN
   1 INTJ NN PBPOS=UH
   1 INTJ JJ PBPOS=UH
   1 DET DT PBPOS=PRP
   1 CCONJ CC PBPOS=VB
   1 CCONJ CC PBPOS=NNP
   1 CCONJ CC PBPOS=NN
   1 CCONJ CC PBPOS=IN
   1 ADV RBR PBPOS=RB
   1 ADV RB PBPOS=VBG
   1 ADV RB PBPOS=JJ
   1 ADV NN PBPOS=RBS
   1 ADV IN PBPOS=RB
   1 ADV CC PBPOS=RB
   1 ADP TO PBPOS=IN
   1 ADP IN PBPOS=RB
   1 ADP IN PBPOS=JJ
   1 ADP IN PBPOS=DT
   1 ADP CC PBPOS=IN
   1 ADJ JJ PBPOS=VB
   1 ADJ JJ PBPOS=RB
   1 ADJ JJ PBPOS=IN

The case ADJ JJ PBPOS=VB means that it was a VB in EWT/Propbank but in UD analyzed as ADJ. The case NOUN NN PBPOS=VB means that it was a VB in the EWT/Propbank but UD consider it NOUN.

We have 177 sentences with these differences between the xpostag in UD data and the POS tag in the Propbank/EWT data:

% gawk '$0 ~ /sent_id/ {sent=$0} $10 ~ /PBPOS/ {print sent}' ud+prop.conllu  | sort | uniq -c | sort -nr | wc -l
     177

@arademaker
Copy link
Member Author

arademaker commented Jan 31, 2020

My suggestions are:

  1. let us defined the output format. See that extra columns were all added in the MISC field for producing a valid conllu format with 10 columns. But it is easy to expand the values for extra columns.
  2. mark all these 177 sentences with a metadata for manual verification of the SRL annotation.
  3. We should not use the PTB trees anymore, I can remove it from the MISC fields.

@huaiyu-zhu
Copy link
Member

huaiyu-zhu commented Jan 31, 2020

Concerning PTB metadata, my suggestion is that we use the new UD data in practice, ignoring the old data, but does not remove the related info from the data itself. While in all future work it is good to use UD data only, there may be occassions people want to compare evaluations of models based on new and old data, and having these links to the past in the same file may be useful.

@arademaker
Copy link
Member Author

In f631cfa I introduce the first version of the merge. Data is not ready for merging into the master.

arademaker added a commit that referenced this issue Feb 15, 2020
@alanakbik
Copy link
Collaborator

@arademaker thanks for performing this merge - this will be very useful for anyone that wants to train SRL systems over UD!

A quick question on the format: The Args part (see below) could become very long in a sentence with many verbs / frame evoking elements (it could become somtehing like Args=_/_/_/_/_/_/_/_/_ for each word in the sentence), perhaps impacting readability. The Finnish Proposition Bank has an alternative encoding (see here) that may be more compact/readable?

# newdoc id = weblog-blogspot.com_gettingpolitical_20030906235000_ENG_20030906_235000
# sent_id = weblog-blogspot.com_gettingpolitical_20030906235000_ENG_20030906_235000-0001
# text = The sheikh in wheel-chair has been attacked with a F-16-launched bomb.
1	The	the	DET	DT	Definite=Def|PronType=Art	2	det	2:det	Framefile=-|Roleset=-|Args=_/_/_
2	sheikh	sheikh	NOUN	NN	Number=Sing	9	nsubj:pass	9:nsubj:pass	Framefile=-|Roleset=-|Args=_/_/ARG1
3	in	in	ADP	IN	_	6	case	6:case	Framefile=-|Roleset=-|Args=_/_/_
4	wheel	wheel	NOUN	NN	Number=Sing	6	compound	6:compound	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
5	-	-	PUNCT	HYPH	_	6	punct	6:punct	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
6	chair	chair	NOUN	NN	Number=Sing	2	nmod	2:nmod:in	Framefile=-|Roleset=-|Args=_/_/_
7	has	have	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	9	aux	9:aux	Framefile=have|Roleset=have.01|Args=V/_/_
8	been	be	AUX	VBN	Tense=Past|VerbForm=Part	9	aux:pass	9:aux:pass	Framefile=be|Roleset=be.03|Args=_/V/_
9	attacked	attack	VERB	VBN	Tense=Past|VerbForm=Part	0	root	0:root	Framefile=attack|Roleset=attack.01|Args=_/_/V
10	with	with	ADP	IN	_	17	case	17:case	Framefile=-|Roleset=-|Args=_/_/_
11	a	a	DET	DT	Definite=Ind|PronType=Art	17	det	17:det	Framefile=-|Roleset=-|Args=_/_/_
12	F	f	NOUN	NN	Number=Sing	16	compound	16:compound	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
13	-	-	PUNCT	HYPH	_	12	punct	12:punct	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
14	16	16	NUM	CD	NumType=Card	12	compound	12:compound	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
15	-	-	PUNCT	HYPH	_	16	punct	16:punct	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/_
16	launched	launch	VERB	VBN	Tense=Past|VerbForm=Part	17	acl	17:acl	Framefile=-|Roleset=-|Args=_/_/_
17	bomb	bomb	NOUN	NN	Number=Sing	9	obl	9:obl:with	SpaceAfter=No|Framefile=-|Roleset=-|Args=_/_/ARGM-MNR
18	.	.	PUNCT	.	_	9	punct	9:punct	Framefile=-|Roleset=-|Args=_/_/_

@arademaker
Copy link
Member Author

Thank you @alanakbik , I agree with have to think a little bit more about the final format. I actually ended up using the same format used for the other languages and improved the README file explaining that the .conllu files in this repo are not actual valid .conllu according to UD specifications.

This is a bad situation because the extension may let people believe that standard CoNNL-U readers can parse the files and it is not the case for now. I also don't like to have to deal with a variable number of columns per sentence. The format you suggest above seems to be very concise and it can be encoded in the MISC column.

Other options are:

  1. encode the predicates in the SENTENCE metadata
  2. adopted the CoNNL-U Plus

@huaiyu-zhu , @yunyaoli ?

@nschneid
Copy link

Or choose a different extension, e.g. .conllusrl?

@alanakbik
Copy link
Collaborator

I would probably vote against encoding this information in the sentence metadata. CoNLL-U plus or changing the extension are good solutions, but best might be to have this in valid CoNLL-U format since this is what most people/tools use.

So I like your way of encoding SRL in the MISC column, just perhaps the readability could be improved by encoding the arguments with a id-pointer system like in the Finnish Propbank or the enhanced dependency graph (that uses head-deprel pairs)?

arademaker added a commit that referenced this issue Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants