Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when parsing multiword expressions in conllu file #26

Closed
sb-b opened this issue Feb 14, 2018 · 4 comments
Closed

Error when parsing multiword expressions in conllu file #26

sb-b opened this issue Feb 14, 2018 · 4 comments

Comments

@sb-b
Copy link

sb-b commented Feb 14, 2018

Hi,

I am trying to train this parser on Turkish UD Treebank. When I run this command:

java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt

I got the following error:

java.lang.NumberFormatException: For input string: "2-3"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at arc_std_swap.Oracle.getTransition(Oracle.java:41)
        at arc_std_swap.Parser.printOracle(Parser.java:366)
        at arc_std_swap.Parser.main(Parser.java:270)

The conllu parse the lstm parser gives error is the one below:

# sent_id = mst-0003
# text = Sanal parçacıklarsa bunların hiçbirini yapamazlar.
1	Sanal	sanal	ADJ	Adj	_	2	amod	_	_
2-3	parçacıklarsa	_	_	_	_	_	_	_	_
2	parçacıklar	parçacık	NOUN	Noun	Case=Nom|Number=Plur|Person=3	6	csubj	_	_
3	sa	i	AUX	Zero	Aspect=Perf|Mood=Cnd|Number=Sing|Person=3|Tense=Pres	2	cop	_	_
4	bunların	bu	PRON	Demons	Case=Gen|Number=Plur|Person=3|PronType=Dem	5	nmod:poss	_	_
5	hiçbirini	hiçbiri	PRON	Quant	Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3|PronType=Ind	6	obj	_	_
6	yapamazlar	yap	VERB	Verb	Aspect=Imp|Mood=Pot|Number=Plur|Person=3|Polarity=Neg|Tense=Aor	0	root	_	SpaceAfter=No
7	.	.	PUNCT	Punc	_	6	punct	_	_

The word 'parçacıklarsa' is a multiword token, so it is numbered as '2-3'. Does lstm parser have a mechanism to deal with multiword tokens? How can I solve this issue?

Thanks,

Betul

@miguelballesteros
Copy link
Contributor

Hi! This is conllu format, the parser only handles conll format. Please see the universal dependencies scripts.

Miguel

@sb-b
Copy link
Author

sb-b commented Feb 14, 2018 via email

@miguelballesteros
Copy link
Contributor

@sb-b
Copy link
Author

sb-b commented Feb 15, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants