Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run conll17 dragnn baseline model? #21

Closed
alexfridlyand opened this issue Mar 31, 2017 · 9 comments
Closed

How to run conll17 dragnn baseline model? #21

alexfridlyand opened this issue Mar 31, 2017 · 9 comments

Comments

@alexfridlyand
Copy link

Hello!
Amazing work!
Could you please tell me how with your scripts to pass text file to the test dragnn script (Russian baseline model) and output in CoNLL format?

If possible please provide detailed instruction. Where to copy files and so on ....
Thanks in advance!

@dsindex
Copy link
Owner

dsindex commented Apr 1, 2017

@alexfridlyand

hi~

i tried to run the baseline model described (https://github.com/tensorflow/models/tree/master/syntaxnet/g3doc/conll2017)

but there is a problem related 'utf8, std:out_or_range' in inference steps.

...
2017-04-01 09:57:58.442684: I syntaxnet/embedding_feature_extractor.cc:35] Features: input.focus;input.focus stack.focus stack(1).focus;stack.focus stack(1).focus
2017-04-01 09:57:58.442689: I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: lookahead;tagger;rnn-stack
2017-04-01 09:57:58.442692: I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;64;64
2017-04-01 09:57:58.442810: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
2017-04-01 09:57:58.442830: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string
INFO:tensorflow:Read 0 documents
...

since i haven't found the way to fix it,
i decided to skip by dropping 'char2word' layer when building 'master_spec'.

after that, all works fine.

https://github.com/dsindex/syntaxnet#dragnn

if you are interested in training the Russian corpus and test,

  1. download Russian UD corpus from http://universaldependencies.org

  2. compile

$ pwd
/path/to/models/syntaxnet
$ bazel build -c opt //work/dragnn_examples:write_master_spec
$ bazel build -c opt //work/dragnn_examples:train_dragnn
$ bazel build -c opt //work/dragnn_examples:inference_dragnn
  1. train
  • say, UD_Russian directory in the path
$ pwd
/path/to/work/UD_Russian
  • edit train_dragnn.sh
SRC_CORPUS_DIR=${CDIR}/UD_Russian
TRAIN_FILE=${DATA_DIR}/ru-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/ru-ud-dev.conllu.conv
  • run
$ nohup ./train_dragnn.sh -v -v &
  1. test
  • run
$ cat textfile | ./test_dragnn.sh -v -v

note that again

loading downloaded model for annotation is not yet available now in here.

but i think https://github.com/tensorflow/models/tree/master/syntaxnet/dragnn/tools
this original code may work well(i didn't test)

@alexfridlyand
Copy link
Author

Thank you very much for such detailed response! I will reply shortly in case of issues, great stuff.

@alexfridlyand
Copy link
Author

Got this error at inference stage (with Russian dataset trained on): UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128).

@alexfridlyand
Copy link
Author

Added this to the test_dragnn.sh and it works:

reload(sys)
sys.setdefaultencoding('utf8')

@alexfridlyand
Copy link
Author

alexfridlyand commented Apr 3, 2017

@dsindex Maybe you could also help how to convert output to Brat standoff ann format to output in Brat?
I commented ${CONLL2TREE} --alsologtostderr from test_dragnn for this,
but then i need to convert CoNNL-U format to standoff, i'm trying with this repo: https://github.com/spyysalo/conllu.py

but getting multiple parse issues. Could you advice something?

@dsindex
Copy link
Owner

dsindex commented Apr 3, 2017

@alexfridlyand

that is cool repo!

i am not sure about getting multiple parse issues you mentioned.
but conllu.py looks like taking file-based processing with two pass.
one is for text, other is for annotation. it is tricky..... ;;
i think we'd better to save conllu files(from test_dragnn.sh) and use conll.py.

$ cat file.txt | ./test_dragnn.sh > file.conllu
$ python conll.py/convert.py -o outdir file.conllu

if we want to run from on-line manner,
we have to modify conllu.py/convert.py, conll.py/conllu/conllu.py
it seems time-consuming.

by the way, i have a question about the brat tool.
nlplab/brat#1221
as this issue which i reported, i can't annotate relations.
because there is no dialog action.

do you know how to fix it?

@alexfridlyand
Copy link
Author

alexfridlyand commented Apr 3, 2017

I use brat as compare only tool, if i will figure out - i'll let you know.

@dsindex same code as you wrote, i'm getting
conllu.conllu.FormatError: invalid CPOSTAG: PRP$ (line 4)
on file with Russian sentences.

@dsindex
Copy link
Owner

dsindex commented Apr 4, 2017

@alexfridlyand thank you :)

hmm.... in UD_English and Korean corpus, there is no error.
i guess cpostag is not right format

CPOSTAG_RE = re.compile(r'^[a-zA-Z]+$')
...
        # some character set constraints
        if not CPOSTAG_RE.match(self.cpostag):
            raise FormatError('invalid CPOSTAG: %s' % self.cpostag)

here, self.cpostag was generated by from_string method

def from_string(cls, s):
        fields = s.split('\t')
        if len(fields) != 10:
            raise FormatError('got %d/10 field(s)' % len(fields), s)
        fields[5] = [] if fields[5] == '_' else fields[5].split('|') # feats
        fields[8] = [] if fields[8] == '_' else fields[8].split('|') # deps
        return cls(*fields)

since i don't know exactly why such character in there,
do some filtering for fields list is the way i'd like to take ;;

hope it helps.

@alexfridlyand
Copy link
Author

Thank you, i think i'll just use second Russian treebank, which is much bigger and looks like with proper tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants